Luca Barbiero Department of Information Engineering University of Padua

A thesis submitted for the Bachelor degree in Information engineering September 23rd, 2011

Supervisor: Michele Pavon

Day of the defense:

Signature from head of committee:

ii

Contents 1 Introduction

1

2 Preliminaries

3

2.1

Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . .

4

2.3

Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.3.1

Information divergence . . . . . . . . . . . . . . . . . . . . . . . .

6

Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.4.1

8

2.4

Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Introduction to maximum entropy methods 3.1

Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1

3.2

9 9

Boltzmann’s dice . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Formal approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.2.1

12

The minimun discrimination information principle . . . . . . . .

4 Covariance Selection

15

4.1

The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.2

A rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.3

Link to the general framework . . . . . . . . . . . . . . . . . . . . . . . .

17

4.3.1

17

Generalization to Matrix Completion Problems . . . . . . . . . .

5 Quasi-Newton methods 5.1

19

Newton’s step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

5.1.1

Minimization and maximization problems . . . . . . . . . . . . .

20

5.1.1.1

20

The multivariate case . . . . . . . . . . . . . . . . . . .

i

CONTENTS

5.2

Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

5.2.1

21

Entropy approach . . . . . . . . . . . . . . . . . . . . . . . . . .

References

23

ii

1

Introduction The aim of this thesis is to give an insight on the motivations as well as the applications of the maximum entropy methods in Information Theory. Such techniques, altought assuming different aspects, all apply the dogmatic principle of maximum entropy introduced by the physicist Edwin Thompson Jaynes in 1957. After a brief recall of some mathematical tools in chapter 2, we’ll introduce first heuristically, and then formally, the motivation of the entropic approach, also showing with a famous example that such approach is closely in agreement with nature: this is the true reason that motivates the wide spectrum of application it finds. Subsequently, in Chapter 4 we’ll focus on an apparently disjoint context, that of matrix completion, referring to the work of the statistician Arthur P. Dempster who in 1972 with his ”Covariance Selection” theory [4] gave rise to a whole stream of research in that field. We’ll observe that despite the different formulation, Dempster’s work is nothing but an application of the maximum entropy principle and what is even more interesting is that it opens the doors of a general matrix completion approach, regardless of the origin of need of a completion, that can come from 1) a lack of reliable information as well as 2) a goal of computational saving. Finally, remaining in the contex of matrix completion we’ll threat a case of the second type, which in turn comes with a different appearence with respect to that presented at the end of the previous chapter but, again, applying the same original idea.

1

1. INTRODUCTION

2

2

Preliminaries We first recall some mathematical tools that will turn out to be useful for our aims.

2.1

Random vectors

Random vectors (rve) are the multivariate extension of random variables (rv). A rve X = (X1 , ..., Xn )T ∈ Rn is a map X : Ω → Rn ,

ω 7→ X(ω) = (X1 (ω), ..., Xn (ω))T .

(2.1)

The probability measure induced by X on Rn E ⊂ Rn ,

P (X ∈ E),

(2.2)

fully characterizes the rve X in a statistical sense. The distribution of X is in turn characterized from the multidimensional cumulative distribution function (CDF) FX (x1 , ..., xn ) = P (X1 < x1 , ..., Xn < xn ).

(2.3)

Analysing the CDF we can distinguish among discrete, continuous and mixed rve. For instance, consider the continuos case; in particular, the absolutely continuous CDFs are a subclass of the continuous CDFs that admit the propability density function (PDF) fX (x1 , ..., xn ). In his points of continuity, the PDF is obtained from the CDF by derivation fX (x1 , ..., xn ) =

∂n FX (x1 , ..., xn ). ∂x1 ...∂xn

3

(2.4)

2. PRELIMINARIES

The usefulness of the PDF is that, when it exists, it reduces the calculus of probability to a multiple integration Z P (X ∈ E) =

Z ...

fX (x1 , ..., xn ) dx1 ...dxn .

(2.5)

E

The PDF doesn’t always exist, as previously mentioned. In what follows, we’ll make frequent use of the concept of mean vector and covariance matrix of a rve. The expected value of X is the vector in Rn E[X] = (E[X1 ], ..., E[Xn ])T

(2.6)

The covariance matrix (or simply covariance when it’s clear that we’re in a multidimensional context) is the matrix in Rn×n Σ = E[(X − E[X])(X − E[X])T ],

(2.7)

in which the (i, j) element is σij = cov(Xi , Xj ) i.e. the covariance between the ith and the jth component of the rve. Obviously, when i = j we denote with σii the variance of the ith component. The covariance matrix is symmetric (because cov(Xi , Xj ) = cov(Xj , Xi )) and positive definite, in fact for every a ∈ Rn aT Σa = aT E[(X − E[X])(X − E[X])T ]a = E[aT (X − E[X])(X − E[X])T a]

(2.8)

= var(aT X) ≥ 0 where we used the linearity of expectation.

2.2

Multivariate normal distribution

A random vector X ∈ Rn is said to have a multivariate normal distribution if 1. every linear combination of its components Y = a1 X1 + ... + an Xn is normally distributed 2. there exists a random l-vector Z whose components are indipendent standard normal random variables, a n-vector µ and a n×l matrix A, such that X = AZ+µ. In words, every multivariate normal distribution is an affine transformation of the so called normal standard multivariate.

4

2.3 Entropy

Then, if the covariance matrix Σ is nonsingular, the PDF of X exists and can be expressed analitically as 1 fX (x) = (2π)−n/2 |Σ|−1/2 exp (− (x − µ)T Σ−1 (x − µ)) 2

(2.9)

We remark that, like in the unidimensional case, the multivariate normal distribution is fully determined by its mean vector µ and covariance matrix Σ. Moreover, since a normal distribution can be made zero mean subtracting its mean (which can be derived by empirical experiments), we stress the fact that it’s the covariance matrix Σ which characterizes the distribution, and observe that it’s the inverse of the covariance matrix 11 σ · · · σ 1n .. .. Σ−1 = ... (2.10) . . σ n1 · · ·

σ nn

that appears in the analytical expression of the distribution, were the σ ij are its components.

2.3

Entropy

Entropy is a measure of randomness or, more precisely, unpredictability associated with random vectors (univariate random variables are special cases of rve). The higher the entropy, the smaller is our ability to predict events a priori: We say that high entropy means that we gain (on the average) high information when an outcome occours, hence we can think of this central concept as, in the end, a quantification of our ignorance about random phenomena. In particular, the case in which our ignorance about a rve is maximum is when its probability distribution is uniform over an interval i.e. every outcome is equally likely and we have no further information about them before the experiment. Consider a discrete rve X ∈ Rn (the continuous case being anologuos) with a finite sample space X of cardinality M and a valid probability mass function (PMF) for it, which we indicate here and in what follows for the ease of notation (for both continuous and discrete distributions), simply with p. A consistent entropy function H(p) on the space of the probability distributions must satisfy the following properties: 1. if X is a.s. costants then H(p) = 0, otherwise H(p) > 0

5

2. PRELIMINARIES

2. if p∗ is uniform over its alphabet, i.e. p∗i =

1 M

i = 1...M , then p∗ = argmaxH(p),

otherwise H(p) < H(p∗ ) The above properties formalize the heuristic intuition we discussed previously. Now we introduce the analytical form of entropy proposed by C.E. Shannon in 1952. H(p) = −

M X

pi log pi ,

(2.11)

i=1

where 0 log 0 = 0 by definition. Note that entropy is associated with a PMF and does not depend on the sample space of the rve. This measure for the entropy of a distribution satisfies at the properties we stated, in particular it has a unique global maximum. Note that the base of the logarithm it’s not important, provided it’s greater than 1: in statistical mechanics, base e is used, instead in Information Theory base 2 is preferred (so the entropy of a fair coin is 1 bit, the unit measure of the information)

2.3.1

Information divergence

We present now a very powerful instrument, introduced by Kullback and Leibler in 1951 [6]. Condider two valid probability distributions (again we focus on discrete distributions: the continuous case can be treated substituting sums with integrals) p and q with the only restriction that the support of p is rigorously contained in the support of q qi = 0 ⇒ pi = 0

∀i

(2.12)

The information divergence, or relative entropy or KL-index of q from p is defined to be D(p||q) =

X i

pi log

pi qi

(2.13)

Note that D(·||·) does not induce a metric in the space of probability distributions since it’s not symmetric and, most important, it does not satisfy the triangular inequality. Nevertheless, it enjoyes two properties 1. D(p||q) ≥ 0, 2. D(p||q) = 0

if and only if

p = q.

6

2.4 Lagrange multipliers

Put in another way, we can see the information divergence as a pseudo-distance of p from q, in some sense. The case in which q is a uniform distribution is interesting: if so, as seen before, q is characterized by having maximum entropy among all possible distributions with sample space of cardinality M and the smaller the divergence of q from p, the higher the entropy of p. In fact observe that D(p||q) =

M X

pi log npi = log n +

i=1

pi log pi = Hmax − H(p)

(2.14)

i=1

It’s easy to see that D(p||q) → 0

2.4

M X

when

H(p) → Hmax .

Lagrange multipliers

The method of Lagrange multipliers provides a strategy for finding the maxima and minima of a function subject to constraints. Note that in this section we move out from the field of random phenomena to recall some results from Analysis, so that here X ⊂ Rn is an open set and Γ ⊂ Rn is a constraint defined to be Γ = {x ∈ Rn : g(x) = b}

(2.15)

where b = (b1 , ..., bm ) is fixed and g : X → Rm is a C 1 function of components g = (g1 , ..., gm ) Let’s recall briefly the main results in this field of Analysis Definition 1. A point x∗ ∈ Γ is said to be a relative maximum (resp. minimum) constrained to Γ for a function f : X → Rn if it exists a neighborhood U of x∗ such that f (x∗ ) ≥ f (x) (resp. f (x∗ ) ≤ f (x)) ∀x ∈ U ∩ Γ. A constrained maximum or minimum is also called constrained extreme. We’re now ready to state the main result of this section. Recall that x∗ ∈ Γ is said to be a regular point of 2.15 if ∇g(x∗ ) 6= 0. Theorem (on Lagrange multipliers) 1. Let x∗ ∈ Γ be a regular point of Γ of constrained extreme for a function f : X → Rn differentiable in x∗ . Then there exist λ1 ...λm ∈ R such that m X ∇f (x∗ ) = λi ∇gi (x∗ ). (2.16) i=1

In particular, λ1 ...λm are called the Lagrange multipliers for the constrained extreme problem.

7

2. PRELIMINARIES

It follows that constrained maxima and minima must be sought between the irregular points of the constraint, and the regular ones which satisfy 2.16. In particular, if Γ is made only of regular points, the problem of constrained extreme consists in the solution of the n + m system with n + m unknowns ( P ∂xj f (x1 , ..., xn ) = m i=1 λi ∂xj g1 (x1 , ..., xn ) gi (x1 , ..., xn ) = bi

2.4.1

j = 1, ..., n i = 1, ..., m

(2.17)

λi [gi (x1 , ..., xn ) − bi ]

(2.18)

Lagrangian

The function L : X × Rm → Rn defined to be L(x, λ) = f (x) − hλ, g(x) − bi = f (x1 , ..., xn ) −

m X i=1

is called lagrangian of the constrained extreme problem. The following results follows from the Lagrange multipliers theorem. Corollary 1. Let f be a C 1 function, having a local extreme contrained to Γ in x∗ and let x∗ a regular point of Γ. Then there exist λ = (λ1 , ..., λn ) such that (x∗ , λ) is a free critical point for L Proof. If x∗ ∈ Γ is a local constrained extreme for f , then there exists λ = (λ1 , ..., λn ) ∈ Rm such that x∗1 , ..., x∗n , λ∗1 , ..., λ∗n are solutions of the system 2.17. Because ∂xj L(x, λ) = ∂xj f (x1 , ..., xn ) −

m X

λi ∂xj g1 (x1 , ..., xn )

i=1

(2.19)

∂λi L(x, λ) = gi (x1 , ..., xn ) − bi this is equivalent in stating that (x∗ , λ) ∈ X × Rm is a free critical point for L. The mathematical usefulness of the lagrangian is now clear: it reconducts a constrained extreme problem to an unconstrained one.

8

3

Introduction to maximum entropy methods In Information Theory, the maximum entropy principle is a postulate which states that, in model fitting problems, when subject to known constraints (or incomplete information) the probability distribution that best represents the current state of knowledge is the one with the largest entropy.

3.1

Heuristic

For many decades it has been recognized through evidences in theoretic advancements as in applicative results that the notion of entropy defines a kind of measure on the space of the probability distributions, such that those of high entropy are in some sense preferable over others. The justification for this was stated in a variety of intuitive forms: higher entropy distribution represent more ”disorder”, they are ”smoother”, ”more probable”, ”less predictable”, ”they assume less”, according to Shannon’s interpretation of entropy as an information measure. In all these keywords, the recurrent idea is that in a model fitting task, given some incomplete informations, it seems the best choiche to determine the model in a way that it allows the widest spectrum of behaviors compatible with the constraints, and this is precisely what we’re accomplishing when we maximize entropy taking into account any constraints: we choose a model that describes the experimental evidences obtained, without (erroneously) unbalancing it on specific behaviors according to inexistent grounds: it is well know that tending to

9

3. INTRODUCTION TO MAXIMUM ENTROPY METHODS

maximum entropy means tending to the uniform distribution, that is over all that of complete ignorance.

3.1.1

Boltzmann’s dice

Suppose that n dice are thrown on a table. We are faced with the task of determining the frequencies pi =

ni n

i.e. ni , the number of dice showing face i. In absence of

any experimental evidence (no contraints) we’re led to choose a priori the uniform distribution, which assigns pi = 1/6, i = 1, ..., 6. Indeed there’s no reason to think that any face is more probable of any other or, put in another way, it would seems highly irrational to make any other estimate than the uniform one. Suppose now we’re given the following experimental evidence: the total number of spots showing is nα 6 X

ini = nα.

(3.1)

i=1

Note that from (3.1) it follows that 6 X ni i = α = E[X] n

(3.2)

i=1

where X is the random variable which denote the number of spots shown by one dice. Consider the general case in which E[X] differs from 3.5, the well known expected value of spots shown by a fair dice. Now the uniform distribution is not suitable to fit the model. One way to proceed is to count the number of ways that n dice can fall so that ni dice show face i. There are

n n1 , ..., n6

=

n! n1 !...n6 !

(3.3)

such ways, where (3.3) is the multinomial coefficient, which in combinatorics is the number of ways in which an n-elements set can be partitioned in 6 disjoint sets each having ni i = 1, ..., 6 elements. This macrostate is indexed by (n1 , ..., n6 ) corresponding to (3.3) microstates, each one having probability

1 6n .

We wish to maximize (3.3) in

order to find the most probable macrostate, under the constraint (3.1). Using a crude

10

3.2 Formal approach

Stirling’s approximation, n! ≈ ( ne )n , we find that

n n1 , ..., n6

≈ Q6

( ne )n

ni ni i=1 ( e )

=

=

nn n Qe e−n 6i=1 nni i 6 Y n ni

6 Y n ( )ni = exp(ln ni i=1

(

i=1

ni

nn ni i=1 ni

= Q6

) ) = exp(ln nn

6 Y 1 ) nni i i=1

6 6 X Y 1 ) = exp(n ln n − = exp(n ln n + ln ni ln ni ) nni i i=1

6 X

= exp(

ni ln n −

i=1 6 X

6 X

ni ln ni ) = exp[

i=1

(3.4)

i=1 6 X

ni (ln n − ln ni )]

i=1 6 X ni

ni ni = exp[n(− ln )] n n n i=1 i=1 n1 n6 = exp[nH( , ..., )]. n n

= exp(−

ni ln

By the monotonicity of the exponential, under the constraint (3.1), maximizing (3.3) is almost equivalent to maximize H( nn1 , ..., nn6 ) i.e. the entropy of the distribution to determine. Thus, the distribution of maximum entropy is the one that can be realized in the greatest number of ways: since the only constraint we have is the mean value of spot showing, determinig the frequencies (i.e. the PMF) taking into account such a constraint but maximizig the entropy is a very good idea because in so doing our model leaves open the wider set of behaviors. Moreover, for large n, the overwhelming majority of all possible distributions compatible with our information have entropy very close to the maximum and when n → ∞ any frequency distribution other than the one of maximum entropy become highly atypical of those allowed by the constraints. This is the central results that come from Jaynes’ Concentration Theorem in [1].

3.2

Formal approach

The formal framework of any maximum entropy method (ME) was introduced by Jaynes in [3] as follows. We discuss the univariate case for the ease of the treatment, without loss of generality. Consider a rv X, its sample space X and the three entities: 1. a valid probability distribution p = {pi }i=1,...,n ,

Pn

i=1 pi

= 1;

2. a consistent entropy measure, for example that of Shannon H(p) = −

11

Pn

i=1 pi ln pi ;

3. INTRODUCTION TO MAXIMUM ENTROPY METHODS

3. a set of linear constraints

Pn

i=1 pi gr (xi )

= ar ,

r = 1, ..., m.

Notice that althought widely used, Shannon’s entropy measure is not the only one: what is really important is that we take a consistent measure for entropy, as discussed in (2.3). Furthermore, we remark that the constraints must be linear: they are usually moment constraints. We are faced with a constrained extreme problem (see 2.4) in which we have to maximize entropy (i.e. a function) subject to a set of linear contraints (that with a multidimensional notation we called Γ in (2.4)) The lagrangian (2.18) of the problem is: L(p1 , .., pn , λ0 , .., λm ) = −

n X

pi ln pi −(λ0 −1)[

i=1

n X

pi −1]−

i=1

m X

λr [

r=1

maximizing L, i.e. imposing ( P ∂pi L = −(ln pi + 1) − m r=1 λr gr (xi ) − (λ0 − 1) = 0 Pn r = 1, ..., m ∂λr L = −( i=1 pi gr (xi ) − ar ) = 0

n X

pi gr (xi )−ar ] (3.5)

i=1

i = 1, ..., n

(3.6)

we obtain that pi = exp[−(λ0 + λ1 g1 (xi ) + ... + λm gm (xi ))]

i = 1, ..., n

(3.7)

while the equations on the partial derivatives in λr simply lead back to the constraints. In order to determine the Lagrange multipliers, we substitute (3.7) into the contraints equations to get the m + 1 (nonlinear) equation in m + 1 unknows system: ( P eλ0 = ni=1 exp[−(λ1 g1 (xi ) + ... + λm gm (xi ))] P ar eλ0 = ni=1 gr (xi )exp[−(λ1 g1 (xi ) + ... + λm gm (xi ))] r = 1, ..., m

(3.8)

that can find a solution via numerical methods. Again, we remark that the continuous case can be treated symply by substituting sums with integrals: no convergence problems arise, since entropy is a bounded, smooth functtion.

3.2.1

The minimun discrimination information principle

The minimun discrimination information principle (MDI) from Kullback extends the framework introduced by Jaynes. Suppose we substitute the entropy measure as second entity with the Information divergence (2.13). Now we seek a constrained minimum instead of a maximum but what is really interesting is that now we have a fourth entity

12

3.2 Formal approach

in the new framework: the distribution q. From the MDI point of view, ME seeks to determine that distribution p, out of those that satisfy the constraints, for which D(p||u) is a minimum, with u denoting the uniform distribution. Kullback’s MDI extends this concept. It seeks to minimize the relative entropy D(p||q), which means it seeks to determine the distribution p that satisfies the constraints and is closest to a given distribution q. This fourth entity, say a ”settable reference distribution” of maximum entropy in absolute makes MDI more flexible than Jaynes’ ME and allows, as we will see, interesting applications in contexts that seems not to have so much in common with probability distributions.

13

3. INTRODUCTION TO MAXIMUM ENTROPY METHODS

14

4

Covariance Selection We discuss now the covariance selection theory introduced by Dempster in [4].

4.1

The problem

Suppose we are faced with the task of fitting a model known to be described by a multivariate normal distribution (2.9). Recall that the normal distribution has the welcome property to be fully determined by its second order description, i.e. its mean vector and covariance matrix, but actually only by the second one by reducing it to a zero mean distribution. So the fitting procedure consists in determining the covariance structure

σ11 .. Σ= . σn1

··· .. . ···

σ1n .. .

(4.1)

σnn

i.e. the set of parameters σij i, j = 1, ..., n. Tipically, we have a sample of m n-variate observations x1 , ..., xm and so an estimated n × n sample covariance matrix S derived using the formula

m

1 X ¯ )T (xl − x ¯) S= (xl − x m

(4.2)

l=1

where

m

1 X ¯= x xl . m

(4.3)

l=1

However, the computational ease with which the set of parameters can be estimated should not lead us to obscure the unwisdom of such estimation from limited data. Hence, we identify a subset of parameters whose reliability we trust from the data

15

4. COVARIANCE SELECTION

and look for a valid completion of the covariance structure. The insight that underlies Dempster’s covariance selection is the principle of parsimony in parametric model fitting, which suggests that parameters should be introduced only when the data indicate they are required. Note that in (2.9) what appears is not the covariance matrix Σ but its inverse Σ−1 so that parameters reduction may resonably be attempted by setting certains σ ij to 0. Parameters reduction involves a tradeoff between benefits and costs: annihilating a substantial number of parameters the amount of noise in a fitted model due to estimation error is significantly reduced but, on the other hand, errors of misspecification are introduced because the null values are incorrect: every decision to fit a model involves an implicit balance between these two kinds of errors.

4.2

A rule

Let I be a subset of the index pairs (i, j) with 1 ≤ i ≤ j ≤ n and J the set of remaining pairs. Think about J as the set of entries whose reliability we trust and I the complementary set of parameters. The formal rule that concretizes the insight given in the previous section is the following. ˆ to be the positive definite symmetric matrix such that S and Σ ˆ are Rule 1. Choose Σ −1 ˆ identical for index pairs (i, j) ∈ J while Σ is identically 0 for index pairs (i, j) ∈ I. This choice, which we name Dempster’s completion, may at first look less natural than setting the unspecified elements of Σ to zero. It has nevertheless considerable advantages compeared to the latter [4]. Dempster established the following far reaching result. Theorem 1. Assume that a symmetric, positive-definite completion of Σ exists. Then there exists a unique Dempster’s Completion Σ0 . This completion maximizes the entropy Z 1 1 H(p) = − log(p(x))p(x)dx = log(det Σ) + n(1 + log 2π) (4.4) 2 2 Rn among zero-mean Gaussian distributions having the prescribed elements σi j, (i, j) ∈ J. Thus, Dempster’s Completion Σ0 solves a maximum entropy problem, i.e., maximizes entropy under linear constraints [7].

16

4.3 Link to the general framework

4.3

Link to the general framework

Dempster’s covariance selection revisits from a different point of view the former work of Jaynes. In fact, instead of determining a probability distribution solving a constrained extreme problem, he thought in terms of parameters reduction, but the underlying idea is the same: given incomplete information on the model, a good way of fitting it is that to leave open the wider spectrum of possible behaviors. This target is accomplished in both cases even if they appear not to have so much in common (actually, it seems that maximum entropy is a consequency in Dempster’s work instead of a goal). But this is not the case. In fact, it can be easily seen that the incomplete information on the covariance structure is nothing but a set of linear constraints on the distribution, while the fact that it was assumed a priori for the distribution to be a (multivariate) normal one is not restrictive as it can be shown that if the linear constraints are the second order description (mean vector and covariance matrix) the maximum entropy distribution is normal [5]. Finally, the fact that Rule 1 leads to the maximum entropy normal distribution follows from Theorem 1 which summarizes Dempster’s Statistical Theory.

4.3.1

Generalization to Matrix Completion Problems

Dempster’s Covariance Selection is in conclusion just one, although if really important, task of matrix completion. Here the original problem is the unwisdom affecting collected data: this is the reason for which we start with a subset of entries of the matrix and need to find a valid completion. As we will see in the following chapter, this entropic approach is well suited in other matrix completion contexts. We’ll focus on a different original problem, that of reducing a significant computational burden. Observe in Theorem 1 that maximizing entropy of a normal distribution is equivalent, apart from constant factors and considering the monotonicity of the logarithm, to extremizing det Σ: we can think about every symmetric, positive-definite matrix as the covariance structure of a multivariate normal distribution and apply Rule 1 to it. Furthermore, M. Pavon and A. Ferrante proved in [7] that symmetry and positive-definiteness are not necessary since the constrained extremization of the determinant only involves the positive part of the matrix. Hence such approach can be extended really to every matrix, also in the rectangular case.

17

4. COVARIANCE SELECTION

18

5

Quasi-Newton methods In numerical analysis, Newton’s method is an algorithm for finding successively better approximations to the roots of a smooth, real valued function. The idea of the method is as follows: one starts with an initial guess which is reasonably close to the true root, then the function is approximated by its tangent line (which can be computed using the tools of calculus), and one computes the x-intercept of this tangent line (which is easily done with elementary algebra). This x-intercept will typically be a better approximation to the function’s root than the original guess, and the method can be iterated.

5.1

Newton’s step

Consider for the ease of exposition the unidimensional case. Let’s X ⊂ R a compact set, f : X → R a differentiable function that takes values in R. Suppose we have some current approximation for the position of one root, say xn . Then the formula for a better approximation xn+1 is derived as follows from the definition of the derivative f 0 (xk ) =

∆y f (xk ) − 0 = ∆x xk − xk+1

k≥0

(5.1)

Then by use of simple algebra we get xk+1 = xk −

f (xk ) f 0 (xk )

k≥0

(5.2)

We should start with some arbitrary initial value x0 : the closer to the root, the better. In absence of any intuition about where the zero might lie, we could spread out different

19

5. QUASI-NEWTON METHODS

initial possibilities in a reasonably small interval appealing to the intermediate value theorem.

5.1.1

Minimization and maximization problems

Newton’s method can be easily extended to maxima and minima problems: actually, it’s sufficient to ask for f to be twice differentiable, and look for the roots of its first derivative, according to Fermat’s theorem on stationary points xk+1 = xk − 5.1.1.1

f 0 (xk ) f 00 (kn )

k≥0

(5.3)

The multivariate case

In a multivariate context (by far the most interesting case, where we’ll concentrate in the next section), i.e. X ⊂ Rn , f : X → R and under the hypotesis f ∈ C 2 , (5.3) becomes xk+1 = xk − [Hf (xk )]−1 ∇f (xk )

k≥0

(5.4)

where ∇f (xk ) and Hf (xk ) are respectively the gradient and the Hessian matrix of f at xk .

5.2

Approximation

In the execution of the algorithm, the most expensive part (computationally speaking) is finding, storing and inverting the Hessian. Quasi-Newton methods seek to approximate the Hessian matrix (or its inverse) for the kth step by accumulating information from the preceding steps using only first derivatives (or they finite-difference approximation) [8]. Consider the second order Taylor expansion 1 f (xk + ∆xk ) ≈ f (xk ) + ∇f (xk )T ∆xk + ∆xk Hf (xk )∆xk , 2

∆xk = xk+1 − xk . (5.5)

Taking the gradient on both sides respect to ∆xk , we get ∇f (xk + ∆xk ) ≈ ∇f (xk ) + Hf (xk )∆xk .

(5.6)

Let Bk be an approximation of Hf (xk ) (B0 is usually taken to be the identity). In QN one employs the Newton’s step (5.4) with Hf (xk ) := Bk imposing in view of (5.6) the secant equation ∇f (xk + ∆xk ) = ∇f (xk ) + Bk ∆xk .

20

(5.7)

5.2 Approximation

In more than one dimension, the secant equation is under determined. Various methods are used to find a symmetric Bk+1 closest (according to some metric) to the current approximation Bk and satisfying (5.7). The underlying idea in all QN is that of avoiding to calculate the Hessian for every Newton’s step, approximating it by rank one (or even rank two) updates specified by gradient evaluations. Historically, remarkable examples of QN are the DFS formula from Davidon–Fletcher–Powell (the first updating scheme proposed), BFGS from Broyden–Fletcher–Goldfarb–Shanno and the SR1 (Symmetric Rank 1) method.

5.2.1

Entropy approach

Consider now the case where f is a strongly convex function, i.e. ∃α > 0,

Hf (xk ) > αIn ,

∀k > 0,

(5.8)

in this case, Bk should be positive definite. Recall from section 2.3.1 the definition of relative entropy and its interpretation. In the case of two multivariate normal distributions p, q with covariance matrixes respectively P, Q (2.13) has a close form Z p(x) D(p||q) = log dx q(x) Z |P |−1/2 1 = log{ −1/2 exp[− xT (P −1 − Q−1 )x]}p(x)dx 2 |Q| Z 1 = log |P Q−1 |−1/2 + [− xT (P −1 − Q−1 )x]}p(x)dx 2 Z 1 1 T −1 = log |P −1 Q| + x (Q − P −1 )xp(x)dx 2 2 Z 1 (5.9) = [log |P −1 Q| + tr(Q−1 − P −1 )xxT p(x)dx] 2 Z 1 = [log |P −1 Q| + tr(Q−1 − P −1 ) xxT p(x)dx] 2 1 = [log |P −1 Q| + tr(Q−1 − P −1 )P ] 2 1 = [log |P −1 Q| + tr[(Q−1 P ) − In ]] 2 1 = [log |P −1 Q| + tr(Q−1 P ) − n]. 2 Notice that D(p||q) uniquely depends on the covariance matrixes P, Q and so, with an abuse of notation, we introduce 1 D(P ||Q) = [log |P −1 Q| + tr(Q−1 P ) − n]. 2

21

(5.10)

5. QUASI-NEWTON METHODS

that can be thought as a (pseudo) metric between (symmetric and positive definite) matrixes. This result gives rise to an important application of MDI of section 3.2.1, which we know to be a refinement of the original ME. Consider the minimization problem min D(Bk+1 ||Bk )

(5.11)

Bk+1 ∆xk = ∇f (xk+1 ) − ∇f (xk ).

(5.12)

subject to the linear constraint

Here we are faced with the task of finding the nearest matrix Bk+1 , i.e. the update of the current approximation of the Hessian, according to the generalized entropic approach of section 4.3.1, using the current approximation Bk and satisfying the linear constraint given by the secant equation (5.12). The lagrangian of the problem is 1 −1 Bk |+tr(Bk−1 Bk+1 )−n]+λTk+1 [Bk+1 ∆xk −∇f (xk+1 )+∇f (xk )] L(Bk+1 , λk+1 ) = [log |Bk+1 2 (5.13) Imposing δL(Bk+1 , λk+1 , δB) = 0 for all δB we get (Bk+1 )−1 = Bk−1 + 2∆xk λTk+1 .

(5.14)

This is the step on which it’s possible to construct iterative schemes to update cyclically −1 Bk+1 and λk . Note that in (5.14) (Bk+1 )−1 is a rank one update of Bk−1 , just like any

conventional QN. The maximum entropy approach shows in this application all its versatility: we’re not considering a model fitting task but an optimization one. We should not forget anyway that what underlies (5.10) is a (pseudo) metric defined on the space of probability distributions and the matrixes involved in D(·||·) must be thought as the covariance matrixes of multivariate normal distributions: not by chance at the beginning of this section we posed as condition for the function f to be strongly convex in the region of interest, this allows the Hessian to gain positive-definiteness, in addition to symmetry, that’s a property held by every Hessian matrix.

22

References [1] Edwin T. Jaynes, On The Rationale of Maximum-Entropy Methods, Proceedings of the IEEE, vol. 70, 1982 11 [2] H.K. Kesavan, J.N. Kapur, The Generalized Maximum Entropy Principle, IEEE transactions on systems, man, and cybernetycs, vol. 19, 1989. [3] Edwin T. Jaynes, Information Theory and Staistical Mechanics, Physical Reviews, vol. 106, 1957. 11 [4] A. P. Dempster, Covariance Selection, Biometrics, vol. 28, 1972. 1, 15, 16 [5] Thomas M. Cover, Joy A. Thomas, Elements of information theory, Wiley, 1938 17 [6] Kullback , Leibler, On Information and Sufficiency, Annals of Mathematical Statistics 22, 1951 6 [7] M. Pavon, A. Ferrante, Matrix Completion ` a la Dempster by the principle of Parsimony, IEEE Transactions on Information Theory, 2011 16, 17 [8] P.E. Gill, W. Murray, Numerical methods for constrained optimization, Academic Press, 1974 20

23