# Maximum Entropy

Maximum Entropy Information Theory 2013 Lecture 9 ... http://classx.stanford.edu/ClassX/system/ ... Maximum Entropy - Information Theory 2013 Lecture ...

Maximum Entropy Information Theory 2013 Lecture 9 Chapter 12

Tohid Ardeshiri

May 22, 2013

Why Maximum Entropy distribution? max h(f ) f (x )

subject to E r (x ) = α • Temperature of a gas corresponds to the expected square velocity of the molecules of a gas. What about the distribution of the velocity? • How is the distribution of molecules velocity in presence of gravity subject to a total energy constraint?

1. Maxwell-Boltzmann distribution, 2. Exponential distribution of the air density in the atmosphere in the vertical direction, th

th

3. 52 in kinetic energy and 35 in potential energy, 4. Distribution of the velocities are independent of the hight of the molecule.

Outline This lecture will cover • Maximum Entropy Distributions. • Anomalous Maximum Entropy Problem. • Spectrum Estimation. • Entropy of a Gaussian Process. • Burg’s Entropy Theorem.

All illustrations are borrowed from the book, Wikipedia and the lecture given by Thomas M. Cover at Stanford http://classx.stanford.edu/ClassX/system/ users/web/pg/view_subject.php?subject=EE376B_ SPRING_2010_2011

Maximum Entropy Distributions Maximize the entropy h(f ) over all probability densities f satisfying 1. f (x ) ≥ 0, with equality outside the support set S 2.

R

3.

R

S S

f (x )dx = 1 f (x )ri (x )dx = αi , for 1 ≤ i ≤ m

Example 1: S = (−∞, ∞), EX = 0, EX 2 = σ 2 ⇒ f (x ) = N (x ; 0, σ 2 ) Example 2: S = [0, +∞), EX = λ ⇒ f (x ) = Exp(x ; λ−1 ) Example 3: S = [a, b], No constraint ⇒ f (x ) = U(x ; a, b)

Finding the solution using Calculus Maximize the entropy h(f ) over all probability densities f satisfying 1. f (x ) ≥ 0, with equality outside the support set S 2.

R

3.

R

S S

f (x )dx = 1 f (x )ri (x )dx = αi , for 1 ≤ i ≤ m

Entropy is a concave function defined over a convex set

Z J(f ) = −

f ln f + λ0

Z f +

m X

Z λi

ri f

i=1 m

X ∂J = − ln f (x ) − 1 + λ0 + λi ri (x ) ∂f (x ) i=1

f (x ) = e

Pm

−1+λ0 +

i=1

λi ri (x )

Theorem 12.1.1: Maximum Entropy Distribution Pm

−1+λ0 +

Theorem: Let f ∗ (x ) = e chosen so that f ∗ satisfies

i=1

λi ri (x )

, x ∈ S, where λ0 , λ1 , ..., λm are

1. f (x ) ≥ 0, with equality outside the support set S 2. 3.

R RS S

f (x )dx = 1 f (x )ri (x )dx = αi , for 1 ≤ i ≤ m .

Then f ∗ UNIQUELY maximizes h(f ) over all probability densities f satisfying the constraints.

Proof using Information Inequality Pm

−1+λ0 +

Theorem: Let f ∗ (x ) = e chosen so that f ∗ satisfies

i=1

λi ri (x )

, x ∈ S, where λ0 , λ1 , ..., λm are

1. f (x ) ≥ 0, with equality outside the support set S 2. 3.

R RS S

f (x )dx = 1 f (x )ri (x )dx = αi , for 1 ≤ i ≤ m .

Then f ∗ UNIQUELY maximizes h(f ) over all probability densities f satisfying the constraints.

Z h(g) = −

g ln g = −

Z

S

Z ≤−

S

g ln f ∗ = −

S

f S

g ∗ f = −D(g||f ∗ ) − f∗

Z g

−1 + λ0 +

S

Z =−

g ln

−1 + λ0 +

m X

Z

g ln f ∗

S

! λi ri

i=1 m X i=1

! λi ri

Z =−

f ∗ ln f ∗ = h(f ∗ )

S

Note: The equality holds iff D(g||f ∗ ) = 0 for all x ⇒ g = f ∗ except for a set of measure 0.

Anomalous Maximum Entropy Problem Maximize the entropy h(f ) over all probability densities f satisfying

Z

+∞

f (x )dx = 1

−∞ +∞

f (x ) = e λ0 +λ1 x +λ2 x

2

Z

xf (x )dx = α1 −∞ +∞

Z

x 2 f (x )dx = α2

⇒ N (α1 , α2 − α12 ) f (x ) = e λ0 +λ1 x +λ2 x

−∞ +∞

Z

x 3 f (x )dx = α3

−∞

sup h(f ) = h(N (α1 , α2 − α12 )) =

1 ln 2π(α2 − α12 ) 2

2

+λ3 x 3

Entropy rates of a Gaussian Process The differential entropy rate of a stochastic process {Xi }, Xi ∈ R h(X ) = lim

n→∞

1 h(X1 , X2 , ..., Xn ) = lim h(Xn |Xn−1 , ..., X1 ) n→∞ n

Since the SP is Gaussian the conditional distribution is also Gaussian and hence, h(Xn |Xn−1 , ..., X1 ) = 12 log 2πeσ 2 and therefore, 2 limn→∞ h(Xn |Xn−1 , ..., X1 ) = 21 log 2πeσ∞ 2 where σ∞ is the variance of the error in the best estimate of Xn given the infinite past. Thus

h(X ) =

1 2 log 2πeσ∞ 2

The entropy rate corresponds to the minimum mean-squared error of the best estimator of a sample of the process given the infinite past. 2 σ∞ =

1 2h(X ) 2 , 2πe

Entropy rates of a Gaussian Process II For a stationary Gaussian stochastic process we have h(X1 , X2 , ..., Xn ) = (n)

where Kij

1 log(2πe)n |K (n) | 2

= R(i − j) = E(Xi − E Xi )(Xj − E Xj ).

Kolmogorov has shown that h(X ) =

1 1 log(2πe) + 2 4π

Z

π

−π

log S(λ)dλ

Spectrum estimation • Autocorrelation function for a stationary zero-mean stochastic process {Xi }: R(k) = E Xi Xi+k

P∞

• Power Spectral Density: S(λ) = m=−∞ R(m)e (−imλ) , −π < λ ≤ π is an indicative of the structure of the process. • Periodogram, truncating and windowing.

b (k) = R

n−k 1 X Xi Xi+k n−k i=1

• Burg suggested to instead of setting the autocorrelations at high lags to zero set them to values that make the fewest assumptions about the data i.e. values that maximize the entropy rate of the process. • Burg assumed that the process to be stationary and Gaussian and found that the process which maximizes the entropy subject to the correlation constraint is an autoregressive Gaussian process of appropriate order.

Burg’s Maximum Entropy Theorem Theorem: The maximum entropy rate stochastic process {Xi } satisfying the constraint E Xi Xi+k = αk ,

k = 0, 1, ..., p

for all i,

is the p th order Gauss-Markov process of the form Xi = −

p X

ak Xi−k + Zi ,

k=1 iid

where the Zi ∼ N (0, σ 2 ) and a1 , a2 , ..., ap σ 2 are chosen to satisfy (1). Remark:We do not assume that {Xi } is 1. zero mean, 2. Gaussian, or 3. wide-sense stationary.

(1)

Proof of the Burg’s Theorem I • Let X1 , X2 , ..., Xn be any stochastic process that satisfies the constraints. • Let Z1 , Z2 , ..., Zn be a Gaussian process with the same covariance matrix as X1 , X2 , ..., Xn . • Let Y1 , Y2 , ..., Yn be a p th order Gauss-Markov process with the same distribution as Z1 , Z2 , ..., Zn for all orders up to p. • Recall that the multivariate normal distribution maximizes the entropy over all vector-valued random variables under a covariance constraint. • Recall that conditioning reduces the entropy. • Since the conditional entropy depends only on the p th order distribution h(Zi |Zi−1 , Zi−2 , ..., Zi−p ) = h(Yi |Yi−1 , Yi−2 , ..., Yi−p ), h(Z1 , ..., Zp ) = h(Y1 , ..., Yp )

Proof of the Burg’s Theorem II h(X1 , X2 , ..., Xn ) ≤ h(Z1 , Z2 , ..., Zn ) = h(Z1 , ..., Zp ) +

n X

h(Zi |Zi−1 , Zi−2 , ..., Z1 )

i=p+1

≤ h(Z1 , ..., Zp ) +

n X

h(Zi |Zi−1 , Zi−2 , ..., Zi−p )

i=p+1

= h(Y1 , ..., Yp ) +

n X i=p+1

= h(Y1 , ..., Yn ) by the Markovity of Yi .

h(Yi |Yi−1 , Yi−2 , ..., Yi−p )

Proof of the Burg’s Theorem III Dividing by n and taking the limit, we obtain lim

n→∞

1 1 1 h(X1 , X2 , ..., Xn ) ≤ lim h(Y1 , ..., Yn ) = log 2πeσ 2 n→∞ n n 2

which is the entropy rate of the Gauss-Markov process. Hence, the maximum entropy rate stochastic process satisfying the constraints is the p th order Gauss-Markov process satisfying the constraints.

A bare-bones summary of the proof • The entropy of a finite segment of a stochastic process is bounded above by the entropy of a segment of a Gaussian random process with the same covariance structure. • This entropy is in turn bounded above by the entropy of the minimal order Gauss-Markov process satisfying the given covariance constraints. • Such a process exists and has a convenient characterization by means of the Yule-Walker.