Outline ● ●

A new kind of prior Information entropy: a measure of amount of uncertainty ●

Shannon's derivation

●

Wallis derivation

●

Maximum entropy distributions

●

Objections against maximum entropy

A new kind of prior ●

Ex.: Translating the English “in” into French ●

●

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1 From analyzing texts, we know – –

●

●

p(dans) + p(en) = 3/10 p(dans) + p(à) = 1/2

Cannot use principle of indifference

Goal: Assign a probability distribution as uniform as possible while agreeing with constraints.

What doesn't work ●

Maximizing variance ●

●

Leads to unjustified solutions

Minimizing sum of squares ●

May end up with negative pi

●

“Fixing” them is not an option – –

Different principles of reasoning for different constraint values Assigns zero probability to situations that are not ruled out by prior information

A measure of uncertainty Requirements for a measure of uncertainty of a probability distribution: (1) Measure is a real-valued function H(p1, ..., pn) (2) Continuity: A small change in pi may cause only a small change in uncertainty (3) Common sense: More possibilities → more uncertainty. 1 1 Formally: h(n) ≤ h(n+1), where h n= H , ... , n n n times

(4) Consistency: All ways of working out H need to yield the same value

Functional equations for H ●

Given two alternatives with probabilities p 1, q ●

●

●

Uncertainty: H(p1, q)

Second alternative really consists of two different alternatives with probabilities p 2, p3 What's H(p1, p2, p3)?

p2 p3 H p 1 , p 2 , p 3 = H p1 , qqH , q q

Functional equations for H (cont.) ●

Generalization ●

n alternatives with probabilities pi

w1 ●

w2

w3

w1 = p 1 p 2 p 3 w 2=...

Group them into composite propositions

H p1 , ... , p n = p1 pk p k1 p km H w1 , ... , w r w1 H ,... , w 2 H , ... , ... w1 w1 w2 w2

Deriving h ●

ni Consider rational pi = , N =∑ n j N –

Imagine pi stands for a composition of ni propositions with equal probabilities.

p1=

●

●

●

3 , p 2=... 13

p1

p2

p3

p4

Then h N =h ∑ n j =H p1 , ... , p n ∑ pi hni If all ni = m, we get h mn=h mh n This is solved by h n=K log n

Finally, a measure of uncertainty ●

Using the functional equations and h = log(n) H p 1 , ... , p n =−∑ pi log p i

●

H is called information entropy ●

Not to be confused with experimental entropy

●

Showed only necessity

●

Proof of uniqueness is in the book ;)

Wallis derivation ●

●

Goal: Assign probabilities pi to m different propositions subject to constraints Game: ●

●

Distribute the n ≫ m quanta of probability randomly among the m propositions ni pi = n Check if the resulting assignment satisfies the constraints –

If yes: done, else: repeat game.

Wallis derivation ●

What's the probability of getting a specific assignment? ● ●

●

n! Multinomial distribution m ⋅W , W = n1 !⋯n m ! Larger W ⇒ more likely result −n

1 As n → ∞: log W H p 1 ,... , p m n ●

Thus, the most likely assignment is the one that maximizes entropy

Maximum entropy distributions ●

Let's put the measure to work ●

Given propositions A1, ..., An, variable x can take corresponding values x1, ..., xn

●

Of course, we want ∑ pi =1

●

Our prior tells us that F k =〈 f k x〉= ∑ pi f k x i –

I.e., the expected values for the functions f k are given

Maximum entropy distributions ●

Using the Lagrange method, it is shown that m

pi =exp −0 −∑ j f j x i j=1

●

–

λi are Lagrange multipliers

–

λi are chosen so they satisfy the constrains

Alternative derivation of pi shows that it indeed maximizes H –

Necessary because Lagrange method doesn't work if maximum at a cusp

Objections

Round 1 ●

“'Maximum uncertainty' is a negative thing which can't possibly lead to any useful predictions.” ● ●

This is a “play on words” The principle doesn't create “new” uncertainty, it merely tries to avoid unwarranted assumptions

Round 2 ●

“Probabilities obtained by MAXENT are irrelevant to physical predictions because they have nothing to do with frequencies.” –

–

“The probability distribution which maximizes the entropy is numerically identical with the frequency distribution which can be realized in the greatest number of ways.” “If the information incorporated into the maximum entropy analysis includes all the constraints actually operating in the random experiment, then the distribution predicted by maximum entropy is overwhelmingly the most likely to be observed experimentally.”

Round 3 ●

“The principle only works when the constraints are averages; in practice, they are real measurements, and not averages over anything.” [?] ● ●

The principle also works for other constraints If there are constraints on the width of the distribution, we can incorporate them

Round 4 ●

“Different people have different information, so the results are basically arbitrary.” ●

Consider Mr A and Mr B; Mr B has some additional information that Mr A hasn't – – –

If Mr B's additional information is implied by Mr A's information, they will find at the same distribution If Mr B's additional information is contradictory to his previous information, no distribution can be found If Mr B's additional information was neither redundant or contradictory, his distribution will indeed have a lower entropy

“The principle of maximum entropy is not an oracle telling which predictions must be right; it is a rule for inductive reasoning that tells us which predictions are most strongly indicated by our present information.”

The end.

JAYNES