ChoosingtheRegularizationParameter Atourdisposal:severalregularizationmethods,basedonﬁlteringofthe SVDcomponents. ... Problem formulation: balancetheﬁ...

0 downloads 60 Views 2MB Size

No documents

Perspectives on regularization

2

The discrepancy principle

3

Generalized cross validation (GCV)

4

The L-curve criterion

5

The NCP method

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

1 / 33

Once Again: Tikhonov Regularization Focus on Tikhonov regularization; ideas carry over to many other methods. Recall that the Tikhonov solution xλ solves the problem min kA x − bk22 + λ2 kxk22 , x

and that it is formally given by xλ = (AT A + λ2 I )−1 AT b = A# λ b, T 2 −1 T where A# λ = (A A + λ I ) A is a “regularized inverse.”

Our noise model b = b exact + e where b exact = A x exact and e is the error.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

2 / 33

Classical and Pragmatic Parameter-Choice Assume we are given the problem A x = b with b = b exact + e

and

b exact = A x exact ,

and that we have a strategy for choosing the regularization parameter λ as a function of the “noise level” kek2 . Then classical parameter-choice analysis is concerned with the convergence rates of xλ → x exact as kek2 → 0 and λ → 0 . This is an important and natural requirement to algorithms for choosing λ. Our focus here is on the typical situation in practice: The norm kek2 is not known, and the errors are fixed (not practical to repeat the measurements). The pragmatic approach to choosing the regularization parameter is based on the forward/prediction error, or the backward error. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

3 / 33

An Example (Image of Io, a Moon of Saturn) Exact

Blurred

λ too large

λ ≈ ok

Intro to Inverse Problems

Chapter 5

λ too small

Reg. Parameter Choice

4 / 33

Perspectives on Regularization Problem formulation: balance the fit (residual) and the size of solution. xλ = arg min kA x − bk22 + λ2 kL xk22 Cannot be used for choosing λ. Forward error: balance regularization errors and perturbation errors. exact x exact − xλ = x exact − A# + e) λ (b exact = I − A# − A# λA x λe .

Backward/prediction error: balance contributions from the exact data and the perturbation. b exact − A xλ = b exact − A A# (b exact + e) λ exact = I − A A# − A A# λ b λe . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

5 / 33

More About the Forward Error The forward error in the SVD basis: x exact − xλ = x exact − V Φ[λ] Σ−1 U T b = x exact − V Φ[λ] Σ−1 U T A x exact − V Φ[λ] Σ−1 U T e = V I − Φ[λ] V T x exact − V Φ[λ] Σ−1 U T e. The first term is the regularization error: [λ]

∆xbias = V I − Φ

T exact

V x

=

n X

[λ]

1 − ϕi

(viT x exact ) vi ,

i=1

and we recognize this as (minus) the bias term. The second error term is the perturbation error: ∆xpert = V Φ[λ] Σ−1 U T e. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

6 / 33

Regularization and Perturbation Errors – TSVD For TSVD solutions, the regularization and perturbation errors take the form n k X X uiT e vi . ∆xbias = (viT x exact ) vi , ∆xpert = σi i=1

i=k+1

We use the truncation parameter k to prevent the perturbation error from blowing up (due to the division by the small singular values), at the cost of introducing bias in the regularized solution. A “good” choice of the truncation parameter k should balance these two components of the forward error (see next slide). The behavior of kxk k2 and kA xk − bk2 is closely related to these errors – see the analysis in §5.1.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

7 / 33

The Regularization and Perturbation Errors

The norm of the regularization and perturbation error for TSVD as a function of the truncation parameter k. The two different errors approximately balance each other for k = 11. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

8 / 33

The TSVD Residual Let kη denote the index that marks the transition between decaying and flat coefficients |uiT b|. Due to the discrete Picard condition, the coefficients |uiT b|/σi will also decay, on the average, for all i < kη . k < kη : kA xk −

bk22

≈

kη X

(uiT b)2

2

+ (n − kη )η ≈

i=k+1

kη X

(uiT b exact )2

i=k+1

k > kη : kA xk − bk22 ≈ (n − k) η 2 . For k < kη the residual norm decreases steadily with k. For k > kη it decreases much more slowly. The transition between the two types of behavior occurs at k = kη when the regularization and perturbation errors are balanced. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

9 / 33

The Discrepancy Principle Recall that E(kek2 ) ≈ n1/2 η. We should ideally choose k such that kA xk − bk2 ≈ (n − k)1/2 η. The discrepancy principle (DP) seeks to combine this: Assume we have an upper bound δe for the noise level, then solve kA xλ − bk2 = τ δe ,

where

kek2 ≤ δe

and τ is some parameter τ = O(1). See next slide. A statistician’s point of view. Write xλ = A# λ b and assume that Cov(b) = η 2 I ; choose the λ that solves 1/2 kA xλ − bk2 = kek22 − η 2 trace(A A# . λ) Note that the right-hand side now depends on λ. Both versions of the DP are very sensitive to the estimate δe . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

10 / 33

Illustration of the Discrepancy Principle

The choice kA xk − bk2 ≈ (n − kη )1/2 η leads to a too large value of the truncation parameter k, while the more conservative choice kA xk − bk2 ≈ kek2 leads to a better value of k. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

11 / 33

The L-Curve for Tikhonov Regularization Recall that the L-curve is a log-log-plot of the solution norm versus the residual norm, with λ as the parameter.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

12 / 33

Parameter-Choice and the L-Curve Recall that the L-curve basically consists of two parts. A “flat” part where the regularization errors dominates. A “steep” part where the perturbation error dominates. The optimal regularization parameter (in the pragmatic sense) must lie somewhere near the L-curve’s corner. The component b exact dominates when λ is large: kxλ k2 ≈ kx exact k2 (constant) kb − A xλ k2 increases with λ. The error e dominates when λ is small: kxλ k2 increases with λ−1 kb − A xλ k2 ≈ kek2 (constant.) Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

13 / 33

The L-Curve Criterion The flat and the steep parts of the L-curve represent solutions that are dominated by regularization errors and perturbation errors. The balance between these two errors must occur near the L-curve’s corner. The two parts – and the corner – are emphasized in log-log scale. Log-log scale is insensitive to scalings of A and b. An operational definition of the corner is required. Write the L-curve as (log kA xλ − bk2 , log kxλ k2 ) and seek the point with maximum curvature.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

14 / 33

The Curvature of the L-Curve We want to derive an analytical expression for the L-curve’s curvature ζ in log-log scale. Define ξ = kxλ k22 ,

ρ = kA xλ − bk22

and ξˆ = log ξ ,

ρˆ = log ρ .

Then the curvature is given by cˆλ = 2

ρˆ0 ξˆ00 − ρˆ00 ξˆ0 , ((ˆ ρ0 )2 + (ξˆ0 )2 )3/2

where a prime denotes differentiation with respect to λ. This can be used to define the “corner” of the L-curve as the point with maximum curvature. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

15 / 33

Illustration

An L-curve and the corresponding curvature cˆλ as a function of λ. The corner, which corresponds to the point with maximum curvature, is marked by the red circle; it occurs for λL = 4.86 · 10−3 . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

16 / 33

A More Practical Formula The first derivatives of ξˆ and ρˆ satisfy ξˆ0 = ξ 0 /ξ ,

ρˆ0 = ρ0 /ρ,

ρ0 = −λ2 ξ 0 .

The second derivatives satisfy ξ 00 ξ − (ξ 0 )2 , ξˆ00 = ξ2

ρˆ00 =

ρ00 ρ − (ρ0 )2 , ρ2

as they are interrelated by ρ00 =

d −λ2 ξ 0 = −2 λ ξ 0 − λ2 ξ 00 . dλ

When all this is inserted into the equation for cˆλ , we get cˆλ = 2 Intro to Inverse Problems

ξ ρ λ2 ξ 0 ρ + 2 λ ξ ρ + λ4 ξ ξ 0 . ξ0 (λ2 ξ 2 + ρ2 )3/2 Chapter 5

Reg. Parameter Choice

17 / 33

Efficient Computation of the Curvature The quantities ξ and ρ readily available. Straightforward to show that ξ0 =

4 T x zλ λ λ

where zλ is given by −1 zλ = AT A + λ2 I AT (A xλ − b) , i.e., zλ is the solution to the problem

A A xλ − b

. min z−

λI 0 2 This can be used to compute zλ efficiently, when we already have a factorization of the coefficient matrix. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

18 / 33

Discrete L-Curves The L-curve may be discrete – corresponding to a discrete regularization parameter k. May have local, fine-grained “corners” (that do not appear with a continuous parameter). Two-step approach (older versions of Reg. Tools): 1

Perform a local smoothing of the L-curve points.

2

Use the smoothed points as control points for a cubic spline curve, compute its “corner,” and return the original point closest to this corner.

Another two-step approach (current version of Reg. Tools): 1

Prune the discrete L-curve for small local corners.

2

Use the remaining points to determine the largest angle between neighbor points.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

19 / 33

The Prediction Error A different kind of goal: find the value of λ or k such that A xλ or A xk predicts the exact data b exact = A x exact as well as possible. We split the analysis in two cases, depending on k: k < kη :

kA xk − b exact k22 ≈ k η 2 +

k > kη :

b exact k22

kη X

(uiT b exact )2

i=k+1

kA xk −

2

≈kη .

For k < kη the norm of the prediction error decreases with k. For k > kη the norm increases with k. The minimum arises near the transition, i.e., for k ≈ kη . Hence it makes good sense to search for the regularization parameter that minimizes the prediction error. But b exact is unknown . . . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

20 / 33

(Ordinary) Cross-Validation Leave-one-out approach: skip ith element bi and predict this element. A(i) = A([1 : i − 1, i + 1 : m], : ) b (i) = b([1 : i − 1, i + 1 : m]) # (i) xλ = A(i) λ b (i) (Tikh. sol. to reduced problem) predict

bi

(i)

= A(i, : ) xλ

(prediction of “missing” element.)

The optimal λ minimizes the quantity C(λ) =

m X

predict 2

bi − bi

.

i=1

But λ is hard to compute, and depends on the ordering of the data.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

21 / 33

Generalized Cross-Validation Want a scheme for which λ is independent of any orthogonal transformation of b (incl. a permutation of the elements). Minimize the GCV function G (λ) =

kA xλ − bk22 2 trace(Im − A A# λ)

where trace(Im −

A A# λ)

=m−

n X

[λ]

ϕi

.

i=1

Easy to compute the trace term when the SVD is available. For TSVD the trace term is particularly simple: m−

n X

[λ]

ϕi

=m−k .

i=1 Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

22 / 33

The GCV Function

The GCV function G (λ) for Tikhonov regularization; the red circle shows the parameter λGCV as the minimum of the GCV function, while the cross indicates the location of the optimal parameter. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

23 / 33

Occasional Failure Occasional failure leading to a too small λ; more pronounced for correlated noise.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

24 / 33

Extracting Signal in Noise An observation about the residual vector. If λ is too large, not all information in b has not been extracted. If λ is too small, only noise is left in the residual. Choose the λ for which the residual vector changes character from “signal” to “noise.” Our tool: the normalized cumulative periodogram (NCP). Let pλ ∈ Rn/2 be the residual’s power spectrum, with elements (pλ )k = |dft(A xλ − b)k |2 ,

k = 1, 2, . . . , n/2 .

Then the vector c(rλ ) ∈ Rn/2−1 with elements c(rλ ) =

kpλ (2 : k+1)k1 , kpλ (2 : n/2)k1

k = 1, . . . , n/2 − 1

is the NCP for the residual vector. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

25 / 33

NCP Analysis

Left to right: 10 instances of white-noise residuals, 10 instances of residuals dominated by low-frequency components, and 10 instances of residuals dominated by high-frequency components. The dashed lines show the Kolmogorov-Smirnoff limits ±1.35 q −1/2 ≈ ±0.12 for a 5% significance level, with q = n/2 − 1. Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

26 / 33

The Transition of the NCPs

Plots of NCPs for various regularization parameters λ, for the test problem deriv2(128,2) with rel. noise level kek2 /kb exact k2 = 10−5 . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

27 / 33

Implementation of NCP Criterion

Two ways to implement a pragmatic NCP criterion. Adjust the regularization parameter until the NCP lies solely within the K-S limits. Choose the regularization parameter for which the NCP is closest to a straight line cwhite = (1/q, 2/q, . . . , 1)T . The latter is implemented in Regularization Tools.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

28 / 33

Summary of Methods (Tikhonov) Discrepancy principle (discrep): Choose λ = λDP such that kA xλ − bk2 = νdp kek2 . L-curve criterion (l_curve): Choose λ = λL such that the curvature cˆλ is maximum. GCV criterion (gcv): Choose λ = λGCV as the minimizer of G (λ) =

kA xλ − bk22 . Pn [λ] 2 m − i=1 ϕi

NCP criterion (ncp): Choose λ = λNCP as the minimizer of d(λ) = kc(rλ ) − cwhite k2 . Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

29 / 33

Comparison of Methods To evaluate the performance of the four methods, we need the optimal regularization parameter λopt : λopt = argminλ kx exact − xλ k2 . This allows us to compute the four ratios RDP =

λDP , λopt

RL =

λL , λopt

RGCV =

λGCV , λopt

RNCP =

λNCP , λopt

one for each parameter-choice method, and study their distributions via plots of their histograms (in log scale). The closer these ratios are to one, the better, so a spiked histogram located at one is preferable.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

30 / 33

First Example: gravity

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

31 / 33

Second Example: shaw

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

32 / 33

Summary The discrepancy principle is a simple method that seeks to reveal when the residual vector is noise-only. It relies on a good estimate of kek2 which may be difficult to obtain in practise. The L-curve criterion is based on an intuitive heuristic and seeks to balance the two error components via inspection (manually or automated) of the L-curve. This method fails when the solution is very smooth. The GCV criterion seeks to minimize the prediction error, and it is often a very robust method – with occasional failure, often leading to ridiculous under-smoothing that reveals itself. The NCP criterion is a statistically-based method for revealing when the residual vector is noise-only, based on the power spectrum. It can mistake LF noise for signal and thus lead to under-smoothing.

Intro to Inverse Problems

Chapter 5

Reg. Parameter Choice

33 / 33