Near optimal polynomial regression on norming meshes

mesh by Linear or Quadratic Programming. Nevertheless, also reducing the Least Squares uniform operator norm, though much more costly is important in ...

0 downloads 131 Views 118KB Size
Near optimal polynomial regression on norming meshes Len Bos ∗ , Federico Piazzon † , Marco Vianello



∗ Department

of Computer Science, University of Verona (Italy) Email: [email protected] † Department of Mathematics, University of Padova (Italy) Email: [email protected], [email protected]

Abstract—We connect the approximation theoretic notions of polynomial norming mesh and Tchakaloff-like quadrature to the statistical theory of optimal designs, obtaining near optimal polynomial regression at a near optimal number of sampling locations on domains with different shapes. 2010 AMS subject classification: 62K05, 65C60, 65D32. Keywords: optimal designs, polynomial regression, norming set, polynomial norming mesh, Tchakaloff-like quadrature.

I. I NTRODUCTION In this paper we apply the approximation theoretic notion of norming set (in particular, of “polynomial norming mesh”) on a multidimensional compact set K within the statistical theory of optimal polynomial regression designs on K. Moreover, we propose an approach to reduce the sampling cardinality, based on a recent implementation of Tchakaloff-like quadrature. We shall denote by Pdn (K) be the space of polynomials of total degree not greater than n restricted to a compact set K ⊂ Rd , and by kf kY the sup-norm of a bounded function on the compact set Y . We recall that a polynomial norming mesh on K (with constant c > 0), hereafter simply called “norming mesh”, is a sequence of norming subsets Xn ⊂ K such that kpkK ≤ c kpkXn , ∀p ∈ Pdn (K) , card(Xn ) = O(N s ) , (1) where s ≥ 1 and N = Nn (K) = dim(Pdn (K)). Observe that necessarily card(Xn ) ≥ N , since Xn is Pdn (K)-determining. On the other N = O(nβ ) with β ≤ d, in particular  hand, n+d d N = d ∼ n /d! on polynomial determining compact sets (i.e., polynomials vanishing there vanish everywhere in Rd ), but we can have β < d for example on compact  algebraic  varieties, like the sphere in Rd where N = n+d − n−2+d . d d We recall that norming meshes can be computed on a wide family of compact sets, containing for example all compact sets satisfying a Markov polynomial inequality, in particular all compact sets with Lipschitz boundary. Norming meshes with O(N ) cardinality, often called “optimal”, are known for several classes of compact sets, for example polytopes and smooth bodies, cf. [13], [15]. It is also worth recalling that norming meshes are preserved by affine transformations, and can be incrementally constructed by finite unions and products; cf. the seminal paper [8].

Norming meshes give good discrete models of a compact set for polynomial fitting, for example it is easily seen that the uniform norm of the (unweighted) Least Squares operator on a norming mesh, say Ln : C(K) → Pdn (K), fulfills the estimate p kLn f kK (2) ≤ c card(Xn ) . kLn k = sup f 6=0 kf kK In addition, norming meshes play a role in the computation of good interpolations sets of Fekete and Leja type, and have been applied in the fields of polynomial optimization and pluripotential numerics; cf., e.g., [5], [16], [18], [19]. The problem of reducing the sampling cardinality keeping invariant estimate (2) (Least Squares compression) has been solved in [17], via weighted Least Squares on N2n Caratheodory-Tchakaloff points extracted from the norming mesh by Linear or Quadratic Programming. Nevertheless, also reducing the Least Squares uniform operator norm, though much more costly is important in applications, and this will be addressed in the next section via the theory of optimal designs. II. N EAR OPTIMAL

DESIGNS BY NORMING MESHES

In statistics a design is a probability measure µ supported on a (discrete or continuous) compact set K ⊂ Rd . The search for designs that optimize some property of statistical estimators (optimal designs) began at least one century ago; the corresponding literature is so vast and still growing that we can not even attempt any kind of survey. We may for example quote the classical book [1] and the very recent paper [10]. Below we recall some relevant notions and results, in order to connect the theory of optimal designs with the theory of norming meshes. In what follows we assume that supp(µ) is determining for Pd (K) (the space of d-variate real polynomials restricted to K); for a fixed degree n, we could even assume that supp(µ) is determining for Pdn (K). The diagonal of the reproducing kernel for µ in Pdn (K) (often called the Christoffel polynomial) Knµ (x, x) =

N X

p2j (x) ,

(3)

j=1

where {pj } is any µ-orthonormal basis of Pdn (K), plays a key role in the theory of optimal designs (it can be shown

that Knµ (x, x) is independent of the choice of the orthonormal basis). Indeed, a relevant property is that q kpkK ≤ max Knµ (x, x) kpkL2µ(K) , ∀p ∈ Pdn (K) . (4) x∈K R Now, by (3) we get immediatly K Knµ (x, x) dµ = N , which implies that maxx∈K Knµ (x, x) ≥ N . A probability measure µ∗ = µ∗ (K) is then called a G-optimal design for polynomial regression of degree n on K if ∗

min max Knµ (x, x) = max Knµ (x, x) = N . µ

x∈K

x∈K

(5)

R Observe that, since K Knµ (x, x) dµ = N for every µ, an op∗ timal design has also the following property Knµ (x, x) = N , ∗ µ − a.e. in K. A cornerstone of optimal design theory, the well-known Kiefer-Wolfowitz General Equivalence Theorem [12], says that the difficult min-max problem (5) is equivalent to the much simpler maximization Z  max det(Gµn ) , Gµn = qi (x)qj (x) dµ , (6) µ

K

1≤i,j≤N

Gµn

where is the Gram matrix of µ in a fixed polynomial basis {qi } (also called the information matrix in statistics). This kind of optimality is called D-optimality, and ensures that an optimal measure always exists, since the set of Gram matrices of probability measures is compact (and convex); see e.g. [1], [2], [4] for a general proof of these results, valid for both continuous and discrete compact sets. An optimal measure is not unique and not necessarily discrete (unless K is discrete itself), but an equivalent atomic optimal measure always exists by Tchakaloff’s Theorem on positive quadratures of degree 2n for K; cf. [20] for a general proof of Tchakaloff’s Theorem. Moreover, it has been proved in [2], [3] that optimal designs converge weakly as n → ∞ to the pluripotential theoretic equilibrium measure of the compact set. G-optimality can be interpreted in a probabilistic as well as in a deterministic framework. From a statistical point of view, it is the probability measure that minimizes the maximum prediction variance by n-th degree polynomial regression, cf. [1]. From the approximation theory point of view, denoting ∗ by Lµn : C(K) → Pdn (K) the corresponding weighted Least Squares projection operator, in view of (4) and (5) for every f ∈ C(K) we have the chain of estimates ∗

∗ kLµn f kK √ ≤ kLµn f kL2µ∗ (K) ≤ kf kL2µ∗ (K) ≤ kf kK , (7) N √ ∗ and thus kLµn k ≤ N , i.e. a G-optimal measure minimizes (the estimate of) the weighted Least Squares operator norm. There is a vast literature on the computation of D-optimal designs, with many different approaches and methods. A classical approach is given by the discretization of K and then the D-optimization over the discrete set. In the discretization framework, the possible role of norming meshes seems apparently overlooked. A simple but meaningful result is given in the following proposition.

Proposition 1: Let K ⊂ Rd be a compact set, admitting a norming mesh {Xn } with constant c. Then for every n, m ∈ N+ , the probability measure ν = ν(n, m) = µ∗ (X2mn )

(8)

is a near G-optimal design on K, in the sense that max Knν (x, x) ≤ cm N , cm = c1/m . x∈K

(9)

Proposition 1 shows that norming meshes are good discretizations of a compact set for the purpose of computing a near G-optimal measure, and that G-optimality maximum condition (5) is approached at a rate proportional to 1/m, since cm ∼ 1 + log(c)/m. Recalling the statistical notion of G-efficiency on K we have Geff (ν) =

N ≥ c−1/m , maxx∈K Knν (x, x)

(10)

whereas concerning the norm of the corresponding weighted Least Squares projection operator p kLνn k ≤ cm N , (11)

i.e. the discrete probability measure ν nearly minimizes (the estimate of) such a norm. It is worth recalling that a better rate proportional to 1/m2 can be obtained on certain compact sets, such as triangles and quadrangles, cube, simplex, (sections of) sphere and ball, smooth convex bodies, where low cardinality norming meshes can be constructed via the approximation theoretic notion of Dubiner distance and suitable geometric transformations; cf. [6], [7], [19], [24]. III. T CHAKALOFF - LIKE DESIGN

CONCENTRATION

Proposition 1 and the General Equivalence Theorem suggest a standard way to compute near G-optimal designs. First, one constructs a norming mesh such as X2mn , then computes a D-optimal design for degree n on such a set by one of the available algorithms. Observe that such designs will be in general approximate, that is we compute a discrete probability measure ν˜ ≈ ν such that on the norming mesh ˜ ≈N max Knν˜ (x, x) ≤ N

x∈mesh

(12)

˜ not necessarily an integer), nevertheless estimates (9)(with N ˜ replacing ν and N , respectively. (11) still hold with ν˜ and N Again, we can not even attempt to survey the vast literature on computational methods for D-optimal designs; we may quote among others the class of exchange algorithms and the class of multiplicative algorithms, cf. e.g. [11], [14] and the references therein. Our computational strategy is roughly the following. We first approximate a D-optimal design for degree n on the norming mesh by a standard multiplicative algorithm, and then we concentrate the measure via Caratheodory-Tchakaloff compression of degree 2n, keeping the Christoffel polynomial, and thus the G-efficiency, invariant. Such a compression is based on a suitable implementation of a discrete version of

the well-known Tchakaloff Theorem [20], [23], which in general asserts that any (probability) measure has a representing atomic measure with the same polynomial moments up to a given degree, with cardinality not exceeding the dimension of the corresponding polynomial space; for an implementation see e.g. [17], [21] and the references therein. In such a way we get near optimality with respect to both, G-efficiency and support cardinality, since the latter will not exceed N2n = dim(Pd2n (K)). To simplify the notation, in what follows X = X2mn , M = card(X), w = P {wi } are the weights of a probability measure on X (wi ≥ 0, wi = 1), and Knw (x, x) is the corresponding Christoffel polynomial. The first step is the application of the standard Titterington’s multiplicative algorithm (cf. [14]) to compute a sequence w(k) of weight arrays w(k)

(xi , xi ) , 1≤i≤M , k≥0, N (13) where we take w(0) = (1/M, . . . , 1/M ). Observe that the weights wi (k+1) determine a probability measure on X, since P w(k) they are clearly nonnegative and i wi (k) Kn (xi , xi ) = N . The sequence w(k) is known to converge as k → ∞, for any initial choice of probability weights, to the weights of a D-optimal design (with a nondecreasing sequence of Gram determinants), cf. e.g. [11] and the references therein. In order to implement (13), we need an efficient way to compute the right-hand side. Denote by Vn = (φj (xi )) ∈ RM×N the rectangular Vandermonde matrix at X in a fixed polynomial basis (φ1 , . . . , φN ), and by D(w) the diagonal matrix of a weight array w. In order to avoid severe illconditioning that may already occur for relatively low degrees, instead of the standard monomial basis we have used the product Chebyshev basis of the smallest box containing X, a choice that turns out to work effectively in multivariate instances; cf. e.g. [5], [16], [17]. In view of the rectangular QR factorization D1/2 (w) Vn = QR with Q = (qij ) orthogonal (rectangular) and R square upper triangular, the polynomials (p1 , . . . , pN ) = (φ1 , . . . , φN )R−1 form a w-orthonormal basis and we can write N N X X 2 2 w qij , 1≤i≤M . pj (xi ) = wi Kn (xi , xi ) = wi wi (k + 1) = wi (k)

Kn

j=1

j=1

(14) The latter equation shows that we can update the weights at each step of (13) by a single QR factorization, using directly the squared 2-norms of the rows of the orthogonal matrix Q. The convergence of (13) can be slow, but a few iterations usually suffice to obtain a good design on X. Indeed, in all our numerical tests with bivariate norming meshs, after 10 or 20 iterations we already get 90% G-efficiency on X, and 95% after 20 or 30 iterations; cf. Figure 1-top for a typical convergence profile. On the other hand, 99% G-efficiency would require hundreds, and 99.9% thousands of iterations. When a G-efficiency very close to 1 is sought, one should adopt one of the more sophisticated approximation algorithms

available in the literature, cf. e.g. [10], [11], [14] and the references therein. Though the designs given by (13) will concentrate in the limit on the support of an optimal design, which typically is of relatively low cardinality (with respecy to M ), the cardinality of the support can be reduced even after a small number of iterations by a suitable implementation of Tchakaloff’s Theorem, that we describe below. Let V2n ∈ RM×N2n be the rectangular Vandermonde matrix at X with respect to a fixed polynomial basis for Pd2n (X) = Pd2n (K) (recall that the chosen norming mesh is determining on K for polynomials of degree up to 2n), and w the weight array of a probability measure supported on X (in our instance, the weights produced by (13) after a suitable number of iterations, to get a prescribed G-efficiency on X). In this fully discrete framework Tchakaloff’s Theorem corresponds to the existence of a sparse solution u to the underdetermined moment system t t V2n u = b = V2n w, u≥0,

(15)

where b is the vector of discrete w-moments of the polynomial basis up to degree 2n. The celebrated Caratheodory Theorem on conical finite-dimensional linear combinations [9], ensures that such a solution exists and has no more than N2n nonzero components. In order to compute a sparse solution, we can resort to Linear or Quadratic Programming. We recall here the second approach, that turned out to be the most efficient in all the tests on bivariate discrete measure compression for degrees in the order of tens that we carried out, cf. [17]. It consists of seeking a sparse solution u ˆ to the NonNegative Least Squares problem t t kV2n u ˆ − bk22 = min kV2n u − bk22 (16) u≥0

using the Lawson-Hanson active set algorithm, that is implemented for example in the Matlab native function lsqnonneg. The nonzero components of u ˆ determine the resulting design, whose support, say T = {xi : uˆi > 0}, has at most N2n points. Observe that by construction Knuˆ (x, x) = Knw (x, x) on K, since the underlying probability measures have the same moments up to degree 2n and hence generate the same orthogonal polynomials. Now, since max Knw (x, x) ≤ cm max Knw (x, x) = x∈K

x∈X

cm N , θ

where θ is the G-efficiency of w on X, in terms of G-efficiency on K we have the estimate Geff (ˆ u) = Geff (w) ≥

θ , cm

(17)

cf. Proposition 1, while in terms of the uniform norm of the weighted Least Squares operator we get the estimate r cm N u ˆ . (18) kLn k ≤ θ

We present now a bivariate example on a nonconvex polygon. An application of polygonal compact sets is the approximation of geographical regions; for example, the 27sided polygon in Figure 1 resembling the shape of France. The problem could be that of locating a near minimal number of sampling stations (sensors) to reconstruct a scalar or vector field (such as rainfall, pollutants concentration, geomagnetic field, ...) by near optimal regression on the whole region. With polygons we can resort to triangulation and finite union, constructing on each triangle a norming mesh by the Duffy transform of a Chebyshev grid of the square with approximately (2mn)2 points; here cm = 1/ cos2 (π/(2m)) for any triangle and hence by finite union for the whole polygon, cf. [7], [8]. The results corresponding to n = 8 and m = 5 are reported in Figure 1; all the computations have been made in Matlab R2017b on a 2.7 GHz Intel Core i5 CPU with 16GB RAM. The whole norming mesh of about 168500 points is compressed into 153 sampling nodes and weights (a compression ratio of 3 orders of magnitude) still ensuring 95% G-efficiency, in about 22 seconds.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0

5

10

15

20

25

30

35

40

ACKNOWLEDGMENTS Work partially supported by the DOR funds and the Project BIRD 181249 of the University of Padova, and by the GNCSINdAM. This research has been accomplished within the RITA “Research ITalian network on Approximation”. R EFERENCES [1] A.K. Atkinson and A.N. Donev, Optimum Experimental Designs, Clarendon Press, Oxford, 1992. [2] T. Bloom, L. Bos, N. Levenberg and S. Waldron, On the convergence of optimal measures, Constr. Approx. 32 (2010), 159–179. [3] T. Bloom, L. Bos and N. Levenberg, The Asymptotics of Optimal Designs for Polynomial Regression, arXiv preprint: 1112.3735. [4] L. Bos, Some remarks on the Fej´er problem for Lagrange interpolation in several variables, J. Approx. Theory 60 (1990), 133–140. [5] L. Bos, J.P. Calvi, N. Levenberg, A. Sommariva and M. Vianello, Geometric Weakly Admissible Meshes, Discrete Least Squares Approximations and Approximate Fekete Points, Math. Comp. 80 (2011), 1601– 1621. [6] L. Bos, F. Piazzon and M. Vianello, Near G-Optimal Tchakaloff Designs, submitted, 2019. [7] L. Bos and M. Vianello, Low cardinality admissible meshes on quadrangles, triangles and disks, Math. Inequal. Appl. 15 (2012), 229–235. [8] J.P. Calvi and N. Levenberg, Uniform approximation by discrete least squares polynomials, J. Approx. Theory 152 (2008), 82–100. ¨ [9] C. Caratheodory, Uber den Variabilittsbereich der Fourierschen Konstanten von positiven harmonischen Funktionen, Rend. Circ. Mat. Palermo 32 (1911), 193–217. [10] Y. De Castro, F. Gamboa, D. Henrion, R. Hess, J.-B. Lasserre, Approximate Optimal Designs for Multivariate Polynomial Regression, Ann. Statist. 47 (2019), 127–155. [11] H. Dette, A. Pepelyshev and A. Zhigljavsky, Improving updating rules in multiplicative algorithms for computing D-optimal designs, Comput. Stat. Data Anal. 53 (2008), 312–320. [12] J. Kiefer and J. Wolfowitz, The equivalence of two extremum problems, Canad. J. Math. 12 (1960), 363-366. [13] A. Kro´o, On optimal polynomial meshes, J. Approx. Theory 163 (2011), 1107–1124. [14] A. Mandal, W.K. Wong and Y. Yu, Algorithmic Searches for Optimal Designs, in: Handbook of Design and Analysis of Experiments, Chapman & Hall/CRC, New York, 2015. [15] F. Piazzon, Optimal Polynomial Admissible Meshes on Some Classes of Compact Subsets of Rd , J. Approx. Theory 207 (2016), 241–264.

Fig. 1: Top: G-efficiency of the approximate optimal designs computed by (13) on a norming mesh with about 168500 points of a 27-sided nonconvex polygon (upper curve, n = 8, m = 5), and estimate (17) (lower curve); Bottom: Caratheodory-Tchakaloff compressed support (153 points) after k = 35 iterations (Geff = 0.95). [16] F. Piazzon, Pluripotential Numerics, Constr. Approx., published online 21 June 2018. [17] F. Piazzon, A. Sommariva and M. Vianello, Caratheodory-Tchakaloff Least Squares, SampTA 2017, IEEE Xplore Digital Library. [18] F. Piazzon and M. Vianello, A note on total degree polynomial optimization by Chebyshev grids, Optim. Lett. 12 (2018), 63–71. [19] F. Piazzon and M. Vianello, Markov inequalities, Dubiner distance, norming meshes and polynomial optimization on convex bodies, Optim. Lett., published online 01 January 2019. [20] M. Putinar, A note on Tchakaloff’s theorem, Proc. Amer. Math. Soc. 125 (1997), 2409–2414. [21] A. Sommariva and M. Vianello, Compression of multivariate discrete measures and applications, Numer. Funct. Anal. Optim. 36 (2015), 1198–1223. [22] A. Sommariva and M. Vianello, Discrete norming inequalities on sections of sphere, ball and torus, J. Inequal. Spec. Funct. 9 (2018), 113–121. [23] V. Tchakaloff, Formules de cubatures m´ecaniques a` coefficients non n´egatifs. (French) Bull. Sci. Math. 81 (1957), 123–134. [24] M. Vianello, Subperiodic Dubiner distance, norming meshes and trigonometric polynomial optimization, Optim. Lett. 12 (2018), 1659– 1667.