# OPTIMAL DESIGN FOR LINEAR MODELS WITH

OPTIMAL DESIGN FOR LINEAR MODELS WITH ... In the common linear regression model the problem of deter- ... ties of the Beta distribution...

Submitted to the Annals of Statistics

OPTIMAL DESIGN FOR LINEAR MODELS WITH CORRELATED OBSERVATIONS By Holger Dette∗ , Andrey Pepelyshev† and Anatoly Zhigljavsky‡ Ruhr-Universit¨ at Bochum∗ , RWTH Aachen† and Cardiff University‡ In the common linear regression model the problem of determining optimal designs for least squares estimation is considered in the case where the observations are correlated. A necessary condition for the optimality of a given design is provided, which extends the classical equivalence theory for optimal designs in models with uncorrelated errors to the case of dependent data. If the regression functions are eigenfunctions of an integral operator defined by the covariance kernel, it is shown that the corresponding measure defines a universally optimal design. For several models universally optimal designs can be identified explicitly. In particular, it is proved that the uniform distribution is universally optimal for a class of trigonometric regression models with a broad class of covariance kernels and that the arcsine distribution is universally optimal for the polynomial regression model with correlation structure defined by the logarithmic potential. To the best knowledge of the authors these findings provide the first explicit results on optimal designs for regression models with correlated observations, which are not restricted to the location scale model.

1. Introduction. Consider the common linear regression model (1.1)

y(x) = θ1 f1 (x) + . . . + θm fm (x) + ε(x) ,

where f1 (x), . . . , fm (x) are linearly independent, continuous functions, ε(x) denotes a random error process or field, θ1 , . . . , θm are unknown parameters and x is the explanatory variable, which varies in a compact design space X ⊂ Rd . We assume that N observations, say y1 , . . . , yN , can be taken at experimental conditions x1 , . . . , xN to estimate the parameters in ˆ of the the linear regression model (1.1). If an appropriate estimate, say θ, parameter θ = (θ1 , . . . , θm )T has been chosen, the quality of the statistical analysis can be further improved by choosing an appropriate design for the experiment. In particular, an optimal design minimizes a functional of the ˆ where the functional should variance-covariance matrix of the estimate θ, AMS 2000 subject classifications: Primary 62K05, 60G10; secondary 31A10, 45C05 Keywords and phrases: optimal design, correlated observations, integral operator, eigenfunctions, arcsine distribution, logarithmic potential

1

2

H. DETTE ET AL.

reflect certain aspects of the goal of the experiment. In contrast to the case of uncorrelated errors, where numerous results and a rather complete theory are available [see for example the monograph of Pukelsheim (2006)], the construction of optimal designs for dependent observations is intrinsically more difficult. On the other hand, this problem is of particular practical interest as in most applications there exists correlation between different observations. Typical examples include models, where the explanatory variable x represents the time and all observations correspond to one subject. In such situations optimal experimental designs are very difficult to find even in simple cases. Some exact optimal design problems were considered in Boltze and N¨ather (1982), N¨ather (1985a), Ch. 4, N¨ather (1985b), P´azman and M¨ uller (2001) and M¨ uller and P´azman (2003), who derived optimal designs for the location scale model (1.2)

y(x) = θ + ε(x).

Exact optimal designs for specific linear models have been investigated in ˇ Dette et al. (2008a); Kiselak and Stehlik (2008); Harman and Stulajter (2010). Because explicit solutions of optimal design problems for correlated observations are rarely available several authors have proposed to determine optimal designs based on asymptotic arguments [see for example Sacks and Ylvisaker (1966, 1968), Bickel and Herzberg (1979), N¨ather (1985a), Zhigljavsky et al. (2010)]. Roughly speaking, there exist three approaches to embed the optimal design problem for regression models with correlated observations in an asymptotic optimal design problem. The first one is due to Sacks and Ylvisaker (1966, 1968), who assumed that the covariance structure of the error process ε(x) is fixed and that the number of design points tends to infinity. Alternatively, Bickel and Herzberg (1979) and Bickel et al. (1981) considered a different model, where the correlation function depends on the sample size. Recently, Zhigljavsky et al. (2010) extended the BickelHerzberg approach and allowed the variance (in addition to the correlation function) to vary as the number of observations changes. As a result, the corresponding optimality criteria contain a kernel with a singularity at zero. The focus in all of these papers is again mainly on the location scale model (1.2). The difficulties in the development of the optimal design theory for correlated observations can be explained by a different structure of the covariance of the least squares estimator in model (1.1), which is of the form M −1 BM −1 for certain matrices M and B depending on the design. As a consequence, the corresponding design problems are in general not convex (except for the location scale model (1.2) where M = 1).

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

3

The present paper is devoted to the problem of determining optimal designs for more general models with correlated observations than the simple location scale model (1.2). In Section 2 we present some preliminary discussion and introduce the necessary notation. In Section 3 we investigate general conditions for design optimality. One of the main results of the paper is Theorem 3.3, where we derive necessary and sufficient conditions for the universal optimality of designs. By relating the optimal design problems to eigenvalue problems for integral operators we identify a broad class of multi-parameter regression models where the universally optimal designs can be determined explicitly. It is also shown that in this case the least squares estimate with the corresponding optimal design has the same covariance matrix as the weighted least squares estimates with its optimal design. In other words, under the conditions of Theorem 3.3 least squares estimation combined with an optimal design can never be improved by weighted least squares estimation. In Section 4 several applications are presented. In particular, we show that for a trigonometric system of regression functions involving only cosinus terms with an arbitrary periodic covariance kernel the uniform distribution is universally optimal. We also prove that the arcsine design is universally optimal for the polynomial regression model with the logarithmic covariance kernel and derive some universal optimality properties of the Beta distribution. To our best knowledge these results provide the first explicit solution of optimal design problems for regression models with correlated observations which differ from the location scale model. In Section 5 we provide an algorithm for computing optimal designs for any regression model with specified covariance function and investigate the efficiency of the arcsine and uniform distribution in polynomial regression models with exponential correlation functions. Finally, Section 6 contains some conclusions and technical details are given in Section 7. 2. Preliminaries. 2.1. The asymptotic covariance matrix. Consider the linear regression model (1.1), where ε(x) is a stochastic process with (2.1)

Eε(x) = 0, Eε(x)ε(x′ ) = K(x, x′ ) , x ∈ X ⊂ Rd .

Throughout this paper we call the function K(x, x′ ) covariance kernel and assume that it is continuous at all points (x, x′ ) ∈ X × X except possibly at the diagonal points (x, x). We also assume that K(x, x′ ) ̸= 0 for at least one pair (x, x′ ) with x ̸= x′ . An important case appears when the error process is stationary and the covariance kernel is of the form K(x, x′ ) = σ 2 ρ(x − x′ ), where ρ(0) = 1 and ρ(·) is called the correlation function.

4

H. DETTE ET AL.

If N observations, say y = (y1 , . . . , yN )T , are available at experimental conditions x1 , . . . , xN and the covariance kernel is known, the vector of parameters can be estimated by the weighted least squares method, i.e. θˆ = (XT Σ−1 X)−1 XT Σ−1 y where X = (fi (xj ))i=1,...,m j=1,...,N and Σ = (K(xi , xj ))i,j=1,...,N . The variance-covariance matrix of this estimate is given by ˆ = (XT Σ−1 X)−1 . Var(θ) If the correlation structure of the process is not known, one usually uses the ordinary least squares estimate θ˜ = (XT X)−1 XT y, which has the covariance matrix (2.2)

˜ = (XT X)−1 XT ΣX(XT X)−1 . Var(θ)

An exact experimental design ξN = {x1 , . . . , xN } is a collection of N points in X , which defines the time points or experimental conditions where observations are taken. Optimal designs for weighted or ordinary least squares estimation minimize a functional of the covariance matrix of the weighted or ordinary least squares estimate, respectively, and numerous optimality criteria have been proposed in the literature to discriminate between competing designs [see Pukelsheim (2006)]. Note that the weighted least squares estimate can only be used if the correlation structure of the errors is known, and its misspecification can lead to a considerable loss of efficiency. At the same time, the ordinary least squares estimate does not employ the structure of the correlation. Obviously the ordinary least squares estimate can be less efficient than the weighted least squares estimate but in many cases the loss of efficiency is often small. For example, consider the location scale model (1.2) with a stationary error 2 process, the Gaussian correlation function ρ(t) = e−λt and the exact design ξ = {−1, −2/3, −1/3, 1/3, 2/3, 1}. Suppose that the guessed value of λ equals 1 while the true value is 2. Then the variance of the weighted least squares estimate is 0.528 computed as −1 T −1 −1 T −1 −1 (XT Σ−1 guess X) X Σguess Σtrue Σguess X(X Σguess X)

while the variance of the ordinary least squares estimate is 0.433. If the guessed value of λ equals the true value, then the variance of the weighted least squares estimate is 0.382. A similar relation between the variances holds if the location scale model and the Gaussian correlation function are replaced by a polynomial model and a triangular or exponential correlation function, respectively. For a more detailed discussion concerning advantages of the ordinary least squares against the weighted least squares estimate see Bickel and Herzberg (1979) and Section 5.1 in N¨ather (1985a).

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

5

Throughout this article we will concentrate on optimal designs for the ordinary least squares estimate. These designs also require the specification of the correlation structure but a potential loss by its misspecification in the stage of design construction is typically much smaller than the loss caused by the misspecification of the correlation structure in the weighted least squares estimate. Moreover, in this paper we will demonstrate that there are many situations, where the combination of the ordinary least squares estimate with the corresponding (universally) optimal design yields the same covariance matrix as the weighted least squares estimate on the basis of a (universally) optimal design for weighted least squares estimation [see the discussions in Section 4 and Section 6]. Because even in simple models the exact optimal designs are difficult to find, most authors usually use asymptotic arguments to determine efficient designs for the estimation of the model parameters [see Sacks and Ylvisaker (1966, 1968), Bickel and Herzberg (1979) or Zhigljavsky et al. (2010)]. Sacks and Ylvisaker (1966, 1968) and N¨ather (1985a), Chapter 4, assumed that the design points {x1 , . . . , xN } are generated by the quantiles of a distribution function, that is ( ) xi = a (i − 1)/(N − 1) , i = 1, . . . , N, where the function a : [0, 1] → X is the inverse of a distribution function. If ξN denotes a design with N points and corresponding quantile function a(·), the covariance matrix of the least squares estimate θ˜ = θ˜ξN given in (2.2) can be written as (2.3)

˜ = D(ξN ) = M −1 (ξN )B(ξN , ξN )M −1 (ξN ), Var(θ)

where (2.4) (2.5)

∫ M (ξN ) = B(ξN , ξN ) =

∫X∫

f (u)f T (u)ξN (du), K(u, v)f (u)f T (v)ξN (du)ξN (dv),

( )T and f (u) = f1 (u), . . . , fm (u) denotes the vector of regression functions. Following Kiefer (1974) we call any probability measure ξ on X (more precisely on an appropriate Borel field) an approximate design or simply design. The definition of the matrices M (ξ) and B(ξ, ξ) can be extended to an arbitrary design ξ, provided that the corresponding integrals exist. The matrix (2.6)

D(ξ) = M −1 (ξ)B(ξ, ξ)M −1 (ξ),

6

H. DETTE ET AL.

is called the covariance matrix for the design ξ and can be defined for any probability measure ξ supported on the design space X such that the matrices B(ξ, ξ) and M −1 (ξ) are well-defined. This set will be denoted by Ξ. An (approximate) optimal design minimizes a functional of the covariance matrix D(ξ) over the set Ξ and a universally optimal design ξ ∗ (if it exists) minimizes the matrix ξ with respect to the Loewner ordering, that is D(ξ ∗ ) ≤ D(ξ) for all

ξ ∈ Ξ.

Note that on the basis of this asymptotic analysis the kernel K(u, v) has to be well defined for all u, v ∈ X . On the other hand, Zhigljavsky et al. (2010) extended the approach in Bickel and Herzberg (1979) and proposed an alternative approximation for the covariance matrix in (2.2), where the variance of the observations also depends on the sample size. As a result they obtained an approximating matrix of the form (2.5), where the kernel K(u, v) in the matrix B(ξ, ξ) may have singularities at the diagonal. Because in this paper we are interested in designs maximizing functionals of the matrix D(ξ) (independently from the type of approximation which has been used to derive it), we will also consider singular kernels in the following discussion. Moreover, we call K(u, v) covariance kernel even if it has singularities at the diagonal. As illustrated in Example 4.2 below, the covariance kernels with singularities at the diagonal can be used as approximations to the standard covariance kernels. Note that in general the function D(ξ) is not convex (with respect to the Loewner ordering) on the space of all approximate designs. This implies that even if one determines optimal designs by minimizing a convex functional, say Φ, of the matrix D(ξ), the corresponding functional ξ → Φ(D(ξ)) is generally not convex on the space of designs Ξ. Consider for example the case m = 1 where D(ξ) is given by [∫ ]−2 ∫ ∫ 2 (2.7) D(ξ) = f (u)ξ(du) K(u, v)f (u)f (v)ξ(du)ξ(dv) , and it is obvious that this functional is not necessarily convex. On the other hand, for the location scale model (1.2) we ∫∫ have m = 1, f (x) = 1 for all x and this expression reduces to D(ξ) = K(u, v)ξ(du)ξ(dv) . In the stationary case K(u, v) = σ 2 ρ(u−v), where ρ(·) is a correlation function, this functional is convex on the set of all probability measures on the domain X , see Lemma 1 in Zhigljavsky et al. (2010) and Lemma 4.3 in N¨ather (1985a). For this reason (namely the convexity of the functional D(ξ)) most of the literature discussing asymptotic optimal design problems for least squares estimation in the presence of correlated observations considers the location

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

7

scale model, which corresponds to the estimation of the mean of a stationary process [see for example Boltze and N¨ather (1982), N¨ather (1985a,b)]. 2.2. The set of admissible design points. Recall the definition of the vector f (x) = (f1 (x), . . . , fm (x))T in the regression model (1.1). We define the sets X0 = {x ∈ X : f (x) = 0} and X1 = X \X0 = {x ∈ X : f (x) ̸= 0} and assume that designs ξ0 and ξ1 are concentrated on X0 and X1 correspondingly. Consider the design ξα = αξ0 + (1 − α)ξ1 with 0 ≤ α < 1; note that if the design ξα is concentrated on the set X0 only (corresponding to the case α = 1), then the construction of estimates is not possible. Otherwise we have ∫ 1 M −1 (ξ1 ) M (ξα ) = f (x)f T (x)ξα (dx) = (1 − α)M (ξ1 ), M −1 (ξα ) = 1−α and B(ξα , ξα ) =

∫ ∫ K(x, u)f (x)f T (u)ξα (dx)ξα (du) = (1 − α)2 B(ξ1 , ξ1 ).

Therefore, D(ξα ) = M −1 (ξα )B(ξα , ξα )M −1 (ξα ) = M −1 (ξ1 )B(ξ1 , ξ1 )M −1 (ξ1 ) = D(ξ1 ) for all 0 ≤ α < 1. Consequently, observations taken at points from the set X0 do not change the estimate θˆ ∫and ∫ its covarianceT matrix. If we use the convention 0 · ∞ = 0, it follows that K(x, u)f (x)f (u)ξ0 (dx)ξ0 (du) = 0, and this statement is also true for covariance kernels K(x, u) with a singularity at x = u. Summarizing this discussion, we assume throughout this paper that f (x) ̸= 0 for all x ∈ X . 3. Characterizations of optimal designs. 3.1. General optimality criteria. Recall the definition of the information matrix in (2.4) and define ∫ ∫ B(ξ, ν) = K(u, v)f (u)f T (v)ξ(du)ν(dv), X

X

where ξ and ν ∈ Ξ are two arbitrary designs and K(u, v) is an arbitrary covariance kernel. The two main examples of the kernel function K(u, v) will be K(u, v) = σ 2 ρ(u − v) and K(u, v) = r(u − v) where ρ(t) is a correlation

8

H. DETTE ET AL.

function and r(t) is a non-negative definite function with singularity at zero. The latter type arises naturally if the Bickel-Herzberg approach [see Bickel and Herzberg (1979)] is extended such that the variance (in addition to the correlation function) varies as the number of observations changes [see Zhigljavsky et al. (2010) and the discussion in Section 4.4]. According to the discussion in the previous paragraph, the asymptotic covariance matrix of the least squares estimator θˆ is proportional to the matrix D(ξ) defined in (2.3). Let Φ(·) be a monotone, real valued functional defined on the space of symmetric m × m matrices where the monotonicity of Φ(·) means that A ≥ B implies Φ(A) ≥ Φ(B). Then the optimal design ξ ∗ minimizes the function (3.1)

Φ(D(ξ))

on the space Ξ of all designs. In addition to monotonicity, we shall also assume differentiability of the functional Φ(·); that is, the existence of the matrix of derivatives ( ) ∂Φ(D) ∂Φ(D) C= = , ∂D ∂Dij i,j=1,...,m where D is any symmetric non-negative definite matrix of size m × m. The following lemma is crucial in the proof of the optimality theorem below. Lemma 3.1 Let ξ and ν be two designs and Φ be a differentiable functional. Set ξα = (1 − α)ξ + αν and assume that the matrices M (ξ) and B(ξ, ξ) are nonsingular. Then the directional derivative of Φ at the design ξ in the direction of ν − ξ is given by ∂Φ(D(ξα )) = 2[b(ν, ξ) − φ(ν, ξ)] ∂α α=0 where

φ(ν, ξ) = tr(M (ν)D(ξ, ξ) C(ξ)M−1 (ξ)), b(ν, ξ) = tr(M−1 (ξ) C(ξ)M−1 (ξ)B(ξ, ν))

and

∂Φ(D) . C(ξ) = ∂D D=D(ξ)

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

9

Proof. Straightforward calculation shows that ∂ −1 M (ξα ) = M−1 (ξ) − M−1 (ξ)M (ν)M−1 (ξ) ∂α α=0 and ∂ B(ξα , ξα ) = B(ξ, ν) + B(ν, ξ) − 2B(ξ, ξ). ∂α α=0 Using the formula for the derivative of a product and the two formulas above, we obtain ∂ D(ξα ) = −2M−1 (ξ)M (ν)D(ξ) + 2M−1 (ξ)B(ξ, ν)M−1 (ξ). ∂α α=0 Note that the matrices M (ξα ) and B(ξα , ξα ) are nonsingular for small nonnegative α (that is, for all α ∈ [0, α0 ) where α0 is a small positive number) which follows from the non-degeneracy of M (ξ) and B(ξ, ξ) and the continuity of M (ξα ) and B(ξα , ξα ) with respect to α. Using the above formula and the fact that tr(H(A + AT )) = 2 tr(HA) for any m×m matrix A and any m×m symmetric matrix H, we obtain ) ( ∂Φ(D(ξα )) ∂ D(ξ ) = tr C(ξ) = 2[b(ν, ξ) − φ(ν, ξ)] . α ∂α ∂α α=0 α=0  Note that the functions b(ν, ξ) and φ(ν, ξ) can be represented as ∫ ∫ b(ν, ξ) = b(x, ξ)ν(dx), φ(ν, ξ) = φ(x, ξ)ν(dx), where (3.2)

φ(x, ξ) = φ(ξx , ξ) = f T (x)D(ξ) C(ξ)M−1 (ξ)f (x),

(3.3)

b(x, ξ) = b(ξx , ξ) = tr(C(ξ)M −1 (ξ)B(ξ, ξx )M −1 (ξ)),

and ξx is the probability measure concentrated at a point x. Lemma 3.2 For any design ξ such that the matrices M (ξ) and B(ξ, ξ) are nonsingular we have ∫ ∫ (3.4) φ(x, ξ)ξ(dx) = b(x, ξ)ξ(dx) = tr D(ξ) C(ξ) where the functions φ(x, ξ) and b(x, ξ) are defined in (3.2) and (3.3), respectively.

10

H. DETTE ET AL.

Proof. Straightforward calculation shows that ∫ ∫ φ(x, ξ)ξ(dx) = tr(D(ξ) C(ξ)M−1 (ξ) f (x)f T (x)ξ(dx)) = tr(D(ξ) C(ξ)). We also have ∫ ∫ [∫ ∫ ] B(ξ, ξx )ξ(dx) = K(u, v)f (u)f T (v)ξ(du)ξx (dv) ξ(dx) ∫ [∫ ] = K(u, x)f (u)f T (x)ξ(du) ξ(dx) = B(ξ, ξ), which implies ∫ ∫ −1 −1 b(x, ξ)ξ(dx) = tr(M (ξ) C(ξ)M (ξ) B(ξ, ξx )ξ(dx)) = tr(D(ξ) C(ξ)).  The first main result of this section provides a necessary condition for the optimality of a given design. Theorem 3.1 Let ξ ∗ be any design minimizing the functional Φ(D(ξ)). Then the inequality (3.5)

φ(x, ξ ∗ ) ≤ b(x, ξ ∗ )

holds for all x ∈ X , where the functions φ(x, ξ) and b(x, ξ) are defined in (3.2) and (3.3), respectively. Moreover, there is equality in (3.5) for ξ ∗ almost all x, that is, ξ ∗ (A) = 0 where A = A(ξ ∗ ) = {x ∈ X | φ(x, ξ ∗ ) < b(x, ξ ∗ )} is the set of x ∈ X such that the inequality (3.5) is strict. Proof. Consider any design ξ ∗ minimizing the functional Φ(D(ξ)). The necessary condition for an element to be a minimizer of a differentiable functional states that the directional derivative from this element in any direction is non-negative. In the case of the design ξ ∗ and the functional Φ(D(ξ)) this yields for any design ν ∂Φ(D(ξα )) ≥0 ∂α α=0 where ξα = (1 − α)ξ ∗ + αν. The inequality (3.5) follows now from Lemma 1. The assumption that the inequality (3.5) is strict for all x ∈ A with ξ ∗ (A) > 0 is in contradiction with the identity (3.4). 

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

11

Remark 3.1 In the classical theory of optimal design, convex optimality criteria are almost always considered. However, in at least one paper, namely Torsney (1986), an optimality theorem for a rather general non-convex optimality criteria was established and used (in the case of non-correlated observations). 3.2. An alternative representation of the necessary condition of optimality. Consider for a design ξ ∈ Ξ the associated space ∫ { } L2 (X , ξ) = h : X → R h2 (x)ξ(dx) < ∞ . Note that f1 , . . ∫. , fm ∈ L2 (X , ξ) and define for a design ξ ∈ Ξ the vectorvalued function K(x, u)f (u)ξ(du). Projecting each component of this function onto the subspace span{f1 , . . . , fm } ⊂ L2 (X , ξ), we obtain the representation ∫ (3.6) K(x, u)f (u)ξ(du) = Λf (x) + g(x), where Λ is a uniquely defined m × m matrix (note that Λ = B(ξ, ξ)M −1 (ξ)) and the vector g satisfies ∫ (3.7) g(x)f T (x)ξ(dx) = 0. This representation defines g(x) uniquely for ξ-almost all x ∈ X . As f (x) is continuous, we can also assume that g(x) is continuous. Using (3.6) and the symmetry of the matrix B(ξ, ξ) we obtain ∫ ∫ B(ξ, ξ) = K(x, u)f (u)ξ(du)f T (x)ξ(dx) ∫ ∫ = Λf (x)f T (x)ξ(dx) + g(x)f T (x)ξ(dx) = ΛM (ξ) = M (ξ)ΛT , which gives for the matrix D in (2.6) D(ξ) = M−1 (ξ)B(ξ, ξ)M−1 (ξ) = M−1 (ξ)Λ = ΛT M−1 (ξ). Similarly, we obtain for the quantities (3.2) and (3.3) φ(x, ξ) = f T (x)D(ξ) C(ξ)M−1 (ξ)f (x) = f T (x)ΛT M−1 (ξ) C(ξ)M−1 (ξ)f (x) = f T (x)M−1 (ξ) C(ξ)M−1 (ξ)Λf (x).

12

H. DETTE ET AL.

We also have

B(ξ, ξx ) =

K(x, u)f (u)ξ(du)f T (x) = Λf (x)f T (x) + g(x)f T (x)

which gives b(x, ξ) = tr(C(ξ)M −1 (ξ)B(ξ, ξx )M −1 (ξ)) = φ(x, ξ) + r(x, ξ), where the function r is defined by r(x, ξ) = f T (x)M −1 (ξ)C(ξ)M −1 (ξ)g(x). The following result is now an obvious corollary of Theorem 3.1. Corollary 3.1 The necessary condition of optimality of the design ξ can be written in the form r(x, ξ) ≥ 0 for all x. 3.3. D-optimality. For the D-optimality there exists an analogue of the celebrated ‘Equivalence Theorem’ of Kiefer and Wolfowitz (1960), which characterizes optimal designs minimizing the D-optimality criterion Φ(D(ξ)) = ln det(D(ξ)). Theorem 3.2 Let ξ ∗ be any D-optimal design. Then for all x ∈ X we have (3.8)

d(x, ξ ∗ ) ≤ b(x, ξ ∗ ),

where the functions d and b are defined by d(x, ξ) = f T (x)M−1 (ξ)f (x) and ∫ −1 T −1 (3.9) b(x, ξ) = tr(B (ξ, ξ)B(ξ, ξx )) = f (x)B (ξ, ξ) K(u, x)f (u)ξ(du), respectively. Moreover, there is equality in (3.8) for ξ ∗ -almost all x. Proof. In the case of the D-optimality criterion Φ(D(ξ)) = ln det(D(ξ)), we have C(ξ) = D−1 (ξ), which gives φ(x, ξ) =f T(x)D(ξ)D−1 (ξ)M−1(ξ)f (x) = d(x, ξ). Similarly, we simplify an expression for b(x, ξ). Reference to Theorem 3.1 completes the proof.  Note that the function r(x, ξ) for the D-criterion is given by r(x, ξ) = f T (x)M −1 (ξ)g(x)

13

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

and, consequently, the necessary condition of the D-optimality can be written as f T (x)M −1 (ξ)g(x) ≥ 0 for all x ∈ X . The following statement illustrates a remarkable similarity between Doptimal design problems in the cases of correlated and non-correlated observations. The proof easily follows from Lemma 3.2 and Theorem 3.2. Corollary 3.2 For any design ξ such that the matrices M (ξ) and B(ξ, ξ) are nonsingular we have ∫ ∫ d(x, ξ)ξ(dx) = b(x, ξ)ξ(dx) = m where b(x, ξ) is defined in (3.9) and m is the number of parameters in the regression model (1.1). 5

5

5

4.5

4.5

4.5

4

4

4 b(x,ξ)

3.5

b(x,ξ)

3.5

3.5

3

3

b(x,ξ) 3 d(x,ξ)

d(x,ξ)

d(x,ξ)

2.5

2.5

2.5

2

2

2

1.5 −1

−0.5

0

0.5

1.5 1 −1

−0.5

0

0.5

1.5 1 −1

−0.5

0

0.5

1

Fig 1. The functions b(x, ξ) and d(x, ξ) for the regression model (1.1) with f (x) = (1, x, x2 )T and the covariance kernels K(u, v) = e−|u−v| (left), K(u, v) = max(0, 1−|u−v|) (middle) and K(u, v) = − log(u − v)2 (right), and the arcsine distribution ξa .

Example 3.1 Consider the quadratic regression model y(x) = θ1 + θ2 x + θ3 x2 + ε(x) with design space X = [−1, 1]. In Figure 1 we plot functions b(x, ξ) and d(x, ξ) for different covariance kernels K(u, v) = e−|u−v| , K(u, v) = max{0, 1 − |u − v|} and K(u, v) = − log(u − v)2 , where the design is the arcsine distribution with density √ (3.10) p(x) = 1/(π 1 − x2 ) , x ∈ (−1, 1) . Throughout this paper this design will be denoted by ξa . By the definition, the function d(x, ξ) is the same for different covariance kernels but the function b(x, ξ) depends on the choice of the kernel. From the left and middle

14

H. DETTE ET AL.

panel we see that the arcsine distribution does not satisfy the necessary condition of Theorem 3.1 for the kernels K(u, v) = e−|u−v| and max{0, 1−|u−v|} and is therefore not D-optimal for the quadratic regression model. On the other hand, for the logarithmic kernel K(u, v) = − log(u − v)2 the necessary condition is satisfied and the arcsine distribution is a candidate for the Doptimal design. We will show in Theorem 4.5 that the arcsine distribution ξa is universally optimal and as a consequence optimal with respect to a broad class of criteria including the D-optimality criterion. 3.4. c-optimality. For the c-optimality criterion Φ(D(ξ)) = cT D(ξ)c, we have C(ξ) = ccT . Consequently, φ(x, ξ) = f T (x)M−1 (ξ) ccT M−1 (ξ)Λf (x) = cT M−1 (ξ)Λf (x)f T (x)M−1 (ξ) c and r(x, ξ) = b(x, ξ) − φ(x, ξ) = cT M −1 (ξ)g(x)f T (x)M −1 (ξ)c. Therefore, the necessary condition for c-optimality simplifies to (3.11)

cT M −1 (ξ)g(x)f T (x)M −1 (ξ)c ≥ 0 for all x ∈ X .

Example 3.2 Consider again the quadratic regression model y(x) = θ1 + θ2 x + θ3 x2 + ε(x) with design space X = [−1, 1]. Assume the triangular correlation function ρ(x) = max{0, 1 − |x|}. Let ξ = {−1, 0, 1; 1/3, 1/3, 1/3} be the design assigning weights 1/3 to the points −1, 0 and 1. For this design, we have the matrices M (ξ) and D(ξ)     1 0 2/3 1 0 −1 M (ξ) =  0 2/3 0  , D(ξ) =  0 1/2 0  , 2/3 0 2/3 −1 0 3/2 and the matrix Λ and the vector g are given by Λ = diag(1/3, 1/3, 1/3),

g(x) = (1/3, x/3, |x|/3)T .

If c = (0, 1, 0)T then r(x, ξ) = 0 for all x ∈ [−1, 1] and thus the design ξ satisfies the necessary condition for c-optimality in (3.11). If c = (1, 0, 1)T then r(x, ξ) = 34 |x|3 (1 − |x|) ≥ 0 for all x ∈ [−1, 1] and the design ξ also satisfies (3.11). The corresponding functions b and φ are displayed in the left panel of Figure 3. Numerical analysis shows that for both vectors this design is in fact c-optimal. However it is not optimal for any c-optimality ( ) criteria. For example, if c = (1, 0, 0)T then r(x, ξ) = −3 x (1 − |x|) 1 − x2 ≤ 0 for all x ∈ [−1, 1], showing that the design is not c-optimal [see the middle panel of Figure 3]. For this case the c-optimal design is displayed in Figure 2. The corresponding functions b and φ are shown in the right panel of Figure 3.

15

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS 3

2

1

0 −1

0

1

Fig 2. The c-optimal design for the quadratic model and the triangular correlation function, where c = (1, 0, 0)T . 0.8

3

0.7

15

b(x,ξ)

2.5

φ(x,ξ)

0.6 2

0.5 0.4

1.5

0.3

1

10 φ(x,ξ) b(x,ξ)

5

0.2 0.1 0 −1

0.5

b(x,ξ) φ(x,ξ) −0.5

0

0.5

(a)

1

0 −1

−0.5

0

0.5

(b)

0 1 −1

−0.5

0

0.5

1

(c)

Fig 3. The functions b(x, ξ) and ϕ(x, ξ) for the c-optimality criterion. (a): the design ξ = {−1, 0, 1; 1/3, 1/3, 1/3} and the vector c = (1, 0, 1)T . (b): the design ξ = {−1, 0, 1; 1/3, 1/3, 1/3} and the vector c = (1, 0, 0)T . (c): the optimal design for vector c = (1, 0, 0)T displayed in Figure 2.

3.5. Universal optimality. In this section we consider the matrix D(ξ) defined in (2.6) as the matrix optimality criterion which we are going to minimize on the set Ξ of all designs, such that the matrices B(ξ, ξ) and M −1 (ξ) (and therefore the matrix D(ξ)) are well-defined. Recall that a design ξ ∗ is universally optimal if D(ξ ∗ ) ≤ D(ξ) in the sense of the Loewner ordering for any design ξ ∈ Ξ. Note that a design ξ ∗ is universally optimal if and only if ξ ∗ is c-optimal for any vector c ∈ Rm \{0}; that is, cT D(ξ ∗ )c ≤ cT D(ξ)c for any ξ ∈ Ξ. The following result gives a necessary and sufficient condition for the optimality of a given design ξ ∗ . Theorem 3.3 Consider the regression model (1.1) with covariance kernel K and m > 1 regression functions. Assume that the design ξ ∗ ∈ Ξ satis-

16

H. DETTE ET AL.

fies (3.6) with some m × m nonsingular matrix Λ. Then the design ξ ∗ is universally optimal in the set Ξ if and only if g(x) = 0 for all x ∈ X . Proof. Consider the regression model y(x) = f T (x)θ+ε(x), where the full ∫ ˆ trajectory {y(x)|x ∈ X } can be observed. Let θ(µ) = y(x)µ(dx) be a general linear unbiased estimate of the parameter θ, where µ = (µ1 , . . . , µm )T is a vector of signed measures. For example, the least squares estimate for a deˆ ξ ), where µξ (dx) = M −1 (ξ)f (x)ξ(dx). sign ξ in this model is obtained as θ(µ ˆ The condition of unbiasedness of the estimate θ(µ) means that [∫ ] ∫ ˆ θ = E[θ(µ)] =E µ(dx)y(x) = µ(dx)f T (x) θ for all θ ∈ Rm , which is equivalent to the condition ∫ ∫ (3.12) µ(dx)f T (x) = f (x)µT (dx) = Im , where Im denotes the m×m identity matrix. In the following discussion we define M as the set of all signed vector measures supported on X satisfying condition (3.12). ˆ For a given vector c ∈ Rm , the variance of the estimate cT θ(µ) is given by ∫ ∫ Tˆ T Φc (µ) = Var(c θ(µ)) = c E[ε(x)ε(u)]µ(dx)µT (du)c ∫ ∫ T = c K(x, u)µ(dx)µT (du)c, and a minimizer of this expression with respect to µ ∈ M determines the best linear estimate for cT θ and the corresponding c-optimal design simultaneously. Note that the set M of signed vector-valued measures is convex and the optimality criterion (3.13) is also convex. Similar arguments as given in Section 3.1 show that the directional derivative of Φc at µ∗ in the direction of ν − µ∗ is given by ∂ ∂ = Φc (µα ) Φc ((1 − α)µ∗ + αν) ∂α ∂α α=0 α=0 [∫ ∫ ] ∫ ∫ T ∗ T ∗ ∗T = 2c K(x, u)µ (dx)ν (du) − K(x, u)µ (dx)µ (du) c. Because Φ c is convex the optimality of µ∗ is equivalent to the condition ∂ ≥ 0 for all ν ∈ M. Therefore, the signed measure µ∗ minimizes ∂α Φc (µα ) α=0

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

17

the functional Φc (µ) if and only if the inequality ∫ ∫ ∗ T (3.13) K(x, u)µ∗ (dx)ν T (du)c ≥ Φc (µ∗ ) Φc (µ , ν) = c holds for all ν ∈ M. Let ξ ∗ ∈ Ξ denote a measure satisfying (3.6) with g(x) = 0 for all x ∈ X . Define the vector-valued measure µ0 (dx) = M −1 (ξ ∗ )f (x)ξ ∗ (dx), then it follows for all ν ∈ M ∫ ∫ Φc (µ0 , ν) = cT K(x, u)M −1 (ξ ∗ )f (x)ξ ∗ (dx)ν T (du)c ∫ = cT M −1 (ξ ∗ )Λ f (u)ν T (du)c = cT M −1 (ξ ∗ )Λc, where we used (3.12) for the measure ν in the last identity. On the other hand, µ0 ∈ M also satisfies (3.12) and we obtain once more using the identity (3.6) with g(x) ≡ 0 ∫ ∫ T Φc (µ0 ) = c K(x, u)µ0 (dx)µT0 (du)c ] ∫ [∫ = cT K(x, u)M −1 (ξ ∗ )f (x)ξ ∗ (dx) µT0 (du)c ∫ T −1 ∗ = c M (ξ )Λ f (u)µT0 (du)c = cT M −1 (ξ ∗ )Λc. This yields that for µ∗ = µ0 we have equality in (3.13) for all ν ∈ M, which shows that the vector-valued measure µ0 (dx) = M −1 (ξ ∗ )f (x)ξ ∗ (dx) minimizes the function Φc for any c ̸= 0 over the set M of signed vectorvalued measures. Now we return to the minimization of the function D(ξ) in the class of all designs ξ ∈ Ξ. For any ξ ∈ Ξ, define the corresponding vector-valued measure µξ (dx) = M −1 (ξ)f (x)ξ(dx) and note that µξ ∈ M. We obtain cT D(ξ)c = cT M −1 (ξ)B(ξ, ξ)M −1 (ξ)c = Φc (µξ ) ≥

min Φc (µ) = Φc (µ0 ) = cT D(ξ ∗ )c.

µ∈M

Since the design ξ ∗ does not depend on the particular vector c, it follows that ξ ∗ is universally optimal. In order to prove the converse we will show that for any design ξ for which g(x) ̸≡ 0 in (3.6) there exists a vector c ∈ Rm \{0}, such that ξ is not c-optimal. For this purpose we will make use of the following auxiliary result, which is proved in the Appendix [see Section 7].

18

H. DETTE ET AL.

Lemma 3.3 Let m > 1 and a, b ∈ Rm be two linearly independent vectors. Then there exists a vector c ∈ Rm such that Sc = cT abT c < 0. Assume now that ξ is a design, such that g(x) ̸= 0 for at least one x ∈ X in (3.6). According to (3.11), the necessary condition for c-optimality of a design ξ is rc (x, ξ) = cT M−1 (ξ)g(x)f T (x)M−1 (ξ)c ≥ 0 for all x ∈ X . The function g(x) is not proportional to f (x) for all x ∈ X as otherwise the condition (3.7) cannot hold. Choose any point x0 ∈ X such that g(x0 ) ̸= 0 and g(x0 ) is not proportional to f (x0 ) (recall also that f (x) ̸= 0 for all x ∈ X ). Then rc (x0 , ξ) = cT M−1 (ξ)g(x0 )f T (x0 )M−1 (ξ)c = cT abT c with a = M−1 (ξ)g(x0 ) and b = M−1 (ξ)f (x0 ). Using Lemma 3.3, we deduce that there exists a vector c such that rc (x0 , ξ) < 0. Therefore the design ξ is not c-optimal and as a consequence also not universally optimal.  A careful inspection of the proof of Theorem 3.3 shows that the proof of the converse part is not valid in the case m = 1. Indeed, examples in (Zhigljavsky et al., 2010) show that there exist optimal designs which do satisfy the condition (3.6) with g(x) ̸≡ 0. In this case the condition is only sufficient and we state the result here for the sake of completeness. Theorem 3.4 Consider one-parameter regression model (3.14)

y(x) = θf (x) + ε(x), x ∈ X .

Assume that for a given design ξ ∗ there exists a constant λ > 0 such that the identity ∫ (3.15) λf (x) = K(u, x)f (u)ξ(du) holds for all x ∈ X . Then the design ξ ∗ is (universally) optimal for the model (3.14). 4. Optimal designs for specific kernels and models. 4.1. Optimality and Mercer’s theorem. In this section we consider the case when the regression functions are proportional to eigenfunctions from Mercer’s theorem. To be precise let X denote a compact subset of a metric

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

19

space and let ν denote a measure on the corresponding Borel field with positive density. Consider the integral operator ∫ (4.1) TK (f )(·) = K(·, u)f (u)ν(du) X

on L2 (ν). Under certain assumptions on the kernel (for example if K(u, v) is symmetric, continuous and positive definite) TK defines a symmetric, compact self-adjoint operator. In this case Mercer’s Theorem [see e.g. Kanwal (1997)] shows that there exist a countable number of eigenfunctions φ1 , φ2 , . . . with positive eigenvalues λ1 , λ2 , . . . of the operator K, that is (4.2)

Tk (φℓ ) = λℓ φℓ , ℓ = 1, 2, . . .

The next statement follows directly from Theorem 3.3. Theorem 4.1 Let X be a compact subset of a metric space and assume that the covariance kernel K(x, u) defines an integral operator TK of the form (4.1), where the eigenfunctions satisfy (4.2). Consider the regression model (1.1) with f (x) = L(φi1 (x), . . . , φim (x))T and the covariance kernel K(x, u), where L ∈ Rm×m is a non-singular matrix. Then the design ν is universally optimal. We note that the Mercer expansion is known analytically for certain covariance kernels. For example, if ν is the uniform distribution on the interval X = [−1, 1] and the covariance kernel is of exponential type, that is K(x, u) = e−λ|x−u| , then the eigenfunctions are given by φk (x) = sin(ωk x + kπ/2), k ∈ N, where ω1 , ω2 , . . . are positive roots of the equation tan(2ω) = −2λω/(λ2 − ω 2 ). Similarly, consider as a second example, the covariance kernel K(x, u) = min{x, u} and X = [0, 1], In this case the eigenfunctions of the corresponding integral operator are given by φk (x) = sin((k + 1/2)πx), k ∈ N. In the following subsection we provide a further example of the application of Mercer’s theorem. 4.2. Uniform design for periodic covariance functions. Consider the regression functions { 1 if j = 1 (4.3) fj (x) = √ 2 cos(2π(j − 1)x) if j ≥ 2

20

H. DETTE ET AL.

and the design space X = [0, 1]. Assume that the correlation function ρ(x) is periodic with period 1, that is ρ(x) = ρ(x+1), and let a covariance kernel be defined by K(u, v) = σ 2 ρ(u − v) with σ 2 = 1. An example of the covariance kernel ρ(x) satisfying this property is provided by a convex combination of the functions {cos(2πx), cos2 (2πx), . . .}. Theorem 4.2 Consider the regression model (1.1) with regression functions fi1 (x), . . . , fim (x) (1 ≤ i1 < · · · < im ) defined in (4.3) and a correlation function ρ(x) that is periodic with period 1. Then the uniform design is universally optimal. Proof. We will show that the identity ∫ 1 ∫ 1 (4.4) K(u, x)fj (u)du = ρ(u − x)fj (u)du = λj fj (x) 0

0

∫ holds for all x ∈ [0, 1], where λj = ρ(u)fj (u)du (j ≥ 1). The assertion then follows from Theorem 4.1. ∫1 To prove (4.4), we define Aj (v) = 0 ρ(u − v)fj (u)du which should be ∫1 shown to be λj fj (x). For j = 1 we have A1 (v) = λ1 because 0 ρ(u − v)du = ∫1 0 ρ(u)du = λ1 by the periodicity of the function ρ(x). For j = 2, 3, . . . we note that ∫ 1−v ∫ 1 fj (u + v)ρ(u)du ρ(u − v)fj (u)du = Aj (v) = −v

0

1−v

=

fj (u + v)ρ(u)du + 0

Because of the periodicity we have ∫ 0 ∫ fj (u + v)ρ(u)du = −v

which gives Aj (v) =

0

−v

fj (u + v)ρ(u)du.

1

fj (u + v)ρ(u)du,

1−v

∫1 0

fj (u + v)ρ(u)du. A simple calculation now shows A′′j (v) = −b2j Aj (v),

(4.5)

where b2j = (2π(j − 1))2 and ∫

1

cos(2π(j − 1)u)ρ(u)du = λj ∫ 1 ′ Aj (0) = −bj sin(2π(j − 1)u)ρ(u)du = 0. Aj (0) =

0

0

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

21

Therefore (from the theory of differential equations) the unique solution of (4.5) is of the form Aj (v) = c1 cos(bj v) + c2 sin(bj v), where c1 and c2 are determined by initial conditions, that is A(0) = c1 = λj , A′ (0) = bj c2 = 0. This yields Aj (v) = λj cos(2π(j − 1)v) = λj fj (v) and proves the identity (4.4).  4.3. Optimal designs for the triangular covariance function. Let us now consider the triangular correlation function defined by (4.6)

ρ(x) = max{0, 1 − λ|x|}.

The following theorem presents the optimal design for the linear model with a triangular correlation function. Theorem 4.3 Consider the model (1.1) with f (x) = (1, x)T , X = [−1, 1], and the triangular correlation function (4.6). (a) If λ ∈ (0, 1/2], then the design ξ ∗ = {−1, 1; 1/2, 1/2} is universally optimal. (b) If λ ∈ N, then the design supported at 2λ + 1 points xk = −1 + k/λ, k = 0, 1, . . . , 2λ with equal weights is universally optimal. Proof. For a proof of part (a) we use arguments as of ∫ given in the proof ∗ Theorem 4.4 in Zhigljavsky et al. (2010) and obtain ρ(x − u)fi (u)ξ (du) = fi (x) for i = 1, 2. Thus, the assumptions of Theorem 3.3 are fulfilled. Part (b). Straightforward tedious calculations show that M∫(ξ ∗ ) = ∑2λ+1 but 2 diag(1, γ), where γ = k=0 xk /(2λ+1) = (λ+1)/(2λ). Also we have ρ(x− u)fi (u)ξ ∗ (du) = fi (x) for i = 1, 2. Thus, the assumptions of Theorem 3.3 are fulfilled.  The designs provided in Theorem 4.3 are also optimal for the location scale model, see Zhigljavsky et al. (2010). However, unlike the results of previous subsections the result of Theorem 4.3 cannot be extended to polynomial models of higher order. We conclude this section with an example which shows that there exist c-optimal designs that are not universally optimal. Example 4.1 Consider the model (1.1) with f (x) = (1, x)T , X = [−1, 1], and the correlation function ρ(x) = max{0, 1 − |x|}. It is easy to see that the design ξ = {−1, 0, 1; p, 1 − 2p, p} with any p ∈ (0, 1/2) is c-optimal for the vector c = (0, 1)T but not universally optimal (unless p = 1/3 when the design is universally optimal).

22

H. DETTE ET AL.

4.4. Polynomial regression models and singular kernels. In this section we consider the polynomial regression model, that is f (x) = (1, x, . . . , xm−1 )T , with logarithmic covariance kernel K(u, v) = γ − β ln(u − v)2 , β > 0, γ ≥ 0,

(4.7) and the kernel

K(u, v) = γ + β/|u − v|α , 0 ≤ α < 1, γ ≥ 0, β > 0

(4.8)

for which the universally optimal designs can be found explicitly. Moreover, for the covariance kernel (4.7) we will establish also uniqueness of the universally optimal design. Random processes with singular covariance functions are interesting on their own right. They also appear naturally as approximations to many ˜ standard covariance functions K(u, v) = σ 2 ρ˜(u − v) with ρ˜(0) = 1 if σ 2 is large. A general scheme for this type of approximation is investigated in Zhigljavsky et al. (2010), Section 4. More precisely, these authors discussed the case where the covariance kernel can be represented as σδ2 ρ˜δ (t) = r ∗ hδ (t) with a singular kernel r(t) and a smoothing kernel hδ (·) (here δ is a smoothing parameter and ∗ denotes the convolution operator). The basic idea is illustrated in the following example. 15

10

δ=0.02 δ=0.05 δ=0.1

5

0 −1

−0.5

0

0.5

1

Fig 4. The logarithmic covariance kernel r(t) = − ln(t) and the covariance kernel (4.9), where δ = 0.02, 0.05, 0.1. 2

˜ Example 4.2 Consider the covariance kernel K(u, v) = ρδ (u − v), where (4.9)

ρδ (t) = 2 −

( |t + δ|t+δ ) 1 log . δ |t − δ|t−δ

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

23

For several values of δ, the function ρδ is displayed in Figure 4. A straightforward calculation shows that ρδ (t) = r ∗ hδ (t), where r(t) = − ln(t)2 and hδ is the density of the uniform distribution on the interval on [−δ, δ]. As illustrated by Fig 4, the function ρδ (·) is well approximated by the singular kernel r(·), if δ is small. In Figure 5 we display the D-optimal designs (constructed numerically) for the quadratic model with a stationary error process with covariance ker˜ nel K(u, u + t) = ρδ (t), where ρδ is defined in (4.9) and δ = 0.02, 0.05, 0.1. As one can see, for small δ these designs are very close to the arcsine design, which is the D-optimal designs for the quadratic model and the logarithmic kernel, as proved in Theorem 4.5 of the following section. 3

3

3

2

2

2

1

1

1

0 −1

0

0 1 −1

0 1 −1

0

0

1

Fig 5. The D-optimal designs for the quadratic model with covariance kernel (4.9), where δ = 0.02 (left), δ = 0.05 (middle) and δ = 0.1 (right). The gray line represents the arcsine density

In Table 1 we show the efficiency of the arcsine distribution (obtained by maximizing det(D(ξ)) with the logarithmic kernel) in the quadratic regression model with the kernel (4.9). We observe a very high efficiency with respect to the D-optimality criterion. Even in the case δ = 0.1 the efficiency is 93.6% and it converges quickly to 100% as δ approaches 0. Table 1 Efficiency of the arcsine design ξa for the quadratic model and the kernel (4.9). δ Eff(ξa )

0.02 0.998

0.04 0.978

0.06 0.966

0.08 0.949

0.1 0.936

4.4.1. Optimality of the arcsine design. We will need the following lemma, which states a result in the theory of Fredholm-Volterra integral equations [see Mason and Handscomb (2002), Ch. 9, page 211].

24

H. DETTE ET AL.

Lemma 4.1 The Chebyshev polynomials of the first kind Tn (x) = cos(n arccos x) are the eigenfunctions of the integral operator with the kernel √ H(x, v) = − ln(x − v)2 / 1 − v 2 . More precisely, for all n = 0, 1, . . . we have for all n ∈ N ∫ 1 dv , x ∈ [−1, 1], λn Tn (x) = − Tn (v) ln(x − v)2 √ π 1 − v2 −1 where λ0 = 2 ln 2 and λn = 2/n for n ≥ 1. With the next result we address the problem of uniqueness of the optimal design. In particular, we give a new characterization of the arcsine distribution. A proof can be found in the Appendix. Theorem 4.4 Let ζ be a random variable supported on the interval [−1, 1]. Then ζ is given by the arcsine distribution with density (3.10) if and only if the equality ( ) E Tn (ζ) − ln(ζ − x)2 = cn Tn (x) holds for almost all x ∈ [−1, 1], where cn = 2/n if n ∈ N and c0 = 2 ln 2 if n = 0. The following result is now an immediate consequence of Theorems 3.3 and 4.4. Theorem 4.5 Consider the polynomial regression model (1.1) with f (x) = (1, x, x2 , . . . , xm−1 )T , x ∈ [−1, 1], and the covariance kernel (4.7), then the probability measure with arcsine density (3.10) is the unique universally optimal design. Proof. We assume without loss of generality that β = 1 and consider the function ρ(x) = − ln x2 + γ with positive γ. From Lemma 4.1 we obtain ∫

1

∫ (− ln(u − x) + γ)Tn (u)p(u)du = −

1

2

−1

−1

ln(u − x)2 Tn (u)p(u)du

= λn Tn (x) + γδn0 where δxy denotes Kronecker’s symbol and we have used the fact that √ ∫1 2 −1 Tn (u)/ 1 − u du = 0 whenever n ≥ 1. Consequently, the arcsine distribution satisfies (3.6) with g(x) ≡ 0 and the statement follows from Theorems 3.3 and 4.4. 

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

25

4.4.2. Generalized arcsine designs. For α ∈ (0, 1) consider the Gegen(α) bauer polynomials Cm (x) which are orthogonal with respect to the weight function (4.10)

pα (x) =

(Γ(α + 21 ))2 (1 − x2 )α−1/2 , x ∈ [−1, 1]. 2α Γ(2α + 1) (α)

For the choice α = 0 the Gegenbauer polynomials Cm (x) are proportional to the Chebyshev polynomials of the first kind Tm (x). Throughout this paper we will call the corresponding beta-distributions generalized arcsine designs emphasizing the fact that the distribution is symmetric and the parameter α varies in the interval (0, 1). The following result (from the theory of Fredholm-Volterra integral equations of the first kind with special kernel, see Fahmy et al. (1999)) establishes an analogue of Lemma 4.1 for the kernel (4.11)

H(u, v) =

|u −

v|α (1

1 . − v 2 )(1−α)/2 (α/2)

Lemma 4.2 The Gegenbauer polynomials Cn (x) are the eigenfunctions of the integral operator with the kernel defined in (4.11). More precisely, for all n = 0, 1, . . . we have ∫ 1 1 dv (α/2) λn Cn (x) = − Cn(α/2) (v) α 2 (1 − v )(1−α)/2 −1 |x − v| for all x ∈ [−1, 1], where λn =

πΓ(n+α) cos(απ/2)Γ(α)n! .

The following result generalizes Theorem 8 of Zhigljavsky et al. (2010) from the case of a location scale model to polynomial regression models. Theorem 4.6 Consider the polynomial regression model (1.1) with f (x) = (1, x, x2 , . . . , xm−1 )T , x ∈ [−1, 1], and covariance kernel (4.8). Then the design with generalized arcsine density defined in (4.10) is universally optimal. Proof. It is easy to see that the optimal design does not depend on β and we thus assume β = 1. To prove the statement for the kernel ρ(x) = 1/|x|α + γ with positive γ we recall the definition of pα in (4.10) and obtain from Lemma 4.2 ∫ ∫ ( ) (α) 1 1 ) (α) (α 2 2 α (u)du = α (u)du ∝ Cn 2 (x) (u)p (u)p + γ C C n n 2 2 |u − x|α |u − x|α ∫ (α/2) for any n ∈ N since Cn (u)pα/2 (u)du = 0. Therefore the design ξ ∗ with density p α2 satisfies condition (3.6) with g(x) ≡ 0 and the optimality therefore follows from Theorem 3.3. 

26

H. DETTE ET AL.

5. Numerical construction of optimal designs. 5.1. An algorithm for computing optimal designs. Numerical computation of optimal designs for a common linear regression model (1.1) with given correlation function can be performed by an extension of the multiplicative algorithm proposed by Dette et al. (2008b) for the case of non-correlated observations. Note that the proposed algorithm constructs a discrete design which can be considered as an approximation to a design which satisfies the necessary conditions of optimality of Theorem 3.1. By choosing a fine discretization {x1 , . . . , xn } of the design space X and running the algorithm long enough, the accuracy of approximation can be made arbitrarily small (in the case when convergence is achieved). (r) (r) Denote by ξ (r) = {x1 , . . . , xn ; w1 , . . . , wn } the design at the iteration r, (0) (0) where w1 , . . . , wn are nonzero weights, for example, uniform. We propose the following updating rule for the weights ) ψ(xi , ξ (r) ) − βr =∑ ) (r) ( n ψ(xj , ξ (r) ) − βr j=1 wj (r) (

(5.1)

(r+1) wi

wi

i = 1, . . . , n,

where βr is a tuning parameter (the only condition on βr is the positivity of all the weights in (5.1)), ψ(x, ξ) = φ(x, ξ)/b(x, ξ) and the functions φ(x, ξ) and b(x, ξ) are defined in (3.2) and (3.3), respectively. The condition (3.5) takes the form ψ(x, ξ ∗ ) ≤ 1 for all x ∈ X . The rule (5.1) means that at the next iteration the weight of a point x = xj increases if the condition (3.5) does not hold at this point. A measure ξ∗ is a fixed point of the iteration (5.1) if and only if ψ(x, ξ∗ ) = 1 for all x ∈ supp(ξ∗ ) and ψ(x, ξ∗ ) ≤ 1 for all x ∈ X \ supp(ξ∗ ). That is, a design ξ∗ is a fixed point of the iteration (5.1) if and only if it satisfies the optimality condition of Theorem 3.1. We were not able to theoretically prove the convergence of iterations (5.1) to the design satisfying the optimality condition of Theorem 3.1 but we observed this convergence in all numerical studies. In particular, for the cases where we could derive the optimal designs explicitly, we observed convergence of the algorithm to the optimal design. The algorithm (5.1) can be easily extended to cover the case of singular covariance kernels. Alternatively, a singular kernel can be approximated by a non-singular one using the technique described in (Zhigljavsky et al., 2010, Section 4). 5.2. Efficiencies of the uniform and arcsine densities. In the present section we numerically study the efficiency (with respect to the D-optimality

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

27

criterion) of two designs for different models. Specifically, we consider the uniform design and the arcsine design for the model (1.1) with f (x) = (1, x, . . . , xm−1 )T and different correlation functions where the design space is given by the interval [−1, 1]. We determine the efficiency of a design ξ as ( Eff(ξ) =

det D(ξ ∗ ) det D(ξ)

)1/m ,

where ξ ∗ is the design computed by the algorithm described in the previous section (applied to the D-optimality criterion). We considered polynomial regression models of degree ≤ 3 and the correlation functions ρ(x) = e−λ|x| 2 and ρ(x) = e−λx for various values of λ. The results are depicted in Tables 2 and 3, respectively. We observe that the efficiency of the arcsine design is always larger than the efficiency of the uniform design. Moreover, the absolute difference between the efficiencies of the two designs increases as the degrees m of the polynomial increases. On the other hand, the efficiency of the uniform design and the arcsine design decreases as m increases. Table 2 Efficiencies of the uniform design ξu and the arcsine design ξa for the polynomial regression model of degree m − 1 and the exponential correlation function ρ(x) = e−λ|x| .

m=1 m=2 m=3 m=4

λ Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa )

0.5 0.913 0.966 0.857 0.942 0.832 0.934 0.826 0.934

1.5 0.888 0.979 0.832 0.954 0.816 0.938 0.818 0.936

2.5 0.903 0.987 0.847 0.970 0.826 0.954 0.823 0.945

3.5 0.919 0.980 0.867 0.975 0.842 0.968 0.835 0.957

4.5 0.933 0.968 0.886 0.973 0.860 0.976 0.849 0.967

5.5 0.944 0.954 0.901 0.966 0.876 0.981 0.864 0.975

Table 3 Efficiencies of the uniform design ξu and the arcsine design ξa for the polynomial 2 regression model of degree m − 1 and the Gaussian correlation function ρ(x) = e−λx .

m=1 m=2 m=3 m=4

λ Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa ) Eff(ξu ) Eff(ξa )

0.5 0.758 0.841 0.756 0.843 0.803 0.866 0.797 0.842

1.5 0.789 0.907 0.698 0.833 0.662 0.771 0.630 0.713

2.5 0.811 0.924 0.709 0.853 0.684 0.818 0.617 0.722

3.5 0.830 0.932 0.725 0.868 0.699 0.844 0.627 0.746

4.5 0.842 0.934 0.739 0.877 0.711 0.859 0.648 0.776

5.5 0.853 0.935 0.753 0.885 0.720 0.869 0.665 0.799

28

H. DETTE ET AL.

6. Conclusions. In this paper we have addressed the problem of constructing optimal designs for least squares estimation in regression models with correlated observations. The main challenge in problems of this type is that - in contrast to “classical” optimal design theory for uncorrelated data - the corresponding optimality criteria are not convex (except for the location scale model). By relating the design problem to an integral operator problem, universally optimal design can be identified explicitly for a broad class of regression models and correlation structures. Particular attention is paid to a trigonometric regression model involving only cosines terms, where it is proved that the uniform distribution is universally optimal for any periodic kernel of the form K(u, v) = ρ(u − v). For the classical polynomial regression model with a covariance kernel given by the logarithmic potential it is proved that the arcsine distribution is universally optimal. Moreover, optimal designs are derived for several other regression models. So far optimal designs for regression models with correlated observations have only be derived explicitly for the location scale model and to our best knowledge the results presented in this paper provide the first explicit solutions to this type of problem for a general class of models with more than one parameter. We have concentrated on the construction of optimal designs for least squares estimation (LSE), because the best linear unbiased estimator (BLUE) requires the knowledge of the correlation matrix. While the BLUE is often sensitive with respect to misspecification of the correlation structure the corresponding optimal designs for the LSE show a remarkable robustness. Moreover, the difference between BLUE and LSE is often surprisingly small and in many cases BLUE and LSE with certain correlation functions are asymptotically equivalent [see Rao (1967), Kruskal (1968)]. Indeed, consider the location scale model y(x) = θ + ε(x) with K(u, v) = ρ(u − v), where the knowledge of a full trajectory of ∫ a process y(x) is availˆ able. Define the (linear unbiased) estimate θ(G) = y(x)dG(x), where G(x) is a distribution function of a signed probability measure. A celebrated result ˆ ∗ ) is BLUE if and only of ∫Grenander (1950) states that the “estimator” θ(G ∗ if ρ(u − x)dG (u) is constant for all x ∈ X . This result was extended by N¨ather (1985a), Sect. 4.3 to the case of random fields with constant mean. Consequently, if G∗ (x) is a distribution function of a non-signed (rather than signed) probability measure, then LSE coincides with BLUE and an asymptotic optimal design for LSE is also an asymptotic optimal design for BLUE. Hajek (1956) proved that G∗ is a distribution function of a non-signed probability measure if the correlation function ρ is convex on the interval (0, ∞). Zhigljavsky et al. (2010) showed that G∗ is a proper distribution function

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

29

for a certain families of correlation functions including non-convex ones. In Theorem 3.3 we have characterized the cases where there exist universally optimal designs for ordinary least squares estimation. Specifically, a design ξ ∗ is universally optimal for least squares estimation if and only if the condition (3.6) with g(x) ≡ 0 is satisfied. Moreover, the proof of Theorem 3.3 shows that in this case the signed vector-valued measure µ(dx) = M −1 (ξ ∗ )f (x)ξ ∗ (dx) and the LSE minimizes (with respect to the Loewner ordering) the matrix ∫ ∫ K(x, u)µ(dx)µT (du) in the space M of all vector-valued signed measures. Because this matrix ∫ is the covariance of the linear estimate y(x)µ(dx) (where µ is a vector of signed measures) it follows that under the assumptions of Theorem 3.3 the LSE combined with the universally optimal design ξ ∗ give exactly the same asymptotic covariance matrix as the BLUE and the optimal design for the BLUE. 7. Appendix: some technical details. Proof of Lemma 3.3. If the vectors a and b are linearly independent, then there exists at least one vector, say c, such that the angle between the vectors c and a is strictly smaller than π/4 and the angle between the vectors c and b is strictly larger than π/4. Therefore, the assertion follows from the well known formula aT c = (aT a)1/2 (cT c)1/2 cos φ, where φ denotes the angle between the vectors a and c.  Proof of Theorem 4.4. Note that the part “if” of the statement follows from Lemma 4.1 and we should prove the part “only if”. Nevertheless, we provide a proof of the part “if” since it will be the base for proving the part “only if”. Since the statement for n = 0 is proved in Schmidt and Zhigljavsky (2009), we consider the case n ∈ N in the rest of proof. Using the transformation φ = arccos u and ψ = arccos x we obtain Tn (cos φ) = cos(nφ) and ∫ π ∫ 1 ln(cos φ − x)2 ln(u − x)2 √ cos(nφ) sin φ dφ. Tn (u)du = 2 π sin φ 0 −1 π 1 − u Consequently, in order to prove Theorem 4.4 we have to show that the function ∫ π ln(cos φ − cos ψ)2 cos(nφ) µ(dφ) 0

30

H. DETTE ET AL.

is proportional to cos(nψ) if and only if µ has a uniform density on the interval [0, π]. Extending µ to the interval [0, 2π] as a symmetric (with respect to the center π) measure, µ(A) = µ(2π − A), and defining the measure µ ˜ as µ ˜(A) = µ(2A)/2 for all Borel sets A ∈ [0, π], we obtain

π

ln(cos φ − cos ψ)2 cos(nφ) µ(dφ) = 0

= =

= = =

1 2

cos(nφ) ln(cos φ − cos ψ)2 µ(dφ) 0

( φ−ψ φ + ψ )2 cos(nφ) ln 2 sin sin µ(dφ) 2 2 0 ∫ ∫ ( 1 2π 1 2π φ − ψ )2 cos(nφ) ln 22 µ(dφ) + cos(nφ) ln sin µ(dφ) 2 0 2 0 2 ∫ ( 1 2π φ + ψ )2 + cos(nφ) ln sin µ(dφ) 2 0 2 ∫ π ∫ π 0+ cos(2nφ) ln sin2 (φ − ψ/2)˜ µ(dφ) + cos(2nφ) ln sin2 (φ + ψ/2)˜ µ(dφ) 0 0 ∫ π cos(2nφ − nψ + nψ) ln sin2 (φ − ψ/2)˜ µ(dφ) 2 0 ∫ π 2 cos(nψ) cos(2nφ − nψ) ln sin2 (φ − ψ/2)˜ µ(dφ) 0 ∫ π +2 sin(nψ) sin(2nφ − nψ) ln sin2 (φ − ψ/2)˜ µ(dφ). 1 2

0

The part “if” follows from the facts that the functions cos(2nz) ln sin2 (z) and sin(2nz) ln sin2 (z) are π-periodic and ∫ π ∫ π dφ dφ 2 sin(2nφ − nψ) ln sin (φ − ψ/2) = sin(2nφ) ln sin2 (φ) = 0, π π 0 0 ∫

π

dφ = cos(2nφ − nψ) ln sin (φ − ψ/2) π

0

π

cos(2nφ) ln sin2 (φ)

2

0

dφ = −1/n. π

To prove the part “only if”, we need to show that the convolution of cos(2nz) ln sin2 (z) and µ ˜(z), i.e. ∫ π cos(2n(φ − t)) ln sin2 (φ − t)˜ µ(dφ), 0

is constant for almost all t ∈ [0, π] if and only if µ ˜ is uniform; and the same holds for the convolution of sin(2nz) ln sin2 (z) and µ ˜(z). This, however, follows from (Schmidt and Zhigljavsky, 2009, Lem. 3) since cos(2nz) ln sin2 (z) ∈ L2 ([0, π]) and all complex Fourier coefficients of these functions are non-zero. Indeed, ∫ π cos(2nt) ln sin2 (t) sin(2kt)dt = 0 ∀ k ∈ Z 0

OPTIMAL DESIGN FOR CORRELATED OBSERVATIONS

∫ 0

π

31

cos(2nt) ln sin2 (t) cos(2kt)dt = (γ|n+k| + γ|n−k| )/2 ∀ k ∈ Z ,

where γ0 = −2π log 2 and γk = −π/k for k ∈ N, see formula 4.384.3 in Gradshteyn and Ryzhik (1965).

Acknowledgements. This work has been supported in part by the Collaborative Research Center “Statistical modeling of nonlinear dynamic processes” (SFB 823, Teilprojekt C2) of the German Research Foundation (DFG). Parts of this paper were written during a visit of the authors at the Isaac Newton Institute, Cambridge, UK, and the authors would like to thank the institute for its hospitality and financial support. We are also grateful to the referees and the associate editor for their constructive comments on an earlier version of this manuscript.

References. Bickel, P. J. and Herzberg, A. M. (1979). Robustness of design against autocorrelation in time I: Asymptotic theory, optimality for location and linear regression. Annals of Statistics, 7(1):77–95. Bickel, P. J., Herzberg, A. M., and Schilling, M. F. (1981). Robustness of design against autocorrelation in time II: Optimality, theoretical and numerical results for the first-order autoregressive process. Journal of the American Statistical Association, 76(376):870– 877. Boltze, L. and N¨ ather, W. (1982). On effective observation methods in regression models with correlated errors. Math. Operationsforsch. Statist. Ser. Statist., 13:507–519. Dette, H., Kunert, J., and Pepelyshev, A. (2008a). Exact optimal designs for weighted least squares analysis with correlated errors. Statistica Sinica, 18(1):135–154. Dette, H., Pepelyshev, A., and Zhigljavsky, A. (2008b). Improving updating rules in multiplicative algorithms for computing D-optimal designs. Computational Statistics and Data Analysis, 53(2):312–320. Fahmy, M. H., Abdou, M. A., and Darwish, M. A. (1999). Integral equations and potentialtheoretic type integrals of orthogonal polynomials. Journal of Computational and Applied Mathematics, 106:245–254. Gradshteyn, I. S. and Ryzhik, I. M. (1965). Table of Integrals, Series, and Products. Academic Press, New York-London. Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat., 1:195–277. Hajek, J. (1956). Linear estimation of the mean value of a stationary random process with convex correlation function. Czechoslovak Mathematical Journal, 6(81):94–117. ˇ Harman, R. and Stulajter, F. (2010). Optimal prediction designs in finite discrete spectrum linear regression models. Metrika, 72(2):281–294. Kanwal, R. (1997). Linear Integral Equations. Birkhauser, Boston. Kiefer, J. (1974). General equivalence theory for optimum designs (Approximate Theory). Annals of Statistics, 2:849–879.

32

H. DETTE ET AL.

Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremem problems. Canadian Journal of Mathematics, 12:363–366. Kiselak, J. and Stehlik, M. (2008). Equidistant and d-optimal designs for parameters of ornsteinuhlenbeck process. Statistics & Probability Letters, 78(12):1388–1396. Kruskal, W. (1968). When are Gauss-Markov and least squares estimators identical? A coordinate-free approach. Annals of Mathematical Statistics, 39:70–75. Mason, J. C. and Handscomb, D. C. (2002). Chebyshev Polynomials. Oxford University Computing Laboratory, England, UK. M¨ uller, W. G. and P´ azman, A. (2003). Measures for designs in experiments with correlated errors. Biometrika, 90:423–434. N¨ ather, W. (1985a). Effective Observation of Random Fields. Teubner Verlagsgesellschaft, Leipzig. N¨ ather, W. (1985b). Exact design for regression models with correlated errors. Statistics, 16:479–484. P´ azman, A. and M¨ uller, W. G. (2001). Optimal design of experiments subject to correlated errors. Statistics and Probability Letters, 52:29–34. Pukelsheim, F. (2006). Optimal Design of Experiments. SIAM, Philadelphia. Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to measurement of signals. Proc. Fifth Berkeley Sympos., Univ. California Press, Berkeley, Calif., pages 355–372. Sacks, J. and Ylvisaker, N. D. (1966). Designs for regression problems with correlated errors. Annals of Mathematical Statistics, 37:66–89. Sacks, J. and Ylvisaker, N. D. (1968). Designs for regression problems with correlated errors; many parameters. Annals of Mathematical Statistics, 39:49–69. Schmidt, K. and Zhigljavsky, A. (2009). A characterization of the arcsine distribution. Statistics and Probability Letters, 79:2451–2455. Torsney, B. (1986). Moment inequalities via optimal design theory. Linear Algebra Appl., 82:237–253. Zhigljavsky, A., Dette, H., and Pepelyshev, A. (2010). A new approach to optimal design for linear models with correlated observations. Journal of the American Statistical Association, 105:1093–1103. ¨ t fu ¨ r Mathematik Fakulta ¨ t Bochum Ruhr-Universita Bochum, 44780, Germany E-mail: [email protected]

Institute of Statistics RWTH Aachen University Aachen, 52056, Germany E-mail: [email protected] School of Mathematics Cardiff University Cardiff, CF24 4AG, UK E-mail: [email protected]