Measuring spatial dispersion: exact results on the variance of random spatial distributions Pablo Jensen · Julien Michel

Received: date / Accepted: date

Abstract Measuring the spatial distribution of locations of many entities (trees, atoms, economic activities, etc.), and, more precisely, the deviations from purely random configurations, is a powerful method to unravel their underlying interactions. Several coefficients have been developed in the past to quantify the possible deviations. It is important to quantify the variances of the coefficients for random distributions, to ascertain the statistical significance of an empirical deviation. By lack of a proper analytical expression, the significance is usually obtained by simulating many random configurations by Monte Carlo simulations. In the present paper, we present an exact analytical expression for the variance of several spatial coefficients for random distributions, and we rigorously show that these distributions asymptotically follow a Normal law. These two results eliminate the need for cumbersome Monte Carlo simulations. They also allow to understand qualitatively the main factors that may change the variance: number of sites, spatial inhomogeneity, etc. Keywords Spatial distribution · Point processes · Statistical significativity · Localisation JEL C19 · R12 · L81

1 Introduction Measuring the spatial distribution of industries [Hoo37], atoms [EB03], trees [WPF96] or retail stores [HG84,Col06,Jen06] is a powerful method to understand the underlying mechanisms of their interactions. Several methods have been developed in the past to quantify Pablo Jensen Universit´e de Lyon (a) LET-CNRS, Universit´e Lyon-2, 69007 Lyon, FRANCE (b) Laboratoire de Physique, Ecole Normale Sup´erieure de Lyon et CNRS, 69007 Lyon, FRANCE (c) Institut des Syst`emes Complexes Rhˆone-Alpes, IXXI-CNRS, 69007 Lyon, FRANCE E-mail: [email protected] Julien Michel Universit´e de Lyon (d) Unit´e de Math´ematiques Pures et Appliqu´ees-UMR 5669, ENS Lyon, 46 all´ee d’Italie, F-69364 Lyon Cedex 07, FRANCE E-mail: [email protected]

2

the deviations of the empirical distributions from purely random distributions, supposed to correspond to the non-interacting case [Rip76, Bes77, EG97, MS99]. Recently, a method originally developed by G. Duranton and H. Overman [DO05], later modified by Marcon and Puech [MP07] has been proposed. Its main interest is that it takes as reference for the underlying space not a homogeneous one as for the former methods [Rip76, Bes77, EG97, MS99], but the overall spatial distribution of sites, thus automatically taking into account the many inhomogeneities of the actual geographical space. For instance, retail stores are inhomogeneously distributed because of rivers, mountains or specific town regulations (parks, pure residential zones, etc.). Therefore, it is interesting to take this inhomogeneous distribution as the reference when testing the random distribution of, for instance, bakeries, in town. Furthermore, by using precise location data (x and y coordinates), this method avoids all the well-known contiguity problems, summarized in the ‘modifiable areal unit problem’ [YK50, Unw96,Ope84,BCL07]. However, the method has two main drawbacks: 1. the need of precise location data (i.e. x, y coordinates, and not only knowing that a site belongs to a given geographical area), 2. the need for Monte Carlo simulations in order to compute the statistical significance of the deviations from a random distribution. Point (1) is probably going to be less crucial as precisely spatialized data becomes more common. Moreover, it can be argued that, when only region-type data exists, it can be more convenient to locate all the sites at the region centroid and then apply the ’continuous’ method, thus avoiding contiguity problems. This paper solves point (2), and, more generally, gives analytic formulas to compute the variance of some characteristics of purely random distributions of points (meaning non interacting distributions). This is all the more important since the variance is needed to ascertain the statistical significance of a deviation from the reference value of the spatial distribution. Lacking an analytical expression for the variance, the usual method to compute the statistical significance of a deviation consists in generating random distributions by Monte Carlo simulations, counting the proportion of distributions that deviate more than the measured distribution. However, Monte Carlo simulations can be cumbersome to implement and furthermore this method can be prohibitively time consuming for large samples, since typically many thousand runs are necessary to compute deviations precisely. The new indices defined in this article present the great advantage, compared to the classical indices, of having a reference value constantly equal to 1 (no additional computation by Monte Carlo simulation of this reference is thus needed). The deviation from this reference of purely random configurations is clearly observed either for clustered or self-excluding configurations, and those behaviors are readily obtained by comparing the indices with 1: significantly greater than the reference 1 means clustering, whereas lesser than the reference means exclusion. The significancy levels for this discrimination are determined thanks to the knowledge of the variances of the indices. The indices introduced cover both the case of discrete locations, and of continuous locations with respect to an a priori density measure. Our main results obtained are theoretical values for the variance of all the indices introduced, giving an easy access to their actual numerical computation. This numerical computation is much faster than the classical Monte Carlo simulations that were used up to now in the litterature. Thus, testing deviation from randomness for a spatial configuration of millions of stores becomes practically feasible. Furthermore, the asymptotic normality proved in the appendix ensures that the variance suffices to calculate the confidence intervals.

3

The paper is organized as follows: in section 2 we define the cumulative coefficient [MP07] measuring the spatial dispersion or aggregation of locations (stores, etc.) in a discrete setting. In section 3 we derive explicit values for the variance of the discrete cumulative coefficients and give an example on the actual locations of bakeries in the city of Lyon. Section 4 is devoted to a more theoretical derivation of the variance in the continuous setting. Finally, in section 5, we extend our calculations to other spatial coefficients, such as a differential index and the widely used Duranton-Overman index [DO05]. The exact value of the variance for the discrete case is given in proposition 1 for the intercoefficient measuring the dependency of the locations of a certain type of stores with respect to another type of fixed stores. Proposition 2 gives the value for the variance of the intra-coefficient measuring the dependency between the same type of stores. The testing procedure is detailed in proposition 3. The continuous version is treated in section 4, and the computation of the variances in this case is performed in propositions 4 and 5. Theorems 1 and 2 state central limit theorems justifying the normal approximation for large numbers of stores. Propositions 6 and 7 give the value of the variance of respectively the Differential coefficients and of the Duranton-Overman indicator.

List of Figures 1 2 3 4 5 6

Comparison of variances obtained from simulations and proposition 1 for the inter-coefficient as a function of NB (logarithmic scales). . . . . . . . . Comparison of variances obtained from simulations and proposition 2 as a function of NA (linear scales). . . . . . . . . . . . . . . . . . . . . . . . . . Plot of the intra-coefficient for bakeries in the city of Lyon with respect to r. Plot of the intra-coefficient for clustered (top) and self-excluding (bottom) configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of random aAB values compared to the Normal law centered at 1 with variance computed from proposition 1. . . . . . . . . . . . . . . . . Distribution of random aAA values compared to a Normal law centered at 1 with variance computed from proposition 2. . . . . . . . . . . . . . . . . .

9 10 11 13 19 19

List of Tables 1 2

Comparison of the expressions of the discrete and continuous variances of the inter-coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Enumeration of the different terms of cardinality 3 and their respective values. 27

2 Discrete setting for the characterisation of spatial dispersion and aggregation The indicators that are studied here deal with the problem of quantifying deviations of empirical distribution of points from purely random and non-interacting distributions. One can be interested in the interaction of a set of points between themselves, or with some other set of points. From now on we shall work with two different types of points: A and B. We define two indicators, refered to as respectively the inter and intra-coefficients [MP07], to characterize the (cumulative) spatial interaction between sites closer than a distance r, the inter-coefficient describes the type of interactions of fixed A points with random B points,

4

whereas intra-coefficient is intended to measure the independence between points of type A. One can also work with indicators characterizing the (differential) spatial distributions between distances r and r + δ r (with δ r r) [DO05]. Those differential coefficients are potentially more sensitive to spatial variations of the distributions because they do not integrate features from 0 to r. We shall start by calculating the variance of the cumulative coefficient and then extend our results to other quantifiers of spatial distributions. In a discrete setting, stores (points) are located randomly on a discrete (finite) set T of fixed points. We shall use the following definitions and notations: – there are Nt sites, of which NA sites are of type A, and NB sites are of type B, – for any site S, we define: Nt (S, r) NA (S, r) NB (S, r)

the total number of sites distinct from S that are at a distance lesser than r of site S, the number of A sites in this same region the number of B sites in this same region.

Note that site S is never counted in those quantities, whatever its state. Remark 1 The notation Nt (D) (resp. NA (D) and NB (D)) will denote the total (resp. A and B) number of sites in a subset D of T . Thus for instance NA (S, r) stands for NA (B(S, r) \ {S}), where B(S, r) denotes the ball centered at S with radius r. In this discrete model, the locations of stores A and B are distributed over the total number of possible sites, with mutual exclusion at a same site. Therefore, the geographical characteristics of the studied area are carried by the actual locations of those possible Nt sites.

Remark 2 The coefficients that we introduce depend on the reference distance r, however we shall drop this dependency in the notations, unless when strictly necessary. In the following two subsections we define the inter and intra-coefficients, with the following goal: both coefficients must be easy to compute on actual data, and they should be easily compared to a reference value associated to a random distribution of points. The computation of their variance shall then give an easy to implement testing hypothesis for this randomness by a standard argument of confidence interval based on Chebyshev’s inequality [YK50]. This testing procedure will be detailed in subsection 3.4.

2.1 inter-coefficient In order to quantify the dependency between two different types of points, we set the following context: the set T has a fixed subset of NA stores of type A, and the distribution of the subset {Bi , i = 1...NB } of type B stores is assumed to be uniform on the set of subsets of cardinal NB of T \ {A1 , . . . , ANA }: this is equivalent to an urn model with NB draws with no replacement in an urn of cardinal Nt − NA . The presence of a point of type A at those locations, under this reference random hypothesis, should not modify (in average) the density of type B stores: the local B spatial concentration (NB (Ai , r)) / (Nt (Ai , r) − NA (Ai , r)) should

5

be close (in average) to the concentration over the whole town, (NB ) / (Nt − NA ). We define the inter-coefficient as aAB =

Nt − NA NA NB

NA

NB (Ai , r)

∑ Nt (Ai , r) − NA (Ai , r)

(1)

i=1

where NA (Ai , r), NB (Ai , r) and Nt (Ai , r) are respectively the A, B and total number of points in the r-neighborhood of point Ai (not counting Ai ), i.e. points at a distance smaller than r. In this definition, the right hand side may contain fractions with zero at the numerator and denominator: those fractions 0/0 are taken as equal to 1. It is straightforward to check that Lemma 1 For all r > 0, we have E[aAB ] = 1. We can deduce a qualitative behaviour in the following sense: if the observed value of the inter-coefficient is greater than 1, we may deduce that A stores have a tendency to attract B stores, whereas lower values mean a rejection tendency. 2.2 intra-coefficient Let us assume that we are interested in the distribution of NA points in the set T , represented by the subset {Ai , i = 1...NA } ⊂ T . The reference law for this set, called pure random distribution, is that this subset is uniformely chosen at random from the set of all subsets of cardinal NA of T : this is equivalent to an urn model with NA draws with no replacement in an urn of cardinal Nt . Intuitively, under this (random) reference law, the local concentration represented by the ratio NA (Ai , r)/Nt (Ai , r) of stores of type A around a given store of type A should, in average, not depend on the presence of this last store, and should thus be (almost) equal to the global concentration NA /Nt , this leads us to introduce the following intra-coefficient: aAA =

Nt − 1 NA (NA − 1)

NA

NA (Ai , r)

∑ Nt (Ai , r) .

(2)

i=1

In this definition, the fraction 0/0 is still taken as equal to 1 in the right hand term. Under the pure randomness hypothesis, it is straightforward to check that the average of this coefficient is equal to 1: Lemma 2 For all r > 0, we have E[aAA ] = 1. We also deduce a qualitative behaviour in the following sense: if the observed value of the intra-coefficient is greater than 1, we may deduce that A stores tend to aggregate, whereas lower values indicate a dispersion tendency. 3 Computation of the variance in the discrete setting, inference for the stores data of the city of Lyon The computation of the variances of the inter and intra-coefficients does not contain many mathematical difficulties, the most important feature is the fact that in the computation of the second moment of this coefficient, the possible overlaps of neighborhoods yields a loss of independence.

6

3.1 Inter covariance Let us recall that there are NA particular (fixed) sites of type A, and that sites B are taken randomly. The r-neighborhoods of the fixed positions A1 , . . . , ANA may intersect. We shall denote by Ci,r j = B(Ai , r) ∩ B(A j , r) the intersection of the two balls of centers Ai and A j and radius r. In the computation of the second order moment of aAB we shall use the simplified notations: – NBi = NB (Ai , r), – NBi j the number of B stores in Ci,r j , i\ j

– NB the number of B stores in B(Ai , r) \Ci,r j . Using those notations, we may write: a2AB

!2 NBi , = ∑ i=1 Nt (Ai , r) − NA (Ai , r) 2 Nt − NA 2 n NA NBi + = ∑ NA NB i=1 Nt (Ai , r) − NA (Ai , r) o Nj Ni ∑ Nt (Ai , r) −BNA (Ai , r) Nt (A j , r) −BNA (A j , r) , i6= j 2 Nt − NA 2 n NA NBi = + ∑ NA NB i=1 Nt (Ai , r) − NA (Ai , r) 2 i\ j j\i i\ j j\i o NB NB + NB NBi j + NBi j NB + NBi j . ∑ i6= j (Nt (Ai , r) − NA (Ai , r))(Nt (A j , r) − NA (A j , r)) Nt − NA NA NB

NA

(3)

(4)

(5)

All the terms in the right hand side are either squares of numbers of B stores in some region, or products of two such terms in disjoint regions: the computation of their expectation is quite easy thanks to the following elementary lemma: Lemma 3 Let D be a subset of cardinal d of T \A = T \ {A1 , . . . , ANA }, and let D0 be a disjoint subset of cardinal d 0 of T \A . Under the pure random hypothesis, the numbers NB (D) and NB (D0 ) of B stores in D and D0 have the following properties: Nt − NA − d . Nt − NA d , P(NB (D) = k) = k NB − k NB NB E[NB (D)] = d , Nt − NA NB (NB − 1) NB E[NB (D)2 ] = d(d − 1) +d , (Nt − NA )(Nt − NA − 1) Nt − NA NB (NB − 1) . E[NB (D)NB (D0 )] = dd 0 (Nt − NA )(Nt − NA − 1) Remark 3 If we set p = d/(Nt − NA ), the variance σ 2 (NB (D)) becomes σ 2 (NB (D)) = p(1 − p)NB

Nt − NA − NB . Nt − NA − 1

7

Using the computations of this lemma, the value of the variance of the inter-coefficient follows easily: if we denote by h·iA the average over the A sites, hu(Ai )iA :=

1 NA

NA

2

∑ u(Ai ), and hk(Ai , A j )iA := NA (NA − 1)

i=1

k(Ai , A j ),

∑

1≤i< j≤NA

we obtain Proposition 1 The variance of the inter-coefficient is given by Nt − NA − NB σ 2 (aAB ) = − NB (Nt − NA − 1) 1 Nt − NA Nt − NA − NB NA NB Nt − NA − 1 Nt (Ai , r) − NA (Ai , r) A Nt − NA − NB Nt − NA 1 + 1− xi j A , Nt − NA − 1 NB NA where

xi j A =

Nt (Ci, j ) − NA (Ci, j ) (Nt (Ai , r) − NA (Ai , r))(Nt (A j , r) − NA (A j , r)

. A

3.2 Intra covariance The computation of the second moment of the intra-coefficient is a little more tricky. When computing the second order moment of aAA we are led to deal with square terms, 2 Ni ∑ Nt (AA i ) , i (with obvious notations), those are clearly treated using the same arguments as for the inter case, and cross-products Nj Ni ∑ Nt (AA i ) Nt (AA j ) . i6= j To compute the average of those terms, it is better to start from the very definition of the average:

u(Ai , A j ) t =

1 total number of configurations

u(Ai , A j ),

∑

(6)

all configurations

where the function u depends on both locations Ai and A j and on the other points of type A: j

u(Ai , A j ) =

NA NAi . Nt (Ai ) Nt (A j )

Now, we can order all the possible A configurations over the Nt sites by grouping those that keep fixed the positions of Ai and A j . Therefore, eq. (6) becomes:

u(Ai , A j ) t =

1 total number of configurations

∑

s,t∈T

∑

u(Ai , A j ).

configurations such that Ai = s, A j = t

(7)

8

For s and t fixed, the inner sum in eq. (7), once correctly rescaled, can be interpreted as the average of u(Ai , A j ) when NA − 2 sites are randomly chosen out of Nt − 2 sites. In other terms, eq. (7) fixes the positions of Ai and A j and averages over the positions of all the other A’s, in order to compute the average by analogy to the inter case seen above, with, formally NB ≡ NA − 2. One has only to be cautious to separate the sum in two terms, for which the value of the random variables NAi NAj product is different: when Ai and A j are neighbors, one has to add 1 to the random value of both NAi and NAj : i\ j

NAi = NA + NAi j + 1, j\i

NAj = NA + NAi j + 1, i\ j

where NA denotes the number of A neighbors of site Ai among all points of T that are r-neighbors of Ai and not r-neighbors of A j . We introduce the following localised average h·in : for any function v defined on couples of points of T , set

v(Ti , T j ) n =

2 Nt (Nt − 1)

∑

v(Ti , T j )1d(Ti ,T j )≤r ,

(8)

1≤i< j≤Nt

where 1d(s,t)≤r is equal to one if and only if s and t are at a distance lesser than r and to 0 otherwise. We obtain then the following result: Proposition 2 The variance of the intra-coefficient is the sum of four terms: 4

σ 2 (aAA ) = ∑ Var(aAA )i , i=1

where Nt − NA , (Nt − 2)(NA − 1) (Nt − 1)(Nt − NA ) 1 , Var(aAA )2 = NA (NA − 1)(Nt − 2) Nti t Var(aAA )1 = −

(Nt − 1)2 (Nt − NA )(Nt − NA − 1) Var(aAA )3 = (Nt − 2)(Nt − 3)NA (NA − 1) Var(aAA )4 =

*

1 Nti Ntj

(Nt − 1)2 (Nt − NA )(NA − 2) xi j t , (Nt − 2)(Nt − 3)NA (NA − 1)

where xi j t is defined analogously to xi j A :

xi j t =

2 Nt (Nt − 1)

where Ti,r j = B(Ti , r) ∩ B(T j , r) \ {Ti , T j }.

∑

1≤i< j≤Nt

Nt (Ti,r j ) Nt (Ti , r)Nt (T j , r)

,

+ , n

9

3.3 Numerical evidence and comparison with Monte Carlo simulations We have performed the computation of the theoretical variances for the set of locations T corresponding to the locations of stores in the city of Lyon, and compared the results with approximate values obtained by Monte Carlo simulations. Figures 1 and 2 give the result of this comparison for different categories of Lyon’s town retail stores locations. Figure 1 presents the comparison of the theoretical inter variance and of a Monte Carlo simulation in two situations. The red line shows the analytical results for NA = 211 sites, the circles corresponding to the estimates of the variance from simulation

results over a million B configurations for different values of NB (h1/Ni iA = 0.116, xi j A = 0.00019). The blue line corresponds to the analytical results for N

A = 72 sites, the squares correspond to the estimates from simulation (h1/Ni iA = 0.0674, xi j A = 0.000081.). The NA values (NA = 72 and NA = 211) are somewhat arbitrary, but nevertheless represent typical numbers of stores of a given activity in a town of roughly a million inhabitants.

Fig. 1 Comparison of variances obtained from simulations and proposition 1 for the inter-coefficient as a function of NB (logarithmic scales).

Figure 2 shows the adequation of the analytical intra results (circles) and the variance obtained by a Monte Carlo simulation of a million configurations (red line). Clearly, the computations closely follow the evolutions of the variances as a function of NA and NB . For illustration, the average values for Lyon’s sites are (for r = 100m) : Nt = 7839, D E

1/Nti t = 0.112, 1/(Nti Ntj ) = 1.29 10−5 and xi j t = 1.19 10−4 . n The calculation times for the variances of the 53 retail activities (i.e. approximately 2000 terms similar to the points shown in figures 1 and 2) are: 21 seconds with the exact formulas, to be compared with 50 minutes for 10, 000 Monte Carlo simulations. Those simulations still show some deviations larger than 5% from the exact variance values (mean absolute deviation: 2%). Note that these values are given to allow comparison of the

10

two calculation times, but their absolute values are far from optimal. Standard computation tricks such as R-trees [MNPT05] allow a drastic reduction of calculation times for the exact computation of the variances by computing distances between sites only on sites which are close enough. This leads to a calculation time of about 1 second for 7841 sites with the exact formulas, and, more importantly, a less steep increase with the number of sites. Therefore, the variances with millions of sites can be calculated in a few hours.

Fig. 2 Comparison of variances obtained from simulations and proposition 2 as a function of NA (linear scales).

3.4 Testing the pure randomness hypothesis The analytical expressions for the variance of the inter and intra-coefficients is the major tool in defining easily a test for the randomness hypothesis. Indeed if we define H0B to be the following hypothesis: the locations of B sites are purely random, and H0A the hypothesis the locations of A sites are purely random, then we have by Chebyshev’s inequality the following results: Proposition 3 Let α be a positive (small) number in (0, 1). Under hypothesis H0B , for any configuration of sites A, we have: P |aAB − 1| ≤ qAB ≥ 1 − α, (9) α p where qAB σ 2 (aAB )/α. α = Under hypothesis H0A we have P |aAA − 1| ≤ qAA ≥ 1 − α, α

(10)

11

p where qAA σ 2 (aAA )/α. α = AB This implies that [1 − qAB α , 1 + qα ] is a confidence interval with level (at least) 1 − α for aAB B under hypothesis H0 (id. for aAA ).

Under the alternative hypotheses in both cases it can be shown that the inter and intracoefficients have a distinct behavior (this will be highlighted in the next subsection): the expected value of those coefficients is no longer equal to 1 (higher values correspond to clustering whereas lower values correspond to exclusion). Thus we may formulate the following testing procedure: B – If |aAB − 1| > qAB α reject hypothesis H0 . A – If |aAA − 1| > qAA reject hypothesis H α 0.

Otherwise accept the hypothesis.

This testing procedure is illustrated in figure 3: the intra-coefficient is computed for different values of r (circles), and the vertical bars correspond to the (lower) half confidence interval centered at 1 with level at least 1 − α = 0.95.

Fig. 3 Plot of the intra-coefficient for bakeries in the city of Lyon with respect to r.

The figure 3 shows the practical importance of variance calculations for economic interpretations of the data. Although aAA remains well below the reference value (i.e. 1), bakeries are significantly dispersed only until 150m (the hypothesis H0A is rejected). For longer distances, their spatial locations approach a random pattern: the hypothesis H0A is not rejected for large values of r.

12

3.5 Clustering and exclusion: examples and numerical evidence of the deviation from the pure random case The theoretical computations in the case of the pure random hypothesis gives the reference model, however we need to show that different situations yield (radically) different values for the inter and intra-coefficients. This is also illustrated in [MP07], p. 15–17. For the inter case it is straightforward to generate random configurations showing clustering or exclusion around sites of type A: let us indeed consider for the B sites for instance a pure random configuration outside the r0 -neighborhoods of the A points (if the total number of sites satisfying this condition is larger than NB ), then it is clear that by construction aAB = 0 for all r ≤ r0 . On the contrary if we concentrate (randomly) the B points around the A points, we may obtain a coefficient (largely) greater than 1. We shall detail the generation of such random configurations for the intra-coefficient, showing clustering or exclusion properties. Let us consider the square lattice [−N, N]2 , and denote by g(r) the number of integer points inside the disk B(0, r). We assume that NA g(r) N 2 . Define the following subsets of configurations: – Self-excluding configurations: let Er,NA ,N denote the subset of all point configurations on the square lattice such that no two points are at distance lesser than r; – Clustered configurations: let ε ∈ (0, 1), define NA,ε = bεNA c, and assume further that NA,ε g(r) NA . Define C (ε, r, NA , N) as the subset of all point configurations {xi , i = 1, . . . , NA } constructedin the following way: – the configuration x1 , . . . , xNA,ε belongs to E (r, NA,ε , N − r) (it is called the mother configuration) and xNA,ε +1 , . . . , xNA is a pure random configuration in the sublattice

SNA,ε i=1

B(xi , r) ∩ Z2 (the progeny configuration).

It is clear that if ε and N are chosen correctly those two subsets of configurations are non empty (large), thus one may consider the uniform distribution over those two subsets, giving two random configurations. The generation of such random configurations may be easily performed using the Metropolis-Hastings algorithm1 . We have performed this simulation on this regular lattice with N = 250 and small configurations, and performed a large number of simulations giving similar results. This choice was dictated by the simulation time of the configurations, the parameters chosen giving the desired clustered or diluted behaviors. The confidence intervals were computed using the exact formulas for the variances, with a confidence level of 1 − α = 0.95. The numerical data of 30 locations in the square lattice of size 501 × 501 are simulated with an exclusion radius of 45 for the self-excluding configuration, and the clustered region is made of five clusters of radius 20. Results are summarized in figure 4: the top picture shows in white diamonds the intracoefficient for a (random) clustered configuration generated by the procedure described above for r ranging from 30 to 70, one observes that the obtained coefficient is always greater than 1. The confidence region is situated under the blue line. The bottom picture represents in black diamonds the intra-coefficient for a (random) self-excluding configuration 1 A fixed configuration belonging to E r,NA ,N is iteratively modified by moving at random one of its points, the movement is accepted if the new configuration is still in Er,NA ,N . This procedure is repeted a large number of times (see [H¨ag02]). The clustered configuration is generated this way for the mother configuration, the progeny is generated by an urn without replacement scheme.

13

generated by the procedure described above for r ranging from 30 to 70, one observes that the obtained coefficient is always lesser than 1. The confidence region is situated above the blue line (note that the two scales are different). We observe, except for small r for the self-excluding configuration, that the intra-coefficient obtained do not belong to the confidence region, this shows that the intra-coefficient does actually discriminate clustered, self-excluding and purely random configurations.

Fig. 4 Plot of the intra-coefficient for clustered (top) and self-excluding (bottom) configurations.

We may remark that the confidence interval is quite large, due to the fact that the size of the configuration is small, however, the deviation is already siginificant.

4 Continuous setting and asymptotic normality The computations in a discrete setting are often more complicated than in the case of a continuous setting. This is clearly the case when the question of asymptotic normality is addressed. In this section, we first give the equivalent definitions for the inter and intracoefficients in a general continuous setting (allowing spatial heterogeneities) and give the exact computation of their variances. This result is then compared to the discrete one in the large Nt limit. We also prove a central limit theorem for the homogeneous case. This result, although restricted to ideal homogeneous environments, can be interpreted as a justification for the normal approximation for inhomogeneous environments. This approximation is needed to sharpen the confidence interval in the testing procedure (the width of the confidence interval is still proportional to the square root of the variance, but with a smaller constant depending on α and the Normal law). We begin with a short paragraph on Poisson point processes that will be the natural way to model the pure random hypothesis in a continuous framework.

14

4.1 Poisson point processes The Poisson point process [Kin93] is the natural way to generate random sets of points in (possibly) unbounded domains. We recall shortly its definition below in a subset of the plane in a diffuse context: let D be a Borel subset of R2 , Λ a diffuse2 non-negative σ -finite Borel measure on D, the Poisson point process with intensity measure Λ is the random locally finite3 set of points X of D such that – for each Borel subset A of D with finite Λ measure, the number of points of X in A, denoted by ](X ∩ A), is a Poisson random variable with parameter Λ (A): Λ (A)k exp(−Λ (A)), k! – for disjoint Borel subsets B1 , . . . , Bn of D, the random variables (](X ∩ Bi )) for i ∈ {1, . . . , n} are independent. ∀k ≥ 0, P(](X ∩ A) = k) =

The term intensity for Λ may be interpreted as local density, indeed the average number of points in a bounded subset A is equal to Λ (A): as Λ is assumedRto be diffuse, there exists a Borel measurable function λ defined on D such that Λ (A) = A λ (x) dx, the function λ clearly plays the role of a density. The Poisson point process has two important properties [Kin93]: Lemma 4 Let X be a Poisson point process with intensity measure Λ on D, and A a bounded Borel subset of D, then, conditionnally on the event {](X ∩ A) = k} where k ≥ 1, the points of X ∩ A are distributed as k independent points with common law Λ /Λ (A). This property is also a characteristic of the Poisson point process. The important other property is the Campbell-Mecke-Slivnyak formula stating that the Poisson point process conditionned on the negligible event that ](X ∩ {x}) = 1 is equal in law to a Poisson point process with the same intensity measure plus the fixed point x: Lemma 5 Let f be a non negative function defined on the product space D × Pl. f . (D), where Pl. f . is the set of all locally finite subsets of D, and let D0 ⊂ D, then one has for a Poisson Point Process X with intensity measure Λ on D: " # Z E

∑

x∈X∩D0

f (x, X) =

D0

E [ f (x, X ∪ {x})] dΛ (x).

In this formula, a sum indexed by an empty set is assumed to be zero. In the following sections, the locations of stores of types A and will be the points of a Poisson point process XA with given intensity ΛA when we are dealing with the intracoefficient, and we shall write NA (x, r) = ](XA ∩ B(x, r)). When dealing with the inter-coefficient, the locations of the points of type A are assumed to be fixed, and the B stores are the points of a Poisson point process XB with given intensity ΛB , and we denote NB (x, r) = ](XB ∩ B(x, r)). The notation Nt (x, r) has no equivalent in this continuous settting. Both intensities ΛA and ΛB are supposed to be finite (ΛA (D),ΛB (D) < +∞), thus the total number of A and B stores is almost surely finite. 2

By diffuse we mean absolutely continuous with respect to the Lebesgue measure. By locally finite we mean that for each compact subset K of D, almost surely the number of points of X in K is finite. 3

15

4.2 Definition of the inter and intra-coefficients The generalization of the aAB coefficient is obtained by considering the relative weights of some neighborhoods for the ΛB measure, indeed rewrite the discrete aAB in the following way: 1 NA NB (Ai , r) Nt (Ai , r) − NA (Ai , r) −1 aAB = , ∑ NB NA i=1 Nt − NA one observes that the right-most term is the inverse of the relative weight of the neighborhood B(Ai , r) in the set T . The analog of this formula is thus given by aPAB = 1 if NA = 0, =

1 NA

NA

NB (Ai , r) ∑ NB i=1

ΛB (Vi ) ΛB (D)

−1 , otherwise,

where Vi = B(Ai , r) ∩ D. The computation of the average (expectation) of this coefficient is straightforward using the definition of the Poisson point process XB , one checks easily that it is equal to 1. If one rewrites the discrete aAA in the following way: 1 NA NA (Ai , r) Nt (Ai , r) −1 aAA = . ∑ NA i=1 NA − 1 Nt − 1 The natural extension of the aAA term for the Poisson case becomes in the same way as above: aPAA = 1 if NA = 0, 1 = NA

NA

NA (Ai , r) ∑ NA − 1 i=1

ΛA (Vi ) ΛA (D)

−1 otherwise.

Using lemma 5 it is clear that the average value of this intra-coefficient is also equal to 1.

4.3 Computation of the variances In order to compute the variance of the inter and intra-coefficients, we need an extended version of lemma 5 ([Kin93]): Lemma 6 Let f be a non negative function defined on the product space D2 × Pl. f . (D), and let D0 ⊂ D, then one has for a Poisson Point Process X with intensity measure Λ on D: " # Z E

∑

x6=y∈X∩D0

f (x, y, X) =

D02

E [ f (x, y, X ∪ {x, y})] dΛ (x) dΛ (y).

This lemma is sometimes stated with symmetric functions in their first two arguments g(x, y, X) = g(y, x, X) and sums over pairs of distinct points: " # Z 1 E ∑ 0 g(x, y, X) = 2 D02 E [g(x, y, X ∪ {x, y})] dΛ (x) dΛ (y). {x,y}⊂X∩D Using this lemma the computation of the variances is straightforward (though a little tricky), and we get

16

Proposition 4 The variance of the inter-coefficient is given by * + ! pBij 1 1 NA − 1 P Var(aAB ) = −1 + + E[NB−1 ; NB > 0], NA pBi A NA pBi pBj A

where the quantities pBi and pBij are given by pBi =

ΛB (Vi ) B ΛB (Vi ∩V j ) , p = , ΛB (D) i j ΛB (D)

and the averages h·iA are taken with respect to the fixed points of type A as in the first sections, and the last term is equal to E[NB−1 ; NB > 0] =

ΛB (D)k exp(−ΛB (D)). k≥1 k k!

∑

The computation of the intra variance is a bit more complicated, using the same arguments as in proposition 4 we obtain Proposition 5 The variance of the intra-coefficient is the sum of the following terms: 1 P 2 Var(aAA ) = −ΛA (D) E (NA + 1)(NA + 2)2 1 ; N > 0 −ΛA (D) E A NA (NA + 1)2 Z 1 1 +E ; NA > 0 dΛA (a) 2 NA (NA + 1) D p(a) Z NA p(a, b) +E dΛA (a) dΛA (b) (NA + 1)2 (NA + 2)2 D2 p(a)p(b) Z 1 1a∼b +E dΛA (a) dΛA (b), (NA + 1)2 (NA + 2)2 D2 p(a)p(b) where p(a) = ΛA (B(a, r) ∩ D)/ΛA (D), p(a, b) = ΛA (B(a, r) ∩ B(b, r) ∩ D)/ΛA (D), 1a∼b = 1 if |a − b| < r, 0 otherwise, and +∞

E[Z(NA ); NA > 0] =

∑ Z(n)

n=1

ΛA (D)n exp(−ΛA (D)). n!

4.4 Comparison with the discrete case: asymptotic equivalence Classically the Poisson (point) process can be seen as a limit of some discrete model, thus we therefore expect to obtain similar values for both approaches in large domains (equivalently Nt large). Let us start with the discrete case: For Nt 1, NA 1, Nt NA , Nt NB , the variance of aAB becomes:

1 Nt 1 σ 2 (aAB ) ' −1 + + Nt xi j A , NB NA Nt (Ai ) − NA (Ai ) A which would be indistinguishable in Figure 1 from the complete theoretical expression. It appears that Var(aAB ) decreases roughly as 1/NB for fixed A’s. The last two terms, which

17

characterize the distribution of A, can be more or less important according to the spatial distribution of A stores. Under the same conditions, aAA variance can be approximated by: * + !

Nt 1 Nt 2 1 1 2 σ (aAA ) ' −1 + Nt xi j t + + , NA NA Nti t NA Nti Ntj n

which, again, is indistinguishable in figure 2 from the complete theoretical expression4 , and Var(aAA ) decreases roughly as 1/NA . To compare these limits with the continuous case, let us display the results side by side: Discrete setting

Continous (Poisson) setting

1 σ 2 (aAB ) = − + NB Nt 1 NA NB Nt (Ai ) − NA (Ai ) A Nt xi j A , NB

σ 2 (aPAB ) = −E[NB−1 ; NB > 0] + 1 1 + E[NB−1 ; NB > 0] NA pBi A * + pBij NA − 1 −1 E[NB ; NB > 0] . NA pBi pBj A

Table 1 Comparison of the expressions of the discrete and continuous variances of the inter-coefficient.

The first lines are clearly similar. To see the similarity

of the last two lines, one has to replace the quantities Nt h1/(Nt (Ai ) − NA (Ai ))iA and Nt xi j A by their respective values 1 1 Nt = Nt (Ai ) − NA (Ai ) A (Nt (Ai ) − NA (Ai ))/Nt A 1 ∼ , pBi A

(Nt (Ci, j ) − NA (Ci, j ))/Nt ((Nt (Ai , r) − NA (Ai , r))/Nt )((Nt (A j , r) − NA (A j , r))/Nt * + pBij . ∼ pBi pBj

Nt xi j A =

, A

A

For the intra terms, the comparison proceeds in the same manner, in the homogenous case with intensity measure ΛA a multiple of the Lebesgue measure: dΛA = λA dx, define NA = λA |D|, and nr = λA πr2 4

D E

j For Lyon’s configuration where Nt xi j t = 0.933, Nt 1/Nti t = 1040 and Nt 2 1/(Nti Nt ) = 793, the n

two last terms dominate as long as NA 300, while for NA 300 all terms become of the same order of magnitude.

18

the mean number of shops in the whole domain and in a neighborhood of radius r, we obtain the following asymptotics for large domains

σ 2 (aPAA ) '

2 . NA nr

p From this asymptotic relation we deduce that NA (aPAA − 1) has a finite variance in the limit as the domain D increases: in the next subsection we prove in the homogeneous case an asymptotic normality property that explains and extends this remark.

4.5 Asymptotic normality in the homogeneous case The asymptotic normality of coefficients such as the one studied here does not seem to have been studied in the litterature, as said in the introduction this property gives sharpened confidence intervals for the testing of the pure randomness hypothesis. We carried out some tests with Monte Carlo simulations on actual subsets of sites of the city of Lyon: – For the inter-coefficient, we have chosen to simulate 87 B sites around 917 A sites that are strongly aggregated (intra-coefficient aAA = 2.17). The 87 B sites are chosen randomly among the 6922 free T sites. The circles correspond to the evaluation of the probability density function of aAB at different points in [0.6, 1.4] performed by direct simulation. Note that the distribution in this A configuration is more subject to large fluctuations due to the strong aggregation. The result in Figure 5 is compared to the Normal law centered at 1 with variance computed from proposition 1. – In the aAA case, the distribution is also indistinguishable from the Normal law, as shown in Figure 6. This figure is obtained by randomly choosing 917 A sites among 7839 T sites. The continuous line in figure 6 corresponds to a Normal law centered at 1 with variance computed from proposition 2. The adequation of the estimated distributions with the corresponding Normal laws seems to be an indicator that such asymptotic normality actually exists. In the following, we shall prove that these distributions do converge to Normal laws when the domain (or the intensity) is large in the intra case, with constant intensity:

dΛA = λA dx.

19

Fig. 5 Distribution of random aAB values compared to the Normal law centered at 1 with variance computed from proposition 1.

Fig. 6 Distribution of random aAA values compared to a Normal law centered at 1 with variance computed from proposition 2.

20

Let us rewrite aPAA on the square domain Dn = [0, n)2 = smaller squares being denoted by Di, j ):

Sn−1

i, j=0 [i, i + 1) × [ j,

j + 1) (those

NA (a) NA (Dn ) − 1 ! n−1 NA (a) 1 , = ∑ ∑ n NA (Dn ) − 1 i, j=0 a∈XA ∩Di, j NA (D )pn (a) ! n−1 λA (Dn ) 1 NA (a) = ∑ , ∑ n NA (Dn ) − 1 i, j=0 a∈XA ∩Di, j NA (D ) λA (V (a)) ! n−1 n 2 λA (D ) NA (a) n 1 = . ∑ n2 i,∑ NA (Dn ) NA (Dn ) − 1 j=0 a∈XA ∩Di, j λA (V (a)) | {z } | {z } | {z } a.s. a.s. −→ 1 −→ λ −1

aPAA (Dn ) =

1 n a∈XA ∩Dn NA (D )pn (a)

∑

Yi, j

n→+∞ A

n→+∞

It is straightforward to adapt the classical mixing central limit theorem for sequences to a two-dimensional setting: the random variables (Yi, j )i, j≥0 form a mixing sequence (as they are indeed independent at large distance), so that we easily get Theorem 1 As n tends to infinity one has p NA (Dn ) aPAA −

λA2 n4 NA (Dn )(NA (Dn ) − 1)

law

−→ N

n→+∞

0, 4 +

2 λA πr2

.

This theorem has a bias in the mean, and therefore in the asymptotic variance. This may be corrected as the intra term may also be written as a V -statistic: aPAA =

λA (D) NA (NA − 1)

∑

x6=y∈XA

1|x−y|≤r . λA (V (x))

For sake of simplicity we shall still work in the square Dn . Let (Yk )k≥1 be independent identically distributed random variables, uniformely distributed on [0, 1)2 and Nn an independent Poisson random variable with parameter λA n2 , so that 1n|Yi −Y j |≤r λA n2 ∑ λA (B(nYi , r)) − 1, Nn (Nn − 1) 1≤i6= j≤Nn 1|Yi −Y j |≤r/n 1 = − 1 , ∑ 2 Nn (Nn − 1) 1≤i6= j≤Nn |B(Yi , r/n) ∩ (0, 1) |

aPAA (Dn ) − 1 =

this is a sort of V -statistic, indexed by the random variable Nn , where the summand will be denoted by Gni, j (it is not symmetric). This random variable – is centered, – bounded by 1 + 4n2 /(πr2 ),

21

– and satisfies, denoting pr/n (Y ) = |B(Y, r/n) ∩ [0, 1)2 | and pr/n (Y1 ,Y2 ) = |B(Y1 , r/n) ∩ B(Y2 , r/n) ∩ [0, 1)2 |, 1 (n) 2 E[(G1,2 ) ] = E − 1, pr/n (Y ) 1|Y1 −Y2 |≤r/n (n) (n) − 1, E[G1,2 G2,1 ] = E pr/n (Y1 )pr/n (Y2 ) (n)

(n)

(n)

(n)

(n)

(n)

(n)

(n)

(n)

(n)

E[G1,2 G2,3 ] = 0, E[G1,2 G1,3 ] = 0,

E[G1,2 G3,2 ] = E

pr/n (Y1 ,Y2 ) − 1, pr/n (Y1 )pr/n (Y2 )

E[G1,2 G3,1 ] = 0, E[G1,2 G3,4 ] = 0. From those relations we have the following asymptotics n2 , πr2 n2 (n) (n) E[G1,2 G2,1 ] ∼ 2 , πr (n)

E[(G1,2 )2 ] ∼

(n)

(n)

E[G1,2 G3,2 ] ≤ Cn−2 . This asymptotic degeneracy of most of the terms in the moments above changes the usual speed of the central limit theorem for V -statistics, and gives the following central limit result: Theorem 2 As the domain D tends towards R2 in a regular way, then p law 2 NA (D) aPAA − 1 −→ N 0, . λA πr2 D→R2 The proof of this result is rather technical, it proceeds by the computation of the moments of the left hand side and of their asymptotics. An extended sketch of the proof is given in the appendix. In this result we see once again the right asymptotics for the variance, σ 2 (aPAA ) ∼ 2/(λA πr2 λA |D|). The existence of central limit theorems for inhomogeneous intensities and/or for the inter-coefficients is still an open problem in this setting, even if numerical evidence seems to justify this approximation.

5 Differential coefficients and comparison with the Duranton-Overman indicators In this section we shortly discuss the diferentiated version of the inter and intra-coefficients introduced above, and eventually show some comparative results with the indicator introduced in [DO05].

22

5.1 Differential coefficients The differentiated version of the inter and intra-coefficients is the discrete derivation of aAB and aAA . Let δ r be positive, and define NA (Ai , r, r + δ r) as the number of points of type A in the shell r ≤ |Ai − x| < r + δ r, let us give the following notations: NA (Ai , r, r + δ r) NB (Ai , r, r + δ r) N˜ ti (r, δ r)

= = =

number of A points x such that r ≤ |Ai − x| < r + δ r, number of B points x such that r ≤ |Ai − x| < r + δ r, number of points x of T such that r ≤ |Ti − x| < r + δ r.

Then the formulas giving the variance for the differentiated indicator are exactly the same as before, using those new quantities. The indicators are: dAB := dAA :=

Nt − NA NA NB

NA

NB (Ai , r, r + δ r)

(11)

∑ Nt (Ai , r, r + δ r) − NA (Ai , r, r + δ r)

i=1

Nt − 1 NA (NA − 1)

NA

NA (Ai , r, r + δ r)

(12)

∑ Nt (Ai , r, r + δ r)

i=1

and Proposition 6 The variance of the Differential inter-coefficient is given by Nt − NA Nt − NA − NB 1 2 σ (dAB ) = NA NB Nt − NA − 1 Nt (Ai , r, r + δ r) − NA (Ai , r, r + δ r) A Nt − NA − NB − NB (Nt − NA − 1)

Nt − NA − NB Nt − NA 1 + yi j A 1− , Nt − NA − 1 NB NA where *

yi j A =

+

Nt (Ci,0 j ) − NA (Ci,0 j ) (Nt (Ai , r, r + δ r) − NA (Ai , r, r + δ r))(Nt (A j , r, r + δ r) − NA (A j , r, r + δ r)

and Ci,0 j = {x ∈ R2 : r ≤ |x − Ai | < r + δ r and r ≤ |x − Ai | < r + δ r}. The variance of the Differential intra-coefficient is the sum of four terms: 4

σ 2 (dAA ) = ∑ Var(dAA )i , i=1

where Nt − NA , (Nt − 2)(NA − 1) 1 (Nt − 1)(Nt − NA ) , Var(dAA )2 = NA (NA − 1)(Nt − 2) N˜ ti (r, δ r) t * + (Nt − 1)2 (Nt − NA )(Nt − NA − 1) 1 Var(dAA )3 = , (Nt − 2)(Nt − 3)NA (NA − 1) N˜ ti (r, δ r)N˜ tj (r, δ r) Var(dAA )1 =

n

(Nt − 1)2 (Nt − NA )(NA − 2) Var(dAA )4 = zi j t , (Nt − 2)(Nt − 3)NA (NA − 1)

, A

23

where zi j t is defined analogously to yi j A :

zi j t =

2 Nt (Nt − 1)

∑

1≤i< j≤Nt

Nt (T˜i,r,δj r ) Nt (Ti , r, r + δ r)Nt (T j , r, r + δ r)

,

where T˜i,r,δj r = {x ∈ R2 : r ≤ |x − Ti | < r + δ r and r ≤ |x − Ti | < r + δ r}. 5.2 Duranton-Overman indicators This differential coefficient is very close to the indicator introduced by Duranton and Overman, let us recall their definition: let f be a non negative kernel with total mass 1, h > 0 be b is defined as the kernel density estimate the typical length scale of the discretisation, then K for n points with inter-distances di, j r − di, j 2 b f ,h (r) = K f , ∑ h n(n − 1) 1≤i< h j≤n if we rewrite this with the particular kernel f0 (x) = 1(−1,0] (x) and h = δ r, this gives for NA points of type A NA NA (Ai , r, r + δ r) 1 b f ,δ r (r) = . K ∑ 0 NA (NA − 1) i=1 δr The computation of the expectation of this coefficient is very similar to the computations detailed before Lemma 7 For all r > 0 and δ r > 0, we have under the pure randomness hypothesis: i N˜ i (r, δ r) h t t b E K f0 ,δ r (r) = . δ r(Nt − 1) The computation of the variance is still a little more tricky, but we obtain after a few steps of calculations: Proposition 7 For all r > 0 and δ r > 0, we have under the pure randomness hypothesis: n h 1 NA − 2 2Nt − NA − 3 ˜ i b f ,δ r (r) = 1 σ2 K Nt (r, δ r)2 t + 0 2 (δ r) NA (NA − 1) (Nt − 1)(Nt − 2) Nt − 3 Nt − NA (NA − 2)(NA − 3) ˜ i 1+ + Nt (r, δ r) t + (Nt − 1)(Nt − 2) (Nt − 2)(Nt − 3) E (NA − 2)(NA − 3) D ˜ i Nt (r, δ r)N˜ tj (r, δ r) + (Nt − 2)(Nt − 3) t

i o (NA − 2)(Nt − NA ) D ˜ (i j) E i 1 ˜ t (r, δ r) 2 , Nt − N t (Nt − 2)(Nt − 3) (Nt − 1)2 t D E (i j) where N˜ t is the following average over all couples of points of T : t

D E (i j) N˜ t = t

2 Nt (Nt − 1)

∑

Nt (T˜i,r,δj r ).

1≤i< j≤Nt

The computation of the variance for general kernels f may also be performed along the same lines.

24

5.3 Drawbacks and advantages of the different coefficients The two coefficients proposed by Duranton and Overman [DO05] and Marcon and Puech [MP07] share a number of advantages detailed in the introduction (inhomogeneous underlying space used as reference, precise location data (x and y coordinates)). Their main difference is the integrative or differential approach they use. Duranton and Overman focus on the distribution between two distances r and r + δ r, while Marcon and Puech integrate the distribution from 0 to r. The differential coefficient allows to zoom on precise distances, and measure differences from randomness in more detail. In contrast, our coefficient, inspired from Marcon and Puech’s approach, is simpler to interpret because the coefficient converges to 1 as r approaches the system size, thus allowing to readily quantify deviations from randomness. In Duranton and Overman’s approach, absolute values of the coefficients are meaningless. It is the deviations to the random values which show whether the spatial distribution is aggregated or dispersed.

6 Conclusions and perspective This paper gives analytic formulas to compute the variance of some coefficients of purely random distributions of points (meaning non interacting distributions). We rigorously show that these distributions asymptotically follow a Normal law. Our paper allows to dispense with Monte Carlo simulations, which can be cumbersome to implement and prohibitively time consuming for large samples. Our analytical expressions may also allow to understand qualitatively the main factors that may change the variance: number of sites, spatial inhomogeneities, etc. A natural extension is to get a better mathematical understanding of a given situation, whether clustered or excluding, by a precise description of the (random) point process having the same behavior. The first way to introduce non independence between sites is to consider Gibbs point processes (or Markov point processes), characterized by an interaction potential, as well as a general potential linked with the landscape, taking into account for instance the population density, or some other geographical or economical artefact. The main topics in this framework shall be the estimation of the actual characteristics (or parameters) of those potentials, and the comparison of those parameters for different economic/geographic situations. This work is in progress.

References [BCL07] A. Briant, P.-P. Combes, and M. Lafourcade. Do the size and shape of spatial units jeopardize economic geography estimations? http://www.vcharite.univ-mrs.fr/PP/combes/maup.pdf, (2007). [Bes77] J. E. Besag. Comments on Ripley’s paper. Journal of the Royal Statistical Society B, 39:193–195, (1977). [Col06] Collective. Competition complementarity in retailing. Revue Belge de g´eographie, (1-2), (2006). [DO05] G. Duranton and H. G. Overman. Testing for localisation using micro-geographic data. The Review of Economic Studies, 72:1077, (2005). [EB03] T. Egami and S. Billinge. Underneath the Bragg Peaks: strutural analysis of complex materials. Material Series. Pergamon, (2003). [EG97] G. Ellison and E. L. Glaeser. Geographic concentration in us manufacturing industries: A dartboard approach. Journal of Political Economy, 105(5):889–927, (1997). [H¨ag02] O. H¨aggstr¨om. Finite Markov chains and algorithmic applications, vol. 52 of London Mathematical Society Student Texts. Cambridge University Press, Cambridge, (2002).

25 [HG84]

E. M. Hoover and F. Giarratani. An introduction to regional economics. http://www.rri.wvu.edu/WebBook/Giarratani/contents.htm, (1984). [Hoo37] E. M. Hoover. Location theory and the shoe and leather industries. Cambridge, MA: Harvard University Press, (1937). [Jen06] Pablo Jensen. Network-based predictions of retail store commercial categories and optimal locations. Phys. Rev. E, 74, (2006). [Kin93] J. F. C. Kingman. Poisson processes, volume 3 of Oxford Studies in Probability. The Clarendon Press Oxford University Press, New York, (1993). [MNPT05] Y. Manolopoulos, A. Nanopoulos, A. N. Papadopoulos, and Y. Theodoridis. R-trees: Theory and Applications. Springer-Verlag, (2005). [MP07] E. Marcon and F. Puech. Measures of the geographic concentration of industries: Improving distance-based methods. http://halshs.archives-ouvertes.fr/halshs-00372617/fr/, (2009). [MS99] F. Maurel and B. Sedillot. A measure of the geographic concentration of french manufacturing industries. Regional Science and Urban Economics, 29(5):575–604, (1999). [Ope84] S. Openshaw. The Modifiable Areal Unit Problem. Norwich: Geo Books, (1984). [Par98] Katy Paroux. Quelques th´eor`emes centraux limites pour les processus Poissoniens de droites dans le plan. Adv. in Appl. Probab., 30(3):640–656, (1998). [Rip76] B. D. Ripley. The second-order analysis of stationary point processes. J. Appl. Probability, 13(2):255–266, (1976). [Unw96] D. J. Unwin. Gis, spatial analysis and spatial statistics. Progress in Human Geography, 20:540– 551, (1996). [WPF96] J. S. Ward, G. R. Parker, and F. J. Ferrandino. Long-term spatial dynamics in an old-growth deciduous forest. For. Ecol. Manage., 83:189–202, (1996). [YK50] G. Udny Yule and M. G. Kendall. An Introduction to the Theory of Statistics. Hafner Publishing Co., New York, N. Y., (1950). 14th ed.

Appendix: proof of theorem 2 Proof We give the proof only the square case, as the generalization to domains D converging to R2 in a regular way (that is with a bounded ratio between its diameter and the radius of the greatest disc included in D) follows the same lines. Let us proceed as the for the general classical combinatorial proof of the central limit theorem for V -statistics: this proof starts with the computation of the moments: E

h

k i aPAA (Dn ) − 1 =

∑E

h

i k aPAA (Dn ) − 1 | Nn = N P(Nn = N),

N≥0

1 = ∑ E k N≥2 (N(N − 1))

N

∑

(n) Gi, j

i, j=1

!k 2 N (λA n ) exp(λA n2 ), N!

the inner expectation can be expanded as " E

N

k

∑

∏

i1 , j1 ,...,ik , jk =1 s=1

# (n) Gis , js

"

N

=

∑

i1 , j1 ,...,ik , jk =1

E

k

∏

# (n) Gis , js

.

s=1

This product is equal to zero as soon as there exists an index s0 such that {is0 , js0 } does not intersect the set {is , js ; s 6= s0 }. For any choice of the indices, the expectation of such a product is bounded by (1 + 4n2 /(πr))k ∼ Cn2k . Our purpose is now to count the terms in this sum that do really matter: from the first remark above, we may introduce the following

26

equivalence relation, let (i, j) = {(i1 , j1 ), . . . , (ik , jk )}, and (i, j) and (k, l) two couples of indices in this set, then (i, j) ∼ (k, l) if and only if ∃(i1 , j1 ), . . . , (it , jt ) ∈ (i, j) such that {i, j} ∩ {i1 , j1 } 6= 0, / . . . , {iu , ju } ∩ {iu+1 , ju+1 } 6= 0, / ..., and {it , jt } ∩ {k, l} 6= 0. / Then (i, j) is the disjoint union of classes for this relation, and as soon as a class is a singleton, the expectation of the product, denoted by P(i,j) is zero. Hence all the classes denoted by (i, j)1 , . . . (i, j)v must have a cardinality greater or equal than 2 in order to have a non zero term, and P(i,j) is equal to the product of the expectations on each class: v

P(i,j) =

∏E

w=1

#

"

∏

(n) Gi, j

,

(i, j)∈(i,j)w

v

=

∏ P(i,j)w .

w=1

From now on the type t(i,j) of (i, j) will be the (ordered) sequence of the cardinalities of its classes. – Let us first assume that k = 2p, if t(i,j) = (a1 , . . . , av ), the number of degrees of freedom v for the choice of such indices (i, j) is at most N ∑w=1 (aw +1) , that is N v+2p . As each class has at least two elements one has v ≤ p, so that the number of degrees of freedom is at most N 3p . Let us assume that (i, j)w is reordered as {((i1 , j1 )), . . . , ((i1 , j1 )), . . .} where ((i, j)) denotes either (i, j) or ( j, i). We already know that if this class (i, j)w has two elements (aw = 2), the value of the corresponding expected product is either: – ∼ n2 /(πr2 ) if it is {((i, j)), ((i, j))}, – or Cn−2 if it is {((i, j)), (( j, k))}. Hence for t(i,j) = {2, . . . , 2}, the value of P(i,j) is either – ∼ (n2 /(πr2 )) p if each class is of the form {((i, j)), ((i, j))}, 0 – Cn2(p−2p ) if there are (p − p0 ) classes of the form {((i, j)), ((i, j))}, and p0 classes of the form {((i, j)), (( j, k))}, with p0 ∈ {1, . . . , p}. The number of terms of the first type is N 2p , whereas the number of the second type of 0 terms is of order N 2p+p , giving the following orders of magnitude for all those terms: 0

0

N 2p n2p , and ∀p0 ∈ {1, . . . , p}, N 2p+p n2(p−2p ) , so that if N behaves like n2 , as expected, the maximum order is achieved for the first type of terms. A precise numbering of those terms of type {2, . . . , 2} can be achieved, in a way quite similar to [Par98]. For those terms that weigh the most among the ones of type {2, . . . , 2}, we have to choose 2p distinct integers in {1, . . . , N}, hence N!/(N − 2p)! possibilities, we also have to choose p pairs of integers in {1, . . . , 2p}, yielding (2p)!/2 p possibilities. Each of those pairs is provided with a couple of the afore-mentioned integers, and each pair is either of type {(i, j), (i, j)}, {(i, j), ( j, i)}, {( j, i), (i, j)}, or {( j, i), ( j, i)}, each with the same approximate value n2 /(πr2 ).

27

This gives a total amount of (N!(2p)!)/((N − 2p)!2 p ), that should be divided by 2 p p! to avoid repetitions of the same terms. Thus the sum of those terms becomes " # 2 p k 4n N!(2p)! (n) E G = ∑ ∏ is , js (N − 2p)!2 p 2 p p! πr2 + smaller terms. s=1 t ={2,...,2} (i,j)

Let us assume that we have proven that the other terms are really negligible, then we can state h 2p i E aPAA (Dn ) − 1 !2p N 2 N 1 (λA n ) exp(λA n2 ), ∑ G(n) E = ∑ i, j 2p N! i, j=1 N≥2 (N(N − 1)) 2 p 1 N!(2p)! 4n (λA n2 )N exp(λA n2 ), ' ∑ 2p p p 2 (N(N − 1)) (N − 2p)!2 2 p! πr N! N≥2 (2p)! 2 p 2p 1 ' p , n E 2 p! πr2 Nn2p where we recall that Nn is Poisson distributed with mean λA n2 , this yields the desired asymptotics h 2p i (2p)! 2λA p . lim Nnp E aPAA (Dn ) − 1 = p n→+∞ 2 p! πr2 Let us show that the other terms are indeed negligible: the computation for a general term is rather tedious, so we shall only give a sketch of the proof for classes (i, j)w of cardinality 3. Firstly we will neglect the boundary effects by replacing pr/n (Y ) by the constant qr/n = πr2 /n2 almost surely. As (1|Yi −Y j |≤r/n − qr/n )(1|Yk −Yl |≤r/n − qr/n )(1|Ys −Yt |≤r/n − qr/n ) = 1|Yi −Y j |≤r/n 1|Yk −Yl |≤r/n 1|Ys −Yt |≤r/n −qr/n 1|Yi −Y j |≤r/n 1|Yk −Yl |≤r/n + 1|Yk −Yl |≤r/n 1|Ys −Yt |≤r/n + 1|Yi −Y j |≤r/n 1|Ys −Yt |≤r/n +q2r/n 1|Yi −Y j |≤r/n + 1|Yk −Yl |≤r/n + 1|Ys −Yt |≤r/n − q3r/n , we may sum up the number of such terms and their values in the following table:

Type of term ((1, 2))((2, 3))((3, 4)) ((1, 2))((2, 3))((3, 1)) ((1, 2))((2, 3))((3, 2)) ((1, 2))((1, 2))((2, 3)) ((1, 2))((1, 2))((1, 2))

Number of terms ∼ N4 ∼ N3 ∼ N3 ∼ N3 ∼ N2

Value 0 0 0 0 ∼ q−2 r/n

Table 2 Enumeration of the different terms of cardinality 3 and their respective values.

28

Hence those terms contribute at most for N 2 n4 . Terms of type for instance {3, 3, 2, . . . , 2} contribute to an amount of at most (N 2 n4 )2 N 2q n2q , where 2p = 6 + 2q, yielding an order of magnitude N 2p−2 n2p+2 versus N 2p n2p for the terms {2, . . . , 2}, recalling that the right order for N is n2 shows the negligibility of those terms. – If k = 2p + 1, on may generalize the asymptotics above to conclude that the moment p+1/2 rescaled by the factor Nn tends to 0 in a direct, though technical, way. This conclude the proofs as the moments converge to the moments of the Normal law.