<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">OJS</journal-id><journal-title-group><journal-title>Open Journal of Statistics</journal-title></journal-title-group><issn pub-type="epub">2161-718X</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ojs.2013.31005</article-id><article-id pub-id-type="publisher-id">OJS-27917</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Why Well Spread Probability Samples Are Balanced
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>nton</surname><given-names>Grafström</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Niklas</surname><given-names>L. P. Lundström</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Department of Forest Resource Management, Swedish University of Agricultural Sciences, Ume?, Sweden </addr-line></aff><aff id="aff2"><addr-line>Department of Mathematics and Mathematical Statistics, Ume? University, Ume?, Sweden </addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>anton.grafstrom@slu.se(NG)</email>;<email>niklas.lundstrom@math.umu.se(NLPL)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>20</day><month>02</month><year>2013</year></pub-date><volume>03</volume><issue>01</issue><fpage>36</fpage><lpage>41</lpage><history><date date-type="received"><day>November</day>	<month>22,</month>	<year>2012</year></date><date date-type="rev-recd"><day>December</day>	<month>24,</month>	<year>2012</year>	</date><date date-type="accepted"><day>January</day>	<month>8,</month>	<year>2013</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
   When sampling from a finite population there is often auxiliary information available on unit level. Such information can be used to improve the estimation of the target parameter. We show that probability samples that are well spread in the auxiliary space are balanced, or approximately balanced, on the auxiliary variables. A consequence of this balancing effect is that the Horvitz-Thompson estimator will be a very good estimator for any target variable that can be well approximated by a Lipschitz continuous function of the auxiliary variables. Hence we give a theoretical motivation for use of well spread probability samples. Our conclusions imply that well spread samples, combined with the Horvitz- Thompson estimator, is a good strategy in a varsity of situations. 
 
</p></abstract><kwd-group><kwd>Balanced Sample; Local Pivotal Method; Spatial Balance; Spatially Correlated Poisson Sampling; Voronoi Polytopes</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In many fields there has been a great interest in selecting samples that are well spread or spatially balanced. Such samples are considered to produce good estimates for target variables that exhibit spatial trends, see e.g. [1,2]. The focus in this paper is to explain the connection between a well spread sample and a balanced sample. Roughly speaking, a sample is well spread if the number of selected units is close to what is expected on average, in every part of the auxiliary space. A sample is balanced on a variable if the Horvitz-Thompson (HT) estimator of the total of that variable agree exactly with the known population total of the variable. In fact, with a short analysis, this paper clarifies why a well spread sample is approximately balanced. We also explain that, if the sample is well spread, the variance of commonly used estimators is usually low.</p><p>It is well known that samples that are balanced or approximately balanced on the auxiliary variables may be selected by using the cube method, see e.g. [<xref ref-type="bibr" rid="scirp.27917-ref3">3</xref>]. Sampling methods for selection of well spread samples in a general auxiliary space, by utilizing a distance function, are more recent and less well known than the cube method. Two such methods are the local pivotal method (LPM) and spatially correlated Poisson sampling (SCPS). The LPM design, based on the pivotal method [<xref ref-type="bibr" rid="scirp.27917-ref4">4</xref>], was first introduced in [<xref ref-type="bibr" rid="scirp.27917-ref5">5</xref>]. The other method, SCPS, was first introduced in [<xref ref-type="bibr" rid="scirp.27917-ref6">6</xref>] and it is a special case of the method described in [<xref ref-type="bibr" rid="scirp.27917-ref7">7</xref>].</p><p>In many areas, such as forest inventories, environmental studies, and even in official statistics, different forms of stratification are commonly used to obtain samples that are well spread geographically or in other available information. Often, stratification is used as a variance reduction technique without particular interest in the different strata. Constructing a stratified sampling design is often not straightforward, especially if several mixed auxiliary variables are available. It is not uncommon that statisticians try to stratify using several variables, but crossing all strata of all variables usually results in cells that are too small. In such situations it may be preferable and less complicated to define a distance measure in the auxiliary space, and then use a sampling method that in general avoid selection of nearby units, thus forcing the sample to be well spread.</p><p>In Section 2, a theoretical motivation for the balancing effect of well spread samples is given. In Section 3, we give arguments indicating that using well spread samples provides a small anticipated variance for the HT-estimator under a very general super-population model. Some sampling methods for selecting well spread samples are briefly discussed in Section 4. Final comments are provided in Section 5.</p></sec><sec id="s2"><title>2. Main Results</title><p>We start by introducing some notation and assumptions. Let <img src="5-1240169\efb802e8-a5be-43fd-89e3-bfc8b697a488.jpg" /> be a population of N units. We wish to select a probability sample s of size n in order to estimate some characteristics of U. It is assumed that we have access to auxiliary information on unit level, i.e. the values of q auxiliary variables <img src="5-1240169\fdeadeb0-8c94-4eea-b5fc-a928f327fce3.jpg" /> are known for each unit<img src="5-1240169\a9258deb-9194-4637-b75b-e543f33a6971.jpg" />. We also assume that it is possible to calculate the distance <img src="5-1240169\a04b7fc3-c322-44bb-a130-39e328b49814.jpg" /> between two units i and j in the auxiliary space. Usually the total</p><p><img src="5-1240169\b8cef0f2-1077-47a4-88e4-86eeb7966c70.jpg" />of one or more target variables are the parameters we wish to estimate. It is assumed that each population unit i is included in the sample with a known probability<img src="5-1240169\80141cc0-58bb-48c9-856f-240c9c8c27ac.jpg" />, <img src="5-1240169\1ae5f04e-74a2-45d4-81db-c07f53243172.jpg" />, with<img src="5-1240169\6563c527-be3a-4fad-bb31-aaa13d954f26.jpg" />, where n is the sample size. In this case the unbiased and commonly used HT-estimator [<xref ref-type="bibr" rid="scirp.27917-ref8">8</xref>] of Y is</p><p><img src="5-1240169\96c982bd-d371-4708-8da7-80a725bc3810.jpg" /></p><p>We are now ready to formalize what a well spread sample is. As suggested in [<xref ref-type="bibr" rid="scirp.27917-ref2">2</xref>], we use Voronoi polytopes to measure how well spread a sample is. The Voronoi polytope<img src="5-1240169\f7b55759-9440-4882-8729-585f72eaa0c2.jpg" />, for<img src="5-1240169\3379efe5-d7be-4647-90d8-c3a230cce86f.jpg" />, includes all population units j satisfying <img src="5-1240169\e26174dc-1a2b-443e-be0a-a23d7af17cf3.jpg" /> for all sample units<img src="5-1240169\bc5ae474-392f-40a9-8c36-17f9d15c60e3.jpg" />. Let n<sub>i</sub> denote the number of units in p<sub>i</sub>, with the correction that if a unit j is included in m<sub>j</sub> polytopes, then j is counted as<img src="5-1240169\1bc55d0e-90c9-47bc-807e-7392b76268c9.jpg" />. Next, let v<sub>i</sub> be the sum of the inclusion probabilities in p<sub>i</sub>. Again, if a unit j is included in m<sub>j</sub> polytopes, then its inclusion probability is divided equally <img src="5-1240169\fc0951c5-35d7-450a-a6d2-8d7437c1732f.jpg" /> to each of the m<sub>j</sub> polytopes. Hence,</p><p><img src="5-1240169\c4e29f4b-8aff-41b2-8dea-755e34c07048.jpg" /></p><p>Note that <img src="5-1240169\cc55e478-2047-4aa1-a714-dc27aa004c28.jpg" /> and<img src="5-1240169\3b2bf4a6-aafb-4ec8-8ede-89e88804c863.jpg" />. We are now ready to give the definition of a well spread sample.</p><p>Definition 1 A sample is said to be well spread (or spatially balanced) with respect to the inclusion probabilities if each v<sub>i</sub> is equal or close to 1.</p><p>As a measure of how well spread a sample is, we may use</p><disp-formula id="scirp.27917-formula105172"><label>(1)</label><graphic position="anchor" xlink:href="5-1240169\3c2b348e-9814-4939-9f20-aa9c8258e442.jpg"  xlink:type="simple"/></disp-formula><p>see e.g. [<xref ref-type="bibr" rid="scirp.27917-ref2">2</xref>]. A small value of B indicates a very well spread sample. The mean of B over repeated samples is an indicator of how well spread samples a design produces. We next define a balanced and an approximately balanced sample.</p><p>Definition 2 We say that a sample s is balanced on the auxiliary x-variables if</p><p><img src="5-1240169\aee5fed9-35d3-418f-b466-18ae016dd09f.jpg" /></p><p>Moreover, a sample is said to be approximately balanced if <img src="5-1240169\f2cf5630-63f4-48ba-9cf4-e8b3dcd99be5.jpg" /> is close to<img src="5-1240169\cc80c08c-adfc-4692-bdfc-93e3b0d5ae80.jpg" />.</p><p>In order to show that well spread samples are balanced we start by making three quite strong assumptions on the sample and the population. Later we will relax these assumptions a bit and then show that a larger class of well spread samples are approximately balanced. We first assume the following.&#160;</p><p>(A.0) In each polytope the inclusion probabilities sum to 1, i.e.</p><p><img src="5-1240169\7a2c60d8-4503-4553-a64f-f7086baee4a1.jpg" /></p><p>(A.1) In each polytope the inclusion probabilities are equal, i.e. for every<img src="5-1240169\d766a4b3-16c4-4343-9e59-3745efc65fe2.jpg" />, we assume</p><p><img src="5-1240169\644377ff-9d84-424d-bbe1-33429f888dfd.jpg" /></p><p>(A.2) In each polytope the auxiliary variables are equal for all units, i.e. for every<img src="5-1240169\70c79fdc-0500-476f-be96-697e0fcdc859.jpg" />,</p><p><img src="5-1240169\5124daa9-8d3f-44fe-990f-fd3d929b0349.jpg" /></p><p>The assumptions (A.0) and (A.1) tell us that the size n<sub>i</sub> of the polytope <img src="5-1240169\c717cba0-f715-404f-9640-dc92de76ff25.jpg" /> is equal to <img src="5-1240169\9916848b-847d-4302-9f52-7cef16e1447e.jpg" /> and (A.2) tells us that<img src="5-1240169\97dce651-1f09-4acb-a830-dae38c6d9459.jpg" />, for<img src="5-1240169\f419e0b8-1876-4aab-a00d-e301169cb245.jpg" />. Note that<img src="5-1240169\578e885e-783d-4e30-b997-21fdeddbd900.jpg" />, <img src="5-1240169\73063261-ac3e-4e6c-9e46-e06e5ddea852.jpg" />and hence <img src="5-1240169\61c38ba5-3bb0-4692-9127-4d78705c3475.jpg" /> are allowed to vary between polytopes. Under the three assumptions it follows that</p><disp-formula id="scirp.27917-formula105173"><label>(2)</label><graphic position="anchor" xlink:href="5-1240169\8cba8b17-1d1d-4710-8d65-0709ba5890be.jpg"  xlink:type="simple"/></disp-formula><p>Thus the sample is balanced on any function <img src="5-1240169\e7d0aa2c-b252-436f-a654-5a5a25fff409.jpg" /> and in particular, it is balanced on the auxiliary variables if we put<img src="5-1240169\e41ab0f7-85da-45e4-b00b-da7693b9b2a7.jpg" />.</p><p>The next step consists of introducing the following three new and less restrictive assumptions.&#160;</p><p><img src="5-1240169\4af5d23b-c352-4996-a5dc-b8a25df56532.jpg" />For each polytope p<sub>i</sub>, the inclusion probabilities satisfies</p><p><img src="5-1240169\cdadc662-6784-49b3-8d4d-6d876bf315d3.jpg" /></p><p><img src="5-1240169\4a94634d-c4d6-4410-a9b1-3a88f81e08f5.jpg" />In each polytope<img src="5-1240169\94c99890-c1c4-4a6a-bad3-191bc2ee4c96.jpg" />, we have</p><p><img src="5-1240169\2ab95658-e783-4ab7-804b-0f42394131b5.jpg" /></p><p><img src="5-1240169\8a87f987-3735-49c8-9fd3-782799bc782f.jpg" />The target is a Lipschitz continuous function of the auxiliary variables, i.e.</p><p><img src="5-1240169\c0a32816-9505-4102-a37f-b27c7a2dd374.jpg" /></p><p>Remark 1 Concerning the validity of assumption<img src="5-1240169\43924bb1-4bf2-4e9c-9d6a-c0188440f96f.jpg" />, the inclusion probabilities are (if unequal) supposed to be derived from the auxiliary x-variables, perhaps they are chosen proportional to one of the x-variables, so they should not vary much within a polytope. Remember that the polytopes are constructed by grouping together units with similar x-values.</p><p>We are now ready to state and prove our main result.</p><p>Theorem 1 Let <img src="5-1240169\de9be5fb-376d-419b-a243-4c1efe182b31.jpg" /> be a well spread sample satisfying <img src="5-1240169\41faff87-3723-42c5-b5b8-44448770d3a5.jpg" /> for all <img src="5-1240169\54a54755-ef63-42e5-954a-c9134a8721a0.jpg" /> and for some<img src="5-1240169\c32737e2-2be3-4fe0-9cd0-8c3f7dad0194.jpg" />. Assume also that <img src="5-1240169\bae73ed8-d82c-4cb7-8cae-52c4b5deb46f.jpg" /> is from a population satisfying assumptions<img src="5-1240169\848e3d8c-a821-4da0-ba23-c412a6fb07e5.jpg" />. Then <img src="5-1240169\3c143ef3-a37c-4fb0-bdc7-0ffa0efc1b5a.jpg" /> is approximately balanced. In particular,</p><p><img src="5-1240169\08b4708f-ff50-41a9-8b0b-0d8ea8f64f52.jpg" /></p><p>By sending <img src="5-1240169\3e5a2a96-f63d-49a9-bfc0-3f663b898e2c.jpg" /> we obtain exact balance on the target since</p><p><img src="5-1240169\50ed1086-4716-400f-a6ac-fb9523127b36.jpg" /></p><p>Note also that if we put <img src="5-1240169\b3b38669-ee2b-44e4-8bd4-d8912de9859e.jpg" /> then we get a balanced sample, see Definition 2. Besides that Theorem 1 tells us that when <img src="5-1240169\421c78ee-2dd5-46c7-aef5-d955afe26cd1.jpg" /> are small, the sample will be approximately balanced on <img src="5-1240169\caa51399-db61-4e07-a2a2-c748f110c062.jpg" /> and<img src="5-1240169\d1c63cb3-bda0-428c-a33c-b84565ff8c2e.jpg" />, it also gives bounds for the target parameter<img src="5-1240169\2b664789-425e-4460-bb7c-483f2da8a815.jpg" />. We can however do better than the bounds in Theorem 1. For instance, we have</p><p><img src="5-1240169\8da7ce94-1e0a-40d7-8423-e27b0faabc07.jpg" /></p><p>but these bounds are constructed by applying a worst case senario, within each polytope, so we cannot expect the bounds to be very good.</p><p>Proof of Theorem 1. By assumption <img src="5-1240169\0da65fc7-41bb-495c-95f5-fff5278ed895.jpg" /> and since <img src="5-1240169\cb214cb3-64d5-43d0-9822-08e9b62519f1.jpg" /> we have, for all<img src="5-1240169\56c38f95-4737-4504-a336-2167ab28a08b.jpg" />,</p><disp-formula id="scirp.27917-formula105174"><label>(3)</label><graphic position="anchor" xlink:href="5-1240169\41711bb1-8d1e-4dc1-8c9f-90d730b48f03.jpg"  xlink:type="simple"/></disp-formula><p>The inequalities in (3) give</p><disp-formula id="scirp.27917-formula105175"><label>(4)</label><graphic position="anchor" xlink:href="5-1240169\bd508486-86db-4a6d-8797-b4fae7892368.jpg"  xlink:type="simple"/></disp-formula><p>Moreover, from assumptions <img src="5-1240169\278ac960-ca77-4ab6-9dac-912c312d8732.jpg" /> and <img src="5-1240169\51a04439-d305-402f-b0d6-104b24827017.jpg" /> we see that</p><disp-formula id="scirp.27917-formula105176"><label>(5)</label><graphic position="anchor" xlink:href="5-1240169\54b6bae1-9e1b-492d-902d-4e601aaf4212.jpg"  xlink:type="simple"/></disp-formula><p>Theorem 1 now follows from (4) and (5). In particular,</p><p><img src="5-1240169\f7283157-b9eb-4e21-89ec-9b4833b3fd27.jpg" /></p><p>and</p><p><img src="5-1240169\57e491ab-263a-4c9d-bd54-146be19194eb.jpg" /></p><p>The proof is complete.</p></sec><sec id="s3"><title>3. Variance under a General Model</title><p>It is interesting to see how well spread samples perform under a general super-population model. Following [<xref ref-type="bibr" rid="scirp.27917-ref9">9</xref>], but here with a possibly non-linear model, we assume</p><disp-formula id="scirp.27917-formula105177"><label>(6)</label><graphic position="anchor" xlink:href="5-1240169\32fe70e7-9cc0-448a-8dc0-1d058287dbb3.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="5-1240169\c310a9a3-83f6-49d7-be64-a5dbf1bd49eb.jpg" /> is a Lipschitz continuous function, <img src="5-1240169\cbdbe79a-cb5e-47c3-9d59-c926404cb944.jpg" />, <img src="5-1240169\786c0012-5cb0-4b44-a9f9-0751f524886c.jpg" />, and <img src="5-1240169\e949ef42-382b-43b8-a4d4-24babe08e649.jpg" /> is a Lipschitz continuous function. Moreover,</p><p><img src="5-1240169\6b189303-b780-4767-8b8a-d2a9dd244b1e.jpg" /></p><p>where<img src="5-1240169\769752b4-aa89-4c62-8dbd-991fa29c034f.jpg" />, <img src="5-1240169\fc7ffc8d-80f6-4470-9fcc-2e3cf31cfa2a.jpg" />and <img src="5-1240169\c0ba1586-67ca-42fd-8c8b-8189aba49847.jpg" /> are the expectation, variance and covariance under the model. The correlations <img src="5-1240169\f77cf182-d90b-4d57-9533-08b0866fa824.jpg" /> are supposed to be decreasing in function of the distance between the units i and j.</p><p>With some routine calculations, the anticipated variance of the HT-estimator under model (6) can be shown to be</p><disp-formula id="scirp.27917-formula105178"><label>(7)</label><graphic position="anchor" xlink:href="5-1240169\8f6d2843-ee1b-4631-aa6d-099d869cdeb0.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="5-1240169\82319b87-2a44-4008-90ff-fa25dff1306d.jpg" /> is the expectation under the design. Now, if we study expression (7), it becomes evident that we want the samples to be as balanced as possible on <img src="5-1240169\e4f4633c-25ba-43c4-9e5f-f2a0dcbc98ba.jpg" /> to minimize the term</p><p><img src="5-1240169\cf6527cf-30b4-4a81-a766-d8e6c2fba835.jpg" /></p><p>We also want to make sure that <img src="5-1240169\7d43d3c7-100a-4196-bc9b-40e6019871b8.jpg" /> is small whenever <img src="5-1240169\a798bc76-afa7-4d9d-8b3b-7fec52b16417.jpg" /> is large in order to minimize the term</p><disp-formula id="scirp.27917-formula105179"><label>(8)</label><graphic position="anchor" xlink:href="5-1240169\0e2087b0-4a26-4a28-b81b-a64a44c183a4.jpg"  xlink:type="simple"/></disp-formula><p>If the samples are selected to be well spread (i.e. small joint inclusion probabilities for nearby units), then both terms in (7) becomes small. However, if the model standard deviation <img src="5-1240169\48cd015d-ff32-4285-b889-6320050f9797.jpg" /> is known, it is possible to also choose the inclusion probabilities to minimize further. The diagonal term of (8) is dominant, i.e.</p><disp-formula id="scirp.27917-formula105180"><label>(9)</label><graphic position="anchor" xlink:href="5-1240169\ec85e6b6-1ac6-4e9a-8015-8f39d9ab1fd4.jpg"  xlink:type="simple"/></disp-formula><p>With the constraint of fixed sample size <img src="5-1240169\2e2afff5-847c-4109-9807-06a8a2d87f5d.jpg" /></p><p>and by using a Lagrangian function, it follows that the minimum in <img src="5-1240169\cc03b2e9-8fe4-489d-8a7d-500ccd3cd41b.jpg" /> of the right hand side of (9) is</p><disp-formula id="scirp.27917-formula105181"><label>(10)</label><graphic position="anchor" xlink:href="5-1240169\5179997b-59c4-42f8-aee1-99819000ec01.jpg"  xlink:type="simple"/></disp-formula><p>if each<img src="5-1240169\92324911-354e-49d3-af70-8bbbe4278e8d.jpg" />. As a result, a very efficient sampling design under this general model is to select samples that are well spread in the x-space with inclusion probabilities given by (10). The requirements needed in order for the samples to be approximately balanced on <img src="5-1240169\220bdcc8-357e-4801-af4c-8541937dba08.jpg" /> are then fulfilled. The inclusion probabilities will not vary much within the polytopes since <img src="5-1240169\ff9ce4b5-7c96-4c4e-a2ac-19ffb2fdf503.jpg" /> is supposed to be a Lipschitz continuous function of x. Hence, with this strategy, the anticipated variance of the HTestimator becomes small.</p><p>It is not possible to balance the sample directly on <img src="5-1240169\a8a67ce6-ca29-411d-b651-544e5995a849.jpg" /> since the function is obviously not known in advance. Probably, the best we can do in practise is to make sure the samples are well spread in x to have a balancing effect on the unknown function<img src="5-1240169\e2b58c9a-0dc0-4287-ba34-7b588e80dd74.jpg" />, and hence also have small <img src="5-1240169\ff2b46aa-61fb-41ed-8e02-89186db8e584.jpg" /> when <img src="5-1240169\4f796019-db16-4d60-b26b-fc7d4b4f0194.jpg" /> may be large.</p><p>If we use well spread probability samples together with the HT-estimator, the estimator will be very efficient (i.e. have a small variance) if the population is close to a realization of the model (6). Note also that the approach is purely design based, and the estimator maintains design unbiasedness and design consistency even if the model is false.</p><p>Example 2, given in the next section, supports the above statements. In particular, the example compares different sampling methods with respect to variance and spatial balance. It is clear that methods obtaining well spread samples are more balanced and hence produce smaller variance.</p></sec><sec id="s4"><title>4. Some Methods for Selecting Well Spread Samples</title><p>Besides spatial stratification, one of the first more novel designs for selecting a well spread sample is called generalized random tessellation stratified (GRTS), and was introduced in [<xref ref-type="bibr" rid="scirp.27917-ref2">2</xref>]. The GRTS design uses a specific random mapping to map two (or more) dimensions to one dimension. Basically the units are re-ordered to a list and units close in the list tend to also be close in the auxiliary space. Then a systematic <img src="5-1240169\3e8fdef4-92e3-46c2-a086-478b7b1e36ab.jpg" />ps sample is selected from the list, making sure the sample becomes well spread in the list and hence also in the auxiliary space. A drawback of GRTS is that a lot of information is lost in the mapping, especially if the space has many dimensions (i.e. many auxiliary variables). However, for two dimensions, the GRTS produces rather well spread samples.</p><p>Another idea is to map dimensions to one by use of space-filling curves, and one such design was presented and evaluated in [<xref ref-type="bibr" rid="scirp.27917-ref10">10</xref>]. However, we believe that mapping several dimensions to one is not the best way to achieve a well spread sample. Too much information is lost in such a mapping.</p><p>A more recent idea to achieve well spread samples is to first define a distance measure in the auxiliary space. To do so, let <img src="5-1240169\e3d976e8-adc3-4458-b6d8-50e20bbf0c36.jpg" /> be all available auxiliary variables, where <img src="5-1240169\2887b8b2-de11-47f3-adbe-809353f11761.jpg" /> correspond to the quantitative variables and <img src="5-1240169\aec8b8ee-fddc-4376-aba8-3fa07e36029f.jpg" /> to the qualitative variables. To measure the distance between unit i and j in this q-dimensional space, [<xref ref-type="bibr" rid="scirp.27917-ref11">11</xref>] propose the following definition of distance</p><p><img src="5-1240169\7f12a85d-f643-4546-ade3-1cfbfbb2d6d8.jpg" /></p><p>where <img src="5-1240169\7eb535f1-b0ff-4fa9-ad04-2d4ec7c68cbb.jpg" /> is the standardized version of<img src="5-1240169\34356225-e9dc-4f9b-8ad7-f529b59b6a69.jpg" />. By standardizing, the auxiliary variables are approximately of equal importance. However, the above distance function is just an example and in a particular situation some other distance function may be more appropriate. Given the distace measure, the design should create a negative correlation of the inclusion indicators for close units, so that two close units seldom appear in the sample together. Such a design is not necessarily complicated. For instance, the local pivotal method (LPM) introduced in [<xref ref-type="bibr" rid="scirp.27917-ref5">5</xref>] is quite simple. The LPM is based on the pivotal method [<xref ref-type="bibr" rid="scirp.27917-ref4">4</xref>]. The main idea in LPM is to make similar units (i.e. nearby units) compete with each other for inclusion in the sample. The LPM successively updates the prescribed vector of inclusion probabilities <img src="5-1240169\976c19a9-a625-4a3a-bcef-d11d3c842d4a.jpg" /> to become a vector with zeros and ones, where the ones indicate inclusion in the sample. In one step of LPM, two close units i and j with <img src="5-1240169\ca909886-ce52-4d4b-90eb-26bdad0c7356.jpg" /> and <img src="5-1240169\5860b0d6-d6e1-4f92-a266-4c9ae31805e8.jpg" /> are chosen to compete. The winner takes as much probability mass as possible from the other unit. Hence, the winner receives the new probability <img src="5-1240169\e666ce40-62d2-4c34-b237-26d4b8463082.jpg" /> and the looser gets the new probability<img src="5-1240169\921f102f-ba01-4c03-a0e1-d109eef42377.jpg" />. Thus, if<img src="5-1240169\ccac8c28-a9e0-407d-9ac9-240e3d3fb78a.jpg" />, then a = 1 and the winning unit will definitely be in the sample. If<img src="5-1240169\facfae12-a897-412b-8e18-ac73c1e69161.jpg" />, then the looser will definitely not be in the sample (since b = 0). The reduced probability vector <img src="5-1240169\453e970a-b646-49cc-a673-ef7d039f63d0.jpg" /> is updated as</p><p><img src="5-1240169\c0fcb66c-99df-4e68-b044-aedd3fc3430a.jpg" /></p><p>Now, replace <img src="5-1240169\f8d7fefc-25ee-44a6-9120-0ce4b5618b7b.jpg" /> with<img src="5-1240169\5316d2e5-4b70-4c74-89ce-5ca13dca56a4.jpg" />. The final outcome is decided for at least one unit each update, and thus the procedure has at most N steps. In each update, unit i is chosen randomly (with equal probabilities among the units with<img src="5-1240169\4bceddbb-3887-4d99-9a61-116577a6f743.jpg" />) and then its nearest neighbor <img src="5-1240169\cc58b0ca-965a-4d54-83ae-93cff3c7086c.jpg" /> (among the units with<img src="5-1240169\f0a92783-4713-4802-a139-f9df72655d7d.jpg" />) is chosen.</p><p>Another method, spatially correlated Poisson sampling (SCPS) was first described in [<xref ref-type="bibr" rid="scirp.27917-ref6">6</xref>] and it is a special case of the method introduced in [<xref ref-type="bibr" rid="scirp.27917-ref7">7</xref>]. The SCPS algorithm is a bit more complicated than LPM, but is based on the same idea. Weights are used to create a negative correlation between the inclusion indicators of nearby units, forcing the sample to be well spread. For more on the above discussed methods, for selecting well spread samples, we refer the reader to the previously mentioned papers.</p><p>The two designs, LPM and SCPS were used in [<xref ref-type="bibr" rid="scirp.27917-ref11">11</xref>] to obtain well spread probability samples. The fact that LPM and SCPS produce well spread samples has been justified by both theoretical results and simulation results in the previously mentioned papers. Variance estimators for the HT-estimator under well spread samples was suggested in the papers [11,12]. To our knowledge LPM and SCPS are the designs that in general produce the lowest mean value of the balance measure (1) in general auxiliary space with prescribed inclusion probabilities.</p><p>When it comes to efficiency of the HT-estimator for well spread samples, we can also make heuristic arguments that such samples produce a low variance of the HT-estimator. When the sample size is fixed, the variance of <img src="5-1240169\991e6ae7-7ad5-4aac-bafa-79e95249beec.jpg" /> can be written as</p><disp-formula id="scirp.27917-formula105182"><label>(11)</label><graphic position="anchor" xlink:href="5-1240169\0b5011ed-813f-4dff-ac8e-55394a7988cc.jpg"  xlink:type="simple"/></disp-formula><p>A property that e.g. the LPM and SCPS design have is that <img src="5-1240169\0cac6349-1426-4490-ad9a-673f960bbbca.jpg" /> is small (minimum or close to minimum) when <img src="5-1240169\d8af37dd-caf7-465a-aeb9-9d808f2961b5.jpg" /> is small and <img src="5-1240169\9aa5fd5f-c5d8-457d-a8d2-fa9fbc84d65e.jpg" /> is large (close to<img src="5-1240169\12aa6c50-9833-4094-8662-d059ac281264.jpg" />) when</p><p><img src="5-1240169\e9d3f369-b7ce-4433-a7ff-ff5347d472c5.jpg" />is large. If <img src="5-1240169\a604becf-ab09-4dd8-be45-657faf0ebef1.jpg" /> is small when <img src="5-1240169\9b4edf1e-5b74-4b84-a485-ebe274237d78.jpg" /> is small, then <img src="5-1240169\62f8f3ac-d658-4e98-a820-f3deb624d605.jpg" /> is small when <img src="5-1240169\a7621448-d079-4d62-a84a-965682026075.jpg" /> is smalli.e. <img src="5-1240169\ab22c6ce-f367-4128-8964-a5cdfef2c642.jpg" />is small when <img src="5-1240169\51a9fc93-d08f-405b-85f5-9e4801bf9f7e.jpg" /> is large. Also, if <img src="5-1240169\c06ae841-3608-4175-b6b6-c38d04b75f76.jpg" /> is large (i.e. <img src="5-1240169\9cf60d9a-3494-4079-b3b1-5d4bc6c5340f.jpg" /></p><p>is large), then <img src="5-1240169\3b1e5e11-3e48-41f3-b4ae-907377af043e.jpg" /> is small since<img src="5-1240169\ea5ef8ea-8fc1-4399-bde0-e00346ab20ce.jpg" />. As a result the variance (11) becomes small.</p><p>For well spread samples, the balancing property can only be shown to hold exactly in very specific situations, i.e. under assumptions (A.0)-(A.2), see (2). For a categorical auxiliary variable, the sample will be balanced if the design produces stratification with fixed sample size for each category. A simple example follows.</p><p>Example 1. Let U be a population of males U<sub>m</sub> and females U<sub>f</sub>. Let x be the only auxiliary variable and let <img src="5-1240169\df2e846a-66bc-49fb-9390-a9dcef421319.jpg" /> if male and <img src="5-1240169\ea760011-adfd-4615-add5-1f5412ed0302.jpg" /> if female. Also, let <img src="5-1240169\a5abeabe-5326-4e77-866f-04cd1618190b.jpg" /> and <img src="5-1240169\89f657e2-0a16-4649-841e-165c129b53bb.jpg" /> be the inclusion probabilities, where n<sub>m</sub> and n<sub>f</sub> are integers. In this special case, we have that e.g. the LPM and SCPS automatically produces stratification with fixed sample sizes. Hence we have</p><p><img src="5-1240169\0ddf7d1b-4b6f-40f0-a335-2fb1986c58c8.jpg" /></p><p>where s<sub>m</sub> and s<sub>f</sub> are the sampled males and females respectively.</p><p>Example 2. We compare the different sampling methods LPM, SCPS, GRTS and simple random sampling (SRS) using a model satisfying (6). In particular, the population is generated from <img src="5-1240169\dfe970ea-09d1-428e-8dc1-323f12cbaad0.jpg" /> with</p><p><img src="5-1240169\de8b9ec3-0103-489a-906b-a1a449b835fa.jpg" /></p><p>The population size is N = 200 and the x-values are generated from a uniform distribution on the unit square. Using Euclidean distance, the covariance function for <img src="5-1240169\8035af50-e43b-4ae9-9325-916a057657c4.jpg" /> is defined as<img src="5-1240169\1a2b75ac-6d9c-4659-a745-de4e159292d6.jpg" />, which is a simple covariance function used for stationary fields [<xref ref-type="bibr" rid="scirp.27917-ref13">13</xref>]. The <img src="5-1240169\b3a0847b-e881-4566-8ed7-0a8206c7d65c.jpg" />-values are generated in two steps. First random independent and identically distributed data <img src="5-1240169\7943d3a8-66d1-42ac-ab79-1fddfa455841.jpg" /> is generated from<img src="5-1240169\afcc77e1-2f07-4673-a6ae-b332a492a6fb.jpg" />. Then, the <img src="5-1240169\5771cb0c-aab2-47b7-9d1a-b26bc7e8fe60.jpg" />-values are constructed as <img src="5-1240169\92b12f12-28e6-4918-8638-6e39d15638aa.jpg" /> using the covariance matrix<img src="5-1240169\d074d05c-c1ca-425c-baac-677b8d0f0a93.jpg" />. In this example <img src="5-1240169\f7eff929-59c3-49a9-b35b-f83966444602.jpg" /> and<img src="5-1240169\0cb7d4de-2115-442a-af7e-6b418bf3fcaf.jpg" />. The units are sampled with equal inclusion probabilities and <img src="5-1240169\fb74efc0-6ea5-4c94-91c2-1eec960ddd1e.jpg" /> units are sampled.</p><p>The target parameter is<img src="5-1240169\0881c5c2-2d93-4fc3-bc9e-a8fb000aa674.jpg" />. In our particular realization the true value is<img src="5-1240169\9e085690-f741-4db0-b5e7-41e6965887ad.jpg" />. The result is presented in <xref ref-type="table" rid="table1">Table 1</xref>. A clear connection between well spread samples and variance can be observed. A design with a small expected value of B, see (1), gives better estimates. Concerning anticipated variance we get similar results if we average over repeated realizations from the model.</p><p><xref ref-type="table" rid="table1">Table 1</xref>. Results for Example 2. Empirical variance <img src="5-1240169\e78a1832-3611-47f2-8379-9d15a29deeb3.jpg" /> of the HT-estimator and the mean of the measure B for 1000 samples of size 50.</p><p><img src="5-1240169\e4495942-dbe0-4a21-9b0d-6992587effd2.jpg" /></p></sec><sec id="s5"><title>5. Final Comments</title><p>It has been shown that in general there is a significant balancing effect for well spread samples. Usually, well spread samples are not as balanced on the auxiliary x-variables as samples selected by the cube method, but nearly so if the sample size is not too small. However, for target variables that are non-linear in x, well spread samples are likely to be more balanced on the target variables than samples selected by the cube method. In that way, well spread samples are good for more general situations. Hopefully, the fact that a significant balancing effect has been shown will increase the interest of using well spread probability samples when auxiliary x-variables are available.</p><p>There also exists a possibility to combine the cube method with a similar idea as used in the LPM, to have a local cube method. Then samples that are both well spread (spatially balanced) and balanced on the auxiliary variables can be selected. Such a method was developed in [<xref ref-type="bibr" rid="scirp.27917-ref9">9</xref>].</p><p>In [1,14], properties of spatial total estimators are studied under a tessellation stratified design in a continues universe. With similar assumptions on the target function, as used in this paper, they show that the convergence rate of the variance of the total estimator is <img src="5-1240169\855f88c3-faaf-4e0c-8841-7e70147d4390.jpg" /> for such a design. Even though our setting is different and does not imply a strict stratification, this indicates that spreading the sample locations well probably gives a small variance when there are spatial trends.</p><p>In the setting of Voronoi polytopes used in this paper, we may consider the nearest neighbor estimator (NNestimator) in place of the HT-estimator. The NN-estimator of Y is, if <img src="5-1240169\745a8bb8-85d5-411c-bb8a-a423aca34568.jpg" /> is the number of units in polytope<img src="5-1240169\32554f71-6294-48e9-8ad7-731081a12b4b.jpg" />,</p><p><img src="5-1240169\bfe21af1-69c3-4c8b-8e34-a0c76bbc7445.jpg" /></p><p>Under the assumptions (A.0) and (A.1), we have <img src="5-1240169\24d1b0b3-5410-4ebb-a95a-7fa4aba6e7c2.jpg" /> and the NN-estimator is equal to the HT-estimator. This implies that the NN-estimator will be approximately design unbiased for well spread samples. Moreover, the NN-estimator can probably adjust for some minor spatial imbalance in the sample by using the realized polytope sizes <img src="5-1240169\c1b5d12e-e25b-4966-9096-7b986415a987.jpg" /> instead of<img src="5-1240169\bcd39f5c-d9df-42f0-b68d-9dac94b76e3e.jpg" />, which can be viewed as the estimated polytope sizes. The possible benefit of using the NN-estimator in place of the HTestimator will be investigated in a future paper.</p></sec><sec id="s6"><title>6. Acknowledgements</title><p>Thanks to Lennart Bondesson and an anonymous reviewer for helpful comments that improved this manuscript.</p></sec><sec id="s7"><title>REFERENCES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.27917-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">L. Barabesi and S. Franceschi, “Sampling Properties of Spatial Total Estimators under Tessellation Stratified Designs,” Environmetrics, Vol. 22, No. 3, 2011, pp. 271- 278. doi:10.1002/env.1046</mixed-citation></ref><ref id="scirp.27917-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">D. L. Stevens Jr. and A. R. Olsen, “Spatially Balanced Sampling of Natural Resources,” Journal of the American Statistical Association, Vol. 99, No. 465, 2004, pp. 262- 278. doi:10.1198/016214504000000250</mixed-citation></ref><ref id="scirp.27917-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">J.-C. Deville and Y. Tillé, “Efficient Balanced Sampling: the Cube Method,” Biometrika, Vol. 91, No. 4, 2004, pp. 893-912. doi:10.1093/biomet/91.4.893</mixed-citation></ref><ref id="scirp.27917-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">J.-C. Deville and Y. Tillé, “Unequal Probability Sampling without Replacement through a Splitting Method,” Biometrika, Vol. 85, No. 1, 1998, pp. 89-101. 
doi:10.1093/biomet/85.1.89</mixed-citation></ref><ref id="scirp.27917-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">A. Grafstr?m, N. L. P. Lundstr?m and L. Schelin, “Spatially Balanced Sampling through the Pivotal Method,” Biometrics, Vol. 68 No. 2, 2012, pp. 514-520. 
doi:10.1111/j.1541-0420.2011.01699.x </mixed-citation></ref><ref id="scirp.27917-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">A. Grafstr?m, “Spatially Correlated Poisson Sampling,” Journal of Statistical Planning and Inference, Vol. 142, No. 1, 2012, pp. 139-147. 
doi:10.1016/j.jspi.2011.07.003 </mixed-citation></ref><ref id="scirp.27917-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">L. Bondesson and D. Thorburn, “A List Sequential Sampling Method Suitable for Real-Time Sampling,” Scandinavian Journal of Statistics, Vol. 35, No. 3, 2008, pp. 466-483. doi:10.1111/j.1467-9469.2008.00596.x</mixed-citation></ref><ref id="scirp.27917-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">D. G. Horvitz and D. J. Thompson, “A Generalization of Sampling without Replacement from a Finite Universe,” Journal of the American Statistical Association, Vol. 47, No. 260, 1952, pp. 663-685. 
doi:10.1080/01621459.1952.10483446</mixed-citation></ref><ref id="scirp.27917-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">A. Grafstr?m and Y. Tillé, “Doubly Balanced Spatial Sampling with Spreading and Restitution of Auxiliary Totals,” Environmetrics, in Press, 2012.  
doi:10.1002/env.2194 </mixed-citation></ref><ref id="scirp.27917-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">A. J. Lister and C. T. Scott, “Use of Space-Filling Curves to Select Sample Locations in Natural Resource Monitoring Studies,” Environmental Monitoring and Assessment, Vol. 149, No. 1-4, 2009, pp. 71-80. 
doi:10.1007/s10661-008-0184-y</mixed-citation></ref><ref id="scirp.27917-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">A. Grafstr?m and L. Schelin, “How to Select Representative Samples,” Unpublished, 2012. </mixed-citation></ref><ref id="scirp.27917-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">D. L. Jr. Stevens and A. R. Olsen, “Variance Estimation for Spatially Balanced Samples of Environmental Resources,” Environmetrics, Vol. 14, No. 6, 2003, pp. 593- 610. doi:10.1002/env.606</mixed-citation></ref><ref id="scirp.27917-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">N. A. C. Cressie, “Statistics for spatial data,” Wiley, New York, 1993.</mixed-citation></ref><ref id="scirp.27917-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">L. Barabesi and M. Marcheselli, “A Modified Monte Carlo Integration,” International Mathematical Journal, Vol. 3, No. 5, 2003, pp. 555-565.</mixed-citation></ref></ref-list></back></article>