<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">OJS</journal-id><journal-title-group><journal-title>Open Journal of Statistics</journal-title></journal-title-group><issn pub-type="epub">2161-718X</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ojs.2023.135036</article-id><article-id pub-id-type="publisher-id">OJS-128498</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  An Adaptive Sequential Replacement Method for Variable Selection in Linear Regression Analysis
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Jixiang</surname><given-names>Wu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Johnie</surname><given-names>N. Jenkins</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Jack</surname><given-names>C. McCarty Jr.</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Genetics and Sustainable Agriculture Research Unit, USDA-ARS, Mississippi State, USA</addr-line></aff><pub-date pub-type="epub"><day>07</day><month>09</month><year>2023</year></pub-date><volume>13</volume><issue>05</issue><fpage>746</fpage><lpage>760</lpage><history><date date-type="received"><day>25,</day>	<month>September</month>	<year>2023</year></date><date date-type="rev-recd"><day>22,</day>	<month>October</month>	<year>2023</year>	</date><date date-type="accepted"><day>25,</day>	<month>October</month>	<year>2023</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci assoc
  iated with traits of importance in various crop species. However, curren
  t gen
  etic association mapping analyses are focused on identifying individua
  l QTLs. This study aimed to identify a set of QTLs or genetic markers, which can capture genetic variability for marker-assisted selection. Selecting a set with k loci that can maximize genetic variation out of high throughput genomic data is a challenging issue. In this study, we proposed an adaptive sequential replacement (ASR) method, which is considered a variant of the sequential repl
  acement (SR) method. Through Monte Carlo simulation and compar
  ing with four other selection methods: exhaustive, SR method, forward, and backward methods we found that the ASR method sustains consistent and repeatable results comparable to the exhaustive method with much reduced computational intensity.
 
</p></abstract><kwd-group><kwd>Adaptive Sequential Replacement</kwd><kwd> Association Mapping</kwd><kwd> Exhaustive Method</kwd><kwd> Global Optimal Solution</kwd><kwd> Sequential Replacement</kwd><kwd> Variable Selection</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>With the rapid development of DNA technologies, high throughput genomic data has been becoming a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. It is well known that many quantitative traits like crop yield, plant height, and seed quality are controlled by many individual quantitative trait loci (QTLs) with minor effects and possible interactions with environmental conditions. Current genetic association mapping is focused on identifying individual QTLs. Therefore, it is crucial to identify a set of QTLs, which can capture sufficient genetic variability for marker-assisted selection. Selecting a set of loci that can maximize genetic variation out of high throughput genomic data is desired but still computationally challenging.</p><p>Simple interval or composite interval mapping were commonly used to identify QTLs for controlled mapping populations (like F2, RI, or DH) when linkage maps are available [<xref ref-type="bibr" rid="scirp.128498-ref1">1</xref>] - [<xref ref-type="bibr" rid="scirp.128498-ref7">7</xref>] . These methods aim to identify each individual QTLs with integrations of linear regression and expected maximum (EM) algorithm when a linkage map is available [<xref ref-type="bibr" rid="scirp.128498-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref7">7</xref>] . When a linkage map is constructed from high throughput genomic data, the distance of two flanking marker loci is often less than 2 centimorgan (cM), a window size commonly used by interval mapping may not be required. Genome-wide association studies (GWAS), on the other hand, have been focused on identifying individual genetic loci attributing phenotypic variation for an uncontrolled/random mapping population with and without population structure [<xref ref-type="bibr" rid="scirp.128498-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref9">9</xref>] .</p><p>Due to the potential of linked or interactive QTLs, the total amount of heritability of a set of QTLs is sometimes not the cumulation of the heritability of each identified individual QTLs. Therefore, it is important to select a set of loci that can catch the maximum genetic variation for a trait of interest. Such a process becomes the variable selection process in multiple linear regression, which aims to select the best subset of k variables out of the total p candidate independent variables. Given p genetic markers/loci, there are (2<sup>p</sup> − 1) all possible linear models to be examined. There is no doubt that the all-possible regression approach (sometimes called exhaustive method, which would be used throughout this study for consistency) is best because it examines every possible model [<xref ref-type="bibr" rid="scirp.128498-ref10">10</xref>] . However, a serious challenge associated with the exhaustive method is that the number of all-possible models could be very large even for a small number of independent variables [<xref ref-type="bibr" rid="scirp.128498-ref11">11</xref>] . Because of the high computational demand associated with the exhaustive method, heuristic methods are more frequently used for variable selection in linear regression analysis. They include forward selection (FS), backward elimination (BE), and stepwise selection (SS) [<xref ref-type="bibr" rid="scirp.128498-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref13">13</xref>] , which are currently available in several popularly used computer tools in R like MASS, leaps, and olsrr [<xref ref-type="bibr" rid="scirp.128498-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref15">15</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref16">16</xref>] . Although these variable selection procedures are very popular in the literature, a considerable number of limitations have also been identified due to the collinearity and/or interactions among predictable variables [<xref ref-type="bibr" rid="scirp.128498-ref17">17</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref19">19</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref20">20</xref>] . For example, an excellent model could be overlooked by these selection methods because of the restriction of adding/deleting only one variable at each time and thus these procedures may not always yield the optimal regression model [<xref ref-type="bibr" rid="scirp.128498-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref19">19</xref>] . Such findings were reported in Chapter 3 in the book of “Subset Selection in Regression Analysis, 2<sup>nd</sup> Edition” (Miller, 2002).</p><p>The number of variables being significantly selected by many variable selection methods could be large. Mathematically, it is desired to predict a response variable using more significant contributing variables. Sometimes, however, a plant breeder may be interested in identifying only four or five genetic markers rather than all contributing markers for a marker-assisted selection (MAS) practice. Therefore, selecting k variables (where k is given like 4 or 5), which aims to seek the smallest residual sum of squares (RSS) or the largest coefficient of determination (R<sup>2</sup>), could be another desirable option for a breeding practice. This procedure will require a total of C p k equations to be examined to identify the best k-variable model if exhaustive method is applied. For example, a global search of the best subset of five (k = 5) variables out of 100 (p = 100) will need to examine over 75 million models.</p><p>In order to avoid the high computational demand associated with the exhaustive method, many scientists developed other alternative variable selection methods to improve the possibility to search the best subset for a given size of k variables [<xref ref-type="bibr" rid="scirp.128498-ref21">21</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref22">22</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref23">23</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref24">24</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref25">25</xref>] . Among these, a sequential replacement (SR) algorithm was proposed to improve variable selection with much reduced computational intensity [<xref ref-type="bibr" rid="scirp.128498-ref11">11</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref26">26</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref28">28</xref>] and the SR method is available in a popularly used R package (leaps) [<xref ref-type="bibr" rid="scirp.128498-ref14">14</xref>] ; however, we discovered that the SR method could sometimes yield some inconsistent or undesired results, as demonstrated in this study. Therefore, it is important to improve the SR method so that both the power and speed can be sustained to achieve an optimal k-variable model selection.</p><p>In this study, our first objective was to propose an adaptive sequential replacement (ASR) method to improve the likelihood to achieve the best-fitting model with much reduced computational intensity. As detailed in Methodologies, we integrated stochasticity and adaptivity with the sequential replacement (SR) method to avoid local optimal solutions and unnecessary computational time when it is evident that a best-fitting model is achieved. The power for this ASR method was evaluated by simulated data. Our second objective was to compare the results between our ASR method and four other methods (SR, exhaustive, forward, and backward) with two actual genetic marker data sets. The purpose of this study is to provide a method to improve power to capture desirable genetic variation from high throughput genomic data for marker assisted selection with reduced computational intensity.</p></sec><sec id="s2"><title>2. Methodologies</title><sec id="s2_1"><title>2.1. The ASR Algorithm</title><p>The SR procedure was detailed by Miller and usually converges rapidly. Unfortunately, this type of replacement algorithm does not guarantee convergence upon the best-fitting k-variable model [<xref ref-type="bibr" rid="scirp.128498-ref18">18</xref>] . In this study, we proposed the ASR algorithm to avoid local optimal solutions with a criterion to determine when the optimal solution is achieved, and the criterion used throughout this study is adjusted coefficient of determination, R A 2 (or r-square for simplification). The ASR procedure is detailed as follows:</p><p>Step 1: Stochasticity process. Randomly select a subset of k (k ≥ 2) variables out of p candidate variables and set the variable index vector as id0. Run this k-variable linear regression analysis and calculate the r-square value as R A 0 2 . This step focuses on stochasticity to avoid local optimal solution.</p><p>Step 2: Sequential replacement process. Replace the first variable in id0 with the remaining variables and run the k-variable linear regression analysis again and calculate the r-square value as R A 1 2 one by one with new variable index id1. If R A 1 2 &gt; R A 0 2 , set R A 0 2 = R A 1 2 and id0 = id1.</p><p>Step 3: Repeat step 2 for the second variable and the remaining variables in id0 if k ≥ 2. Save R A 0 2 and id0.</p><p>Step 4: Adaptivity process. Repeat steps 1 to 3 until (1) the three largest r-square R A 2 are identical, (2) the difference between the first and third largest adjusted R A 2 is less than a given delta Δ (e.g. 0.001), or (3) it reaches a given maximum iteration time (e.g. 100) if condition (1) or (2) is not met. Save the largest r-square R A 2 with the corresponding variables.</p><p>Step 5: Repeat steps 1 to 4 for N (i.e. 5 or 10) times. Record the largest r-square R A 2 with the corresponding variable index vector.</p><p>Stochasticity is used in step 1 to avoid local optimal solutions. If condition (1) in step 4 is met, it is very likely that the optimal solution has been achieved. Step 5 will help increase the probability to reach the optimal if condition (1) is not met. If k is small less than 4 or the several candidate variables have a strong linear relationship with the response variable y, then the condition (1) in step 4 will be achieved rapidly. Given p = 100 and k = 5 the all-possible subset regression method, the</p><p>number of linear regressions to be assessed is C p k = p ! k ! ( p − k ) ! = 75287520 .</p><p>While with our method, there are only k ∗ ( p − k ) + 1 = 476 multiple regression models to be assessed from steps 1 to 3. If step 4 is repeated for 50 times and step 5 is repeated for 5 times, the total number will be up to 119,000, which could be much less (0.16%) of computational time compared to the exhaustive method. In addition, either step 4 or 5 can be integrated with parallel computing to increase the computational speed proportionally to achieve the optimal solution.</p></sec><sec id="s2_2"><title>2.2. Data Analysis</title><p>The authors of this study intended to compare the results between this ASR method and other commonly used methods. Such an intention is prohibited due to a few significant factors. For example, both forward selection and backward selection methods are available in several R packages but these two methods focus on selecting all significant rather than on k-variable model only. The exhaustive method is computationally prohibited for a large number of candidate variables. On the other hand, power and Type I error can be self-determined for a target method via simulation technique. Therefore, without losing focus, the authors of this study emphasized on applying the ASR method to process the simulated data. While in applications, we aimed to compare the results among several methods: forward, backward, exhaustive, SR, and ASR methods. All four methods are available in the R package: leaps [<xref ref-type="bibr" rid="scirp.128498-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref15">15</xref>] while the ASR method was developed by the first author of this paper and will be available upon request. In this study, all data simulations and actual data processing were conducted under RStudio platform [<xref ref-type="bibr" rid="scirp.128498-ref29">29</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref30">30</xref>] .</p></sec></sec><sec id="s3"><title>3. Results</title><sec id="s3_1"><title>3.1. Simulation Results</title><p>In our simulation study, a total of 100 independent variables (p = 100) were used while five (k = 5) were related to the response variable with equal contribution. The regression model used for simulation is as follows,</p><p>y i = b 0 + b 1 X 1 i + b 2 X 2 i + b 3 X 3 i + b 4 X 4 i + b 5 X 5 i + e i</p><p>where, y<sub>i</sub> is response variable for observation i; b<sub>0</sub> is intercept, b<sub>1</sub> - b<sub>5</sub> are slopes for variables X<sub>1</sub> to X<sub>5</sub>, respectively. For simplicity, the intercept and all slopes were preset to 1. Five sets of coefficients of correlation (r = 0.00, 0.20, 0.40, 0.60, and 0.80) among the first 10 variables are provided in <xref ref-type="table" rid="table1">Table 1</xref>, namely S1 (r = 0.00), S2 (r = 0.20), S3 (r = 0.40), S4 (r = 0.60), and S5 (r = 0.80). Four coefficients of determinations: R<sup>2</sup> = 0.20, 0.40, 0.60, and 0.80, equivalent to total heritability, from five variables/loci, were used. The above-mentioned parameters were used to generate simulated data. The mean power of five variables for each setting, mean adjusted coefficients of determination for selected R &#175; A S 2 and true models R &#175; A T 2 were calculated over 200 simulations.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Coefficients of correlation between five true variables (X<sub>1</sub> - X<sub>5</sub>) and other five noise variables (X<sub>6</sub> - X<sub>10</sub>)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >X<sub>1</sub></th><th align="center" valign="middle" >X<sub>2</sub></th><th align="center" valign="middle" >X<sub>3</sub></th><th align="center" valign="middle" >X<sub>4</sub></th><th align="center" valign="middle" >X<sub>5</sub></th><th align="center" valign="middle" >X<sub>6</sub></th><th align="center" valign="middle" >X<sub>7</sub></th><th align="center" valign="middle" >X<sub>8</sub></th><th align="center" valign="middle" >X<sub>9</sub></th><th align="center" valign="middle" >X<sub>10</sub></th></tr></thead><tr><td align="center" valign="middle" >X<sub>1</sub></td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>2</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>3</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>4</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>5</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td></tr><tr><td align="center" valign="middle" >X<sub>6</sub></td><td align="center" valign="middle" >r<sup>†</sup></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>7</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>8</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>9</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td><td align="center" valign="middle" >0.00</td></tr><tr><td align="center" valign="middle" >X<sub>10</sub></td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >r</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >0.00</td><td align="center" valign="middle" >1.00</td></tr></tbody></table></table-wrap><p><sup>†</sup>: r = 0.00, 0.20, 0.40, 0.60, and 0.80 and S1 - S5 were named accordingly.</p><p>The results are summarized in <xref ref-type="table" rid="table2">Table 2</xref> and <xref ref-type="table" rid="table3">Table 3</xref>. The mean powers for the target variables being selected by our ASR method were 98.2%, 98.5%, 98.2%, 98.1%, and 96.5% for five settings S1, S2, S3, S4, and S5 when coefficient of determination was low as 0.20. When the coefficient of determination was 0.40 and higher, mean powers for target variables being selected was 100.0%. Therefore, the simulation results clearly suggest that this ASR method can be used to identify the best k-variable model, which can capture the maximum of variation in a linear regression analysis.</p><p>Comparing the coefficients of determination between the selected and the true models ( R &#175; A S 2 vs R &#175; A T 2 ) helps us determine the efficiency of finding an optimal subset or better subset in linear regression analysis. The mean coefficients of determination for the models selected and the true models are summarized in <xref ref-type="table" rid="table3">Table 3</xref>. The results in <xref ref-type="table" rid="table3">Table 3</xref> showed that mean R<sup>2</sup> for selected models was</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Mean powers of five quantitative variables being selected for five settings of correlation coefficients (0.00, 0.20, 0.40, 0.60, 0.80, S1, S2, S3, S4, and S5) among the first 10 variables and four coefficients of determination (R<sup>2</sup> = 0.20, 0.40, 0.60, and 0.80) each based on 200 simulated data sets</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Setting</th><th align="center" valign="middle"  colspan="4"  >Coefficient of determination</th></tr></thead><tr><td align="center" valign="middle" >0.20</td><td align="center" valign="middle" >0.40</td><td align="center" valign="middle" >0.60</td><td align="center" valign="middle" >0.80</td></tr><tr><td align="center" valign="middle" >S1</td><td align="center" valign="middle" >0.982</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td></tr><tr><td align="center" valign="middle" >S2</td><td align="center" valign="middle" >0.985</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td></tr><tr><td align="center" valign="middle" >S3</td><td align="center" valign="middle" >0.982</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td></tr><tr><td align="center" valign="middle" >S4</td><td align="center" valign="middle" >0.981</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td></tr><tr><td align="center" valign="middle" >S5</td><td align="center" valign="middle" >0.965</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td><td align="center" valign="middle" >1.000</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Mean adjusted coefficients of determination between selected models ( R &#175; A S 2 ) and true models ( R &#175; A T 2 ) over 200 simulations for five correlation settings (S1 - S5) with four different coefficients of determination (0.20, 0.40, 0.60, and 0.80)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" ></th><th align="center" valign="middle"  colspan="4"  >R<sup>2</sup></th></tr></thead><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0.20</td><td align="center" valign="middle" >0.40</td><td align="center" valign="middle" >0.60</td><td align="center" valign="middle" >0.80</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >S1</td><td align="center" valign="middle" >R &#175; A S 2</td><td align="center" valign="middle" >0.1995</td><td align="center" valign="middle" >0.3995</td><td align="center" valign="middle" >0.6000</td><td align="center" valign="middle" >0.7979</td></tr><tr><td align="center" valign="middle" >R &#175; A T 2</td><td align="center" valign="middle" >0.1992</td><td align="center" valign="middle" >0.3995</td><td align="center" valign="middle" >0.6000</td><td align="center" valign="middle" >0.7979</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >S2</td><td align="center" valign="middle" >R &#175; A S 2</td><td align="center" valign="middle" >0.2069</td><td align="center" valign="middle" >0.3978</td><td align="center" valign="middle" >0.5990</td><td align="center" valign="middle" >0.8014</td></tr><tr><td align="center" valign="middle" >R &#175; A T 2</td><td align="center" valign="middle" >0.2065</td><td align="center" valign="middle" >0.3978</td><td align="center" valign="middle" >0.5990</td><td align="center" valign="middle" >0.8014</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >S3</td><td align="center" valign="middle" >R &#175; A S 2</td><td align="center" valign="middle" >0.2004</td><td align="center" valign="middle" >0.3976</td><td align="center" valign="middle" >0.5977</td><td align="center" valign="middle" >0.7974</td></tr><tr><td align="center" valign="middle" >R &#175; A T 2</td><td align="center" valign="middle" >0.2000</td><td align="center" valign="middle" >0.3976</td><td align="center" valign="middle" >0.5977</td><td align="center" valign="middle" >0.7974</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >S4</td><td align="center" valign="middle" >R &#175; A S 2</td><td align="center" valign="middle" >0.2024</td><td align="center" valign="middle" >0.3969</td><td align="center" valign="middle" >0.5969</td><td align="center" valign="middle" >0.7991</td></tr><tr><td align="center" valign="middle" >R &#175; A T 2</td><td align="center" valign="middle" >0.2022</td><td align="center" valign="middle" >0.3969</td><td align="center" valign="middle" >0.5969</td><td align="center" valign="middle" >0.7991</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >S5</td><td align="center" valign="middle" >R &#175; A S 2</td><td align="center" valign="middle" >0.2028</td><td align="center" valign="middle" >0.3955</td><td align="center" valign="middle" >0.5998</td><td align="center" valign="middle" >0.7989</td></tr><tr><td align="center" valign="middle" >R &#175; A T 2</td><td align="center" valign="middle" >0.2020</td><td align="center" valign="middle" >0.3955</td><td align="center" valign="middle" >0.5998</td><td align="center" valign="middle" >0.7989</td></tr></tbody></table></table-wrap><p>slightly higher than that for the original models when pre-set R<sup>2</sup> was 0.20. Checking individual R<sup>2</sup>, we observed that each R<sup>2</sup> from each selected model was either equal to or higher than that for the true model (detailed results not provided). When R<sup>2</sup> was 0.40 or higher, R<sup>2</sup> for each selected model and that for the original model were identical for each simulated data set. The results in <xref ref-type="table" rid="table3">Table 3</xref> were highly consist with the those in <xref ref-type="table" rid="table2">Table 2</xref>. On one hand, when R<sup>2</sup> is small, occasionally, some true variables may be replaced by noise variables, which cause a slightly higher R<sup>2</sup> for that simulated data set due to Type I error. On the other hand, these results implied that this ASR method was able to identify a subset of variable with the highest R<sup>2</sup>, which is desired mathematically.</p></sec><sec id="s3_2"><title>3.2. Applications</title><p>In our first application, we applied the ASR method to a fruit fly wing data, which were used for QTL analysis [<xref ref-type="bibr" rid="scirp.128498-ref31">31</xref>] . The total number of polymorphic DNA markers on chromosome 2 is 37 (p = 37) after 11 co-existing markers were deleted. In this application, we were able to include SR, exhaustive, forward, and backward selection methods into our comparisons for k = 1 to 14. The SR, exhaustive, forward, and backward methods are available in leaps package [<xref ref-type="bibr" rid="scirp.128498-ref14">14</xref>] . The results in <xref ref-type="table" rid="table4">Table 4</xref> showed that R<sup>2</sup> for both exhaustive and ASR methods were identical, indicating that our ASR method has improved probability to determine the best k-variable/marker model. For most cases, the SR method had the same R<sup>2</sup> values compared to the ASR and exhaustive methods (i.e. k = 1 - 6, 8, 10, and 11) while the SR method had slightly lower R<sup>2</sup> values than the ASR and exhaustive methods for k = 9, 12, and 13 but not for k = 7 or 14. Both backward and forward selection methods had consistently and slightly lower R<sup>2</sup> value compared to the ASR and exhaustive selection methods except k = 1 for the forward selection method (<xref ref-type="table" rid="table4">Table 4</xref>). These two selection methods also yielded constantly lower R<sup>2</sup> values than the SR method for k ≥ 2 except k = 7 and 14. The backward method had consistently higher R<sup>2</sup> values than the forward methods for all cases except for k = 1. It was surprising to notice that the R<sup>2</sup> values for k = 7 and k = 14 for SR method was far lower than those for the other four methods. We also noticed that the SR method yielded inconsistent and lower R<sup>2</sup> values for k = 7 and 14 when the order of 37 markers were randomized for several times (results not showed here). Without investigating the R scripts in the leaps package, it is hard to conclude if such outcomes were caused by the algorithm itself or bugs in leaps package.</p><p>In application 2, a barley data set, which was analyzed in our previous publication [<xref ref-type="bibr" rid="scirp.128498-ref32">32</xref>] , was used to compare the SR, ASR, forward, and backward selection methods. The data set includes 391 single nucleotide polymorphisms (SNPs) and 762 heading data points. Due to high computational intensity, the analysis was prohibited by the exhaustive in leaps package; however, we were able to compare the results among four methods (SR, ASR, forward, and backward). The results showed that both SR and ASR methods performed equally well when k = 1 to 4, 6, and 7 while the ASR performed better for the remaining cases (<xref ref-type="table" rid="table5">Table 5</xref>). The forward method had higher R<sup>2</sup> values than the backward method except k = 8 and 9. Both SR and ASR methods had higher R<sup>2</sup> values than both forward and backward methods except k = 1 to 3 for the forward method, which had the same R<sup>2</sup> values compared to the SR and ASR methods.</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Adjusted coefficients of determination of k-marker subset (k = 1 to 14) for five selection methods for the data with fruit fly wing shape and 37 RFLP markers [<xref ref-type="bibr" rid="scirp.128498-ref31">31</xref>] </title></caption><table><tbody><thead><tr><th align="center" valign="middle" >k</th><th align="center" valign="middle" >SR<sup>†</sup></th><th align="center" valign="middle" >Exhaustive</th><th align="center" valign="middle" >ASR<sup>‡</sup></th><th align="center" valign="middle" >Forward</th><th align="center" valign="middle" >Backward</th></tr></thead><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.445102</td></tr><tr><td align="center" valign="middle" >2</td><td align="center" valign="middle" >0.688750</td><td align="center" valign="middle" >0.688750</td><td align="center" valign="middle" >0.688750</td><td align="center" valign="middle" >0.642101</td><td align="center" valign="middle" >0.675081</td></tr><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" >0.843880</td><td align="center" valign="middle" >0.843880</td><td align="center" valign="middle" >0.843880</td><td align="center" valign="middle" >0.813288</td><td align="center" valign="middle" >0.836392</td></tr><tr><td align="center" valign="middle" >4</td><td align="center" valign="middle" >0.877729</td><td align="center" valign="middle" >0.877729</td><td align="center" valign="middle" >0.877729</td><td align="center" valign="middle" >0.867078</td><td align="center" valign="middle" >0.868122</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >0.903693</td><td align="center" valign="middle" >0.903693</td><td align="center" valign="middle" >0.903693</td><td align="center" valign="middle" >0.894649</td><td align="center" valign="middle" >0.897927</td></tr><tr><td align="center" valign="middle" >6</td><td align="center" valign="middle" >0.920823</td><td align="center" valign="middle" >0.920823</td><td align="center" valign="middle" >0.920823</td><td align="center" valign="middle" >0.911156</td><td align="center" valign="middle" >0.917549</td></tr><tr><td align="center" valign="middle" >7</td><td align="center" valign="middle" >0.230306</td><td align="center" valign="middle" >0.925672</td><td align="center" valign="middle" >0.925672</td><td align="center" valign="middle" >0.921786</td><td align="center" valign="middle" >0.922730</td></tr><tr><td align="center" valign="middle" >8</td><td align="center" valign="middle" >0.930037</td><td align="center" valign="middle" >0.930037</td><td align="center" valign="middle" >0.930037</td><td align="center" valign="middle" >0.926434</td><td align="center" valign="middle" >0.927707</td></tr><tr><td align="center" valign="middle" >9</td><td align="center" valign="middle" >0.931577</td><td align="center" valign="middle" >0.931974</td><td align="center" valign="middle" >0.931974</td><td align="center" valign="middle" >0.930089</td><td align="center" valign="middle" >0.930469</td></tr><tr><td align="center" valign="middle" >10</td><td align="center" valign="middle" >0.933271</td><td align="center" valign="middle" >0.933271</td><td align="center" valign="middle" >0.933271</td><td align="center" valign="middle" >0.931636</td><td align="center" valign="middle" >0.933069</td></tr><tr><td align="center" valign="middle" >11</td><td align="center" valign="middle" >0.934016</td><td align="center" valign="middle" >0.934016</td><td align="center" valign="middle" >0.934016</td><td align="center" valign="middle" >0.932646</td><td align="center" valign="middle" >0.933897</td></tr><tr><td align="center" valign="middle" >12</td><td align="center" valign="middle" >0.934320</td><td align="center" valign="middle" >0.934341</td><td align="center" valign="middle" >0.934341</td><td align="center" valign="middle" >0.933326</td><td align="center" valign="middle" >0.934223</td></tr><tr><td align="center" valign="middle" >13</td><td align="center" valign="middle" >0.934592</td><td align="center" valign="middle" >0.934617</td><td align="center" valign="middle" >0.934617</td><td align="center" valign="middle" >0.934060</td><td align="center" valign="middle" >0.934529</td></tr><tr><td align="center" valign="middle" >14</td><td align="center" valign="middle" >0.652253</td><td align="center" valign="middle" >0.934866</td><td align="center" valign="middle" >0.934866</td><td align="center" valign="middle" >0.934638</td><td align="center" valign="middle" >0.934806</td></tr></tbody></table></table-wrap><p><sup>†</sup>: sequential replacement and <sup>‡</sup>: adaptive sequential replacement.</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Adjusted coefficients of determination of k-marker subset (k = 1 to 10) for four selection methods for the data with barley heading date and 391 SNPs [<xref ref-type="bibr" rid="scirp.128498-ref33">33</xref>] </title></caption><table><tbody><thead><tr><th align="center" valign="middle" >k</th><th align="center" valign="middle" >SR<sup>†</sup></th><th align="center" valign="middle" >ASR<sup>‡</sup></th><th align="center" valign="middle" >Forward</th><th align="center" valign="middle" >Backward</th></tr></thead><tr><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.318846</td><td align="center" valign="middle" >0.318846</td><td align="center" valign="middle" >0.318846</td><td align="center" valign="middle" >0.268571</td></tr><tr><td align="center" valign="middle" >2</td><td align="center" valign="middle" >0.371503</td><td align="center" valign="middle" >0.371503</td><td align="center" valign="middle" >0.371503</td><td align="center" valign="middle" >0.363640</td></tr><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" >0.407045</td><td align="center" valign="middle" >0.407045</td><td align="center" valign="middle" >0.407045</td><td align="center" valign="middle" >0.402651</td></tr><tr><td align="center" valign="middle" >4</td><td align="center" valign="middle" >0.434137</td><td align="center" valign="middle" >0.434137</td><td align="center" valign="middle" >0.432367</td><td align="center" valign="middle" >0.423071</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >0.450320</td><td align="center" valign="middle" >0.450409</td><td align="center" valign="middle" >0.450128</td><td align="center" valign="middle" >0.440955</td></tr><tr><td align="center" valign="middle" >6</td><td align="center" valign="middle" >0.467692</td><td align="center" valign="middle" >0.467692</td><td align="center" valign="middle" >0.462793</td><td align="center" valign="middle" >0.460830</td></tr><tr><td align="center" valign="middle" >7</td><td align="center" valign="middle" >0.483355</td><td align="center" valign="middle" >0.483355</td><td align="center" valign="middle" >0.475074</td><td align="center" valign="middle" >0.474516</td></tr><tr><td align="center" valign="middle" >8</td><td align="center" valign="middle" >0.492946</td><td align="center" valign="middle" >0.494132</td><td align="center" valign="middle" >0.486796</td><td align="center" valign="middle" >0.490362</td></tr><tr><td align="center" valign="middle" >9</td><td align="center" valign="middle" >0.501345</td><td align="center" valign="middle" >0.504865</td><td align="center" valign="middle" >0.498145</td><td align="center" valign="middle" >0.503484</td></tr><tr><td align="center" valign="middle" >10</td><td align="center" valign="middle" >0.513704</td><td align="center" valign="middle" >0.514244</td><td align="center" valign="middle" >0.508404</td><td align="center" valign="middle" >0.504725</td></tr></tbody></table></table-wrap><p><sup>†</sup>: sequential replacement and <sup>‡</sup>: adaptive sequential replacement.</p></sec><sec id="s3_3"><title>3.3. Repeatability, Consistence, and Speed</title><p>Repeatability and consistence for a method are important when stochasticity is applied to this method. The same data analyses in Applications 1 and 2 of this study with the ASR method were repeated independently for 20 times. The results including mean, minimum, and maximum of adjusted coefficients of determination for different k-marker models are summarized in <xref ref-type="table" rid="table6">Table 6</xref> and <xref ref-type="table" rid="table7">Table 7</xref>, respectively. The results showed that the probability of the best model being determined for each k-marker set was at least 65% for the first application (<xref ref-type="table" rid="table6">Table 6</xref>) while the probability of the best model being determined varied widely among different k-marker sets for the second application (<xref ref-type="table" rid="table7">Table 7</xref>). The difference among mean, minimum, and maximum of adjusted coefficients of determination for each k-marker set was very small for both cases (<xref ref-type="table" rid="table6">Table 6</xref> and <xref ref-type="table" rid="table7">Table 7</xref>). For example, the difference between minimum and maximum of adjusted</p><table-wrap id="table6" ><label><xref ref-type="table" rid="table6">Table 6</xref></label><caption><title> Minimum, maximum, and mean values over 20 replications for R A 2 for different k-marker models for the data with fruit fly wing shape and 37 RFLP markers [<xref ref-type="bibr" rid="scirp.128498-ref31">31</xref>] </title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >k = 1</th><th align="center" valign="middle" >k = 2</th><th align="center" valign="middle" >k = 3</th><th align="center" valign="middle" >k = 4</th><th align="center" valign="middle" >k = 5</th></tr></thead><tr><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.688750</td><td align="center" valign="middle" >0.843880</td><td align="center" valign="middle" >0.877727</td><td align="center" valign="middle" >0.903693</td></tr><tr><td align="center" valign="middle" >Min</td><td align="center" valign="middle" >0.514068</td><td align="center" valign="middle" >0.688750</td><td align="center" valign="middle" >0.843880</td><td align="center" valign="middle" >0.877675</td><td align="center" valign="middle" >0.903693</td></tr><tr><td align="center" valign="middle" >Max</td><td align="center" valign="middle" >0.514068(20<sup>†</sup>)</td><td align="center" valign="middle" >0.688750(20)</td><td align="center" valign="middle" >0.843880(20)</td><td align="center" valign="middle" >0.877729(19)</td><td align="center" valign="middle" >0.903693(20)</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >k = 6</td><td align="center" valign="middle" >k = 7</td><td align="center" valign="middle" >k = 8</td><td align="center" valign="middle" >k = 9</td><td align="center" valign="middle" >k = 10</td></tr><tr><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >0.920797</td><td align="center" valign="middle" >0.925672</td><td align="center" valign="middle" >0.930037</td><td align="center" valign="middle" >0.931974</td><td align="center" valign="middle" >0.933193</td></tr><tr><td align="center" valign="middle" >Min</td><td align="center" valign="middle" >0.920560</td><td align="center" valign="middle" >0.925672</td><td align="center" valign="middle" >0.930037</td><td align="center" valign="middle" >0.931974</td><td align="center" valign="middle" >0.932899</td></tr><tr><td align="center" valign="middle" >Max</td><td align="center" valign="middle" >0.920823(18)</td><td align="center" valign="middle" >0.925672(20)</td><td align="center" valign="middle" >0.930037(20)</td><td align="center" valign="middle" >0.931974(20)</td><td align="center" valign="middle" >0.933271(15)</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >k = 11</td><td align="center" valign="middle" >k = 12</td><td align="center" valign="middle" >k = 13</td><td align="center" valign="middle" >k = 14</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >0.934005</td><td align="center" valign="middle" >0.934332</td><td align="center" valign="middle" >0.934608</td><td align="center" valign="middle" >0.934864</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Min</td><td align="center" valign="middle" >0.933901</td><td align="center" valign="middle" >0.934286</td><td align="center" valign="middle" >0.934498</td><td align="center" valign="middle" >0.934856</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >Max</td><td align="center" valign="middle" >0.934016(18)</td><td align="center" valign="middle" >0.934341(13)</td><td align="center" valign="middle" >0.934616(17)</td><td align="center" valign="middle" >0.934866(16)</td><td align="center" valign="middle" ></td></tr></tbody></table></table-wrap><p><sup>†: </sup>The number of the maximum R A 2 was reached over 20 independent trials.</p><table-wrap id="table7" ><label><xref ref-type="table" rid="table7">Table 7</xref></label><caption><title> Minimum, maximum, and mean values over 20 replications for R A 2 for different k-marker models for the data with barley heading date and 391 SNPs [<xref ref-type="bibr" rid="scirp.128498-ref33">33</xref>] </title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >k = 1</th><th align="center" valign="middle" >k = 2</th><th align="center" valign="middle" >k = 3</th><th align="center" valign="middle" >k = 4</th><th align="center" valign="middle" >k = 5</th></tr></thead><tr><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >0.318846</td><td align="center" valign="middle" >0.371503</td><td align="center" valign="middle" >0.407045</td><td align="center" valign="middle" >0.433075</td><td align="center" valign="middle" >0.450381</td></tr><tr><td align="center" valign="middle" >Min</td><td align="center" valign="middle" >0.318846</td><td align="center" valign="middle" >0.371503</td><td align="center" valign="middle" >0.407045</td><td align="center" valign="middle" >0.432367</td><td align="center" valign="middle" >0.450128</td></tr><tr><td align="center" valign="middle" >Max</td><td align="center" valign="middle" >0.318846(20<sup>†</sup>)</td><td align="center" valign="middle" >0.371503(20)</td><td align="center" valign="middle" >0.407045(20)</td><td align="center" valign="middle" >0.434137(8)</td><td align="center" valign="middle" >0.450409(18)</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >k = 6</td><td align="center" valign="middle" >k = 7</td><td align="center" valign="middle" >k = 8</td><td align="center" valign="middle" >k = 9</td><td align="center" valign="middle" >k = 10</td></tr><tr><td align="center" valign="middle" >Mean</td><td align="center" valign="middle" >0.466777</td><td align="center" valign="middle" >0.481149</td><td align="center" valign="middle" >0.493443</td><td align="center" valign="middle" >0.504992</td><td align="center" valign="middle" >0.514079</td></tr><tr><td align="center" valign="middle" >Min</td><td align="center" valign="middle" >0.466675</td><td align="center" valign="middle" >0.477830</td><td align="center" valign="middle" >0.489611</td><td align="center" valign="middle" >0.502217</td><td align="center" valign="middle" >0.513864</td></tr><tr><td align="center" valign="middle" >Max</td><td align="center" valign="middle" >0.467692(2)</td><td align="center" valign="middle" >0.483355(10)</td><td align="center" valign="middle" >0.494132(14)</td><td align="center" valign="middle" >0.505904(5)</td><td align="center" valign="middle" >0.514667(5)</td></tr></tbody></table></table-wrap><p><sup>†</sup>:The number of the maximum R A 2 was reached over 20 independent trials.</p><p>coefficients of determination was less than 0.0004 (equivalent to less than 0.040% lower than the exhaustive search) among 14 cases in application 1 (<xref ref-type="table" rid="table6">Table 6</xref>) and less than 0.006 (equivalent to less than 1.143% lower compared to the best model) in application 2. These results in <xref ref-type="table" rid="table6">Table 6</xref> and <xref ref-type="table" rid="table7">Table 7</xref> showed that the ASR method is repeatable and consistent in search of the best k-variable model. However, repeating the process from steps 1 to 4 for several times is recommended to reach better solutions for a large number of genetic markers which are closely linked as well.</p><p>The time used for selecting the best k-variable set will give us some insight in using the ASR method. The computer that we used for data processing was a Dell laptop with Intel<sup>&#174;</sup> Xeon <sup>&#174;</sup> W-11855M CPU @ 3.20 GHz and 64.0 GB Ram. With the SR method, the time used in application 2 was less than 1 second for k = 1 to 10. With the ASR method, it averaged 28 minutes in total (k = 1 to 10) over 20 replications. Compared to the SR method, the ASR method is slow; however, this amount of time is very appealing because it can offer the improved power compared to the exhaustive method. With the ASR method, the given analysis for k = 1 to 10 in application 2 was completed within a lunchbreak period, which is very acceptable.</p></sec></sec><sec id="s4"><title>4. Discussion</title><p>Selecting a set of markers that captures the maximum genetic variation out of high throughput data is highly desired. The exhaustive search method is guaranteed to achieve the best solutions, but it can be prohibited due to high computational burden. On the other hand, many stepwise variable selection methods in linear regression analysis are heuristic and approximate, not guaranteeing the optimal solution, as showed in our two applications (<xref ref-type="table" rid="table4">Table 4</xref> and <xref ref-type="table" rid="table5">Table 5</xref>), though they offer desirable computational speed. The SR method starts a model selected from forward or stepwise selection method and then it allows to sequentially replace each variable in the model with the remaining variables [<xref ref-type="bibr" rid="scirp.128498-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.128498-ref34">34</xref>] . Mathematically, the SR method should be better than forward and backward selection methods. Therefore, the SR method has advantage of speed but it could result in a non-global optimal model due to lack of stochasticity. In this study, we were motivated to develop the ASR method with a high likelihood to identify a k-marker set for capturing the maximum genetic variation with much reduced computational intensity compared to the exhaustive method.</p><p>The first key feature applied to this ASR method is stochasticity to avoid local optimal solutions. Our ASR selection method starts a completely random k-variable subset every time as described in step 1. Thus, increasing the number of random starts should increase the possibility to avoid local optimal solutions and thus improve the possibility to reach global optimal solutions. However, such practice may add significant computational burden compared to the SR method. The other key feature added to the ASR method is adaptivity so that the search process can be terminated once the optimal solution is achieved as stated in condition (1) at step 4. Our logic is that there could be several local optimal solutions for a k-variable model, but the global optimal solution is unique. If the best solution has appeared for at least three times during the search process, it is evident that the global solution has been achieved, and no additional search should be needed. For example, we preset 50 times of random start at step 4 but it can reach the global solution with only a minimum of five to 19 iterations for k = 2 to 14 in application 1, which greatly reduced computational intensity. However, users may need to increase the iteration number in condition (3) at step 4 or N at step 5 to increase the likelihood to achieve the best solution if condition (1) in step (4) is not met. In addition, this study showed that the ASR method is robust regarding obtaining highly repeatable and/or consistent best k-variable/marker models as showed in <xref ref-type="table" rid="table6">Table 6</xref> and <xref ref-type="table" rid="table7">Table 7</xref>.</p><p>The power of this ASR method was numerically evaluated by simulated data through presetting five contributing variables at different levels of coefficients of determination, where each coefficient of determination is equivalent to heritability in genetics. Our simulation results showed that the ASR method was able to identify all optimal subset of true variables when the coefficient of determination was at least 0.40. Such a conclusion was evidenced by the coefficients of determination between the models selected and the true models and mean power over five preset variables being selected over 200 simulations (<xref ref-type="table" rid="table3">Table 3</xref> and <xref ref-type="table" rid="table4">Table 4</xref>). Even when the preset coefficient of determination is low as 0.20 and target variables are highly correlated with noise variables (0.80), the ASR method sustained a high power as demonstrated in <xref ref-type="table" rid="table2">Table 2</xref>. In addition, we noticed that each individual R<sup>2</sup> values obtained by the ASR method was equal to or slightly greater than that for the true model for each simulated data set due to Type I error (<xref ref-type="table" rid="table4">Table 4</xref>, the individual results not provided here), indicating that it is more likely that the ASR method has the capability of selecting a better model. However, as expected, Type I error should be expected when a coefficient of determination or heritability is low.</p><p>Due to a small number of variables in our application 1, we compared five methods: forward, backward, SR, global, and our ASR methods. Both the ASR and exhaustive methods achieved the identical k-variable models (k = 1 - 14) (<xref ref-type="table" rid="table4">Table 4</xref>). The SR method could determine the same subset of variables for most cases when compared to the exhaustive method, indicating that the SR method has the ability to achieve the best model. However, occasionally, some models identified by the SR method had slightly lower coefficients of determination compared to those determined by the exhaustive and ASR methods. In two cases, the subsets determined by SR methods showed far lower coefficients of determination compared to the other four methods (k = 7 and 14 in the first application). The results showed that the SR method sometimes may lack consistence to generate the optimal solutions. In application 2, the ASR method had equal or higher adjusted coefficient of determination compared to the SR method for k = 1 to 10 (<xref ref-type="table" rid="table5">Table 5</xref>), suggesting that the ASR method sustains an improved power and is preferred to identify the better models than the SR method.</p><p>Several key factors may influence the possibility to find the optimal model. The first factor is the degree of the subset linearly associated with the response/dependent variable. This is equivalent to heritability in a genetic association mapping study. Higher heritability is associated with higher power to catch the best model. The second factor is the degree of collinearity among variables associated with the dependent variable. Strong linear associations between predictive variables and the response variable and weak collinearity among predictive variables will achieve the optimal k-variable model much more rapidly than weak association and/or strong collinearity among the predictable variables. However, even though, increasing the number of random starts (iteration number) will help achieve the optimal solution and this is one desirable feature associated with this ASR algorithm, when the number of variables or genetic markers is high, more iterations are required. For example, over 100 iterations are more likely required to meet condition (1) or (2) in step 4 for k ≥ 6 in the second application of this study.</p><p>Many association and QTL mapping studies showed that even though a single marker/locus was significantly associated with a quantitative trait of interest, using the single marker as MAS was still far from the efficiency needed for breeding selection. Thus, selecting k markers as a subset, which can catch desirable genetic variation, is desired for MAS application. However, it doesn’t mean the more the better. In breeding practice, increasing one DNA marker for marker-assisted selection would double field/lab work with one additional bi-allelic DNA marker. On the other hand, our previous study on barley association mapping analysis showed that many selected SNP markers were significant yet the total coefficient of determination was stabilized with the increase of SNP markers at some points during our forwarded selection process [<xref ref-type="bibr" rid="scirp.128498-ref35">35</xref>] . The results from this study as presented in <xref ref-type="table" rid="table3">Table 3</xref> and <xref ref-type="table" rid="table4">Table 4</xref> also showed a similar pattern. Therefore, selecting a particular number of markers/variables should be determined depending on several key factors such as the degree of associations between selected genetic markers and the trait of interest and affordability/availability of labor and land. The ASR method in this study can help breeders capture the maximum genetic variation associated with a particular k-marker set.</p><p>The ASR selection method can potentially identify the best k-variable subset. However, it is possible that this method is extendable to forward and backward variable selections with slight modifications. For example, if all k variables in the model are significant, then steps 1 to 5 can be proceeded with k + 1 variables. This process can be repeated until no more new variables can be added. Such a process is related to the ASR based forward selection. On the other hand, if one or more variables are not significant in the k-variable solution, then steps 1 to 5 can be proceeded with k − 1 variables. This process continues until no more variables can be eliminated which is related to ASR backward selection. Additional comparisons between ASR based forward/backward selection methods and commonly used forward/backward selection are ongoing.</p></sec><sec id="s5"><title>Acknowledgements</title><p>This study was partially supported by USDA-ARS (project # 6064-21000-016) and the USDA-NIFA hatch project (SD00H525-14) while the senior author formerly working at South Dakota State University.</p></sec><sec id="s6"><title>Disclaimer</title><p>Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and employer.</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s8"><title>Cite this paper</title><p>Wu, J.X., Jenkins, J.N. and McCarty Jr., J.C. (2023) An Adaptive Sequential Replacement Method for Variable Selection in Linear Regression Analysis. Open Journal of Statistics, 13, 746-760. https://doi.org/10.4236/ojs.2023.135036</p></sec></body><back><ref-list><title>References</title><ref id="scirp.128498-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Jansen, R.C. (1993) Interval Mapping of Multiple Quantitative Trait Loci. Genetics, 135, 205-211. https://doi.org/10.1093/genetics/135.1.205</mixed-citation></ref><ref id="scirp.128498-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Jansen, R.C. and Stam, P. (1994) High-Resolution of Quantitative Traits into Multiple Loci via Interval Mapping. Genetics, 136, 1447-1455.  
https://doi.org/10.1093/genetics/136.4.1447</mixed-citation></ref><ref id="scirp.128498-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Xu, S.H. (1995) A Comment on the Simple Regression Method for Interval Mapping. Genetics, 141, 1657-1659. https://doi.org/10.1093/genetics/141.4.1657</mixed-citation></ref><ref id="scirp.128498-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Zeng, Z.B. (1994) Precision Mapping of Quantitative Trait Loci. Genetics, 136, 1457-1468. https://doi.org/10.1093/genetics/136.4.1457</mixed-citation></ref><ref id="scirp.128498-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Zeng, Z.B. (2005) QTL Mapping and the Genetic Basis of Adaptation: Recent Developments. Genetica, 123, 25-37. https://doi.org/10.1007/s10709-004-2705-0</mixed-citation></ref><ref id="scirp.128498-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Haley, C.S. and Knott, S.A. (1992) A Simple Regression Method for Mapping Quantitative Trait Loci in Line Crosses Using Flanking Markers. Heredity, 69, 315-324.  
https://doi.org/10.1038/hdy.1992.131</mixed-citation></ref><ref id="scirp.128498-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Lander, E.S. and Botstein, D. (1989) Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics, 121, 185-199.  
https://doi.org/10.1093/genetics/121.1.185</mixed-citation></ref><ref id="scirp.128498-ref8"><label>8</label><mixed-citation publication-type="book" xlink:type="simple">Hayes, B. (2013) Overview of Statistical Methods for Genome-Wide Assocition Studies (GWAS). In: Gondro, C., van der Werf, J. and Hayes, B., Eds., Genome-Wide Association Studies and Genomic Prediction, Humana Press, Totowa, 149-169.  
https://doi.org/10.1007/978-1-62703-447-0_6</mixed-citation></ref><ref id="scirp.128498-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Yu, J.M., Pressoir, G., Briggs, W.H., Bi, I.V., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen, D.M., Holland, J.B., Kresovich, S. and Buckler, E.S. (2006) A Unified Mixed-Model Method for Association Mapping That Accounts for Multiple Levels of Relatedness. Nature Genetics, 38, 203-208.  
https://doi.org/10.1038/ng1702</mixed-citation></ref><ref id="scirp.128498-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Berk, K.N. (1977) Tolerance and Condition in Regression Computations. Journal of the American Statistical Association, 72, 863-866.  
https://doi.org/10.1080/01621459.1977.10479972</mixed-citation></ref><ref id="scirp.128498-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Gorman, J.W. and Toman, R.J. (1966) Selection of Variables for Fitting Equations to Data. Technometrics, 8, 27-51. https://doi.org/10.1080/00401706.1966.10490322</mixed-citation></ref><ref id="scirp.128498-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Efroymsn, M. (1966) Stepwise Regression—A Backward and Forward Look. Eastern Regional Meetings of the Institute of Mathematical Statistics, Florham Park, New Jersey.</mixed-citation></ref><ref id="scirp.128498-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Draper, N. and Smith, H. (1966) Applied Regression Analysis. John Wiley &amp; Sons, New York.</mixed-citation></ref><ref id="scirp.128498-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Lumley, T. (2020) Regression Subset Selection. Version 3.1.  
https://CRAN.Rproject.org/package=leaps</mixed-citation></ref><ref id="scirp.128498-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Hebbali, A. (2020) Olsrr: Tools for Building OLS Regression Models. Version 0.5.3. https://CRAN.R-project.org/package=olsrr</mixed-citation></ref><ref id="scirp.128498-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Venables, W.N. and Ripley, B.B. (2002) Modern Applied Statistics with S. Springer, New York. https://doi.org/10.1007/978-0-387-21706-2</mixed-citation></ref><ref id="scirp.128498-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Hocking, R.R. (1976) The Analysis and Selection of Variables in Linear Regression (A Biometrics Invited Paper). Biometrics, 32, 1-49. https://doi.org/10.2307/2529336</mixed-citation></ref><ref id="scirp.128498-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Miller, A.J. (2002) Subset Selection in Regression. Chapman &amp; Hall/CRC, Boca Raton.</mixed-citation></ref><ref id="scirp.128498-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Mantel, N. (1970) Why Stepdown Procedures in Variable Selection. Technometrics, 12, 621-625. https://doi.org/10.1080/00401706.1970.10488701</mixed-citation></ref><ref id="scirp.128498-ref20"><label>20</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Huberty</surname><given-names> C.J. </given-names></name>,<etal>et al</etal>. (<year>1989</year>)<article-title>Problems with Stepwise Methods—Better Alternatives</article-title><source> Advances in Social Science Methodology</source><volume> 1</volume>,<fpage> 43</fpage>-<lpage>70</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.128498-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Kirton, H.C. (1967) Best Models in Multiple Regression Analysis. N.S.W. Department of Agriculture, Sydney.</mixed-citation></ref><ref id="scirp.128498-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Hocking, R.R. and Leslie, R.N. (1967) Selection of the Best Subset in Regression Analysis. Technometrics, 9, 531-540.  
https://doi.org/10.1080/00401706.1967.10490502</mixed-citation></ref><ref id="scirp.128498-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Beale, E.M.L., Kendall, M.G. and Mann, D.W. (1967) The Discarding of Variables in Multivariate Analysis. Biometrika, 54, 357-366.  
https://doi.org/10.1093/biomet/54.3-4.357</mixed-citation></ref><ref id="scirp.128498-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">LaMotte, L.R. and Hocking, R.R. (1970) Computational Efficiency in the Selection of Regression Variables. Technometrics, 12, 83-93.  
https://doi.org/10.1080/00401706.1970.10488636</mixed-citation></ref><ref id="scirp.128498-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Furnival, G.M. and Wilson, R.W. (1974) Regression by Leaps and Bounds. Technometrics, 42, 69-79. https://doi.org/10.1080/00401706.2000.10485982</mixed-citation></ref><ref id="scirp.128498-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Miller, A.J. (1984) Selection of Subsets of Regression Variables. Journal of the Royal Statistical Society Series A, 147, 389-425. https://doi.org/10.2307/2981576</mixed-citation></ref><ref id="scirp.128498-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Barr, A.J., Goodnight, J.H. and Sall, J.P. (1979) SAS User’s Guide. SAS Institute, Releigh.</mixed-citation></ref><ref id="scirp.128498-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Lerner, J.V. and Games, P.A. (1981) Maximum R2 Improvement and Stepwise Multiple Regression as Related to Overfitting. Psychological Reports, 48, 979-983.  
https://doi.org/10.2466/pr0.1981.48.3.979</mixed-citation></ref><ref id="scirp.128498-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">R Core Team (2023) R: A Language and Environment for Statistial Computing. R Foundation for Statistical Computing, Vienna, Austria.</mixed-citation></ref><ref id="scirp.128498-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">R Studio Team (2022) R Studio: Integrated Development for R. R Studio, Inc., Boston.</mixed-citation></ref><ref id="scirp.128498-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Weber, K., Eisman, R., Higgins, S., Morey, L., Patty, A., Tausek, M. and Zeng, Z.B. (2001) An Analysis of Polygenes Affecting Wing Shape on Chromosome 2 in Drosophila Melanogaster. Genetics, 159, 1045-1057.  
https://doi.org/10.1093/genetics/159.3.1045</mixed-citation></ref><ref id="scirp.128498-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Xu, Y., Wu, Y. and Wu, J. (2018) Capturing Pair-Wise Epistatic Effects Associated with Three Agronomic Traits in Barley. Genetica, 146, 161-170.  
https://doi.org/10.1007/s10709-018-0008-0</mixed-citation></ref><ref id="scirp.128498-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Xu, Y., Bai, G.H., Graybosch, R., Wu, Y. and Wu, J. (2017) Marker Association Analysis with Three Agronomic Traits in Hard Winter Wheat Lines under Diverse Environments. Journal of Applied Bioinformatics &amp; Computational Biology, 6, 2. https://doi.org/10.4172/2329-9533.1000136</mixed-citation></ref><ref id="scirp.128498-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Myers, R.H. (1990) Classical and Modern Regression with Applications. PWS-KENT Publising Compary, Boston.</mixed-citation></ref><ref id="scirp.128498-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Xu, Y., Wu, Y., Gonda, M. and Wu, J. (2015) A Linkage Based Imputation Method for Missing SNP Markers in Association Mapping. Journal of Applied Bioinformatics &amp; Computational Biology, 4, 1.</mixed-citation></ref></ref-list></back></article>