<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2019.710005</article-id><article-id pub-id-type="publisher-id">JCC-95688</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Initial Value Filtering Optimizes Fast Global K-Means
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Jintao</surname><given-names>Han</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Haiming</surname><given-names>Li</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai, China</addr-line></aff><pub-date pub-type="epub"><day>10</day><month>10</month><year>2019</year></pub-date><volume>07</volume><issue>10</issue><fpage>52</fpage><lpage>62</lpage><history><date date-type="received"><day>2,</day>	<month>September</month>	<year>2019</year></date><date date-type="rev-recd"><day>11,</day>	<month>October</month>	<year>2019</year>	</date><date date-type="accepted"><day>14,</day>	<month>October</month>	<year>2019</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  K-means clustering algorithm is an important algorithm in unsupervised learning and plays an important role in big data processing, computer vision and other research fields. However, due to its sensitivity to initial partition, outliers, noise and other factors, the clustering results in data analysis, image segmentation and other fields are unstable and weak in robustness. Based on the fast global K-means clustering algorithm, this paper proposed an improved K-means clustering algorithm. Through the neighborhood filtering mechanism, the points in the neighborhood of the selected initial clustering center have not participated in the selection of the next initial clustering center, which can effectively reduce the randomness of initial partition and improve the efficiency of initial partition. Mahalanobis distance was used in the clustering process to better consider the global nature of data. Compared with the traditional clustering algorithm and other optimization algorithms, the results of real data set testing are significantly improved.
 
</p></abstract><kwd-group><kwd>K-Means</kwd><kwd> Cluster</kwd><kwd> Neighbourhood</kwd><kwd> Mahalanobis Distance</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>With the development of artificial intelligence, researchers have explored more and more application scenarios for intelligent algorithms [<xref ref-type="bibr" rid="scirp.95688-ref1">1</xref>] , and various machine learning algorithms have become research hotspots. Machine learning algorithms can be roughly divided into supervised learning, unsupervised learning and semi-supervised learning. K-means algorithm is an important clustering algorithm in unsupervised learning [<xref ref-type="bibr" rid="scirp.95688-ref2">2</xref>] . It plays an important role not only in the field of big data analysis, but also in the field of computer vision, such as image segmentation [<xref ref-type="bibr" rid="scirp.95688-ref3">3</xref>] .</p><p>K-means algorithm is simple and easy to understand, usually as the first choice for large sample cluster analysis algorithm [<xref ref-type="bibr" rid="scirp.95688-ref4">4</xref>] . However, in the traditional K-means algorithm, the number of clustering centers is observed from the data according to experience, and the initial location of clustering centers is random. This results in the weak stability of the algorithm, which is easily affected by noise and outliers. In recent years, many optimization algorithms have been developed by researchers [<xref ref-type="bibr" rid="scirp.95688-ref5">5</xref>] - [<xref ref-type="bibr" rid="scirp.95688-ref12">12</xref>] . For example, paper [<xref ref-type="bibr" rid="scirp.95688-ref5">5</xref>] used the method of residual analysis to automatically obtain the initial cluster center and number of class clusters from the decision graph, which solves the problem of manually specifying the number of class clusters. However, this method is complex to implement and has poor effect on the sparsely distributed data set. In paper [<xref ref-type="bibr" rid="scirp.95688-ref6">6</xref>] , median was used as the clustering center object and K-means++ clustering method makes the clustering effect better than traditional clustering method, but the algorithm size is large and time complexity increases. Literature [<xref ref-type="bibr" rid="scirp.95688-ref7">7</xref>] takes the point with the largest number of nearest neighbor data points as the initial center point, which is effectively applied to the anomaly detection of Marine data, but the corresponding effect of massive high-dimensional data is weak.</p><p>This paper presents a fast global K-means optimization algorithm based on neighborhood screening. On the basis of optimizing the random selection of K traditional clustering centers, the speed of searching the clustering centers in the initial test is improved. In addition, Mahalanobis distance [<xref ref-type="bibr" rid="scirp.95688-ref13">13</xref>] is used in the process of clustering, which improves the global consideration of the clustering process and makes the algorithm more suitable for application in image processing.</p></sec><sec id="s2"><title>2. K-Means Clustering</title><p>K-means algorithm is a very classical clustering algorithm with a wide range of applications. This chapter mainly concludes this algorithm and its derived optimization algorithm.</p><sec id="s2_1"><title>2.1. Traditional K-Means Clustering</title><p>The execution process of the classic K-means algorithm is divided into the following steps:</p><p>Step 1: The value of user input parameter K [<xref ref-type="bibr" rid="scirp.95688-ref5">5</xref>] , which is the number of initial clustering centers and is generally obtained from given data samples based on empirical observation. The algorithm randomly generates K clustering centers m 1 , m 2 , ⋯ , m k , represent clusters c 1 , c 2 , ⋯ , c k .</p><p>Step 2: To calculate the Euclidean distance from each sample point x<sub>i</sub> in data set D to K clustering centers [<xref ref-type="bibr" rid="scirp.95688-ref6">6</xref>] , and put the samples into the cluster c i ( i = 1 , 2 , ⋯ , k ) where the nearest clustering center is located. D = { x i | x i ∈ R m , i = 1 , 2 , ⋯ , n } . Euclidean distance represents the similarity degree between the sample point and the cluster center. The smaller the distance, the higher the similarity degree. The calculation formula is shown in formula (1).</p><p>D i s t ( x i , m j ) = ( x i − m j ) T ( x i − m j ) (1)</p><p>i ∈ { 1 , 2 , ⋯ , n } , j ∈ { 1 , 2 , ⋯ , k } .</p><p>Step 3: To calculate the mean value of all sample points in each cluster, and update all clustering centers in step 1 with the obtained mean value.</p><p>Step 4: Repeat step 1 and step 2 until the clustering center obtained two times in a row is no longer changed, then ending the clustering.</p><p>The traditional K-means clustering algorithm is simple in thought and easy to implement, which is one of the widely studied and applied clustering algorithms. However, random selection of the initial clustering center also causes unstable clustering results and clustering efficiency, as well as local optimal problems [<xref ref-type="bibr" rid="scirp.95688-ref7">7</xref>] .</p></sec><sec id="s2_2"><title>2.2. Fast Global K-Means</title><p>The fast global K-means algorithm is an improvement on the traditional K-means algorithm. By considering global data, the initial clustering center is found to reduce the sensitivity of the algorithm to outliers and noise [<xref ref-type="bibr" rid="scirp.95688-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.95688-ref15">15</xref>] . The algorithm flow chart is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p><p>The calculation formula of b<sub>n</sub> is shown in formula (2). Where N is the total number of samples, d k − 1 j is the minimum distance between sample point x<sub>j</sub> and k initial clustering centers, and x<sub>n</sub> is the sample points except the clustering center.</p><p>b n = ∑ j = 1 N max ( d k − 1 j − ‖ x n − x j ‖ 2 , 0 ) (2)</p><p>This algorithm can effectively solve the random problem of the initial clustering center [<xref ref-type="bibr" rid="scirp.95688-ref8">8</xref>] , and can effectively reduce the clustering times and thus shorten the clustering time. However, in the selection process of clustering center, repeated distance calculation is required for each sample point, which increases the time complexity of initial value selection.</p></sec><sec id="s2_3"><title>2.3. Global K-Means Algorithm</title><p>The global K-means algorithm mentioned in literature [<xref ref-type="bibr" rid="scirp.95688-ref9">9</xref>] replaces the maximum relative distance b<sub>n</sub> from the existing clustering center in the Fast Global K-means algorithm with the maximum absolute distance ‖ x n − x j ‖ 2 between two points. During the selection of the initial cluster center, d<sub>j</sub> is only calculated as the distance between the pre-selected cluster center x<sub>n</sub> and other sample points x<sub>j</sub>, and d<sub>j</sub> is summed up. Finally, the point with minimum accumulation value is selected as the clustering center.</p><p>This method reduces the computational steps when the initial cluster center is selected and reduces the time complexity of the algorithm to some extent. However, the influence of the selected initial clustering center on the next initial clustering center is ignored, which reduces the constraint conditions of initial value selection and improves the randomness.</p></sec></sec><sec id="s3"><title>3. Initial Value Filtering Optimizes Fast Global K-Means</title><p>In this paper, the selection of initial cluster center is optimized by neighborhood screening. When selecting the initial clustering center, the points within the minimum radius of the existing clustering center do not participate in the selection of the next initial clustering center, which reduces the time complexity of the Fast Global K-means algorithm in selecting the initial value. In the process of updating the clustering center, Mahalanobis distance is used instead of Euclidean distance, which increases the consideration of global data of the algorithm and is more suitable for the application of computer vision field.</p><sec id="s3_1"><title>3.1. Neighborhood Filter</title><p>In practical applications, each cluster center will be a certain distance away, and the next cluster center must be outside a certain neighborhood of the known cluster center. According to formula (2), there must be no point that maximizes b<sub>n</sub> in a certain neighborhood of the known initial cluster center. Therefore, it is not necessary to calculate the initial cluster center search for the sample points in the neighborhood. Under the circumstance that the distribution of the whole class of samples is unknown, the size of the neighborhood is largely affected by the number of clustering centers k.</p><p>Suppose the first initial cluster center m<sub>1</sub> is located at the middle point of the sample, and sample x<sub>max</sub> is the farthest sample point from m<sub>1</sub>, and the distance is d<sub>m</sub><sub>max</sub>. In the extreme case, K initial clustering centers are evenly distributed on the line segment formed by x<sub>max</sub> and m<sub>1</sub>, and the vertex of the line segment is two initial clustering centers, so the distance between each two initial clustering centers is d r = d m max / ( k − 1 ) . After comprehensive consideration, sufficient sample points are ensured to serve as the next initial clustering center after each initial clustering center is determined, and the time complexity of the algorithm is minimized. In this paper, R is selected as formula (3)</p><p>R = d m max / ( 2 ∗ ( k − 1 ) ) (3)</p><p>where k is the number of clustering centers, d<sub>m</sub><sub>max</sub> is the maximum distance between all sample points and the first initial clustering center (i.e., sample median).</p><p>Taking the selection of the second initial clustering center as an example, calculate the distance d<sub>m</sub> between all samples in the initial sample set D { x 1 , x 2 , ⋯ , x n } and the first initial clustering center m<sub>1</sub>. D 1 { x 1 , x 2 , ⋯ , x m } , the set composed of d<sub>m</sub> sample points, is selected. From D<sub>1</sub>, each sample x<sub>n</sub> is respectively selected as the second clustering center. According to formula (2), b<sub>n</sub> is calculated to determine the second initial clustering center m<sub>2</sub>.</p><p>Then, the distance between all sample points in D<sub>1</sub> and m<sub>2</sub> is calculated respectively, and the points whose distance is greater than the minimum radius R are formed into the set D<sub>2</sub>. m 3 , m 4 , ⋯ , m k can be obtained according to the above methods.</p></sec><sec id="s3_2"><title>3.2. Mahalanobis Distance</title><p>In the current researches on K-means clustering algorithm, most of them conduct clustering based on Euclidean distance, but Euclidean distance is only applicable to clustering of spherical structure, and the correlation between variables and the difference in importance of each variable are not considered when processing data [<xref ref-type="bibr" rid="scirp.95688-ref10">10</xref>] . It has some defects in the application of high correlation data and image fuzzy segmentation. Mahalanobis distance is a method of calculating distance similarity proposed by P. C. Mahalanobis, an Indian statistician. Can be used to calculate both follow the same distribution and its covariance matrix of the Σ degree of difference between random variables. When the covariance matrix Σ matrix for the unit, the Mahalanobis distance can be converted into Euclidean distance. The Mahalanobis distance formula is shown in formula (4).</p><p>M ( x i , x j ) = ( x i − x j ) T Σ − 1 ( x i − x j ) (4)</p><p>The x<sub>i</sub>, x<sub>j</sub> for two vectors of the same sample concentration, Σ as the covariance matrix of the sample, M(x<sub>i</sub>, x<sub>j</sub>) for the Mahalanobis distance between two samples.</p><p>Compared with the Euclidean distance, the Mahalanobis distance reflects the internal relationship between sample attributes [<xref ref-type="bibr" rid="scirp.95688-ref11">11</xref>] , can effectively describe the global relationship between two sample points, and contains more neighborhood information and spatial information [<xref ref-type="bibr" rid="scirp.95688-ref12">12</xref>] , which can play a better analysis effect in big data processing and image segmentation.</p></sec><sec id="s3_3"><title>3.3. Average Error</title><p>The K-means clustering algorithm usually uses the square sum of clustering error D to represent the clustering effect, which is the sum of the distance from each sample to K cluster centers, and is defined as the formula (5).</p><p>D = ∑ j = 1 K ∑ 1 N | x i − m j | 2 (5)</p><p>where, x<sub>i</sub> represents the ith sample, and there are N samples, m<sub>j</sub> represents the jth clustering center, with a total of K clustering centers. In order to facilitate the observation of values, this paper uses the average error L to represent the clustering effect, which is defined as the formula (6).</p><p>L = 1 N D (6)</p><p>For the same data set, the smaller the value of L is, the better the clustering effect is.</p></sec><sec id="s3_4"><title>3.4. Algorithm Steps</title><p>Steps of fast global K-means clustering algorithm based on neighborhood screening and Mahalanobis distance:</p><p>Input: K: The number of cluster clusters;</p><p>D: A data set containing n objects.</p><p>Output: Sets and categories of K clusters</p><p>Method:</p><p>(1) Calculate the median value of all samples as the initial cluster center of the first cluster, and set s = 1.</p><p>(2) Calculate the distance d<sub>j</sub> from each sample point x<sub>i</sub> to its clustering center m<sub>1</sub>, taking d m m a x = max ( d j ) , j = 1 , 2 , ⋯ , n , and D 1 = D .</p><p>(3) Calculate the minimum radius R in set D<sub>i</sub>, as shown in formula (3).</p><p>(4) Set s = s + 1, if s &gt; k jumps to (7).</p><p>(5) As m 1 , m 2 , ⋯ , m s − 1 are the first s − 1 cluster center, to calculate the distance d s − 1 j from each sample x<sub>j</sub> in the set D<sub>s</sub><sub>−1</sub> to the cluster center m<sub>s</sub><sub>−1</sub>. The new data set D i { x 1 , x 2 , ⋯ , x m } is composed of d &gt; R sample x<sub>j</sub>.</p><p>(6) To calculate b<sub>n</sub>, for example, formula (2), select the x<sub>n</sub> with the largest b<sub>n</sub> as the s-th cluster center m<sub>s</sub>, and jump to (3).</p><p>(7) The Mahalanobis distance from x j ( j = 1 , 2 , ⋯ , n ) of all samples in set D to m i ( i = 1 , 2 , ⋯ , k ) of k cluster centers is calculated respectively, and the sample points are divided into clusters closest to the cluster center.</p><p>(8) Calculate the sample mean m i N in each cluster. If m i N = m i , jump to (9); otherwise, set m i = m i N , and jump to (7).</p><p>(9) Output the set, class number and average error of K clusters to end the clustering.</p></sec></sec><sec id="s4"><title>4. Experiment and Results</title><p>Experimental environment: Windows10 system, python3.5 development environment, Pycharm compiler, Intel Core i5 8th Gen CPU, 8G memory and 64-bit operating system were used.</p><p>Experimental data: data sets of two-dimensional data and Wine quality-red standard data sets in UCI were selected in the experiment. Data source: http://archive.ics.uci.edu/ml/.</p><sec id="s4_1"><title>4.1. Simulation Result</title><p>In this paper, the algorithm time and the mean value of error sum square are used as the evaluation criteria of clustering effect.</p><p>Clustering experiments were carried out on traditional K-means algorithm, Fast Global K-means algorithm (FGK-means), fast global K-means algorithm based on neighborhood screening (RFGK-means), and fast global K-means algorithm based on neighborhood screening and Markov distance (RMFGK-means), respectively. Set the number of clustering centers to 4, and the clustering effect is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p><p>After 10 times of clustering simulation, the average value is obtained. The time and average error of each algorithm are shown in <xref ref-type="table" rid="table1">Table 1</xref> (kept three decimal places).</p><p>There are 1600 pieces of data in Wine quality-red, and each piece of data has 12 characteristics, among which the data under the quality attribute can be used as sample labels. After removing the title and the last quality feature, 1599 pieces of data are used, and 11 feature data of each piece of data are normalized for clustering. The number of cluster centers was set to 6, and the average value was obtained after 10 cluster simulations. The clustering results were shown in <xref ref-type="table" rid="table2">Table 2</xref> (kept three decimal places).</p><p>In this paper, the ratio of the number of correctly classified samples to the total number of samples was defined as the correct classification rate, which was used to test the clustering effect of RFGK-means and RMFGK-means.</p><p>According to the 6 qualities of Wine quality-red, the original samples are classified into classes D1 to D6, and the clustering results are classified into classes DA1 to DA6 respectively. Each data sample in DA1 is fitted with samples from D1 to D6 respectively, and the number of the same samples is recorded. The fitting results of RFGK-means are shown in <xref ref-type="table" rid="table3">Table 3</xref>. Finally, the sample quality classification set with the largest number of samples and no duplication with other category sets as the similar set of clustering results, and statistics the number of identical samples in similar sets. Similar set results of RFGK-means are shown in <xref ref-type="table" rid="table4">Table 4</xref>.</p><p>The fitting results and similar set results after RMFGK-means clustering are shown in <xref ref-type="table" rid="table5">Table 5</xref> and <xref ref-type="table" rid="table6">Table 6</xref> respectively.</p><p>By analyzing the above results, the clustering effects of RFGK-means and RMFGK-means are shown in <xref ref-type="table" rid="table7">Table 7</xref> (kept three decimal places).</p><p>According to the above data, the correct classification rate of samples obtained by RMFGK-means clustering is higher than that obtained by RFGK-means clustering.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Clustering results of two-dimensional simulation data set</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Arithmetic</th><th align="center" valign="middle" >Time of Initial Value Selection (s)</th><th align="center" valign="middle" >Time of Clustering (s)</th><th align="center" valign="middle" >Total Time (s)</th><th align="center" valign="middle" >Average Error</th></tr></thead><tr><td align="center" valign="middle" >K-means</td><td align="center" valign="middle" >0.016</td><td align="center" valign="middle" >0.564</td><td align="center" valign="middle" >0.580</td><td align="center" valign="middle" >1.507</td></tr><tr><td align="center" valign="middle" >FGK-means</td><td align="center" valign="middle" >9.671</td><td align="center" valign="middle" >0.506</td><td align="center" valign="middle" >10.177</td><td align="center" valign="middle" >1.507</td></tr><tr><td align="center" valign="middle" >RFGK-means</td><td align="center" valign="middle" >8.685</td><td align="center" valign="middle" >0.488</td><td align="center" valign="middle" >9.173</td><td align="center" valign="middle" >1.507</td></tr><tr><td align="center" valign="middle" >RMFGK-means</td><td align="center" valign="middle" >8.683</td><td align="center" valign="middle" >0.741</td><td align="center" valign="middle" >9.489</td><td align="center" valign="middle" >1.508</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Clustering results</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Arithmetic</th><th align="center" valign="middle" >Time of Initial Value Selection (s)</th><th align="center" valign="middle" >Time of Clustering (s)</th><th align="center" valign="middle" >Total Time (s)</th><th align="center" valign="middle" >Average Error</th></tr></thead><tr><td align="center" valign="middle" >K-means</td><td align="center" valign="middle" >0.238</td><td align="center" valign="middle" >7.625</td><td align="center" valign="middle" >7.862</td><td align="center" valign="middle" >0.107</td></tr><tr><td align="center" valign="middle" >FGK-means</td><td align="center" valign="middle" >184.426</td><td align="center" valign="middle" >6.158</td><td align="center" valign="middle" >190.585</td><td align="center" valign="middle" >0.104</td></tr><tr><td align="center" valign="middle" >RFGK-means</td><td align="center" valign="middle" >112.670</td><td align="center" valign="middle" >5.514</td><td align="center" valign="middle" >118.184</td><td align="center" valign="middle" >0.100</td></tr><tr><td align="center" valign="middle" >RMFGK-means</td><td align="center" valign="middle" >112.388</td><td align="center" valign="middle" >42.530</td><td align="center" valign="middle" >155.407</td><td align="center" valign="middle" >0.115</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Results of RFGK-means fitting</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >RFGK-means</th><th align="center" valign="middle" >D1</th><th align="center" valign="middle" >D2</th><th align="center" valign="middle" >D3</th><th align="center" valign="middle" >D4</th><th align="center" valign="middle" >D5</th><th align="center" valign="middle" >D6</th></tr></thead><tr><td align="center" valign="middle" >DA1</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >140</td><td align="center" valign="middle" >178</td><td align="center" valign="middle" >67</td><td align="center" valign="middle" >6</td></tr><tr><td align="center" valign="middle" >DA2</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >5</td><td align="center" valign="middle" >272</td><td align="center" valign="middle" >143</td><td align="center" valign="middle" >13</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >DA3</td><td align="center" valign="middle" >7</td><td align="center" valign="middle" >28</td><td align="center" valign="middle" >389</td><td align="center" valign="middle" >231</td><td align="center" valign="middle" >29</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >DA4</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >10</td><td align="center" valign="middle" >42</td><td align="center" valign="middle" >166</td><td align="center" valign="middle" >31</td><td align="center" valign="middle" >4</td></tr><tr><td align="center" valign="middle" >DA5</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >22</td><td align="center" valign="middle" >133</td><td align="center" valign="middle" >123</td><td align="center" valign="middle" >10</td></tr><tr><td align="center" valign="middle" >DA6</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >40</td><td align="center" valign="middle" >17</td><td align="center" valign="middle" >8</td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Results of RFGK-means similar set</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Classification of clustering results</th><th align="center" valign="middle" >DA1</th><th align="center" valign="middle" >DA2</th><th align="center" valign="middle" >DA3</th><th align="center" valign="middle" >DA4</th><th align="center" valign="middle" >DA5</th><th align="center" valign="middle" >DA6</th></tr></thead><tr><td align="center" valign="middle" >Classification of sample quality</td><td align="center" valign="middle" >D4</td><td align="center" valign="middle" >D1</td><td align="center" valign="middle" >D3</td><td align="center" valign="middle" >D2</td><td align="center" valign="middle" >D5</td><td align="center" valign="middle" >D6</td></tr><tr><td align="center" valign="middle" >Same sample size</td><td align="center" valign="middle" >178</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >389</td><td align="center" valign="middle" >10</td><td align="center" valign="middle" >123</td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Fitting results of RMFGK-means</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >RMFGK-means</th><th align="center" valign="middle" >D1</th><th align="center" valign="middle" >D2</th><th align="center" valign="middle" >D3</th><th align="center" valign="middle" >D4</th><th align="center" valign="middle" >D5</th><th align="center" valign="middle" >D6</th></tr></thead><tr><td align="center" valign="middle" >DA1</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >211</td><td align="center" valign="middle" >64</td><td align="center" valign="middle" >7</td><td align="center" valign="middle" >1</td></tr><tr><td align="center" valign="middle" >DA2</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >137</td><td align="center" valign="middle" >210</td><td align="center" valign="middle" >59</td><td align="center" valign="middle" >6</td></tr><tr><td align="center" valign="middle" >DA3</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >18</td><td align="center" valign="middle" >10</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >DA4</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >36</td><td align="center" valign="middle" >389</td><td align="center" valign="middle" >295</td><td align="center" valign="middle" >42</td><td align="center" valign="middle" >1</td></tr><tr><td align="center" valign="middle" >DA5</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >135</td><td align="center" valign="middle" >271</td><td align="center" valign="middle" >159</td><td align="center" valign="middle" >12</td></tr><tr><td align="center" valign="middle" >DA6</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >15</td><td align="center" valign="middle" >18</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><table-wrap id="table6" ><label><xref ref-type="table" rid="table6">Table 6</xref></label><caption><title> RMFGK-means similar set results</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Classification of clustering results</th><th align="center" valign="middle" >DA1</th><th align="center" valign="middle" >DA2</th><th align="center" valign="middle" >DA3</th><th align="center" valign="middle" >DA4</th><th align="center" valign="middle" >DA5</th><th align="center" valign="middle" >DA6</th></tr></thead><tr><td align="center" valign="middle" >Classification of sample quality</td><td align="center" valign="middle" >D2</td><td align="center" valign="middle" >D5</td><td align="center" valign="middle" >D1</td><td align="center" valign="middle" >D3</td><td align="center" valign="middle" >D4</td><td align="center" valign="middle" >D6</td></tr><tr><td align="center" valign="middle" >Same sample size</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >59</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >389</td><td align="center" valign="middle" >271</td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><table-wrap id="table7" ><label><xref ref-type="table" rid="table7">Table 7</xref></label><caption><title> Comparison of clustering effect</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Arithmetic</th><th align="center" valign="middle" >Number of correctly classified samples</th><th align="center" valign="middle" >Correct classification rate</th></tr></thead><tr><td align="center" valign="middle" >RFGK-means</td><td align="center" valign="middle" >701</td><td align="center" valign="middle" >0.438</td></tr><tr><td align="center" valign="middle" >RMFGK-means</td><td align="center" valign="middle" >726</td><td align="center" valign="middle" >0.454</td></tr></tbody></table></table-wrap></sec><sec id="s4_2"><title>4.2. Experimental Analysis</title><p>In the process of using traditional K-means for clustering, the clustering time and average error fluctuate greatly. Since the initial value is randomly selected, the clustering time is unstable, and the clustering effect is easy to fall into local optimal. The other three algorithms use the global method to find the initial clustering center, and can output the clustering center stably, so as to obtain stable clustering results. RFGK-means and RMFGK-means are faster than FGK-means in the selection of initial clustering center. Mahalanobis distance is used to take into account the global distribution of data, instead of Euclidean distance, which can improve the accuracy of clustering results in real data sets.</p></sec></sec><sec id="s5"><title>5. Conclusion</title><p>The fast global K-means algorithm based on neighborhood screening can effectively shorten the time used for initial value search, enhance the robustness of the algorithm, and its clustering effect is basically consistent with the fast global K-means algorithm. The use of Mahalanobis distance instead of Euclidean distance in the clustering process can fully consider the integrity of data, effectively improve the anti-noise ability of the algorithm and improve the clustering accuracy. However, due to a large amount of calculation of Mahalanobis distance, the clustering time is increased to some extent, which makes the total time of the algorithm increase. RMFGK-means algorithm can exert greater advantages when clustering highly correlated data.</p></sec><sec id="s6"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s7"><title>Cite this paper</title><p>Han, J.T. and Li, H.M. (2019) Initial Value Filtering Optimizes Fast Global K-Means. Journal of Computer and Communications, 7, 52-62. https://doi.org/10.4236/jcc.2019.710005</p></sec></body><back><ref-list><title>References</title><ref id="scirp.95688-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Cui, Y.H., Shang, C., Chen, S.Q. and Hao, J.Y. (2019) Overview of AI: Developments of AI Techniques. Radio Communications Technology, 45, 225-231.</mixed-citation></ref><ref id="scirp.95688-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Gbadoubissa, J.E.Z., Ari, A.A.A. and Gueroui, A.M. (2018) Efficient K-Means Based Clustering Scheme for Mobile Networks Cell Sites Management. Journal of King Saud University—Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2018.10.015</mixed-citation></ref><ref id="scirp.95688-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Han, X. (2017) Research of the Micro Pipeline Robot Based on Machine Vision. Master Thesis, Tianjin University of Technology, Tianjin.</mixed-citation></ref><ref id="scirp.95688-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Cong, S.A. and Wang, X.X. (2018) Research Review on K-Means Algorithm. Electronic Technology &amp; Software Engineering, No. 17, 155-156.</mixed-citation></ref><ref id="scirp.95688-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Jia, R.Y. and Li, Y.G. (2018) K-Means Algorithm of Clustering Number and Centers Self-Determination. Computer Engineering and Applications, 54, 152-158.</mixed-citation></ref><ref id="scirp.95688-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Liu, Y., Wu, S., Zhou, H.-H., Wu, X.-J. and Han, L.-Y. (2019) Research on Optimization Method Based on K-Means Clustering Algorithm. Information Technology, 43, 66-70.</mixed-citation></ref><ref id="scirp.95688-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Jiang, H., Ji, F., Wang, H.-J., Wang, X., Luo, Y.-D., Jiang, H., Ji, F., Wang, H.-J., Wang, X. and Luo, Y.-D. (2018) Improved Kmeans Algorithm for Ocean Data Anomaly Detection. Computer Engineering and Design, 39, 3132-3136.</mixed-citation></ref><ref id="scirp.95688-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Wang, H. and Qin, L.B. (2012) Method of Image Segmentation Based on Fast Global K-Means Algorithm and Region Merging. Computer Engineering and Applications, 48, 187-190+223.</mixed-citation></ref><ref id="scirp.95688-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Tao, Y., Yang, F., Liu, Y. and Dai, B. (2018) Research and Optimization of K-Means Clustering Algorithm. Computer Technology and Development, 28, 90-92.</mixed-citation></ref><ref id="scirp.95688-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Yi, Q., Teng, S.H. and Zhang, W. (2012) Intrusion Detection Based on K-Means Clustering Algorithm Based on Mahalanobis Distance. Journal of Jiangxi Normal University (Natural Science), 36, 284-287.</mixed-citation></ref><ref id="scirp.95688-ref11"><label>11</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Liu</surname><given-names> Y.H. </given-names></name>,<etal>et al</etal>. (<year>2018</year>)<article-title>Design and Implementation of an Improved K-Means Clustering Algorithm for Natural Image Segmentation</article-title><source> Journal of Huainan Normal University</source><volume> 20</volume>,<fpage> 120</fpage>-<lpage>125</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.95688-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Wang, Y., Qi, X.H. and Duan, Y.X. (2019) Image Segmentation of FCM Algorithm Based on Kernel Function and Markov Distance. Application Research of Computers, 1-5.</mixed-citation></ref><ref id="scirp.95688-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Hoffelder, T. (2019) Equivalence Analyses of Dissolution Profiles with the Mahalanobis Distance. Biometrical Journal, 61, 779-782. https://doi.org/10.1002/bimj.201700257</mixed-citation></ref><ref id="scirp.95688-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Liu, C. and Xie, D.-Y. (2015) An Improved Fast Global K-Means Clustering Segmentation Algorithm. Journal of Qinghai Normal University (Natural Science Edition), 31, 1-5.</mixed-citation></ref><ref id="scirp.95688-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Lai, J.Z.C. and Huang, T.-J. (2010) Fast Global K-Means Clustering Using Cluster Membership and Inequality. Pattern Recognition, 43, 1954-1963. https://doi.org/10.1016/j.patcog.2009.11.021</mixed-citation></ref></ref-list></back></article>