<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2020.84003</article-id><article-id pub-id-type="publisher-id">JCC-99419</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  An SLGC Model for Asian Food Image Classification
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ruoqi</surname><given-names>Wu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Shuai</surname><given-names>Zhao</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Zhijian</surname><given-names>Qu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Department of Computer Science, University of York, York, UK</addr-line></aff><aff id="aff1"><addr-line>Department of Computer Science and Technology, Shandong University of Technology, Zibo, China</addr-line></aff><pub-date pub-type="epub"><day>30</day><month>03</month><year>2020</year></pub-date><volume>08</volume><issue>04</issue><fpage>26</fpage><lpage>43</lpage><history><date date-type="received"><day>2,</day>	<month>February</month>	<year>2020</year></date><date date-type="rev-recd"><day>6,</day>	<month>April</month>	<year>2020</year>	</date><date date-type="accepted"><day>9,</day>	<month>April</month>	<year>2020</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  As a fine-grained classification problem, food image classification faces many difficulties in the specific implementation. Different countries and regions have different eating habits. In particular, Asian food images have a complicated structure, and the related classification methods are still very scarce. There is an urgent need to develop a feature extraction and fusion scheme based on the characteristics of Asian food images. To solve the above problems, we proposed an image classification model SLGC (SURF-Local and Global Color) that combines image segmentation and feature fusion. By studying the unique structure of Asian foods, the color features of the images are merged into the representation vectors in the local and global dimensions, respectively, thereby further enhancing the effect of feature extraction. The experimental results show that the SLGC model can express the intrinsic characteristics of Asian food images more comprehensively and improve classification accuracy.
 
</p></abstract><kwd-group><kwd>Asian Food</kwd><kwd> Image Classification</kwd><kwd> Image Segmentation</kwd><kwd> Feature Fusion</kwd><kwd> Bag of Features</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>With the improvement of living standards, people began to pursue a more scientific and healthy diet. The food image classification is to automatically analyze the food images provided by the user through a computer and give a matching food name to further predict the user’s diet and nutrient intake [<xref ref-type="bibr" rid="scirp.99419-ref1">1</xref>].</p><p>Since the 1990s, relevant research on food identification has appeared. SVM-based multi-core learning, multi-feature fusion and other methods have been applied by researchers in the field of food recognition [<xref ref-type="bibr" rid="scirp.99419-ref2">2</xref>]. Hongsheng He et al. present an automatic food classification method, DietCam, which specifically addresses the variation of food appearances [<xref ref-type="bibr" rid="scirp.99419-ref3">3</xref>]. Shota Sasano et al. propose to characterize the color and texture information by incorporating the strategy of patch-based bag of features model, which can greatly improve the accuracy of classification [<xref ref-type="bibr" rid="scirp.99419-ref4">4</xref>]. Since the food images mostly show the table scene, so there are inevitably hidden objects such as tableware, condiments, tablecloths, which increases the complexity of the stage. Such problems lead to a lack of clarity in the food subject, which hurts the extraction of features, and seriously affects the effect of image classification. Also, convolutional neural networks have become a very effective method in the field of computer vision [<xref ref-type="bibr" rid="scirp.99419-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.99419-ref6">6</xref>] and are increasingly being used in the area of food image classification [<xref ref-type="bibr" rid="scirp.99419-ref7">7</xref>]. Takumi Ege et al. apply Faster R-CNN to food photos of multiple dishes and use Faster R-CNN as a food detector to detect each dish in a food image, then they estimate food calories from a food photo of multiple dishes [<xref ref-type="bibr" rid="scirp.99419-ref8">8</xref>]. Shu Naritomi et al. implement Japanese food category transformation in mixed reality using both image generation and HoloLens [<xref ref-type="bibr" rid="scirp.99419-ref9">9</xref>]. In the use of convolutional neural networks for image classification, to achieve better accuracy, it is necessary to provide an extensive dataset during the training process. However, the current data collection and processing for Asian food images are progressing slowly, making it challenging to implement a large-scale deep learning training process. On the other hand, the image classification method based on deep learning is highly accurate, but the results are often lack of explanatory, which is not conducive to the in-depth analysis of Asian food-specific structures and feature extraction methods [<xref ref-type="bibr" rid="scirp.99419-ref10">10</xref>].</p><p>From the perspective of food attributes, European, American and Asian foods differ significantly in terms of structure, morphology, texture, and colour [<xref ref-type="bibr" rid="scirp.99419-ref11">11</xref>]. In the West, most of the food is well structured, and the cooking style is relatively monotonous. However, Asian foods have different shapes, unclear structures, and the appearance of dishes under different cooking methods varies greatly, so it is necessary to develop an image classification scheme suitable for Asian food.</p><p>We proposed an image classification model SLGC (SURF-Local and Global Color) that combines image segmentation and feature fusion. First, the GrabCut algorithm is used to segment the original image, and extract the food subject from the image. Then, we made improvements to the BoF (Bag of Features) model, to extract the SURF (Speeded Up Robust Features) feature of the image, and the colour information in the neighbourhood of the SURF feature point as the “local colour feature”. The local colour feature is merged with the SURF feature and quantized. By clustering and building a feature dictionary, we can get the “image representation vector” of the original image. Finally, the “global colour feature” of the original image is extracted and merged with the image representation vector, then input the merged features into a classification model based on the SVM (Support Vector Machine) for training and classification.</p></sec><sec id="s2"><title>2. The Theoretical Model</title><p>The SLGC model proposed in this paper includes image segmentation, image feature extraction, local and global feature fusion and classification. <xref ref-type="fig" rid="fig1">Figure 1</xref> shows the framework of SLGC.</p><p>To reduce the influence of the background of the food image on feature extraction, SLGC uses the GrabCut algorithm to segment food images. GrabCut is an interactive segmentation algorithm that uses GMM (Gaussian mixture model) to estimate the colour distribution of the segmented object and background based on the specified bounding box of the segmented object [<xref ref-type="bibr" rid="scirp.99419-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.99419-ref13">13</xref>]. Equation (1) shows the energy function of the GrabCut.</p><p>E ( α , k , θ , z ) = U ( α , k , θ , z ) + V ( α , z ) (1)</p><p>α is the transparency coefficient, k is the number of GMM components, and θ = { π k , u k , Σ k } is the ratio, mean and covariance corresponding to each GMM component. According to the matching degree between α and pixel z, the quality of the region data item U can be measured. The smooth term V obtains the minimum value at the image boundary, thereby getting the optimal amount of the energy function E and the best segmentation scheme is determined.</p><p><xref ref-type="fig" rid="fig2">Figure 2</xref> shows the effect of using the GrabCut algorithm for food image segmentation. Column (a) is the original image, containing interference information such as tableware, tablecloths and other foods. Column (b) shows the effect of marking the main body of the food in the original image. Column (c) shows the effect of the image segmentation, and column (d) is the final result after cropping.</p><p>The SLGC uses SURF to extract local feature information of the image. The core of the SURF algorithm is the Hessian matrix. Using the Hessian matrix to calculate the local determinant of each pixel in the image, we can obtain the feature points of the image, as shown in Equation (2).</p><p>H ( x , σ ) ( ∂ 2 ∂ x 2 ∂ 2 ∂ x y ∂ 2 ∂ x y ∂ 2 ∂ y 2 ) ⋅ L ( x , y , σ ) = ( L x x ( x , y , σ ) L x y ( x , y , σ ) L x y ( x , y , σ ) L y y ( x , y , σ ) ) (2)</p><p>L x x ( x , y , σ ) is the convolution of Gaussian second-order differential with the original image at point ( x , y ) .</p><p>At the same time, to ensure the feature points that be extracted have rotation invariance, it is necessary to determine the main direction of the feature points. Counting the Harry wavelet feature in the neighbourhood centred on the feature points and radiused by six scales. Moreover, calculating the sum of the wavelet responses in the 60˚ fan window. Then the feature direction vector is obtained, as shown in Equation (3).</p><p>m w = ∑ w d x + ∑ w d y , θ w = arctan ( ∑ w d x / ∑ w d y ) (3)</p><p>m w and θ w represent the magnitude and direction of the feature direction</p><p>vector, respectively. Centring on the feature points, we divided the square areas of 20 scale ranges into 16 sub-blocks along the main direction. ∑ d x , ∑ d y , ∑ | d x | and ∑ | d y | are respectively counted to generate SURF feature descriptors, and the dimension D s is a fixed value of 64.</p><p>Compared with SIFT (Scale-invariant feature transform), using SURF to extract the features of food images can effectively improve the speed while maintaining the appropriate number of feature points [<xref ref-type="bibr" rid="scirp.99419-ref14">14</xref>], which is beneficial to the subsequent real-time processing of food images.</p><p><xref ref-type="fig" rid="fig3">Figure 3</xref> shows the effect of extracting feature points on the food image using SURF. Column (a) is the original image. To make the feature point clear, set the threshold to 6000 to extract the SURF feature points and mark them in the image in column (b). The centre of the blue circle is the position of the feature point, and the different radii represent different scale information.</p><p><xref ref-type="fig" rid="fig4">Figure 4</xref> shows the change of feature vector information during local-global feature fusion.</p><p>The primary function of the feature fusion part in SLGC is to fuse the SURF feature of the food image with the local colour feature using the BoF model to enhance the representation ability of the local feature information. Then, the global colour feature of the image is added to the image representation vector to complete the global feature fusion, and the feature extraction effect is further improved. Then, the final representation vector fused by the local and global features is input into the SVM for training and classification. The following sections will focus on the process of feature fusion.</p></sec><sec id="s3"><title>3. Feature Fusion</title><p>The critical work of the SLGC is in the feature fusion section. We made the following improvements to the BoF model. In the two steps of “feature point information quantization” and “formation of final representation vector”, the local</p><p>colour feature and the global colour feature of the image are respectively merged into the final representation vector. Through these two improvements, we can improve the effect of image classification effectively.</p><sec id="s3_1"><title>3.1. Bag of Features Model</title><p>The BoF model was proposed by Csurka and gradually applied to the field of image processing [<xref ref-type="bibr" rid="scirp.99419-ref15">15</xref>]. The necessary steps are as follows. Firstly, selected the training image by region, the feature points are located, and described by the feature vector respectively, as shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>.</p><p>Then, the feature vector set is processed by the K-means clustering algorithm to obtain the feature dictionary. With different feature extraction methods, the number of feature points that can be located in each image is also different. If the dataset contains m images, and the number of feature points of each image is β i , then the value of the cluster number K selected in this paper is as shown in Equation (4).</p><p>K = ∑ i = 1 m β i (4)</p><p>Finally, referring to the feature dictionary, the feature words can be extracted, and the frequency of occurrence of each word is counted to obtain the image representation vector of the original image, as shown in <xref ref-type="fig" rid="fig6">Figure 6</xref>.</p></sec><sec id="s3_2"><title>3.2. Local Feature Fusion</title><p>The SURF feature point is relatively unique in the image, which can reflect the essential information of the image, so the colour information of the pixels in the neighbourhood is also critical. The SURF information is extracted and fused</p><p>with the colour information in the neighbourhood of the feature point so that we can represent the image content more accurately and comprehensively. <xref ref-type="fig" rid="fig7">Figure 7</xref> shows an example of the feature point location and selection of neighbourhood pixel.</p><p>Point P is a certain SURF feature point, and R is the neighbourhood radius. Assume that the optimal value of R is 2 (subsequent experiments will determine the actual optimal value of R), and the RBG colour space is used to represent the colour information of pixels, which in the neighbourhood of feature points. Then, the local colour feature is formed, and we calculate its dimensions by Equation (5).</p><p>D c = ( 2 R 2 + 2 R + 1 ) ∗ 3 (5)</p><p>R represents the radius of the neighbourhood and D c represents the dimension of the local colour feature.</p><p>After obtaining the local colour feature, we improved the BoF model in the part of feature point quantization, and meantime, the local colour feature is combined with the corresponding SURF descriptor information to complete the local feature fusion. Since the feature vector and the colour information are different evaluation indicators, to eliminate the dimensional influence between the indicators, they must be normalized separately before the fusion. <xref ref-type="fig" rid="fig8">Figure 8</xref> shows the specific process of local feature fusion.</p><p>We take a visual word representing “bread” on the left as an example so that we can obtain the SURF feature vector, and the colour information in the neighbourhood of radius R. Then, we spliced the two into a local fused vector. We calculate its dimensions by Equation (6).</p><p>D l = D s + D c (6)</p><p>D s represents the dimension of the SURF feature vector, which is a fixed value of 64, D c represents the dimension of local colour feature, and D l represents the dimension of local fused vector after feature splicing. All the local fused vectors in the image are clustered, and the feature dictionary is generated to form</p><p>the image representation vector, and the local feature fusion process is completed.</p><p>Take the image of “Udon noodles” and “pies” in the UEC FOOD 100 as an example. We marked the position of the SURF feature point in the image of column (a), and column (b) is the colour histogram corresponding to the left image. Since the two images have a large area of yellow, and the main body is similar in colour, so the degree of discrimination is low. However, by analyzing the location of the feature points, we found that most of them are located in the “embellishment area”, which is the position of the red intestine and the chopped green onion. Unlike the Western habit of adorning food containers, Asian tend to embellish dishes directly with brightly coloured, fresh-tasting materials for a variety of reasons. Because the colour and shape of the embellishment area are prominent, the feature point can appear in the vicinity of it with a high probability, and the colour of the embellishment area can better reflect the type of food. By analyzing the colour information in the neighbourhood of the feature points of the two foods in <xref ref-type="fig" rid="fig9">Figure 9</xref>, it is possible to distinguish the two types of foods by the difference between red and green.</p></sec><sec id="s3_3"><title>3.3. Global Feature Fusion</title><p>Before using SURF to extract the features of food images, the original images must be greyed out. The grayscaled image inevitably loses its colour features, which contains much crucial information, and they are essential for colour-rich food images. In particular, the global colour feature can represent the most widely distributed colour in the image and can play a more significant role in the classification.</p><p>This article uses the HSV colour space to represent the global colour feature of a food image. The HSV colour space is a space in which H (hue), S (saturation), and V (Value) are used as colour values to locate colours. Compared with the RGB colour space, the HSV space can intuitively express the brightness and vividness of the colour, which is closer to the natural visual perception of the food image by humans.</p><p>To avoid the vector dimension of the global colour feature being too high, we use the Equation (7) to quantify the HSV space.</p><p>H = { 0 H ∈ [ 316,20 ] 1 H ∈ [ 21,40 ] 2 H ∈ [ 41,75 ] 3 H ∈ [ 76,155 ] 4 H ∈ [ 156,190 ] 5 H ∈ [ 191,270 ] 6 H ∈ [ 271,295 ] 7 H ∈ [ 296,315 ] S = { 0 S ∈ [ 0 , 0.2 ] 1 S ∈ [ 0.2 , 0.7 ] 2 S ∈ [ 0.7 , 1 ] V = { 0 V ∈ [ 0 , 0.2 ] 1 V ∈ [ 0.2 , 0.7 ] 2 V ∈ [ 0.7 , 1 ] (7)</p><p>Based on the above-described quantization relationship, each colour component is synthesized into a 72-dimensional colour feature vector according to Equation (8) and is used to represent the overall colour feature of the image.</p><p>G = 9 H + 3 S + V (8)</p><p>To complete the global feature fusion, we improved the BoF model in the section “Formation of final representation vector”, as shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>0.</p><p>The “image representation vector” is formed by clustering all the local fused vectors of the image to form a feature dictionary and complete statistics. It contains the SURF feature of the image and the colour features in the neighbourhood of the feature points. The image representation vector is spliced with the HSV global colour feature of the image to form a “final representation vector”. We calculate its dimensions by Equation (9).</p><p>D g = K + D h (9)</p><p>K is the number of clusters according to formula (4), in the meantime, it is the dimension of the image representation vector. D h is the dimension of the global colour feature, which is a fixed value of 72, and D g is the dimension of the final representation vector after the global feature fusion. <xref ref-type="fig" rid="fig1">Figure 1</xref>1 shows a colour</p><p>histogram comparison of different foods. Through analysis, we can see that the global feature fusion is reasonable and practical.</p><p>Column (a) shows two kinds of foods, “fried rice” and “vegetables”, they have different primary colour tones, which are beige and turquoise. Through the colour histogram of column (b), we find that the difference in the distribution of the two colours is significant. Therefore, global feature fusion can further improve the ability of the feature vector to express the content of food image and further improve the classification accuracy rate.</p></sec></sec><sec id="s4"><title>4. Experimental Results and Analysis</title><p>To verify the validity of the SLGC, we validate its key steps from two perspectives. Firstly, by comparing the classification effects before and after image segmentation to see if it contributes to the classification accuracy. Secondly, by comparing the classification accuracy under different value of neighbourhood radii, and the classification accuracy before and after the global feature fusion, we studied the influence of local feature fusion and global feature fusion on image classification separately.</p><sec id="s4_1"><title>4.1. Dataset</title><p>The datasets in the experiment were Caltech 101 and UEC FOOD 100. Caltech101 is an integrated image dataset with rich content and different types of images. UEC FOOD 100 is a food image dataset created by Yoshiyuki Kawano of the University of Electro-Communications, most of which are popular Japanese foods, which can fully reflect the structural characteristics of Asian food. <xref ref-type="table" rid="table1">Table 1</xref> shows the structure of the dataset.</p><p>We use linear SVM (Radial Basis Function) as a classifier for food images and use a one-versus-rest strategy for multi-classification. Optimize parameters using the grid.py tool of Libsvm. We did not divide the training and test sets manually. Instead, K-fold cross-validation (K-CV) is used to divide and train the dataset K times and obtain K classification models. Taking the average value p a v e of the classification accuracy p i of the K models as the performance of this K-CV, from this, the optimal parameters C and gamma are determined, and then the image classifier is constructed using the optimal parameters to obtain the best classification accuracy P.</p><p>The experimental environment is Windows 10 operating system, Intel Core i5 CPU, 16G memory, programming environment is PyCharm 2018.2.7, Python 2.7. We modified the relevant code of OpenCV 3.3 to extract the SIFT and SURF features and used Libsvm for parameter optimization and SVM training.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Dataset structure</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Dataset</th><th align="center" valign="middle" >Number of images</th><th align="center" valign="middle" >Number of image types</th><th align="center" valign="middle" >Image content</th></tr></thead><tr><td align="center" valign="middle" >Caltech101</td><td align="center" valign="middle" >9145</td><td align="center" valign="middle" >102</td><td align="center" valign="middle" >Complex</td></tr><tr><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >14366</td><td align="center" valign="middle" >100</td><td align="center" valign="middle" >Food</td></tr></tbody></table></table-wrap></sec><sec id="s4_2"><title>4.2. Image Segmentation</title><p>For the problem of tablecloths, tableware and other interference information in the background of food image, the GrabCut algorithm is used to segment the image to extract the food subject. Using UEC FOOD 100 as the experimental dataset, the classification and comparison experiments before and after image segmentation were carried out for the image data of different scales. This experiment is mainly to evaluate and quantify the effect of image segmentation on classification accuracy. <xref ref-type="fig" rid="fig1">Figure 1</xref>2 shows the experimental results.</p><p>In <xref ref-type="fig" rid="fig1">Figure 1</xref>2, the abscissa indicates the size of different datasets (the number of image types), and the ordinate indicates the classification accuracy. By extracting the SURF and SIFT features and performing image classification experiments on the dataset before and after the segmentation, we observed that with the same experimental data, image segmentation could achieve a 4% to 6% improvement with SIFT and a 5% to 8% increase with SURF. It shows that image segmentation can effectively highlight the food subject and avoid the adverse effects of background interference on subsequent processing.</p><p>At the same time, the image segmentation effectively reduces the amount of resources used for subsequent processing. During the experiment, the size of the dataset can have a decisive impact on the efficiency of the classification model. Furthermore, we can conclude from Equation 6 that the dimension of the SURF information in the local fused vector is unchanged, and the dimension of the local fused vector depends on the value of the neighbourhood radius of the feature point. Therefore, the value of the neighbourhood radius will affect the execution efficiency of the model. As shown in <xref ref-type="table" rid="table2">Table 2</xref>, we performed image segmentation under different data sizes and the value of the neighbourhood radius. Experiment shows that the use of image segmentation can effectively reduce the time of feature extraction by about 45%.</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> The effect of image segmentation on feature extraction</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Dataset size</th><th align="center" valign="middle" >Value of the neighbourhood radius</th><th align="center" valign="middle" >Image segmentation</th><th align="center" valign="middle" >Time (min)</th></tr></thead><tr><td align="center" valign="middle" >10 types of images, 100 for each</td><td align="center" valign="middle" >R = 2</td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" >83.9</td></tr><tr><td align="center" valign="middle" >10 types of images, 100 for each</td><td align="center" valign="middle" >R = 2</td><td align="center" valign="middle" >No</td><td align="center" valign="middle" >149.8</td></tr><tr><td align="center" valign="middle" >20 types of images, 100 for each</td><td align="center" valign="middle" >R = 4</td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" >189.2</td></tr><tr><td align="center" valign="middle" >20 types of images, 100 for each</td><td align="center" valign="middle" >R = 4</td><td align="center" valign="middle" >No</td><td align="center" valign="middle" >343.2</td></tr></tbody></table></table-wrap></sec><sec id="s4_3"><title>4.3. Local Feature Fusion</title><p>The purpose of the experiment is to evaluate the influence of different value of the neighbourhood radius of feature points on the effect of local feature extraction. Therefore, we ignore the SURF information intentionally and extract only the colour information in the neighbourhood of the feature point as the final representation vector of the image. The evaluation criteria are the accuracy of the image classification, and <xref ref-type="fig" rid="fig1">Figure 1</xref>3 shows the experimental results.</p><p>The abscissa in <xref ref-type="fig" rid="fig1">Figure 1</xref>3 represents the value of the neighbourhood radius of the feature point, and the ordinate represents the accuracy of the image classification. The experiments were carried out on Caltech101 and UEC FOOD 100 respectively, took the top 30 types of images from Caltech101 and the first 20 types of images from UEC FOOD 100 for experiments. In the experiment, the colour information of all pixels in different value of the neighbourhood radius of SURF feature points is extracted and represented by RGB and HSV colour space respectively. Using the above information as the final representation vector of the image, then its dimension can be calculated from Equation (5). The final representation vector is quantized and input into the SVM for classification.</p><p>Analysis of the experimental results shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>3, we can see that as the value of the neighbourhood radius R continues to increase, the accuracy of image classification is also steadily increasing. After several rounds of testing, we can determine that when R is 5, we can get the best feature representation, and the RGB space is generally better for the representation of local colours.</p><p>At the same time, we found that the food image is more sensitive to local colour features than the integrated image. It is because the food images, especially Asian food images, their feature points are mostly located in the “embellishment area”, and the colour information in the neighbourhood of the feature points has a higher value.</p></sec><sec id="s4_4"><title>4.4. Global Feature Fusion</title><p>The purpose of this experiment is to evaluate the influence of global feature fusion on the overall feature extraction effect. Therefore, the single feature extraction method (SIFT/SURF) is used as a benchmark to compare the classification performance of local feature fusion (SIFT/SURF + RGB) and local-global feature fusion (SIFT/SURF + RGB + HSV). The evaluation criteria are the accuracy of the image classification, and <xref ref-type="fig" rid="fig1">Figure 1</xref>4 shows the experimental results.</p><p>The abscissa in <xref ref-type="fig" rid="fig1">Figure 1</xref>4 indicates the size of different datasets (the number of image types), and the ordinate indicates the classification accuracy. The experiments were carried out on Caltech101 and UEC FOOD 100, respectively, and the scale of the experimental data is consistent with the previous section. In the experiment, we represent the overall colour information of the image in the HSV colour space and splicing it with the local features of the image (SIFT/SURF + RGB), wherein the value of the neighbourhood radius R takes a value of 2. Through the above comparison experiments, we found that the addition of the global colour feature improves the classification accuracy of the integrated image by about 5%, and contributes about 3% to the classification of food image. The global feature fusion, based on local feature fusion, further enhances the effect of feature extraction. The reason why global colour feature contributes less to the classification of food image is that the colour difference between the types of food image is smaller than that of the integrated image, and there are many types of foods having the same colour tone.</p></sec><sec id="s4_5"><title>4.5. Experimental Results</title><p>We randomly selected 20 types of images in the UEC FOOD 100 as experimental data. According to Equation (4), the clustering value K is 1368, the value of the neighbourhood radius R is 5. After the local feature fusion, the dimension D l of the local fused vector of each visual word is 247, according to the Equation (6). After the global feature fusion, the dimension D g of the final representation vector of each image is 1440, according to the Equation (9). Under the above experimental conditions, we tested the components of the SLGC. The purpose is to summarize and quantify the contribution of image segmentation, local feature fusion and global feature fusion to the accuracy of image classification.</p><p>As can be seen from <xref ref-type="table" rid="table3">Table 3</xref>, the use of the SURF descriptor is better than the use of the SIFT descriptor, when the other conditions are the same, can generally improve the classification accuracy by about 4%. The segmentation of the food image can improve the classification accuracy by about 4% based on the same feature extraction method, the accuracy of classification can be improved by about 5% after local feature fusion, and 3% after the global feature fusion.</p><p>Then, using the same dataset and parameters as the above experiment, the SLGC is compared with other models (<xref ref-type="table" rid="table4">Table 4</xref>).</p><p>First, the baseline method Color Histogram and Bag of SIFT Features have a single feature extraction method, can not extract features according to the characteristics of food images, and reduced the effect of classification. OM performed well on the PFID, but because of its unique feature combination structure that</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Comparison of P values of different feature extraction methods</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Extraction methods</th><th align="center" valign="middle" >Image segmentation</th><th align="center" valign="middle" >Local feature fusion</th><th align="center" valign="middle" >Global feature fusion</th><th align="center" valign="middle" >P-Value/%</th></tr></thead><tr><td align="center" valign="middle" >10 types of images, 100 for each</td><td align="center" valign="middle" >R = 2</td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" >83.9</td><td align="center" valign="middle" >48%/52%</td></tr><tr><td align="center" valign="middle" >10 types of images, 100 for each</td><td align="center" valign="middle" >R = 2</td><td align="center" valign="middle" >No</td><td align="center" valign="middle" >149.8</td><td align="center" valign="middle" >50%/56%</td></tr><tr><td align="center" valign="middle" >20 types of images, 100 for each</td><td align="center" valign="middle" >R = 4</td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" >189.2</td><td align="center" valign="middle" >57%/61%</td></tr><tr><td align="center" valign="middle" >20 types of images, 100 for each</td><td align="center" valign="middle" >R = 4</td><td align="center" valign="middle" >No</td><td align="center" valign="middle" >343.2</td><td align="center" valign="middle" >60%/64%</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Comparison of P values of different classification model</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Classification model</th><th align="center" valign="middle" >Dataset</th><th align="center" valign="middle" >Image content</th><th align="center" valign="middle" >P-Value/%</th></tr></thead><tr><td align="center" valign="middle" >Color Histogram</td><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >Asian food</td><td align="center" valign="middle" >52%</td></tr><tr><td align="center" valign="middle" >Bag of SIFT Features</td><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >Asian food</td><td align="center" valign="middle" >49%</td></tr><tr><td align="center" valign="middle" >OM</td><td align="center" valign="middle" >PFID</td><td align="center" valign="middle" >Fast food</td><td align="center" valign="middle" >78%</td></tr><tr><td align="center" valign="middle" >OM</td><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >Asian food</td><td align="center" valign="middle" >57%</td></tr><tr><td align="center" valign="middle" >Texture + SIFT + MKL</td><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >Asian food</td><td align="center" valign="middle" >61%</td></tr><tr><td align="center" valign="middle" >SLGC</td><td align="center" valign="middle" >UEC FOOD 100</td><td align="center" valign="middle" >Asian food</td><td align="center" valign="middle" >64%</td></tr></tbody></table></table-wrap><p>only applies to PFID, the accuracy of classification in Asian food datasets has declined.</p><p>On the UEC FOOD 100 dataset, we achieved similar performance to the Texture + SIFT + MKL, reaching more than 60%. At the same time, for the problem that it does not preprocess the original image, the GrabCut algorithm is used to extract the food subject from the image, which further improves the accuracy of classification, reaching about 64%.</p></sec></sec><sec id="s5"><title>5. Conclusions</title><p>To explore the intrinsic characteristics of Asian food images and improve the accuracy of classification, after studying and analyzing the unique structure and colour characteristics of Asian foods, this paper proposes an image classification model SLGC based on feature fusion. It constructs a feature representation method that combines SURF features, local colour information and global colour information, which can extract features of Asian food image comprehensively and efficiently. At the same time, the image segmentation algorithm is used to separate the invalid interference information and highlight the food subject, which further improves the effect of image classification. The experimental results show that the SLGC based on feature fusion can effectively improve the effect of image classification.</p><p>In the process of research, we have tried many feature matching and fusion schemes. The theoretical and practical basis of these schemes is not only computer vision related technologies, but also an in-depth study of the characteristics of Asian food. The research on the characteristics of Asian food in this article is not enough. In the future, we can try to fuse deeper texture and structural features to improve the understanding of Asian food pictures from the intensity of feature expression.</p></sec><sec id="s6"><title>Acknowledgements</title><p>This work was supported by the Outstanding Youth Innovation Teams in Higher Education of Shandong Province (2019KJN048).</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s8"><title>Cite this paper</title><p>Wu, R.Q., Zhao, S. and Qu, Z.J. (2020) An SLGC Model for Asian Food Image Classification. Journal of Computer and Communications, 8, 26-43. https://doi.org/10.4236/jcc.2020.84003</p></sec><sec id="s9"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.99419-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Subhi, M.A. and Ali, S.M. (2018) A Deep Convolutional Neural Network for Food Detection and Recognition. 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Borneo, 3-6 December 2018, 284-287.https://doi.org/10.1109/IECBES.2018.8626720</mixed-citation></ref><ref id="scirp.99419-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Tiankaew, U., Chunpongthong, P. and Mettanant, V. (2018) A Food Photography App with Image Recognition for Thai Food. 2018 Seventh ICT International Student Project Conference (ICT-ISPC), Nakhon Pathom, 11-13 July 2018, 1-6.https://doi.org/10.1109/ICT-ISPC.2018.8523925</mixed-citation></ref><ref id="scirp.99419-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">He, H., Kong F. and Tan, J. (2015) Dietcam: Multiview Food Recognition Using a Multikernelsvm. IEEE Journal of Biomedical and Health Informatics, 20, 848-855. https://doi.org/10.1109/JBHI.2015.2419251</mixed-citation></ref><ref id="scirp.99419-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Sasano, S., Han, X.H. and Chen, Y.W. (2016) Food Recognition by Combined Bags of Color Features and Texture Features. 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, China, 15-17 October 2016, 815-819. https://doi.org/10.1109/CISP-BMEI.2016.7852822</mixed-citation></ref><ref id="scirp.99419-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Liang, G., Hong, H., Xie, W., et al. (2018) Combining Convolutional Neural Network with Recursive Neural Network for Blood Cell Image Classification. IEEE Access, 6, 36188-36197. https://doi.org/10.1109/ACCESS.2018.2846685</mixed-citation></ref><ref id="scirp.99419-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Zhao, W., Jiao, L., Ma, W., et al. (2017) Superpixel-Based Multiple Local CNN for Panchromatic and Multispectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing, 55, 4141-4156.https://doi.org/10.1109/TGRS.2017.2689018</mixed-citation></ref><ref id="scirp.99419-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Hnoohom, N. and Yuenyong, S. (2018) Thai Fast Food Image Classification Using Deep Learning. 2018 International ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON), 116-119. https://doi.org/10.1109/ECTI-NCON.2018.8378293</mixed-citation></ref><ref id="scirp.99419-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Ege, T. and Yanai, K. (2017) Estimating Food Calories for Multiple-Dish Food Photos. 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, 26-29 November 2017, 646-651. https://doi.org/10.1109/ACPR.2017.145</mixed-citation></ref><ref id="scirp.99419-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Naritomi, S., Tanno, R., Ege, T. and Yanai, K. (2018) FoodChangeLens: CNN-Based Food Transformation on HoloLens. 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Taichung, Taiwan, 10-12 December 2018, 197-199. https://doi.org/10.1109/AIVR.2018.00046</mixed-citation></ref><ref id="scirp.99419-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Sharma, O. (2019) Deep Challenges Associated with Deep Learning. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14-16 February 2019, 72-75.https://doi.org/10.1109/COMITCon.2019.8862453</mixed-citation></ref><ref id="scirp.99419-ref11"><label>11</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Wang</surname><given-names> D.Y. </given-names></name>,<etal>et al</etal>. (<year>2019</year>)<article-title>Comparison and Reference of the Differences between Chinese and Western Food and Nutrition Development</article-title><source> Food and Nutrition in China</source><volume> 25</volume>,<fpage> 5</fpage>-<lpage>8</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.99419-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Jaisakthi, S.M., Mirunalini, P. and Aravindan, C. (2018) Automated Skin Lesion Segmentation of Dermoscopic Images Using GrabCut and k-Means Algorithms. IET Computer Vision, 12, 1088-1095. https://doi.org/10.1049/iet-cvi.2018.5289</mixed-citation></ref><ref id="scirp.99419-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Ren, D., Jia, Z., Yang, J., et al. (2017) A Practical Grabcut Color Image Segmentation Based on Bayes Classification and Simple Linear Iterative Clustering. IEEE Access, 5, 18480-18487. https://doi.org/10.1109/ACCESS.2017.2752221</mixed-citation></ref><ref id="scirp.99419-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Mustafa, R. and Dhar, P. (2018) A Method to Recognize Food Using Gist and SURF Features. 2018 Joint 7th International Conference on Informatics, Electronics &amp; Vision (ICIEV) and 2018 2nd International Conference on Imaging, Vision &amp; Pattern Recognition (icIVPR), 127-130. https://doi.org/10.1109/ICIEV.2018.8641072</mixed-citation></ref><ref id="scirp.99419-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Zhu, Q., Zhong, Y., Zhao, B., et al. (2016) Bag-of-Visual-Words Scene Classifier with Local and Global Features for High Spatial Resolution Remote Sensing Imagery. IEEE Geoscience and Remote Sensing Letters, 13, 747-751.https://doi.org/10.1109/LGRS.2015.2513443</mixed-citation></ref></ref-list></back></article>