<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JILSA</journal-id><journal-title-group><journal-title>Journal of Intelligent Learning Systems and Applications</journal-title></journal-title-group><issn pub-type="epub">2150-8402</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jilsa.2022.141001</article-id><article-id pub-id-type="publisher-id">JILSA-119132</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Toward an Intelligent System for Taurine Cattle Recognition
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Fulbert</surname><given-names>Bembamba</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Frédéric</surname><given-names>T. Ouédraogo</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Soudré</surname><given-names>Albert</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Amadou</surname><given-names>Traoré</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib></contrib-group><aff id="aff3"><addr-line>Institut de l’Environnement et de Recherches Agricoles (INERA), Ouagadougou, Burkina Faso</addr-line></aff><aff id="aff2"><addr-line>Université Norbert Zongo, Koudougou, Burkina Faso</addr-line></aff><aff id="aff1"><addr-line>Université Norbert Zongo, Laboratoire Mathématiques, Informatique et Applications (LAMIA), Koudougou, Burkina Faso</addr-line></aff><pub-date pub-type="epub"><day>28</day><month>02</month><year>2022</year></pub-date><volume>14</volume><issue>01</issue><fpage>1</fpage><lpage>13</lpage><history><date date-type="received"><day>1,</day>	<month>February</month>	<year>2022</year></date><date date-type="rev-recd"><day>21,</day>	<month>February</month>	<year>2022</year>	</date><date date-type="accepted"><day>28,</day>	<month>February</month>	<year>2022</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution-NonCommercial International License (CC BY-NC).http://creativecommons.org/licenses/by-nc/4.0/</license-p></license></permissions><abstract><p>
 
 
  Unlike zebus, taurine cattle have the natural ability to resist trypanosomosis, a parasitic disease endemic to the humid areas of West Africa. However, repeated crossbreeding between zebus and taurine cattle is jeopardizing the genetic heritage of the Taurines and their ability to resist trypanosomosis. To strengthen protection and conservation efforts, it is essential to accurately distinguish purebred taurines from crossbreds. In this study, five Machine Learning models were built using morphological data collected from 1968 cattle. These models were trained to determine whether a given individual is purebred taurine or not. The classifiers yielded promising results. The random forest model and RBF Kernel SVM performed the best with up to 86% and 85% accuracy respectively. Moreover, the study of the correlation coefficients and the feature importance scores allowed us to define the most discriminating morphological traits.
 
</p></abstract><kwd-group><kwd>Machine Learning</kwd><kwd> Trypanosomosis</kwd><kwd> Purebred Taurine</kwd><kwd> Accuracy</kwd><kwd> Model</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In a world literally drowned in data, Artificial Intelligence (AI) is becoming an increasingly important part of our lives. This new science at the junction of algebra, statistics, probability and computer science has diversified to meet the needs. Among the different branches of AI is Machine Learning (ML), which is used when it is difficult or impossible to define explicit instructions to give to a computer to solve a problem, but we have many illustrative examples at hand. We can oppose a classical program which uses a procedure and the data it receives (input) to produce answers (output), to a Machine Learning program, which uses the data and the answers in order to produce the procedure which makes it possible to obtain the latter from the first [<xref ref-type="bibr" rid="scirp.119132-ref1">1</xref>] .</p><p>AI, in general, and machine learning in particular, are progressively becoming strategic research axis for decision support solutions in several fields such as finance, marketing, security, etc. AI has also popped into agriculture and livestock, especially by contributing to improving the health and production of animals [<xref ref-type="bibr" rid="scirp.119132-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.119132-ref3">3</xref>] , but also in the field of genetic improvement and conservation [<xref ref-type="bibr" rid="scirp.119132-ref4">4</xref>] . This is the case of the West African taurine cattle, also known as Lobi or Baoul&#233;. Taurine cattle are tolerant to trypanosomosis disease though smaller in size and with lower productivity compared to most zebu-type cattle [<xref ref-type="bibr" rid="scirp.119132-ref5">5</xref>] . Trypanosomosis is the main parasitic disease of ruminants in wetlands, causing enormous economic losses to producers. However, for the Sahel region, these wetlands are the most suitable places for livestock production because of the abundance of fodder and pasture. The effects of climate change are accelerating the phenomenon of zebu migration to these areas that were once known as taurine sanctuaries. Uncontrolled and indiscriminate crossbreeding among local cattle types is thus taking place, leading to the dilution of trypano-tolerance ability and threats to the genetic integrity of West African taurine cattle types [<xref ref-type="bibr" rid="scirp.119132-ref5">5</xref>] . Therefore, empirical methods of distinguishing the two species, formerly based on visual differences in morphological traits (size, presence of hump, etc.) no longer work. An efficient yet very costly method is the laboratory analysis of blood samples. Our study aims at proposing a low-cost method inspired by machine learning techniques to easily make this distinction. In the long run, it is planned to integrate the results achieved here with image processing applications to identify purebred taurines using their images.</p><p>This paper is structured as follows. In Section 2, we present the context of the problem we have to address. In Section 3, we give an overview of related work. Section 4 will provide definitions and background. In Section 5 and Section 6, we will respectively unveil some results and conduct discussions. Finally, Section 7 concludes the paper.</p></sec><sec id="s2"><title>2. Context</title><p>There are two subspecies of cattle: zebus and taurines. The taurine cattle live in the wetlands. This fodder-rich region is unfortunately infested with tsetse flies, a vector for the spread of an endemic parasitic disease called trypanosomosis, that causes enormous losses to livestock:</p><p>&#183; direct economic losses due to morbidity;</p><p>&#183; stunted growth of young animals;</p><p>&#183; weight loss;</p><p>&#183; low milk production;</p><p>&#183; infertility;</p><p>&#183; abortion of cows;</p><p>&#183; etc.</p><p>The taurines are special in that they have a natural resistance to trypanosomosis. Unfortunately, this genetic faculty is undermined by repeated cross-breeding over several generations with zebus due to the seasonal transhumance of the latter towards the wetlands and deliberate actions of breeders who seek larger animals through these crossings.</p><p>In order to preserve this type of cattle, it is necessary to find out whether a given individual is pure taurine or not. The empirical segregation methods are less and less accurate because of the massive crossings. The only formal method is a genetic analysis which is too costly in time and resources. Therefore, artificial intelligence is used for this characterization.</p><p>A conservation project working in the Sahel that focuses on the preservation of bulls has made several scientific productions on the topic, though in the field of natural and social sciences. For our study, we have in hand the data collected by this research project. Phenotypic data were measured on several thousand cattle in accordance with the 2012 FAO guidelines [<xref ref-type="bibr" rid="scirp.119132-ref6">6</xref>] for the phenotypic characterization of animal genetic resources (<xref ref-type="table" rid="table1">Table 1</xref>). Blood samples were also taken for laboratory analysis. These analyses allowed, among other things, to determine formally if an individual is a purebred taurine (with full trypano-tolerance capacity), pure zebu (no trypano-tolerance capacity) or a crossbred (some percentage of trypano-tolerance capacity). In the present work, we use the first dataset of 1968 individuals (taurines, zebus, crossbreds) in which six traits have been assessed: height at withers, chest girth, body length, weight, sex and age.</p></sec><sec id="s3"><title>3. Related Work</title><p>Animal species identification is an important issue for the modernization of livestock. The scientific literature reveals different techniques that replace direct observation methods. These techniques are mainly based on body measurements, images or biological markers. One important issue is how to obtain the body features. Traditional direct measurement of animals consumes time and effort. For instance, the use of scales for live weight measurement requires a vehicle, some qualified personnel and special facilities. To overcome this difficulty, [<xref ref-type="bibr" rid="scirp.119132-ref7">7</xref>] and [<xref ref-type="bibr" rid="scirp.119132-ref6">6</xref>] used barymetric equations to estimate the weight applicable to Niger Azawak and Burkina Faso taurine cattle. This technique main drawback is that it provides low accuracy with adult animals because of the possible fattening or the</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> List of quantitative traits</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Head measurements</th><th align="center" valign="middle" >Body measurements</th></tr></thead><tr><td align="center" valign="middle" >cranial length, head width, head length, cranial width, facial length, facial width, muzzle circumference, distance between horn tips, distance between horn bases, horn length, earn length</td><td align="center" valign="middle" >height at withers, thoracic perimeter, height at sacrum, body length, length of scapula ischium, hip width, ischium width, tail length, chest depth, shoulder width, chest width, teat length, weight</td></tr></tbody></table></table-wrap><p>physiological state of females.</p><p>Rudendko [<xref ref-type="bibr" rid="scirp.119132-ref8">8</xref>] derived cows’ weight using artificial neural network algorithms. This is achieved in two steps: firstly, a convolution neural network(CNN) is used to detect cows in the picture, and the stereopsis method allows the system to obtain their size measurements such as wither height, hipheight, body length and hip width via photogrammetry; secondly, these measurements are used to determine the cow live weight.</p><p>References [<xref ref-type="bibr" rid="scirp.119132-ref9">9</xref>] and [<xref ref-type="bibr" rid="scirp.119132-ref10">10</xref>] trained CNN classifiers to classify images of dogs to the appropriate class out of 120 breeds of dogs. The problem is tackled as an image classification problem using a deep convolutional neural network. In [<xref ref-type="bibr" rid="scirp.119132-ref10">10</xref>] , the image is divided into numerous lattices and the extracted descriptors serve as input for the CNNs that are trained to identify dog species.</p><p>Reference [<xref ref-type="bibr" rid="scirp.119132-ref11">11</xref>] implemented an effective breed identification system using genetic markers single nucleotide polymorphisms (SNPs) genotyped from pigmeat products. Six machine learning methods were trained to make this identification task. SVM yielded the most accurate performance.</p><p>The identification methods outlined above are based on costly techniques in computational resources as well as material resources. Our approach, which also offers good accuracy, is based on Machine Learning, using phenotypic data collected from hundreds of cows to predict their sub-species. This method can be integrated into a lite and affordable intelligent system for breed recognition in the Sahel social and economic context.</p></sec><sec id="s4"><title>4. Methods</title><sec id="s4_1"><title>4.1. Conceptual Framework</title><p>The problem that we have to deal with is to decide whether a designated bovine individual is pure taurine or not. For this purpose, we dispose of its morphological measurements. To train our model, we also have at our disposal the measurements of thousands of other individuals (examples) with their label: the “pure” character. The problem is, therefore, a supervised learning mater. According to [<xref ref-type="bibr" rid="scirp.119132-ref12">12</xref>] , supervised learning is the machine learning task of learning a function that maps an input to an output based on input-output pairs.</p><p>For the sake of simplicity, we will restrict our attention in this phase, to determining whether the individual is pure or not, regardless of the inter-breeding rate. So, the space of labels is binary: {pure, notpure}. We are thus reduced to a binary classification problem.</p></sec><sec id="s4_2"><title>4.2. Selection of Algorithms</title><p>Machine Learning relies on different algorithms to solve data problems. Choosing an appropriate classification algorithm for a particular problem task requires practice and experience [<xref ref-type="bibr" rid="scirp.119132-ref13">13</xref>] . At this stage of our study, we have chosen a limited number of the most commonly used algorithms, making sure that they are as representative as possible of the different types of algorithms: linear, non-linear, instance-based, bayesian, and ensemble methods.</p><p>After a model is trained, we evaluate its performance on the test set to guarantee that future measurements in similar situations are sufficiently accurate. To compare the two models, we can compare their accuracy, precision, or recall values. Reference [<xref ref-type="bibr" rid="scirp.119132-ref14">14</xref>] recommends that AUC (Area Under the Curves) be used in preference to overall accuracy for single number evaluation of machine learning algorithms.</p></sec><sec id="s4_3"><title>4.3. Overview of a Few ML Algorithms</title><p>In the following lines, we will briefly describe the general principles of the machine learning algorithms that we will implement in this study. We will consider the following notations:</p><p>&#183; n: the number of examples;</p><p>&#183; m: the number of features of an example;</p><p>&#183; <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x2.png" xlink:type="simple"/></inline-formula>:<inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x3.png" xlink:type="simple"/></inline-formula>: the ith example;</p><p>&#183; <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x4.png" xlink:type="simple"/></inline-formula>; <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x5.png" xlink:type="simple"/></inline-formula>or simply <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x6.png" xlink:type="simple"/></inline-formula> if there is no ambiguity: the jth feature of the ith example</p><p>&#183; <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x7.png" xlink:type="simple"/></inline-formula>and <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x8.png" xlink:type="simple"/></inline-formula> respectively the true class label and the predicted class label of the ith training example</p><p>&#183; <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x9.png" xlink:type="simple"/></inline-formula>: the jth model weight.</p><sec id="s4_3_1"><title>4.3.1. Random Forest (RF)</title><p>Decision Trees are considered to be one of the most popular approaches for representing classifiers [<xref ref-type="bibr" rid="scirp.119132-ref15">15</xref>] . However, they are known to have high variance and so, tend to overfit. Random forest is an ensemble method that allows to combining several trees in order to avoid overfitting. Ensemble methods apply the “wisdom of crowd” concept. This concept is based on the idea that combining many weak learners results in a performance that is far beyond the individual performance of those learners, because their errors compensate for each other. [<xref ref-type="bibr" rid="scirp.119132-ref16">16</xref>] proposes to build the individual trees of the forest using different variables. So at each node a number p of variables smaller than the total number is selected before applying the splitting criteria.</p></sec><sec id="s4_3_2"><title>4.3.2. Logistic Regression (LR)</title><p>It’s a fast model to learn and effective on binary classification problems. It’s one of the most widely used algorithms for classification in the industry [<xref ref-type="bibr" rid="scirp.119132-ref17">17</xref>] . The basic principle is like linear regression, where the hypothesis space consists of a linear combination of the variables:</p><disp-formula id="scirp.119132-formula30"><label>(1)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x10.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>This linear hypothesis can yield very high values as well as very low values (below zero). Logistic regression transforms this output using the sigmoid function to return a probability value: between 0 and 1. Concretely, we apply the sigmoid function:</p><disp-formula id="scirp.119132-formula31"><label>(2)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x11.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>Therefore, we have:</p><disp-formula id="scirp.119132-formula32"><label>(3)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x12.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>with</p><disp-formula id="scirp.119132-formula33"><label>(4)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x13.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>The model is trained by minimizing the cost function<inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x14.png" xlink:type="simple"/></inline-formula>, using the descending gradient technique:</p><disp-formula id="scirp.119132-formula34"><label>(5)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x15.png?20220811165608078"  xlink:type="simple"/></disp-formula></sec><sec id="s4_3_3"><title>4.3.3. Na&#239;ve Bayes (NB)</title><p>The model is comprised of two types of probabilities that can be calculated directly from your training data:</p><p>&#183; The prior probability of each class.</p><p>&#183; The conditional probability for each class given each x value.</p><disp-formula id="scirp.119132-formula35"><label>(6)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x16.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x17.png" xlink:type="simple"/></inline-formula> is the posterior probability,</p><p><inline-formula><inline-graphic xlink:href="/html.scirp.org/file/1-9601526x18.png" xlink:type="simple"/></inline-formula>is the class prior probability,</p><p><inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x19.png" xlink:type="simple"/></inline-formula>is the likelihood,</p><p><inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x20.png" xlink:type="simple"/></inline-formula>is the predictor prior probability.</p><p>The predictor prior probability term is constant with regard to the class values. Therefore, we can write:</p><disp-formula id="scirp.119132-formula36"><label>(7)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x21.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>The different Bayes classifiers differ mainly by the assumptions they make regarding the distribution of <inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x22.png" xlink:type="simple"/></inline-formula> [<xref ref-type="bibr" rid="scirp.119132-ref18">18</xref>] .</p><p>With the naive conditional independence assumption for example, this expression becomes:</p><disp-formula id="scirp.119132-formula37"><label>(8)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x23.png?20220811165608078"  xlink:type="simple"/></disp-formula></sec><sec id="s4_3_4"><title>4.3.4. K-Nearest Neighbors (KNN)</title><p>Nearest neighbors algorithm is one of the simplest predictive models there is [<xref ref-type="bibr" rid="scirp.119132-ref13">13</xref>] . Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those instances. We select in advance the number (K) of neighbors to consider and the notion of distance to apply. KNN is called “lazy” not because of its apparent simplicity, but because it doesn’t learn a discriminative function from the training data but memorizes the training dataset instead [<xref ref-type="bibr" rid="scirp.119132-ref17">17</xref>] . KNN belongs to a subcategory of non-parametric models that are described as instance-based learning.</p></sec><sec id="s4_3_5"><title>4.3.5. Support Vector Machine Kernel (SVMk)</title><p>SVM might be one of the most powerful and widely used classifiers and can be considered an extension of the perceptron [<xref ref-type="bibr" rid="scirp.119132-ref17">17</xref>] . In SVM, our optimization objective is to maximize the margin that we define as the distance between the separating hyperplane (decision boundary) and the support vectors. The support vectors are the training examples that are closest to this hyperplane. The optimal hyperplane can be set as:</p><disp-formula id="scirp.119132-formula38"><label>(9)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x24.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>where w is the weight vector, x the input vector and b, the bias. For all elements of the training set, w and b should verify [<xref ref-type="bibr" rid="scirp.119132-ref19">19</xref>] :</p><disp-formula id="scirp.119132-formula39"><label>(10)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x25.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>Support vectors are those <inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x26.png" xlink:type="simple"/></inline-formula> for which</p><disp-formula id="scirp.119132-formula40"><label>(11)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x27.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>The training objective is to find the right parameters (<inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x28.png" xlink:type="simple"/></inline-formula>and b) so that the</p><p>hyperplane separates the data and maximizes the margin<inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x29.png" xlink:type="simple"/></inline-formula>. Which is equivalent to minimizing<inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x29.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x30.png" xlink:type="simple"/></inline-formula>.</p><p>SVM is also popular because it can be kernelized to solve nonlinear classification problems. In practice, we use a mapping function to transform the training data into a higher dimensional feature space. We now train a linear SVM to classify the data in this new feature space. The “kernel trick” allows saving expensive cost of calculations. We define a kernel function as:</p><disp-formula id="scirp.119132-formula41"><label>(12)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x31.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>One popular kernel function is called Radial Basis Function (RBF) which can be written as:</p><disp-formula id="scirp.119132-formula42"><label>(13)</label><graphic position="anchor" xlink:href="//html.scirp.org/file/1-9601526x32.png?20220811165608078"  xlink:type="simple"/></disp-formula><p>Here, <inline-formula><inline-graphic xlink:href="//html.scirp.org/file/1-9601526x33.png" xlink:type="simple"/></inline-formula>is a free parameter to be optimized k can be interpreted as a similarity score, ranging from 0 (very dissimilar examples) to 1 (exactlysimilar examples).</p></sec></sec><sec id="s4_4"><title>4.4. Data Preparation</title><p>The quality of the data and the amount of useful information that it contains are key factors that determine how well a machine learning algorithm can learn [<xref ref-type="bibr" rid="scirp.119132-ref17">17</xref>] . Data preparation is the process of transforming raw data so that they can be run through machine learning algorithms. This involves handling categorical data and missing values, rescaling data, etc. Supervised machine learning techniques require splitting data into multiple parts for training and testing steps. However, if we are dividing a dataset, we have to keep in mind that we are withholding valuable information that the learning algorithm could benefit from. At the same time, the smaller the test set, the more inaccurate the estimation of the generalization error. Therefore, dividing a dataset into training and test sets is all about balancing this tradeoff [<xref ref-type="bibr" rid="scirp.119132-ref20">20</xref>] . Within the framework of this work, we used the “hold out” method. Basically, we split the dataset into two chunks: the calibration sample and the test sample. The default proportions of 70% - 30% were used.</p></sec></sec><sec id="s5"><title>5. Results</title><p>In the data preparation process, we cleaned the data by discarding entries that contained missing values or outliers. These inconsistent data represented 8.6% of the entire dataset. Finally, 1797 observations were validated for the study.</p><p>Data mining allowed us to visualize the shape of the feature distributions. We used pair plots to assess the correlation between the features. Graphs plotting features one against the other on the one hand and one against the label, on the other hand, showed that the interdependence is not negligible. In particular, a strong correlation was noted between weight and chest girth with a correlation coefficient of 0.948463 (<xref ref-type="fig" rid="fig1">Figure 1</xref>).</p><p>Correlations between the different descriptors and the “pure” trait were also analyzed. Height at withers has the highest coefficient with the label: −0.472985. This is corroborated by the feature importance graphic (<xref ref-type="fig" rid="fig3">Figure 3</xref>).</p><p>We trained the five algorithms presented earlier on the calibration set. The resulting predictive models were tested on the test sample to measure the generalization ability. Hyper parameters were adjusted to yield the best performances. The results obtained are shown in <xref ref-type="table" rid="table2">Table 2</xref> and <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Performances measures</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Accuracy</th><th align="center" valign="middle" >Precision</th><th align="center" valign="middle" >Recall</th></tr></thead><tr><td align="center" valign="middle" >RF</td><td align="center" valign="middle" >86%</td><td align="center" valign="middle" >0.87</td><td align="center" valign="middle" >0.88</td></tr><tr><td align="center" valign="middle" >NB</td><td align="center" valign="middle" >64%</td><td align="center" valign="middle" >0.62</td><td align="center" valign="middle" >0.58</td></tr><tr><td align="center" valign="middle" >LR</td><td align="center" valign="middle" >81%</td><td align="center" valign="middle" >0.79</td><td align="center" valign="middle" >0.80</td></tr><tr><td align="center" valign="middle" >SVMk</td><td align="center" valign="middle" >85%</td><td align="center" valign="middle" >0.83</td><td align="center" valign="middle" >0.85</td></tr><tr><td align="center" valign="middle" >KNN</td><td align="center" valign="middle" >83%</td><td align="center" valign="middle" >0.83</td><td align="center" valign="middle" >0.78</td></tr></tbody></table></table-wrap><p>ROC curves (Receiver Operating Characteristic) were also drawn (see <xref ref-type="fig" rid="fig4">Figure 4</xref>). As per [<xref ref-type="bibr" rid="scirp.119132-ref14">14</xref>] recommendations, AUC (Area Under the Curve)values have been calculated in order to compare the performances of the different methods.</p><p>From the AUC perspective, it appears that nonlinear Kernel SVM is the most efficient algorithm (AUC = 0.9202), followed by Random Forest (AUC = 0.9161). K-Nearest Neighbors (K = 10) and Logistic Regression have very similar performances. Their ROC curves overlap at some cut-off points and their AUC are very closed: 0.8963324783059543 and 0.8919186268624134 respectively. Naive Bayes yields the worst result. The Random Forest model remains the most accurate as far as accuracy (86%), precision (87%) and recall (88%) are concerned.</p></sec><sec id="s6"><title>6. Discussion</title><p>Naive Bayes yields the lowest performance: 64% of accuracy. This is due to the algorithm’s strong independence assumption. It is clearly difficult to completely decouple age from chest girth or weight for example. Reference [<xref ref-type="bibr" rid="scirp.119132-ref1">1</xref>] demonstrated that the Naive Bayes model reaches its best performance in two opposite cases: completely independent features and functionally dependent features. In the present case, the level of feature dependence is in between.</p><p>Furthermore, the analysis of the coefficients of the regression model confirms that size (height at withers) is the most significant discriminant variable among the two species. The negative sign of the coefficients indicates inverse proportionality. This reinforces the general view that taurines are smaller than zebus. Feature importance plot (<xref ref-type="fig" rid="fig3">Figure 3</xref>) supports this since height at withers and body length score the most.</p><p>The strong correlation between weight and girth width can be explained by the measurement technique used in the field. Indeed, technicians used a weigh band, a tool that deduces the weight from the chest width, that is measured directly on the subject [<xref ref-type="bibr" rid="scirp.119132-ref6">6</xref>] .</p><p>The performance ranking showed that nonlinear models provide better results. Random Forest gives an accuracy of 86%, kernel SVM and KNN performed 85% and 83% respectively. These algorithms often lead to models with high variance. There is therefore a non-negligible risk of overfitting.</p></sec><sec id="s7"><title>7. Conclusions</title><p>Trypanosomosis, which is prevalent in humid areas of West Africa, leads to a drop in livestock production and higher operating costs. The taurines, unlike the zebu species, have an innate ability to resist this disease. Unfortunately, uncontrolled crossbreeding between those two species of cattle leads to the dilution of this resistance capacity and threatens the genetic heritage of the taurines. Innovative means such as machine learning applications are needed to contribute to the preservation of the taurine species and its precious trypanotolerance capacity. To achieve this, it is crucial to distinguish purebred taurines from others. In this study, we applied five machine learning algorithms to train supervised models in order to make this identification. Random Forest performed the best with up to 86% accuracy, 88% recall and 0.9161 of AUC. The study confirmed that height at withers is the most discriminating descriptor among the six descriptors analyzed.</p><p>To obtain better results, it is important to continue the study by including the other morphological variables (<xref ref-type="table" rid="table1">Table 1</xref>). As this preliminary study confirmed, nonlinear methods seem to be more efficient. This trend could be further explored by the implementation of some cutting-edge models like Artificial Neural Network, XGBoost, etc. Moreover, the generalization capacity of the models trained here can be improved by associating other sampling methods such as bootstrapping.</p></sec><sec id="s8"><title>Acknowledgements</title><p>We wish to express our sincere gratitude to all the partners who provided us with data for this study, in particular the Local Cattle Breed of Burkina Faso (LoCaBreed) team. We would like to thank Austrian Partnership Programme in Higher Education and Research for Development (APPEAR project 120). We are also grateful to Minist&#232;re de l’Enseignement Sup&#233;rieurde la Recherche Scientifique et de l’Innovation (MESRSI) of Burkina Faso.</p></sec><sec id="s9"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s10"><title>Cite this paper</title><p>Bembamba, F., Ou&#233;draogo, F.T., Albert, S. and Traor&#233;, A. (2022) Toward an Intelligent System for Taurine Cattle Recognition. Journal of Intelligent Learning Systems and Applications, 14, 1-13. https://doi.org/10.4236/jilsa.2022.141001</p></sec></body><back><ref-list><title>References</title><ref id="scirp.119132-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Rish, I., et al. (2001) An Empirical Study of the Naive Bayes Classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, 4-6 August 2001, 41-46.</mixed-citation></ref><ref id="scirp.119132-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Wanga, H.P., Ghani, N. and Kalegele, K. (2015) Designing a Machine Learning-Based Framework for Enhancing Performance of Livestock Mobile Application System. American Journal of Software Engineering and Applications, 4, 56. https://doi.org/10.11648/j.ajsea.20150403.13</mixed-citation></ref><ref id="scirp.119132-ref3"><label>3</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Olasehinde</surname><given-names> O. </given-names></name>,<etal>et al</etal>. (<year>2021</year>)<article-title>Infrared Thermography and Machine Learning in Livestock Production</article-title><source> International Journal of Advanced Research and Review</source><volume> 6</volume>,<fpage> 38</fpage>-<lpage>57</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.119132-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Libbrecht, M.W. and Noble, W.S. (2015) Machine Learning Applications in Genetics and Genomics. Nature Reviews Genetics, 16, 321-332. https://doi.org/10.1038/nrg3920</mixed-citation></ref><ref id="scirp.119132-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Ouedraogo, D., Ouedraogo-Kone, S., Yougbare, B., et al. (2021) Population Structure, Inbreeding and Admixture in Local Cattle Populations Managed by Community-Based Breeding Programs in Burkina Faso. Journal of Animal Breeding and Genetics, 138, 379-388. https://doi.org/10.1111/jbg.12529</mixed-citation></ref><ref id="scirp.119132-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Yougbare, B., Soudre, A., Ouedraogo, D., et al. (2021) Genome-Wide Association Study of Trypanosome Prevalence and Morphometric Traits in Purebred and Crossbred Baoulé Cattle of Burkina Faso. PLOS ONE, 16, e0255089. https://doi.org/10.1371/journal.pone.0255089</mixed-citation></ref><ref id="scirp.119132-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Dodo, K., Pandey, V.S. and Illiassou, M.S. (2001) Utilisation de labarymetrie pour l’estimation du poids chez le zebu Azawak au Niger. Revue d’élevage et de médecine vétérinaire des pays tropicaux, 54, 63-68. https://doi.org/10.19182/remvt.9808</mixed-citation></ref><ref id="scirp.119132-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Rudenko, O., Megel, Y., Bezsonov, O., et al. (2020) Cattle Breed Identification and Live Weight Evaluation on the Basis of Machine Learning and Computer Vision. Proceedings of the Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), Zaporizhzhia, 27 April-1 May, 2020, 939-954. https://doi.org/10.32782/cmis/2608-70</mixed-citation></ref><ref id="scirp.119132-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Raduly, Z., Sulyok, C., et al. (2018) Dog Breed Identification Using Deep Learning. IEEE 16th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, 13-15 September 2018, 271-276. https://doi.org/10.1109/SISY.2018.8524715</mixed-citation></ref><ref id="scirp.119132-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Kumar, R., Sharma, M., Dhawale, K., et al. (2019) Identification of Dog Breeds Using Deep Learning. 2019 IEEE 9th International Conference on Advanced Computing (IACC), Tiruchirappalli, 13-14 December 2019, 193-198. https://doi.org/10.1109/IACC48062.2019.8971604</mixed-citation></ref><ref id="scirp.119132-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Xu, Z.T., Diao, S.Q., Teng, J.Y., et al. (2021) Breed Identification of Meat Using Machine Learning and Breed Tag SNPs. Food Control, 125, Article ID: 107971. https://doi.org/10.1016/j.foodcont.2021.107971</mixed-citation></ref><ref id="scirp.119132-ref12"><label>12</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Mahesh</surname><given-names> B. </given-names></name>,<etal>et al</etal>. (<year>2020</year>)<article-title>Machine Learning Algorithms—A Review</article-title><source> International Journal of Science and Research (IJSR)</source><volume> 9</volume>,<fpage> 381</fpage>-<lpage>386</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.119132-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Grus, J. (2015) Data Science from Scratch: First Principles with Python. O’Reilly, Sebastopol.</mixed-citation></ref><ref id="scirp.119132-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Bradley, A.P. (1997) The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30, 1145-1159. https://doi.org/10.1016/S0031-3203(96)00142-2</mixed-citation></ref><ref id="scirp.119132-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Rokach, L. and Maimon, O. (2009) Classification Trees. In: Data Mining and Knowledge Discovery Handbook, Springer, Berlin, 149-174. https://doi.org/10.1007/978-0-387-09823-4_9</mixed-citation></ref><ref id="scirp.119132-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324</mixed-citation></ref><ref id="scirp.119132-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Raschka, S. and Mirjalili, V. (2019) Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-Learn and TensorFlow 2. 3rd Edition, Packt Publishing, Birmingham.</mixed-citation></ref><ref id="scirp.119132-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011) Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.</mixed-citation></ref><ref id="scirp.119132-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Huang, K.X., Xiao, C., Glass, L.M., et al. (2021) Machine Learning Applications for Therapeutic Tasks with Genomics Data. Patterns, 2, Article ID: 100328.</mixed-citation></ref><ref id="scirp.119132-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Hawkins, D.M., Basak, S.C. and Mills, D. (2003) Assessing Model Fit by Cross-Va- lidation. The Journal for Chemical Information and Computer Scientists, 43, 579-586. https://doi.org/10.1021/ci025626i</mixed-citation></ref></ref-list></back></article>