<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2023.112006</article-id><article-id pub-id-type="publisher-id">JCC-123309</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Supervised Learning Algorithm on Unstructured Documents for the Classification of Job Offers: Case of Cameroun
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Fritz</surname><given-names>Sosso Makembe</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Roger</surname><given-names>Atsa Etoundi</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Hippolyte</surname><given-names>Tapamo</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Department of Computer Science, Faculty of Science, University of Yaoundé 1, Yaoundé, Cameroon</addr-line></aff><pub-date pub-type="epub"><day>15</day><month>02</month><year>2023</year></pub-date><volume>11</volume><issue>02</issue><fpage>75</fpage><lpage>88</lpage><history><date date-type="received"><day>20,</day>	<month>January</month>	<year>2023</year></date><date date-type="rev-recd"><day>24,</day>	<month>February</month>	<year>2023</year>	</date><date date-type="accepted"><day>27,</day>	<month>February</month>	<year>2023</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution-NonCommercial International License (CC BY-NC).http://creativecommons.org/licenses/by-nc/4.0/</license-p></license></permissions><abstract><p>
 
 
  Nowadays, in data science, supervised learning algorithms are frequently used to perform text classification. However, African textual data, in general, have been studied very little using these methods. This article notes the particularity of the data and measures the level of precision of predictions of naive Bayes algorithms, decision tree, and SVM (Support Vector Machine) on a corpus of computer jobs taken on the internet. This is due to the data imbalance problem in machine learning. However, this problem essentially focuses on the distribution of the number of documents in each class or subclass. Here, we delve deeper into the problem to the word count distribution in a set of documents. The results are compared with those obtained on a set of French IT offers. It appears that the precision of the classification varies between 88% and 90% for French offers against 67%, at most, for Cameroonian offers. The contribution of this study is twofold. Indeed, it clearly shows that, in a similar job category, job offers on the internet in Cameroon are more unstructured compared to those available in France, for example. Moreover, it makes it possible to emit a strong hypothesis according to which sets of texts having a symmetrical distribution of the number of words obtain better results with supervised learning algorithms.
 
</p></abstract><kwd-group><kwd>Job Offer</kwd><kwd> Underemployment</kwd><kwd> Text Classification</kwd><kwd> Imbalanced Data</kwd><kwd> Symmetric Word Distribution</kwd><kwd> Supervised Learning</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In 2020, according to the World Bank [<xref ref-type="bibr" rid="scirp.123309-ref1">1</xref>] , the unemployment rate in Cameroon was estimated at 3.4% and the underemployment rate was estimated at 84.7%. According to the International Labor Organization ILO [<xref ref-type="bibr" rid="scirp.123309-ref2">2</xref>] , underemployment occurs when the duration or productivity of a person’s employment is inadequate in relation to other possible jobs that the person is willing and able to do. In Cameroon, it is interpreted as a failure of the labor market, and is characterized mainly by the misuse of professional skills. Moreover, with the advent of ICT (Information and Communication Technologies), we are witnessing a new form of labor market in Cameroon that of the supply and demand of jobs online from websites or other mobile applications. It is in this sense that Jonas Hjort et al. (2019) [<xref ref-type="bibr" rid="scirp.123309-ref3">3</xref>] , provide evidence of the impact of the internet on the labor market in 12 African countries. Farrukh Suvankulov et al. (2012) [<xref ref-type="bibr" rid="scirp.123309-ref4">4</xref>] show that job seekers who used the internet saw their probability of being reemployed within 12 months increase from 7.1% to 12.7%. Good communication in this market requires the categorization of job offers. Indeed, job boards group job offers into categories, but in Cameroon, these categories correspond to the company’s field of activity and not to the description of the offer itself. Thus, the failure of the job market is born because we have offers that are poorly classified from the point of view of the job seeker. It is therefore necessary to be able to categorize these offers so that the applicant can easily find the offers that best correspond to his or her profile. R. Feldman and J. Sanger [<xref ref-type="bibr" rid="scirp.123309-ref5">5</xref>] define automatic text classification as the task of classifying a data instance into a predefined set of categories, i.e., given a set of categories (topics, classes, or labels) and a collection of text documents, classification is the process of automatically identifying the correct topic (or topics) for each document. So we can define job classification as the process of automatically putting together job offers that are similar. In other words, it can be compared to the automatic detection of the field to which a job offer relates according to its content. The objective of text classification is therefore to automatically classify documents into categories that have been defined either beforehand by an expert or automatically. This is supervised classification when the labeling is done by an expert and unsupervised classification (or clustering) when the labeling is done automatically by a machine. The rest of our work will focus mainly on supervised classification.</p><p>In our context, IT offers, for example, include offers from frontend developers, backend developers, community managers, database managers, assistants in an internet cafe or trainers on the office pack. The computer scientist is the one who uses the computer to solve a problem. Many works address the problem of supervised classification of job offers, and propose several approaches to solve it. However, our experiments have shown that, the approaches proposed in the literature have been shown to be insufficient for Cameroonian jobs. Indeed, the naive bayes, decision trees, SVM and recurrent neural networks methods give less good results on Cameroonian offers. The question then arises as to why these approaches or classical methods of classifying job offers, proposed in the literature, provide less good results on Cameroonian job offers.</p><p>To answer this question, we use data retrieved from the websites Minajobs (in Cameroon) and Monster (in France), which specialize in the publication of job offers on the Internet. These data allowed us to have two corpora of documents (one corpus for each site). The study, using supervised learning algorithms, explores the impact on the precision of these algorithms by analyzing the distribution of the number of words in the documents of the corpus. That is to say that we wish to show, in an experimental way, that the distribution of the number of words, in a corpus, must be symmetrical to reach the optimal results with the above mentioned algorithms.</p><p>The rest of this work is divided into five main parts: a presentation of related work, followed by a presentation of the data and methods used, then a presentation of the results obtained, after a discussion and finally a conclusion.</p></sec><sec id="s2"><title>2. Related Works</title><p>Many works have approached the classification of texts in general and job offers in particular in several ways and with different methods.</p><p>In attempting to improve the results of recommender systems by matching job offers and profiles according to required skills and experiences, A. Casagrande et al. (2017) [<xref ref-type="bibr" rid="scirp.123309-ref6">6</xref>] following the work of Dieng (2016) [<xref ref-type="bibr" rid="scirp.123309-ref7">7</xref>] and Florea et al. (2013) [<xref ref-type="bibr" rid="scirp.123309-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.123309-ref9">9</xref>] , propose to automatically detect the sector of activity of the job offers using supervised learning techniques. The idea is to automatically assign a class (a universe or sector of activities) to a document (a job offer).</p><p>Moldagulova et al. (2017) [<xref ref-type="bibr" rid="scirp.123309-ref10">10</xref>] propose an approach for building a machine learning system in R that uses the KNN method for text document classification. Moreover, they show that: the impact of the value of k (which represents the number of neighbors) on the classification accuracy in the K-Nearest Neighbors (KNN) algorithm, is less from a high number of K. They first start by analyzing the text of two articles collected from two sites: (egov.kz; http://www.government.kz/). Then they use the word cloud technique to select the most frequent terms in the text; then they modify the documents into a more manageable representation: a vector of terms and their frequencies represents a document; finally they deploy the KNN algorithm by training the model with “known” data and then classifying it on “unknown” data. The model presented in this article has two main limitations: first, the choice of the parameter K, although having demonstrated that its impact on the accuracy of the classification decreases when it is large, there remains the problem of determining the optimal value of K which varies according to the corpus. Secondly, in this classification method, the model is the entire training corpus, which poses the problem of complexity in time and space because it is necessary to load the entire training corpus and recomputed the similarity with all the elements of the corpus when we wish to classify a new element. This second limitation poses a real problem on our corpus with large job offers.</p><p>Ouchiha, L. (2016) [<xref ref-type="bibr" rid="scirp.123309-ref11">11</xref>] having made a comparative study of supervised text classification methods, it emerges according to his study that SVM stands out and occupies the first place by its performance. Despite the fact that the performance of polynomial SVM far exceeds that of the decision tree (DA), we note that its execution time is significantly greater than that of the DA. This state of affairs led us to use the linear SVM available on WEKA, which gave very good performances both in terms of classification error rate and execution time. He also demonstrated that the Naive Bayes Classifier (NBC) also performs well with long documents, due to its particular implementation in KNIME, as in the case of SVM. He makes the following remark, as he adds categories, the performance of AD deteriorates more and more, the interpretation which seems logical, is the fact that our AD is subjected to a very large dimension of descriptors which led to its over-learning.</p><p>Kameni F. et al (2020) [<xref ref-type="bibr" rid="scirp.123309-ref12">12</xref>] are interested in the extraction of skills expressed in documents such as CVs or job offers and based on the CNN (convolutional neural network) classification model manage to extract high level skills in CVs with performances reaching 98.79% for recall and 91.34% for precision. However, these data are retrieved in a very formal context.</p><p>Jakub Nowak et al. (2020) [<xref ref-type="bibr" rid="scirp.123309-ref13">13</xref>] address, using the supervised methods, the problem of non-uniformity of job names and descriptions by proposing two models: a convolutional network for text classification, consisting of six convolutional layers and three fully connected layers, and a recurrent network with long-term memory (LSTM) and Gated Recurrent Unit (GRU) cells with a convolutional input layer. In this solution, the description of the offer is entered word by word in the order in which it is written, this procedure simulates reading an ad on the Internet in the same way as humans. The convolutional part encodes the written word for the purposes of the recurrent cells, and provides an input vector to the output of the convolutional part. Therefore, all feature maps are combined into a single dimension given to the recurrent cells. The final classification remains with the LSTMs and GRUs. The number of calls to the recurrent cells was dynamic and depended on the number of words for each case in the database. The limitation was placed on the number of letters in a word and was 16 characters. Also, the number of words per offer was limited to 1024 and for any offers exceeding this limitation, the words after the 1024th were not taken into account. In addition to these adjustments, the SELU activation function was used throughout the framework as an alternative to the widely used RELU function, and they justify this by the fact that the SELU function can give negative values, which speeds up the learning process of the convolutional network. They applied their models on 17,177 job offers obtained from the Emplocity Ltd website (https://emplocity.com/) grouped into five classes. They obtain an accuracy of 84.7% for the LSTM and 86.5% for the GRU.</p><p>The various works mentioned above study different aspects of job postings using supervised learning methods, but none of them focuses on Cameroonian job postings. This paper aims to demonstrate the particularity of Cameroonian jobs offers and to measure the precision of naive bayes, SVM, and decision trees algorithms on a corpus of Cameroonian computer job offers taken from the internet. The results will be compared to those obtained on a set of computer job offers obtained on the Monster website in France.</p><p>To do so, we will start by submitting the offers to a new pre-processing approach aiming at normalizing our offers. This pre-processing approach consists, firstly, in removing the job offers with atypical word counts up to a certain threshold. In fact, by removing atypical offers, we should be able to keep a representative number of starting offers, and the remaining atypical offers should be negligible or even zero. Secondly, we apply tokenization and stemming, and finally we remove stopwords and other special characters. Then the offers that have been subjected to the new preprocessing approach are transformed into frequency vectors with the TF-IDF method as done by Diaby, M et al. (2014) [<xref ref-type="bibr" rid="scirp.123309-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.123309-ref15">15</xref>] . Finally on these offers transformed into frequency vectors, we apply supervised learning methods to classify them and evaluate the classification results.</p></sec><sec id="s3"><title>3. Data and Methods</title><sec id="s3_1"><title>3.1. Data</title><p>We have 7533 job offers from minajobs.net [<xref ref-type="bibr" rid="scirp.123309-ref16">16</xref>] in all sectors of activity. However we concentrated on the offers in French and in computer science. This choice was made because we were not able to obtain the labeling of the job offers by the experts of the other fields. We believe that this choice does not impact the results in the other categories because job offers in project management or marketing encounter the same difficulties. However, we have some reservations about areas such as medicine or academic training.</p><p>Thus, as shown in <xref ref-type="fig" rid="fig1">Figure 1</xref> below, we have a corpus of 726 job offers distributed over 13 classes numbered from 1 to 13 and corresponding respectively to the categories: developer, web master, database manager, analyst, digital marketing, designer, network administrator, IS security, computer maintenance, community manager, archivist, tester, system administrator.</p><p>This corpus has offers belonging to the same category (or domain) but whose lengths (i.e. the number of words contained in the text of the offer) are very different. This is the case of the example presented in the following images. These are two offers belonging to the “developer” domain, one has three words and the other has 2341 words, so a difference of 2339 words. In <xref ref-type="fig" rid="fig2">Figure 2</xref>, anyone who has done computer development can apply. The candidate knows nothing more about the client’s need. In <xref ref-type="fig" rid="fig3">Figure 3</xref>, the offer has a part in English because Cameroon is bilingual; however, we have used a language detection algorithm which classifies it in French because the majority of the text is in French. Several jobs are available in the same offer.</p><p>The second dataset is a set of job offers collected on the French website Monster.fr.</p><p>The offers used in [<xref ref-type="bibr" rid="scirp.123309-ref15">15</xref>] are not accessible, because to have access to the platform on which the offers were extracted, it is necessary to make a physical and paying registration. We have therefore retained the French offers because of their accessibility and because, in terms of structure, they are closer to ours than to the offers used in [<xref ref-type="bibr" rid="scirp.123309-ref15">15</xref>] . As shown in <xref ref-type="fig" rid="fig4">Figure 4</xref>, it is a corpus of 1280 job offers, in data processing, opened on Monster.fr and divided into three classes numbered from 1 to 3 corresponding, respectively, to the categories: computer sciences engineer, computer graphics and software architect. Now we will explain the methodology for comparing the performance of the supervised classification algorithms on the two corpora.</p></sec><sec id="s3_2"><title>3.2. Methods</title><p>The methodological approach used in this work is presented in <xref ref-type="fig" rid="fig5">Figure 5</xref>. The first step is the cleaning step which consists in removing the atypical descriptions, i.e. those whose word count contrasts greatly with the ”normal” measured values. Indeed, as Yamada et al. (2020) [<xref ref-type="bibr" rid="scirp.123309-ref17">17</xref>] , we based ourselves on the measure of the interquartile range to determine the atypical offers. Thus, all offers outside the interval: [Q1 − k(Q3 − Q1), Q3 + k(Q3 − Q1), where k is a positive constant, Q1 and Q3 are the first and third quartile respectively. Then we cut the descriptions into lists of words (tokenization) to transform these words into their root or radical, also called the stems (stemming). Finally, we remove the stopwords.</p></sec><sec id="s3_3"><title>3.3. Pre-Processing of Job Offers</title><p>The preprocessing approach (<xref ref-type="fig" rid="fig6">Figure 6</xref>) is shown in the following diagram:</p><p>Step 1) Removal of outliers: In this step, we remove the outliers, i.e. the offers with an atypical number of words. To do this, we are mainly interested in the descriptive statistics of the series of numbers of words per offer, from which we determine the offers that have atypical numbers of words (visible on the whisker box) and we remove them. The goal here is to obtain a histogram of the distribution of Cameroonian offers that is as close as possible to a symmetrical distribution.</p><p>Step 2) Tokenization: According to Wisdom et al. (1999) [<xref ref-type="bibr" rid="scirp.123309-ref18">18</xref>] tokenization consists in transforming a text into a list of words without separators. In our case it is to separate an offer into a list of words</p><p>Step 3) Stemming: According to Perkins et al. (2010) [<xref ref-type="bibr" rid="scirp.123309-ref19">19</xref>] stemming is a method which consists in extracting the roots of the words. In our case, for each word obtained after tokenization, we will apply stemming and obtain a new list of words that will be the radical of the tokens, this radical is still called stem.</p><p>Step 4) Removal of stopword: Here we are going to remove words that are devoid of information. For this purpose, we have developed a stopword dictionary; in addition, regular expressions are used to remove certain elements such as special characters.</p><p>Step 5) Vector representation of job offers: In this step, we numerically represent the job offers using the TF-IDF method. Each job offer being already a stem list will be represented by a vector.</p><p>T F _ I D F = T F ∗ I D F [<xref ref-type="bibr" rid="scirp.123309-ref20">20</xref>] (1)</p><p>Where:</p><p>T F = Number   Of   Stems   Occurrences / Number   Of   Stems [<xref ref-type="bibr" rid="scirp.123309-ref20">20</xref>] (2)</p><p>I D F = log ( Number   Of   Descriptions / Descriptions   Containing   Stem ) [<xref ref-type="bibr" rid="scirp.123309-ref20">20</xref>] (3)</p><p>Step 6) Supervised classification methods: We are doing single-label classification with these three supervised classification algorithms:</p><p>&#183; Support Vector Machine (SVM): We used the linear SVM.</p><p>&#183; Na&#239;ve Bayes: We thus obtain the naive bayes Gaussian.</p><p>&#183; Decision tree: We used the classification model based on the ID3 [<xref ref-type="bibr" rid="scirp.123309-ref21">21</xref>] algorithm</p></sec></sec><sec id="s4"><title>4. Results</title><p>Here we present the results of our experiments. We present the results of the application of the descriptive statistics of the data and the classification methods described in section 3.2.3 first on the two datasets that did not undergo any pre-processing beforehand, then on the same data but this time after deleting all the offers with an atypical number of words, and finally on the two datasets after having applied all the pre-processing approach described in section 3.3.</p><sec id="s4_1"><title>4.1. Results without the Proposed Approach</title><p>After the statistical study carried out on our different corpora without having previously applied any pre-processing, we obtain the following results (<xref ref-type="table" rid="table1">Table 1</xref>).</p><p>From these observations, we notice that the Monster offers have two atypical offers while the Cameroonian offers have 49; and especially that the difference between the extents of the two corpora is 1520. In addition, the histograms (<xref ref-type="fig" rid="fig7">Figure 7</xref>, <xref ref-type="fig" rid="fig8">Figure 8</xref>) show that the distribution of the number of words of the Cameroonian offers is very spread on the right compared to the distribution of words of the Monster corpus. This is also justified by the empirical skewness values obtained.</p><p>After supervised learning on the corpora, the performances obtained by the different classification methods have been summarized in the following table with precision as the performance evaluation metric (<xref ref-type="table" rid="table2">Table 2</xref>).</p><p>The previous table shows the difference in performance of the classification methods on our corpus. Indeed, overall, these performances are clearly better on Monster’s offers. This can be justified by the difference in the shape of the distribution between the two corpora.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Comparison of descriptive statistical study done on Monster and Minajob job postings (source: Author)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Elements of comparison</th><th align="center" valign="middle" >Offers from Minajobs</th><th align="center" valign="middle" >Offers from Monster</th></tr></thead><tr><td align="center" valign="middle" >Total number of words</td><td align="center" valign="middle" >227,113</td><td align="center" valign="middle" >22,817</td></tr><tr><td align="center" valign="middle" >Average number of words</td><td align="center" valign="middle" >312.828</td><td align="center" valign="middle" >334.538</td></tr><tr><td align="center" valign="middle" >Minimum number of word</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >64</td></tr><tr><td align="center" valign="middle" >Max number of word</td><td align="center" valign="middle" >2341</td><td align="center" valign="middle" >885</td></tr><tr><td align="center" valign="middle" >Median</td><td align="center" valign="middle" >200</td><td align="center" valign="middle" >333</td></tr><tr><td align="center" valign="middle" >Standard deviation</td><td align="center" valign="middle" >257.591</td><td align="center" valign="middle" >126.969</td></tr><tr><td align="center" valign="middle" >Number of atypical values</td><td align="center" valign="middle" >49</td><td align="center" valign="middle" >2</td></tr><tr><td align="center" valign="middle" >Scope</td><td align="center" valign="middle" >2338</td><td align="center" valign="middle" >821</td></tr><tr><td align="center" valign="middle" >Empirical skewness</td><td align="center" valign="middle" >2.531</td><td align="center" valign="middle" >0.356</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Results obtained by supervised learning with Naive Bayes, decision tree, and SVMmethods on Monster and Minajob job offers without preprocessing (source: author)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Methods</th><th align="center" valign="middle"  colspan="2"  >Na&#239;ve Bayes</th><th align="center" valign="middle"  colspan="2"  >Decision tree</th><th align="center" valign="middle"  colspan="2"  >SVM</th></tr></thead><tr><td align="center" valign="middle" >Data Metrics</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td></tr><tr><td align="center" valign="middle" >Recall</td><td align="center" valign="middle" >62.79%</td><td align="center" valign="middle" >85.99%</td><td align="center" valign="middle" >69.47%</td><td align="center" valign="middle" >86.56%</td><td align="center" valign="middle" >69.34%</td><td align="center" valign="middle" >86.45%</td></tr><tr><td align="center" valign="middle" >Precision</td><td align="center" valign="middle" >62.79%</td><td align="center" valign="middle" >86.04%</td><td align="center" valign="middle" >68.02%</td><td align="center" valign="middle" >87.37%</td><td align="center" valign="middle" >68.27%</td><td align="center" valign="middle" >87.86%</td></tr><tr><td align="center" valign="middle" >F1-score</td><td align="center" valign="middle" >59.79%</td><td align="center" valign="middle" >85.49%</td><td align="center" valign="middle" >65.98%</td><td align="center" valign="middle" >86.09%</td><td align="center" valign="middle" >65.88%</td><td align="center" valign="middle" >86.49%</td></tr></tbody></table></table-wrap></sec><sec id="s4_2"><title>4.2. Results after Deleting Offers with an Outlier Number of Words</title><p>In this part, we first removed all the offers with an atypical number of words (the outliers), then we redid the statistical study on the two corpora without outliers, and finally we performed the classification again. After the statistical study carried out on our different corpora without outliers, the result is summarized in the following <xref ref-type="table" rid="table3">Table 3</xref>.</p><p>The following figures present respectively the histograms of the distributions of the number of words of the Cameroonian offers after removing the outliers.</p><p>After removing the offers with an atypical number of words, we see, in <xref ref-type="fig" rid="fig9">Figure 9</xref>, that the range of the distribution of words in the Cameroonian offers has decreased by 1635. The maximum number of words in an offer is now 706, the skewness has also decreased by 1.884651. Thus, the curve and the histogram of the distribution of words in the Cameroonian offers are closer to those of the Monster offers. On the other hand, the Monster offers have fewer changes.</p><p>After supervised learning on the corpora after having eliminated the outliers, the performances obtained by the different classification methods have been summarized in the following table with precision as the performance evaluation metric (<xref ref-type="table" rid="table4">Table 4</xref>).</p><p>The previous table shows us the difference in performance of the classification methods on our corpora after the removal of outliers. We can see that the results obtained on the Cameroonian corpus have considerably increased with the removal of outliers. The performances are now close to those obtained on the Monsters offers which have not changed significantly. This can be justified by the distribution of words in the corpora, indeed by removing the outliers on the Cameroonian corpus; we have considerably modified the distribution of words in this corpus, giving it a distribution close to that of the Monster corpus. However, this deletion did not have a major influence on the offers of the Monster corpus because it contained only two outliers.</p></sec><sec id="s4_3"><title>4.3. Results after Complete Pre-Processing of Offers</title><p>Having already the corpora devoid of outliers, we continued the pre-processing with tokennization, stemming, and stopwords removal on the offers. Thus, after finishing with the pre-processing we redid supervised learning on the pre-processed corpora. The performances obtained by the different classification methods have been summarized in the following table with precision as the performance evaluation metric (<xref ref-type="table" rid="table5">Table 5</xref>).</p><p>The previous table allows us to see that the performance of the classification methods on the two corpora has increased by an average of 9% for the corpus of Cameroonian offers and by 7% for the Monster offers. This implies that the tokennization, stemming and stopwords removal steps have improved the classification of the offers. The fact that this increase is 2% more important for the Cameroonian offers than for the Monster offers allows us to say that the Cameroonian offers have a little more stopwords than the Monster offers.</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Comparison of statistical study done on Monster and minajobs job postings after removing outliers (source: author)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Elements of comparison</th><th align="center" valign="middle" >Offers from Minajobs</th><th align="center" valign="middle" >Offers from Monster</th></tr></thead><tr><td align="center" valign="middle" >Total number of words</td><td align="center" valign="middle" >174,026</td><td align="center" valign="middle" >22,446</td></tr><tr><td align="center" valign="middle" >Average number of words</td><td align="center" valign="middle" >257.055</td><td align="center" valign="middle" >333.032</td></tr><tr><td align="center" valign="middle" >Minimum number of word</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >64</td></tr><tr><td align="center" valign="middle" >Max number of word</td><td align="center" valign="middle" >706</td><td align="center" valign="middle" >661</td></tr><tr><td align="center" valign="middle" >Median</td><td align="center" valign="middle" >181</td><td align="center" valign="middle" >333</td></tr><tr><td align="center" valign="middle" >Standard deviation</td><td align="center" valign="middle" >119.016</td><td align="center" valign="middle" >124.085</td></tr><tr><td align="center" valign="middle" >Number of atypical values</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >Scope</td><td align="center" valign="middle" >703</td><td align="center" valign="middle" >597</td></tr><tr><td align="center" valign="middle" >Empirical skewness</td><td align="center" valign="middle" >0.646</td><td align="center" valign="middle" >0.212</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Results obtained by supervised learning with Naive Bayes, decision tree, and SVM methods on Monster and Minajob job offers after eliminating outliers (source: author)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Methods</th><th align="center" valign="middle"  colspan="2"  >Na&#239;ve Bayes</th><th align="center" valign="middle"  colspan="2"  >Decision tree</th><th align="center" valign="middle"  colspan="2"  >SVM</th></tr></thead><tr><td align="center" valign="middle" >Data Metrics</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td></tr><tr><td align="center" valign="middle" >Recall</td><td align="center" valign="middle" >93.89%</td><td align="center" valign="middle" >86.55%</td><td align="center" valign="middle" >86.75%</td><td align="center" valign="middle" >86.42%</td><td align="center" valign="middle" >86.97%</td><td align="center" valign="middle" >86.01%</td></tr><tr><td align="center" valign="middle" >Precision</td><td align="center" valign="middle" >94.93%</td><td align="center" valign="middle" >87.95%</td><td align="center" valign="middle" >87.75%</td><td align="center" valign="middle" >87.93%</td><td align="center" valign="middle" >87.97%</td><td align="center" valign="middle" >87.47%</td></tr><tr><td align="center" valign="middle" >F1-score</td><td align="center" valign="middle" >94.42%</td><td align="center" valign="middle" >86.55%</td><td align="center" valign="middle" >85.75%</td><td align="center" valign="middle" >86.89%</td><td align="center" valign="middle" >85.97%</td><td align="center" valign="middle" >86.18%</td></tr></tbody></table></table-wrap><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Results obtained by supervised learning with Naive Bayes, decision tree, and SVM methods on Monster and Minajob job offers after complete pre-processing of these (source: author)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Methods</th><th align="center" valign="middle"  colspan="2"  >Na&#239;ve Bayes</th><th align="center" valign="middle"  colspan="2"  >Decision tree</th><th align="center" valign="middle"  colspan="2"  >SVM</th></tr></thead><tr><td align="center" valign="middle" >Data Metrics</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td><td align="center" valign="middle" >Minajobs offers</td><td align="center" valign="middle" >Monster offers</td></tr><tr><td align="center" valign="middle" >Recall</td><td align="center" valign="middle" >93.79%</td><td align="center" valign="middle" >90.88%</td><td align="center" valign="middle" >95.99%</td><td align="center" valign="middle" >96.79%</td><td align="center" valign="middle" >95.12%</td><td align="center" valign="middle" >96.23%</td></tr><tr><td align="center" valign="middle" >Precision</td><td align="center" valign="middle" >94.90%</td><td align="center" valign="middle" >88.95%</td><td align="center" valign="middle" >97.45%</td><td align="center" valign="middle" >97.56%</td><td align="center" valign="middle" >97.90%</td><td align="center" valign="middle" >97.96%</td></tr><tr><td align="center" valign="middle" >F1-score</td><td align="center" valign="middle" >93.75%</td><td align="center" valign="middle" >88.98%</td><td align="center" valign="middle" >96.65%</td><td align="center" valign="middle" >96.49%</td><td align="center" valign="middle" >96.90%</td><td align="center" valign="middle" >96.87%</td></tr></tbody></table></table-wrap></sec></sec><sec id="s5"><title>5. Results Analysis</title><p>The different experiments carried out during this work allow us to see that the classic approaches to classifying job offers give less satisfactory results on Cameroonian job offers. On the other hand, when we change the distribution of the words of these offers by eliminating the offers having an aberrant length, which in our case constituted only 7% of the offers, we make more symmetrical the curve of distribution of the words of the offers and thus, increase by nearly 20% the precision of the classification methods on these offers. When, in addition to the removal of outliers, we add tokenization, stemming and stopword removal, we increase the precision of the classification methods by almost 9%. This allows us to conclude that the Cameroonian job offers have a main problem on their word count distribution. Indeed, <xref ref-type="table" rid="table1">Table 1</xref> shows us that the Cameroonian corpus has a word count of 2338, meaning that some job offers have a very high word count compared to other offers in the same corpus, which generally biases the vector representation of the offers and therefore the classification.</p></sec><sec id="s6"><title>6. Conclusion</title><p>This research also provides a strong hypothesis that text sets with a symmetric distribution of word counts are more likely to perform better with supervised learning algorithms. The results of the research indicate that, in the Cameroonian context, published offers must often be reprocessed to better match the expectations of employers and job seekers. In this context, we believe that the hypothesis could be justified by demonstrating why the results of job advertisement classification are better when the distribution of the number of words per advertisement is closer to a symmetric distribution and find out to what extent, instead of removing aberrant offers, they should be corrected, because even if they are aberrant, they are still offering to be considered.</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s8"><title>Cite this paper</title><p>Makembe, F.S., Etoundi, R.A. and Tapamo, H. (2023) Supervised Learning Algorithm on Unstructured Documents for the Classification of Job Offers: Case of Cameroun. Journal of Computer and Communications, 11, 75-88. https://doi.org/10.4236/jcc.2023.112006<sup> </sup></p></sec></body><back><ref-list><title>References</title><ref id="scirp.123309-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">World Bank Website (2012) Cameroon: Universities Debate Unemployment. https://www.worldbank.org/en/news/feature/2012/03/22/cameroon-universities-debate-unemployment</mixed-citation></ref><ref id="scirp.123309-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">International Definitions and Prospects of Underemployment Statistics (2021). https://www.ilo.org/wcmsp5/groups/public/---dgreports/---stat/documents/publication/wcms_091440.pdf</mixed-citation></ref><ref id="scirp.123309-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Hjort, J. and Poulsen, J. (2019) The Arrival of Fast Internet and Employment in Africa. American Economic Review, 109, 1032-1079. https://doi.org/10.1257/aer.20161385</mixed-citation></ref><ref id="scirp.123309-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Suvankulov, F., Lau, M.C.K. and Chau, F.H.C. (2012) Job Search on the Internet and Its Outcome. Internet Research, 22, 298-317. https://doi.org/10.1108/10662241211235662</mixed-citation></ref><ref id="scirp.123309-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Feldman, R. and Sanger, J. (2006) The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge.https://doi.org/10.1017/CBO9780511546914</mixed-citation></ref><ref id="scirp.123309-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Casagrande, A., Gotti, F. and Lapalme, G. (2017) Classification d’offres d’emploi. University of Montreal, Montreal. https://rali.iro.umontreal.ca/rali/node/1519/</mixed-citation></ref><ref id="scirp.123309-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Dieng M.A. (2016) Développement d’un système d’appariement pour l’e-recrutement. Université de Montréal, Montréal.</mixed-citation></ref><ref id="scirp.123309-ref8"><label>8</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Florea</surname><given-names> N.V. </given-names></name>,<etal>et al</etal>. (<year>2013</year>)<article-title>Cost/Benefit Analysis—A Tool To Improve Recruitment, Selection and Employment in Organizations</article-title><source> Management &amp; Marketing</source><volume> 11</volume>,<fpage> 274</fpage>-<lpage>290</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.123309-ref9"><label>9</label><mixed-citation publication-type="book" xlink:type="simple">Pazzani, M.J. and Billsus, D. (2007) Content-Based Recommendation Systems. In: Brusilovsky, P., Kobsa, A. and Nejdl, W., Eds., The Adaptive Web. Lecture Notes in Computer Science, Vol. 4321, Springer, Berlin, 325-341. https://doi.org/10.1007/978-3-540-72079-9_10</mixed-citation></ref><ref id="scirp.123309-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Moldagulova, A. and Sulaiman, R.B. (2017) Using KNN Algorithm for Classification of Textual Documents. 2017 8th International Conference on Information Technology (ICIT), Amman, 17-18 May 2017, 665-671. https://doi.org/10.1109/ICITECH.2017.8079924</mixed-citation></ref><ref id="scirp.123309-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Ouchiha, L. (2016) Classification supervisée de documents: étude comparative. Université du Québec en Outaouais, Gatineau. https://di.uqo.ca/id/eprint/806</mixed-citation></ref><ref id="scirp.123309-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Jiechieu, K.F.F. and Tsopze, N. (2020) Skills Prediction Based on Multi-Label Resume Classification Using CNN with Model Predictions Explanation. Neural Computing and Applications, 33, 5069-5087.https://doi.org/10.1007/s00521-020-05302-x</mixed-citation></ref><ref id="scirp.123309-ref13"><label>13</label><mixed-citation publication-type="book" xlink:type="simple">Nowak, J., Milkowska, K., Scherer, M., Talun, A. and Korytkowski, M. (2020) Job Offer Analysis Using Convolutional and Recurrent Convolutional Networks. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R. and Zurada, J.M., Eds., Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science, Vol. 12416, Springer, Cham, 380-387.https://doi.org/10.1007/978-3-030-61534-5_34</mixed-citation></ref><ref id="scirp.123309-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Diaby, M. and Viennet, E. (2014) Taxonomy-Based Job Recommender Systems on Facebook and LinkedIn Profiles. 2014 IEEE Eighth International Conference on Research Challenges in Information Science (RCIS), Marrakech, 28-30 May 2014, 1-6.https://doi.org/10.1109/RCIS.2014.6861048</mixed-citation></ref><ref id="scirp.123309-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Quang, C.T. (2005) Classification automatique des textes vietnamiens Hanoi. Institut de la Francophonie pour l’informatique, Hanoi, Vietnam.</mixed-citation></ref><ref id="scirp.123309-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Minajobs.net. https://cameroun.minajobs.net/</mixed-citation></ref><ref id="scirp.123309-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Yamada, Y., Shinkawa, K. and Shimmei, K. (2020) Atypical Repetition in Daily Conversation on Different Days for Detecting Alzheimer Disease: Evaluation of Phone-Call Data from a Regular Monitoring Service. JMIR Mental Health, 7, e16790. https://doi.org/10.2196/16790</mixed-citation></ref><ref id="scirp.123309-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Wisdom, V. and Gupta, R. (2016) An Introduction to Twitter Data Analysis in Python. Artigence Inc., Bangalore.</mixed-citation></ref><ref id="scirp.123309-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Perkins, J. (2010) Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham.</mixed-citation></ref><ref id="scirp.123309-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Yoo, J.Y. and Yang, D. (2015) Classification Scheme of Unstructured Text Document Using TF-IDF and Na&amp;iuml;ve Bayes Classifier. Advanced Science and Technology Letters, 111, 263-266. https://doi.org/10.14257/astl.2015.111.50</mixed-citation></ref><ref id="scirp.123309-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Brownlee, J. (2016) Master Machine Learning. Melbourne, Australia.https://datageneralist.files.wordpress.com/2018/03/master_machine_learning_algo_from_scratch.pdf</mixed-citation></ref></ref-list></back></article>