<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2016.45012</article-id><article-id pub-id-type="publisher-id">JCC-66762</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Improve Data Quality by Processing Null Values and Semantic Dependencies
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Houda</surname><given-names>Zaidi</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Faouzi</surname><given-names>Boufarès</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Yann</surname><given-names>Pollet</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Laboratory LIPN, University Paris 13, Sorbonne Paris Cité, Villetaneuse, France</addr-line></aff><aff id="aff1"><addr-line>Laboratory CEDRIC, Conservatoire National des Arts et Métiers, Paris, France</addr-line></aff><pub-date pub-type="epub"><day>29</day><month>05</month><year>2016</year></pub-date><volume>04</volume><issue>05</issue><fpage>78</fpage><lpage>85</lpage><history><date date-type="received"><day>12</day>	<month>May</month>	<year>2016</year></date><date date-type="rev-recd"><day>accepted</day>	<month>19</month>	<year>May</year>	</date><date date-type="accepted"><day>26</day>	<month>May</month>	<year>2016</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
   Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is very likely to manipulate data without knowledge about their structures and their semantics. In fact, the meta-data may be insufficient or totally absent. Data Anomalies may be due to the poverty of their semantic descriptions, or even the absence of their description. In this paper, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-col- umns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns. 
 
</p></abstract><kwd-group><kwd>Data Quality</kwd><kwd> Big Data</kwd><kwd> Contextual Semantics</kwd><kwd> Semantic Dependencies</kwd><kwd> Functional Dependencies</kwd><kwd> Null Values</kwd><kwd> Data Cleaning</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Data quality represents a very main challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world [<xref ref-type="bibr" rid="scirp.66762-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref2">2</xref>]. The data may be derived from different sources for which metadata can be totally absent and most often not sufficient to reflect the actual content of the data and treat any anomalies. Therefore, it is interesting to create new data integration tools to better understand the semantics and structure of the data. We develop this work in collaboration with the Talend Company which is an editor of one the most known of open source data integration and data quality tools [<xref ref-type="bibr" rid="scirp.66762-ref3">3</xref>]. The first part of the project consisted to treat the inter-lines anomalies [<xref ref-type="bibr" rid="scirp.66762-ref4">4</xref>]-[<xref ref-type="bibr" rid="scirp.66762-ref7">7</xref>]. Indeed, the automatic correction of anomalies in a same column gives better results, insofar as their semantics is known. Moreover, the discovery of semantic links that may exist between the columns may avoid the violation of various types of dependencies constraints between them. This is what constitutes the second part of the project which is the subject of this article. Constraint checking dependencies from large amounts of data will help to correct automatically the anomalies such as null values and some functional dependencies. In this paper, our proposal is to try to understand the semantics of the data before correcting it. An intra-column study allows the automatic correction of anomalies in the data related to its context. For example, the following two strings “Londres” and “London” are equivalent if they represent the context of city names but they are different if they are names of people. Algorithms of similarity distance calculation such as Levenshtein [<xref ref-type="bibr" rid="scirp.66762-ref8">8</xref>], Jaro-Winkler [<xref ref-type="bibr" rid="scirp.66762-ref9">9</xref>], Soundex [<xref ref-type="bibr" rid="scirp.66762-ref10">10</xref>] and Metaphone [<xref ref-type="bibr" rid="scirp.66762-ref11">11</xref>] do not take into account the context. Our approach is to illustrate the data structure (i.e. the semantic schema) of the data source. First, the syntactic and semantic automatic corrections inside one column can be better focused. Second, eventual semantic links that may exist between columns can be discovered. Then, the process of anomalies caused by the violation of dependency constraints will start. In a context of Big Data, we use the MapReduce technology to verify the discoveries dependencies and correct anomalies caused by the violation of these dependencies. The rest of the article is organized as follows. The second section describes the step of semantic recognition of data. Section three discusses the intra-column and inter-columns data cleaning step. Finally, we compare our approach with other work and we address our future goals.</p></sec><sec id="s2"><title>2. Semantic Categorization of Data</title><p>The step of data categorization consists to determine the semantics of each column of a data source. Indeed, in order to qualify a syntactically incorrect data, it should be evaluated in its context. Several examples can illustrate this: 1) The string “Pari” can be considered syntactically incorrect only if it is the French name of the city “Paris”. The words “P&#233;kin” and “Beijing” mean the same thing in two different languages if we know that these are names of cities. “Beijing” could be considered semantically incorrect if the used language (dominant language) is French; 2) The three strings “16-10-1996”, “10-16-1996” and “1996-16-10” may represent the same type of information “date with different formats” defined by a regular expression; 3) The string “1996-10” is not a date.</p><p>Then, we introduce the use of stored knowledge in a referential called Data Dictionary (DD) [<xref ref-type="bibr" rid="scirp.66762-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref13">13</xref>]. The DD contains knowledge identified by two modes: 1) Data defined by extension: a priori given list such as city names, country names and names of organizations. Valid Strings (DDVS) and Key Words (DDKW) are stored in the DD; 2) Data defined by intention: this knowledge must verify some properties such as regular expressions or belonging to an interval of values (DDRE). For example Emails, Websites and Dates must be conforming to a model which can be a regular expression. In other words, the DD can then be seen as a set of categories. Each category corresponds to only one data type (String, Number, Date). Some categories can have subcategories such as languages. We propose a DD that contains not only information grouped by categories (Valid Strings, Rules such as Regular Expression) but also semantically valid knowledge such as constraints or functional dependencies pre-stored. <xref ref-type="fig" rid="fig1"><xref ref-type="fig" rid="fig">Figure </xref>1</xref> below shows examples of pre-stored knowledge.</p><sec id="s2_1"><title>2.1. Recognition of the Data Structure</title><p>As metadata can be totally absent, we can consider that the Data Source (DS) will have to be corrected is in a CSV format and therefore without schema. The process of semantic categorization uses the DD to assign each column of the source to a context and thus a semantic (a category). It returns a new semantic structure of the DS enriched with constraints and comments. The objective is to propose for each column: 1) a semantic name (the category), 2) eventually a subcategory (the language), 3) a data type (the syntactic domain), 4) intra and inter- columns constraints and 5) comments. Details are given in [<xref ref-type="bibr" rid="scirp.66762-ref12">12</xref>].</p><p>Given the following example of DS (<xref ref-type="fig" rid="fig2"><xref ref-type="fig" rid="fig">Figure </xref>2</xref>), the semantic recognition consists to find similarities between the DS and the valid strings in the DD to infer the semantic name of each column. We interested by the contextual data quality. We use measurements of similarity distance. We propose a mixture of methods of similarity measure, on the one hand, “it is written as”, such as Jaro-Winkler or Levenshtein, and on the other hand, “is pronounced as”, such as Soundex or Metaphone. The application of our semantic recognition process on “Patient.csv” file (<xref ref-type="fig" rid="fig2"><xref ref-type="fig" rid="fig">Figure </xref>2</xref>) gives the results in <xref ref-type="fig" rid="fig3"><xref ref-type="fig" rid="fig">Figure </xref>3</xref>. Let us note that the DS has no significant schema:</p><p>S (Column 1: String, Column 2: String, Column 3: String, Column 4: String, Column 5: String, Column 6: String, Column 7: String, Column 8: String, Column 9: String, Column 10: String).</p><p>The semantic process which we propose gives a new semantic structure of data summarized by:</p>
<fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1"><xref ref-type="fig" rid="fig">Figure </xref>1</xref></label><caption><title> Extract of the data dictionnary for valid strings and semantic dependencies</title></caption></fig>
<table-wrap id="table_fig1" >
<object-id pub-id-type="pii"><xref ref-type="table" rid="table1">Table 1</xref></object-id>
</table-wrap>
<fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig2"><xref ref-type="fig" rid="fig">Figure </xref>2</xref></label><caption><title> Patient.csv file extraction (from the DS)</title></caption></fig>
<table-wrap id="table_fig2" >
<object-id pub-id-type="pii"><xref ref-type="table" rid="table2">Table 2</xref></object-id>
</table-wrap>
<fig id="fig1"  position="float">
<label><xref ref-type="fig" rid="fig3">
<xref ref-type="fig" rid="fig">Figure </xref>3</xref></label>
<caption><title> Semantic categorization of data (Patient.csv)</title></caption></fig>
<table-wrap id="table_fig3" >
<object-id pub-id-type="pii"><xref ref-type="table" rid="table3">Table 3</xref></object-id>
</table-wrap><p>S (Number: Number, First Name: String, Civility: String, Sex: String, Blood Group: String, Date: Date, Phone: Number, City: String, Country: String, Continent: String); with constraints such as Civility ϵ {M;Mme;Mlle} and functional dependencies such as Civility &#174; Sex and Country &#174; Continent.</p></sec>
<sec id="s2_2"><title>2.2. Recognition of Semantic Dependencies between Columns</title><p>Let us note that it is possible to exploit the syntactic domain of data to infer the implausible dependencies. Thus, the knowledge pre-stored can guide the user and reduce the search space in the process of discovering dependencies constraints. Suggestions of semantically valid dependencies can be given to the user. Contrary to some works such as [<xref ref-type="bibr" rid="scirp.66762-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref15">15</xref>] which search to verify the functional dependencies between all columns. For example, trying to verify the following dependencies has no meaning: BloodGroup &#174; City or Date &#174; Country or Date &#174; FirstName or FirstName &#174; Date or FirstName &#174; BloodGroup.</p></sec></sec><sec id="s3"><title>3. Data Cleaning</title><sec id="s3_1"><title>3.1. Correction of Intra-Column Anomalies</title><p>In this section, we present our approach which allows exploiting the semantic knowledge deduced from the step of semantic categorization to correct intra-column anomalies. We call this phase homogenization/standardiza- tion of data. The syntactical correction of data is done by approximating the values of the data source to those which are similar in the Data Dictionary. We use methods of similarity distance calculation. Our cleaning process can correct misspelled values. It allows standardizing formats. Some examples of syntax corrections are given in <xref ref-type="fig" rid="fig4"><xref ref-type="fig" rid="fig">Figure </xref>4</xref>. Note that certain transformations are not feasible at this level such as the string “Fr” or null values (NULL). The step of the semantic categorization of the data allows recognizing the dominant category and eventually the dominant subcategory (language) in each column. We propose to unify the data in the same subcategory. Then values that do not belong to the dominant subcategory are translated to their synonym in the dominant language. Various formats can be used in the same column. We propose unified coding values in a dominant Format (<xref ref-type="fig" rid="fig5"><xref ref-type="fig" rid="fig">Figure </xref>5</xref>).</p><p>The intra-column transformations do not correct the errors of violation of semantic dependencies. Null values are not treated at this level of the process.</p>
<fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig4"><xref ref-type="fig" rid="fig">Figure </xref>4</xref></label><caption><title> Examples of intra-column automatic transformations</title></caption></fig>
<table-wrap id="table_fig4" >
<object-id pub-id-type="pii"><xref ref-type="table" rid="table4">Table 4</xref></object-id></table-wrap><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig5"><xref ref-type="fig" rid="fig">Figure </xref>5</xref></label><caption><title> Intra-column automatic corrections done on the DS (Patient.CSV)</title></caption></fig><table-wrap id="table_fig5" ><object-id pub-id-type="pii"><xref ref-type="table" rid="table5">Table 5</xref></object-id></table-wrap></sec><sec id="s3_2"><title>3.2. Correction of Inter-Columns Anomalies</title><p>The correction of intra-column anomalies facilitates the verification of semantic links between the columns which is our ultimate goal. Let S(C) the schema of the DS such as C is the set of columns, X and Y two subsets of columns such as X and Y are subsets of C. We call X functionally determines Y (noted X &#174; Y) if and only if for all x<sub>i</sub> = x<sub>j</sub> then y<sub>i</sub> = y<sub>j</sub>. In other words for every value x<sub>i</sub> of X, there is only one corresponding value y<sub>j</sub> of Y. For instance, the functional dependency Country &#174; Continent can not be verified using the instances given below: {(France, NULL); (Frence, Europe); (China, NULL); (Chine, Asy)}.</p><p>Hence, it is necessary to begin with the intra-column corrections. The search space is reduced. <xref ref-type="fig" rid="fig5"><xref ref-type="fig" rid="fig">Figure </xref>5</xref> represents the results of data categorization and intra-column correction. Let us note that the source contains inter-columns anomalies. Using the DD some semantic dependencies are not plausible such as Date &#174; City or BloodGroup &#174; City. On the contrary, Civility &#174; Sex, City &#174; Country, City &#174; Continent and Country &#174; Continent are proposed for the verification.</p><p>The step before the inter-columns correction is the verification of each dependency constraint. The dependencies constraints verification algorithm (Algorithm 1, <xref ref-type="fig" rid="fig">Figure </xref>A1) consists to count the number &#223;<sub>i</sub> of different values of y<sub>i</sub> for each x<sub>i</sub>. If there is a &#223;<sub>i</sub> ≥ 1 then the dependence is not verified. The algorithm contains two phases Map and two phases Reduce [<xref ref-type="bibr" rid="scirp.66762-ref16">16</xref>]. The two phases Map1 and Reduce1 take as input the DS and they return the number of occurrences (x<sub>i</sub>; y<sub>i</sub>, α<sub>i</sub>). The two phases Map2 and Reduce2 take the result of the previous two phases as input in order to count the number &#223;<sub>i</sub> of occurrences of each x<sub>i</sub>. If &#223;<sub>i</sub> ≥ 1 then the dependency constraint (X &#174; Y) is violated. We propose an algorithm (Algorithm 2, <xref ref-type="fig" rid="fig">Figure </xref>A1) to correct the dependency anomalies. We use the valid values stored in the DD. Some null values are then processed and corrected according deductions extracted from the data. Let us note that the order of treatment of functional dependencies is very important to complete null values. It is necessary to start with the columns that are less empty. <xref ref-type="fig" rid="fig">Figure </xref>6 contains the results of the automatic correction of inter-columns anomalies.</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig">Figure </xref>6</label><caption><title> Inter-columns automatic corrections done on the DS (Patient.CSV): (Civitiy &#174; Sex, City &#174; Country, Country &#174; Continent)</title></caption></fig><table-wrap id="table_fig6" ><object-id pub-id-type="pii"><xref ref-type="table" rid="table6">Table 6</xref></object-id></table-wrap></sec></sec><sec id="s4"><title>4. Related Work</title><p>We studied and compared the functionalities of some data quality tools such as Talend Data Quality [<xref ref-type="bibr" rid="scirp.66762-ref3">3</xref>] and Pentaho Data Integration [<xref ref-type="bibr" rid="scirp.66762-ref17">17</xref>]. These tools are used to verify only the functional dependencies given by the user. Then, the user must have knowledge about the data schema and the dependencies to be verified. These Tools do not correct anomalies caused by the violation of semantic dependencies between columns. They do not process null values. The different algorithms of discovery of functional dependencies [<xref ref-type="bibr" rid="scirp.66762-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref15">15</xref>] [<xref ref-type="bibr" rid="scirp.66762-ref18">18</xref>] consist to find dependencies across all possible combinations for the different columns of a data source. This increases the size of the search space. These algorithms do not take into account the semantics of data, while, in our approach we focus on contextual data quality. The application of the principle of MapReduce in Big Data will allow to validate algorithms on large volumes of data with good perfermance. Few studies include the principle of MapReduce (distribution of data and treatments).</p></sec><sec id="s5"><title>5. Conclusion</title><p>The goal of our work is to contribute to the development of new data integration tools in order to assist the user in the contextual data quality process. Our contribution is to understand the semantic of data before correcting them. In fact, our approach allows the categorization of data by giving them a category and eventually a subcategory. Semantic categorization allows to infer semantic links that can exist between the different columns. The recognition of the structure and semantics of data facilitates the detection and correction of various intra-column, inter-columns and inter-lines anomalies in the same data source. We propose a MapReduce algorithm of the verification of dependency constraints to detect anomalies caused by the violation of these constraints. We present an algorithm for the automatic correction of inter-columns anomalies especially the treatments of null values. The enrichment of the data dictionary will be the subject of our future work.</p></sec><sec id="s6"><title>Cite this paper</title><p>Houda Zaidi,Faouzi Boufar&#232;s,Yann Pollet, (2016) Improve Data Quality by Processing Null Values and Semantic Dependencies. Journal of Computer and Communications,04,78-85. doi: 10.4236/jcc.2016.45012</p></sec><sec id="s7"><title>Appendix</title>
<fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig">Figure </xref>A1</label><caption><title> Inter-columns automatic corrections algorithms</title></caption>
<table-wrap id="table_fig7" ><object-id pub-id-type="pii"><xref ref-type="table" rid="table7">Table 7</xref></object-id></table-wrap></fig></sec></body><back><ref-list><title>References</title><ref id="scirp.66762-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Chu, X., Morcos, J., Ilyas, I.F, Ouzzani, M., Papotti, P., Tang, N. and Ye, Y. (2015) KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, 31 May-4 June 2015, 1247-1261.</mixed-citation></ref><ref id="scirp.66762-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Dallachiesay, M., Ebaidz, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M. and Tang, N. (2013) NADEEF: A Commodity Data Cleaning System. Proceedings of ACM SIGMOD International Conference on Management of Data, IEEE Press, New York, 22-27 June 2013, 541-552.</mixed-citation></ref><ref id="scirp.66762-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Talend. https://www.talend.com/</mixed-citation></ref><ref id="scirp.66762-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Ben Salem, A., Boufarès, F. and Correia, S. (2014) Semantic Recognition of a Data Structure in Big-Data. Proceedings of the 6th International Conference on Computational Intelligence and software Engineering, Vol. 2, Beijing, 11-13 July 2014, 93-103.</mixed-citation></ref><ref id="scirp.66762-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Ben Salem, A., Qualité contextuelle des données (2015) Détection et nettoyage guidés par la sémantique des données. Ph.D. Thesis, Université Paris 13, Sorbonne Paris cité, Paris.</mixed-citation></ref><ref id="scirp.66762-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Boufarès, F., Ben Salem, A. and Correia, S. (2012) Qualité de données dans les entrep?ts de données: Elimination des similaires. Proceedings: 8èmes Journées francophones sur les Entrep?ts de Données et l’Analyse en ligne, Bordeaux France, 12-13 Juin 2012, 32-41. </mixed-citation></ref><ref id="scirp.66762-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Boufarès, F., Ben Salem, A., Rehab, M. and Correia, S. (2013) Similar Elimination Data: MFB Algorithm. Proceed- ings of IEEE-2013 International Conference on Control, Decision and Information Technologies, Hammamet Tunisie, 6-8 May 2013, 289-293.</mixed-citation></ref><ref id="scirp.66762-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Levenshtein. http://fr.wikipedia.org/wiki/Distance_de_Levenshtein</mixed-citation></ref><ref id="scirp.66762-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Jaro-Winkler. http://fr.wikipedia.org/wiki/Distance_de_Jaro-Winkler</mixed-citation></ref><ref id="scirp.66762-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Soundex. https://fr.wikipedia.org/wiki/Soundex</mixed-citation></ref><ref id="scirp.66762-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Metaphone. https://fr.wikipedia.org/wiki/Metaphone</mixed-citation></ref><ref id="scirp.66762-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Zaidi, H., Boufarès, F., Pollet, Y. and Kraiem, N. (2015) Semantic of Data Dependencies to Improve the Data Quality. Proceedings of the 5th International Conference on Model &amp; Data Engineering, LNCS, Vol. 9344, Springer, Rhodes Greece, 26-28 September 2015, 53-61.</mixed-citation></ref><ref id="scirp.66762-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Zaidi, H., Boufarès, F. and Pollet, Y. (2016) Nettoyage de données guidé par la sémantique inter-colonne. Proceedings: 16th Conférence Internationale sur l’Extraction et la Gestion des Connaissances, Vol. RNTI-E-30, Reims France, 18-22 Janvier 2016, 549-550.</mixed-citation></ref><ref id="scirp.66762-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Diallo, T. and Novelli, N. (2010) Découverte des dépendances fonctionnelles conditionnelles. Proceedings: 10th Con- férence Internationale sur l’Extraction et La gestion des Connaissances, Hammamet Tunisie, 26-29 Janvier 2010, 315- 326.</mixed-citation></ref><ref id="scirp.66762-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Simonenko, E. and Novelli, N. (2012) Extraction de dépendances fonctionnelles approximatives: Une approche incré- mentale. Proceedings: 12th Conférence Internationale Francophone sur l’Extraction et la Gestion des Connaissances, Bordeaux France, 31 Janvier-1 Février 2012, 95-100.</mixed-citation></ref><ref id="scirp.66762-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the 6th Conference on Symposium on Operating System Design and Implementation, California, 137-150.</mixed-citation></ref><ref id="scirp.66762-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Pentaho Data Integration. http://www.pentaho.fr/explore/pentaho-data-integration</mixed-citation></ref><ref id="scirp.66762-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Garnaud, E., Hanusse, N., Maabout, S. and Novelli, N. (2013) Calcul parallèle de dépendances. Proceedings: 29e Journées Bases de Données Avancées, Nantes France, Octobre 2013, 1-20.</mixed-citation></ref></ref-list></back></article>