<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JDAIP</journal-id><journal-title-group><journal-title>Journal of Data Analysis and Information Processing</journal-title></journal-title-group><issn pub-type="epub">2327-7211</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jdaip.2019.74015</article-id><article-id pub-id-type="publisher-id">JDAIP-95871</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Towards Kikamba Computational Grammar
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Benson</surname><given-names>Kituku</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Wanjiku</surname><given-names>Nganga</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Lawrence</surname><given-names>Muchemi</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>School of Computing and Informatics, University of Nairobi, Nairobi, Kenya</addr-line></aff><aff id="aff1"><addr-line>Computer Science Department, Dedan Kimathi University of Technology, Nyeri, Kenya</addr-line></aff><pub-date pub-type="epub"><day>12</day><month>09</month><year>2019</year></pub-date><volume>07</volume><issue>04</issue><fpage>250</fpage><lpage>275</lpage><history><date date-type="received"><day>8,</day>	<month>September</month>	<year>2019</year></date><date date-type="rev-recd"><day>19,</day>	<month>October</month>	<year>2019</year>	</date><date date-type="accepted"><day>22,</day>	<month>October</month>	<year>2019</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  The under-resourced Kikamba language has few language technology tools since the more efficient and popular data driven approaches for developing them suffer from data sparseness due to lack of digitized corpora. To address this challenge, we have developed a computational grammar for the Kikamba language within the multilingual Grammatical Framework (GF) toolkit. GF uses the Interlingua rule-based translation approach. To develop the grammar, we used the morphology driven strategy. Therefore, we first developed regular expressions for morphology inflection and thereafter developed the syntax rules. Evaluation of the grammar was done using one hundred sentences in both English and Kikamba languages. The results were an encouraging four n-gram BLEU score of 83.05% and the Position independent error rate (PER) of 10.96%. Finally, we have made a contribution to the language technology resources for Kikamba including multilingual machine translation, a morphology analyzer, a computational grammar which provides a platform for development of multilingual applications and the ability to generate a variety of bilingual corpora for Kikamba for all languages currently defined in GF, making it easier to experiment with data driven approaches.
 
</p></abstract><kwd-group><kwd>Grammar</kwd><kwd> Morphology</kwd><kwd> Syntax</kwd><kwd> Grammatical Framework</kwd><kwd> Under-Resourced language</kwd><kwd> Concord</kwd><kwd> Multilingual</kwd><kwd> Agglutination</kwd><kwd> Kikamba</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>The commonly used data driven approaches for developing natural language processing (NLP) tools are currently unusable with under-resourced languages due to data sparsity and this problem might not be resolved in the near future. There is a high demand for these NLP tools due to the exponential growth of the Internet, which has availed a wealth of information available to people and coupled with the high penetration rate of connected mobile devices. There is, therefore, an urgent need to devise strategies that can accelerate the development of language technology tools and applications for under-resourced languages so as to enable their speakers to maintain the use of their languages within a digital environment. This paper describes the development of a computational grammar for Kikamba Language, an under-resourced language, using the multilingual Grammatical Framework toolkit.</p><p>Guthrie [<xref ref-type="bibr" rid="scirp.95871-ref1">1</xref>] classifies Kikamba language as E55 (Language 5 in group 50 of zone E) in the larger Bantu family and the language commands close to four million speakers. Its grammar is agglutinative, tonal, inflectional and has a noun class system or a class gender (noun prefix and Concord for the noun modifiers) [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref4">4</xref>]. In addition, its orthography consists of seven vowels and fifteen consonants [<xref ref-type="bibr" rid="scirp.95871-ref5">5</xref>]. In terms of descriptive grammar for Kikamba language, some work is already done, though most of them are not published such as derivational verb morphology [<xref ref-type="bibr" rid="scirp.95871-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>], noun modification [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>], morphosyntax for Kikamba [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] and tonal perspective [<xref ref-type="bibr" rid="scirp.95871-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref9">9</xref>]. Some gaps still exist on these works; for example, the subject marker and negation in verb morphology is only done for class gender which deals with humans only. The concord for possessive pronouns, morph phonological changes in adjectives and verbs, the morphology of compound Nouns and adjectives are yet to be done. With respect to language resource tools, there are only two language tools for this language to the best of our knowledge—these are a Part of Speech tagger and a named entity recognizer [<xref ref-type="bibr" rid="scirp.95871-ref10">10</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref11">11</xref>]. GF has also been used to model language resources for Bantu Languages. Kiswahili language has a partial morphology analyzer [<xref ref-type="bibr" rid="scirp.95871-ref12">12</xref>] while the Tswana Language from South Africa has a mini resource grammar [<xref ref-type="bibr" rid="scirp.95871-ref13">13</xref>]. Hence, no wide-coverage grammar for a Bantu Language has been made in GF so far. Thus, development of the Kikamba Computational Grammar is a significant milestone towards the creation of standard Basic Language Resource Kit (BLARK) [<xref ref-type="bibr" rid="scirp.95871-ref14">14</xref>] since it will result in a Morphological analyzer and multilingual translation using the capability of Grammatical Framework. Secondly, it will be a catalyst to the provision of information and communication technology (ICT) in Kikamba language, thus bridging the digital divide. It will provide a platform for the generation of parallel corpora and treebanks, which are crucial for building NLP tools using data driven approaches. Finally, it is an electronic preservation effort for the Kikamba language so that the Kamba people are not disenfranchised in the global information space.</p></sec><sec id="s2"><title>2. Kikamba Descriptive Grammar</title><sec id="s2_1"><title>2.1. Morphology</title><p>Kikamba language way of forming words from the morphemes is through prefixing and suffixing (agglutination) with the direct influence of noun class system, noun concord and morph phonological transformation. Only a few borrowed words or irregular words deviate from the noun class system prefixing. Regarding the noun class system, arguments have been advanced whether it should be referred to as gender or noun class. Some consider a pair of singular and plural noun class as gender [<xref ref-type="bibr" rid="scirp.95871-ref15">15</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref16">16</xref>]. This thought is reinforced by Demuth [<xref ref-type="bibr" rid="scirp.95871-ref17">17</xref>] by proposing a noun class as a subset of gender. However, Ibrahim [<xref ref-type="bibr" rid="scirp.95871-ref18">18</xref>] argues that gender or noun class can hold ground since Bantu genders are not inspired by natural sex gender semantics as the case with Indo-European languages. For the purpose of this paper, we shall adopt two pairs of noun classes (singular and plural) forming class gender. <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref> lists all noun classes for Kikamba language [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref4">4</xref>]. The morpheme before the underscore represents the singular noun class while the one after represents the plural noun class and both form the class gender encoded in the third column for use in the GF grammar modeling. We shall discuss the inflection of open and thereafter closed categories.</p><sec id="s2_1_1"><title>2.1.1. Noun</title><p>The structure of noun morphology consists of obligatory prefix and root plus an optional suffix. The prefix determines the noun class number and we exemplify its usage by Example 1 where the notation “c” means class and the number means noun class number based on <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref> (for example c1 means noun class number one), while the root is the radical of the lexical word. The suffix “ni” is used to form a locative noun, which is a case (grammar feature). In the real sense, it is a preposition and a noun combined, for example “at the shop” becomes “dukani” and “on the table” become “mesani”. The words “shop” and “table” in Kikamba are “duka” and “mesa” therefore, the preposition is actualized by adding the suffix “ni”.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref></label><caption><title> Kikamba noun classes</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Classes (c)</th><th align="center" valign="middle" >Class number</th><th align="center" valign="middle" >GF coding</th></tr></thead><tr><td align="center" valign="middle" >mu_a</td><td align="center" valign="middle" >1, 2</td><td align="center" valign="middle" >G1</td></tr><tr><td align="center" valign="middle" >mu_mi</td><td align="center" valign="middle" >3, 4</td><td align="center" valign="middle" >G2</td></tr><tr><td align="center" valign="middle" >i_ma</td><td align="center" valign="middle" >5, 6</td><td align="center" valign="middle" >G3</td></tr><tr><td align="center" valign="middle" >ki_i</td><td align="center" valign="middle" >7, 8</td><td align="center" valign="middle" >G4</td></tr><tr><td align="center" valign="middle" >ka_tu</td><td align="center" valign="middle" >12, 13</td><td align="center" valign="middle" >G5</td></tr><tr><td align="center" valign="middle" >va_ku</td><td align="center" valign="middle" >14, 15</td><td align="center" valign="middle" >G6</td></tr><tr><td align="center" valign="middle" >n_n</td><td align="center" valign="middle" >9, 10</td><td align="center" valign="middle" >G7</td></tr><tr><td align="center" valign="middle" >u_ma</td><td align="center" valign="middle" >11, 6</td><td align="center" valign="middle" >G8</td></tr><tr><td align="center" valign="middle" >u_n</td><td align="center" valign="middle" >11, 10</td><td align="center" valign="middle" >G9</td></tr><tr><td align="center" valign="middle" >ku_ma</td><td align="center" valign="middle" >15, 6</td><td align="center" valign="middle" >G10</td></tr></tbody></table></table-wrap></sec><sec id="s2_1_2"><title>2.1.2. Adjective</title><p>The adjective describes and modifies a noun and its inflection consists of a prefix (concord) which agrees with the class gender of the noun being modified. In addition, to form the adjective, concatenation of the prefix with the adjective root is done [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref4">4</xref>]. Example 2 demonstrates the structure of the adjective whereby the adjective prefix is shown by the noun class while the radical is shown by Adjroot.</p></sec><sec id="s2_1_3"><title>2.1.3. Verbs</title><p>Kikamba language is no exception to the complexity of verb morphology in Bantu languages. Its declension involves several morphemes (several prefixes, root, extensional suffix and final vowel which represent mood) plus some grammar features such as person, number, class gender, tense, polarity, etc. <xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref> describes all the morphemes used in verb inflection [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>]. The object marker, infinitive and extension suffix are not obligatory while in some cases of negative polarity, the subject marker and negation marker are fused together to form one morpheme. Importantly, the focus and negative marker do not co-exist. Finally, the morphemes of verbs embody all the constituents needed to make a sentence, hence the reason a verb can act in place of a sentence. Examples 3 - 6 demonstrate this principle.</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref></label><caption><title> Architecture of verbs</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Architecture</th><th align="center" valign="middle" >Morpheme</th><th align="center" valign="middle" >Kikamba</th></tr></thead><tr><td align="center" valign="middle" >Prefixes</td><td align="center" valign="middle" >Focus</td><td align="center" valign="middle" >“ni”</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Negation</td><td align="center" valign="middle" >as per class</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Subject marker</td><td align="center" valign="middle" >as per class and person</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Tense/Aspect</td><td align="center" valign="middle" >As per tense</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Object marker</td><td align="center" valign="middle" >as per class and person</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Infinitive</td><td align="center" valign="middle" >“Ku”</td></tr><tr><td align="center" valign="middle" >Root</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >Root</td></tr><tr><td align="center" valign="middle" >Extension</td><td align="center" valign="middle" >Applicative</td><td align="center" valign="middle" >“i’</td></tr><tr><td align="center" valign="middle" >Suffix</td><td align="center" valign="middle" >Causative</td><td align="center" valign="middle" >“ithy”</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Passive</td><td align="center" valign="middle" >“w”</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Reversive</td><td align="center" valign="middle" >“u”</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Reciprocal</td><td align="center" valign="middle" >“an”</td></tr><tr><td align="center" valign="middle" >Final vowel</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >“a/e”</td></tr></tbody></table></table-wrap><p>Tense</p><p>Reichenbach [<xref ref-type="bibr" rid="scirp.95871-ref19">19</xref>] states point of the speech, point of reference and point of the event in relation to time bases for tenses and time is based from speech point [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>]. The coincidence of the three points results in the present tense. When the speech point is after the other two points, then past tense occurs. Future tense occurs when the speech point is before other points. Finally, when the reference time proceeds event time, the resultant is perfect tense. The Aspect gives a view of the action of the verb such as beginning, continuing or ended [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>]. Most of the time, tense and aspect are combined together in Kikamba languages. Several tenses exist in Kikamba Language [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>]. Here we shall exemplify present, future, past and perfect tenses. The following notations are used: Fs for focus, Neg for negation, Agr for the subject marker, root for the root, Tns for tense, Asp for aspect and Fw for the final vowel.</p><p>The morpheme “ka” marks future tense also referred to as indefinite future tense. The tense morpheme is in-between the subject marker and the root as exemplified in Example 3. Kikamba language has a remote future tense, constructed by concatenating prefix “ni” to the future tense, e.g., using the case of Example 3 we will have “niakakoma”, “Gloss”, “he will sleep”.</p><p>Past tense is marked by final vowels morpheme “ie” which mark tense though affected by the phonological rule and uses infix “na” to mark aspect [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>]. On negative polarity, the infix “nee” is used as exemplified in Example 4.</p><p>Present tense, in some cases referred to as present indefinite tense or habitual tense depending on usage, is marked by aspect vowel “a” [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>] as exemplified by Example 5.</p><p>Finally, the Perfect tense on positive polarity is not marked by any morpheme though, in the negative, it is marked by morpheme “na’’ as illustrated in Example 6.</p></sec><sec id="s2_1_4"><title>2.1.4. Closed Categories</title><p>The demonstrative, a noun modifier which shows how far the object(s) is/are from the speaker and unlike Indo-Europeans languages which have demonstrative strings for near and distant. Kikamba language has an extra string for the aforementioned [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>]. Demonstrative inflect for the variable features of class gender and number.</p><p>Personal pronouns in Kikamba language stand for absent nouns and in GF they are modeled as noun phrases and therefore have a string and enforce agreement (person, class gender and number). The possessive pronoun, a noun modifier depicting ownership and its architecture consist of a prefix dependent on class, gender and number [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>] as exemplified in Example 7 and a root.</p><p>For the preposition, through elicitation, it was noted that the strings for some prepositions have variable features of the class, gender and number for example “of”, while most of them do not inflect. In addition, some prepositions are fused into the noun as demonstrated in Example 8, resulting in the locative noun. Cardinal and ordinal numerals can be expressed in words or digits. The cardinal numerals, when expressed in words for the cases of one to five behave like adjectives and take a concord agreement while the rest are independent of the class gender [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>]. Ordinal numbers consist of two strings: the preposition “of” and string both dependent on class gender and singular number. Finally, the adverbs do not inflect and there are no articles in Kikamba languages.</p></sec></sec><sec id="s2_2"><title>2.2. Syntax</title><p>The main topology for the Kikamba language sentence is subject-verb-object (SVO) [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref7">7</xref>] whereby the subject is a noun phrase, followed verb phrase. The verb phrase is a combination of the verb phrase and object complement which can be a verb phrase, noun phrase, etc. The presence of the object is influenced by the verb valence (univalent, divalent and trivalent). For example, for the univalent verb, the topology becomes SV because the one place verb does not require arguments. The syntactic agreement is via concord agreement within the lexical items mainly influenced by the class gender of the noun [<xref ref-type="bibr" rid="scirp.95871-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>].</p><p>Noun phrases are made of a noun and its modifiers which include an adjective (Adj), determiner (Det), both possessive (poss) and demonstrative (dem) and finally numbers (Num). Rugemarila [<xref ref-type="bibr" rid="scirp.95871-ref20">20</xref>] has worked extensively on the structure of noun phrases in Bantu languages and has concluded the structure to be as illustrated below which concurs with one presented by Mbuvi [<xref ref-type="bibr" rid="scirp.95871-ref3">3</xref>] for Kikamba language.</p><p>[dem] [Noun] [Det ] [Num] [Adj]</p><p>The structure of a verb phrase is the same as a verb and carries all parameters that are integral to verbs.</p></sec></sec><sec id="s3"><title>3. Translation Approaches</title><p>The three main approaches to machine translation are: data driven, rule based and hybrid strategies [<xref ref-type="bibr" rid="scirp.95871-ref21">21</xref>]. The data driven approach, such as neural network models, statistical models, etc. makes use of parallel aligned corpus to make the machine translation possible. It is divided into statistical and example based translations. The rule based approach uses syntax, lexical rules and a lexicon to form a computational grammar based on Chomsky theories [<xref ref-type="bibr" rid="scirp.95871-ref21">21</xref>]. Word-based, transfer and interlingua are the three subcategories of rule based approaches. A grammar formalism determines the architecture of the grammar. The hybrid approach involves using the above approaches together with either a rule based guided hybrid or data driven hybrid translation. In section one, we mentioned Kikamba language being an under resourced language. Thus, very few digital corpora are available, that is why we used the Interlingua rule based translation approach. The Grammatical Framework was chosen because first, its multilingual capability enables the creation of the technology in the different languages already defined in GF. Secondly, separate tecto-grammatical (abstract syntax) and pheno-grammatical (concrete structure) [<xref ref-type="bibr" rid="scirp.95871-ref22">22</xref>] enable faster development since one concentrates on only the concrete syntax of the language been developed. Finally, it provides a platform where application grammars can develop controlled natural languages on top of the resource grammars without the application programmer knowing the mechanics of the resource grammars.</p><p>Grammatical Framework (GF henceforth) is a toolkit used for rapid development of multilingual grammar resources and applications based on the functional programming paradigm, the logic framework of abstract syntax plus concrete syntax. GF is also a grammar formalism grounded on categorical formalism [<xref ref-type="bibr" rid="scirp.95871-ref23">23</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref24">24</xref>]. GF has one abstract syntax which defines categories of trees and the functions to implement them and many concrete syntaxes, one for each specific language grammar which provides the linearization of the categories and function of trees embodied in abstract grammar [<xref ref-type="bibr" rid="scirp.95871-ref22">22</xref>]. These parallel grammars of concrete syntaxes equivalent to parallel multiple context-free grammars reside in Grammar Resource Library (GRL) [<xref ref-type="bibr" rid="scirp.95871-ref25">25</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref26">26</xref>] and currently, it has over 35 languages forming the multilingual ecosystem of GF [<xref ref-type="bibr" rid="scirp.95871-ref13">13</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref27">27</xref>]. The GRL is divided into morphological and syntactic components [<xref ref-type="bibr" rid="scirp.95871-ref22">22</xref>]. In the morphology component, inflection smart paradigms are built using the regular expression [<xref ref-type="bibr" rid="scirp.95871-ref22">22</xref>] to build morphological lexicons of categories while in the syntax component, implementation of syntax rules is done. In GF, parsing transforms language-specific concrete syntax into abstract trees (language analysis) while linearization transforms abstract trees to strings in a specific language (language generation).</p><p>Grammar features are defined using parameters which are objects of some type and use the keyword param. Below is an illustration of parameter number</p><p>param</p><p>Number = Singular/Plural</p><p>GF makes a distinction between inherent and variable features of grammar. To gather all features of a specific category together, a record is used. For example, the noun category in Kikamba languages has inherent feature class gender and variable features number and case and therefore its linearization type record gathering all features would be defined as below</p><p>N = {s: Number =&gt; Case =&gt; Str; g: Cgender};</p><p>Finally, GF uses the operator “+” for concatenation and keyword oper to define operation or function for regular expression of all categories in the morphology modules.</p></sec><sec id="s4"><title>4. Implementing the Kikamba Grammar in GF</title><p>Dictionaries, linguistic postgraduate theses and informants (who speak the language and/or are linguists) formed the data source for the lexicon and descriptive grammar. Linguists were used in cleaning, authenticating the data and through elicitation, they generated morphology and syntax of the categories that were missing in the Descriptive grammar from corpora. The elicitation was performed either through language analysis of the corpus through linguist judgment or by translation from English to the specific Bantu language as proposed by Chelliah [<xref ref-type="bibr" rid="scirp.95871-ref28">28</xref>]. Snowball sampling techniques [<xref ref-type="bibr" rid="scirp.95871-ref29">29</xref>] <sup>1</sup>, which is a non-probabilistic sampling technique was used to gather the sparse corpora and to identify the few linguists available in the language. The evolutionary prototype model [<xref ref-type="bibr" rid="scirp.95871-ref30">30</xref>] <sup>2</sup> approach was applied since for every function or module developed in GF there was a need to demonstrate its working by testing and refining the function until it produces the correct output. Interlingua rule based approach was used to develop the computational grammar in a morphology driven strategy, which is a bottom-up method. It involves first defining the lexicon, then categories, their smart paradigms based on the regular expression and finally working on the syntax rules [<xref ref-type="bibr" rid="scirp.95871-ref25">25</xref>]. Therefore, we will first discuss the morphology of the part of Speech tags and thereafter syntax rules.</p><sec id="s4_1"><title>4.1. Noun</title><p>To model the noun inflection class gender, number and case, grammar features were used. Ten class genders were identified as shown in <xref ref-type="table" rid="table1"><xref ref-type="table" rid="table">Table </xref>1</xref> column 1 and coded to ease their use as per column 3 since this work is a subset of a project to create a computational grammar for Bantu languages in Kenya. The case consisted of normative and locative. The locative case was created by adding suffix “ni” to normative case lexicon, while the number refers to singular and plural. The noun inflects number to the case with an inherent class gender feature as shown by the linearization categories of a noun (lincat) below.</p><p>param</p><p>Number = Sg | Pl;</p><p>Case = Nom | Loc;</p><p>Cgender = G1|G2 | G3 | G4 | G5 | G6 | G7| G8 | G9 | G10;</p><p>lincat</p><p>N = {s: Number =&gt; Case =&gt; Str; g: Cgender};</p><p>Kikamba language has a simple noun (single string) and compound noun (two strings) with inflection happening by changing the prefix (see <xref ref-type="table" rid="table3"><xref ref-type="table" rid="table">Table </xref>3</xref>). The smart paradigm regN implements the simple noun while compoundN implements the compound noun. The nouns that do not inflect were modeled using iregN (irregular nouns). The make noun mkN function assembled the smart paradigms together as shown below together with snippets of the smart paradigms.</p><p>mkN = overload {</p><p>mkN: Str -&gt; Cgender -&gt; N = \n, g -&gt; lin N (regN n g);</p><p>mkN: (man, men: N)-&gt; Cgender -&gt; N = compoundN;</p><p>mkN: (man, men: Str) -&gt; Cgender -&gt; N = \s,p,g -&gt; lin N (iregN s p g);};</p><p>The function PrefixPlNom provided the inflection prefix while each smart paradigm retained class gender for future concord agreement with the noun modifiers at the syntax stage.</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3"><xref ref-type="table" rid="table">Table </xref>3</xref></label><caption><title> Kikamba noun morphology</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Noun inflection</th></tr></thead><tr><td align="center" valign="middle" >regN: Str -&gt;Cgender -&gt; Noun = \w, g -&gt; let wpl = case g of { G1=&gt;case w of {“mwa” + _ =&gt; Predef.drop 2 w; “mwi” + _ =&gt; “e” + Predef.drop 3 w; _ =&gt; PrefixPlNom G1 + Predef.drop 2 w }; G2=&gt;case w of {“mw” + _ =&gt; “my” + Predef.drop 2 w; _ =&gt; PrefixPlNom G2 + Predef.drop 2 w }; ……………………………………………….. _ =&gt; PrefixPlNom g + Predef.drop 2 w}; in iregN w wpl g; compoundN: N -&gt; N -&gt;Cgender-&gt; N = \mundu,muume,g -&gt; { s = \\n,c =&gt; mundu.s! n! c ++ muume.s!n! c; g = g; lock_N = &lt;&gt; }; iregN: Str-&gt; Str -&gt;Cgender -&gt; Noun= \man,men,g -&gt; { s = table{Sg =&gt; table{Nom =&gt; man; Loc=&gt; man + “ni” | men + “ni” }; Pl =&gt; table{Nom =&gt; men; Loc=&gt; “”}}; g = g; };</td></tr></tbody></table></table-wrap></sec><sec id="s4_2"><title>4.2. Adjective</title><p>Adjectives were implemented using parameter AForm, which had positive (AAdj), comparative (AComp) forms plus Adverbs9Advv) formed using adjectives and utilizing variable features: class gender and number. The comparative adjective form was implemented by adding the infix “ang” to positive adjective form just before the final vowel of the adjective. <xref ref-type="table" rid="table4"><xref ref-type="table" rid="table">Table </xref>4</xref> provides a snippet of the smart paradigms for the regular adjective (regA). The function ConsonantAdjprefix provided the specific class gender prefix for the adjective. The concatenation of the class gender prefix with a vowel starting Adjective root was affected by morph phonological process.</p><p>AForm = AAdj Cgender Number | AComp Cgender Number | Advv;</p></sec><sec id="s4_3"><title>4.3. Verbs</title><p>The GRL provided a grid of (4*2*2) four tense (present, past, future and conditional), two polarities (positive and negative) and two anteriorities (anterior and simultaneous) which were used to implement verbs. The above grid expanded because of morphemes in Kikamba verbs, which depend on ten class gender and number grammar features such as subject marker and object marker hence (10*2*4*2*2). To improve time and space complexity, we implemented the verb suffixes in <xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref> at the verb level and the prefixes to be concatenated at the verb phrase level.</p><p>Various verb forms needed for implementation of the verb and verb phrase were identified as present progressive, infinitives, past tense form, present definite form and neutral form and the parameter VForm was used to assemble them as shown below. The parameter VForm Extension provided the derivational morphology based on the extension suffixes presented in <xref ref-type="table" rid="table2"><xref ref-type="table" rid="table">Table </xref>2</xref>. The</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4"><xref ref-type="table" rid="table">Table </xref>4</xref></label><caption><title> Kikamba adjective morphology</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Adjective inflection</th></tr></thead><tr><td align="center" valign="middle" >regA:Str -&gt; {s: AForm =&gt; Str} = \seo -&gt; {s = table { AAdj G1 Sg=&gt;case Predef.take 1 seo of { “a”|”e”|”i”|”o” =&gt; “mw” + seo; “u” =&gt; “m” + seo; _ =&gt; ConsonantAdjprefix G1 Sg + seo }; ………………………………………………………….. AComp g Sg=&gt;let af: Str = case Predef.take 1 seo of { “i” =&gt; “mw” + seo; “a” =&gt; “my” + seo; “u” =&gt; “m” + seo; _ =&gt; ConsonantAdjprefix g Sg + seo }; in init af + “ang” + last af } };</td></tr></tbody></table></table-wrap><p>smart paradigms regV and iregV were the functions for regular and irregular verbs.</p><p>param</p><p>VExte = EPassive | EApplicative | EReciprocal | ECausative | EDistributive;</p><p>VForm = VPreProg | VInf |VPast | VPreDef | VGen | VExtension VExte;</p><p>oper</p><p>regV: Str -&gt; Verb =\vika -&gt; let root = init vika</p><p>in {s = table{</p><p>VPreProg =&gt; case Predef.dp 1 root of {</p><p>“b” |”v”|”m” =&gt; root + “ete”;</p><p>_ =&gt; root + “ite”};</p><p>VInf =&gt; “ku”+ vika;</p><p>VPast =&gt; root + “ie”;</p><p>VPreDef =&gt; root + “aa”;</p><p>VExtension type =&gt; init vika + extension type + last vika;</p><p>VNeuter =&gt; vika}};</p><p>iregV: Str -&gt; Verb =\vika -&gt; {s=\\_=&gt; vika};</p></sec><sec id="s4_4"><title>4.4. Numeral</title><p>Cardinal and ordinal numerals were implemented both in words from 0 up to 999,999. Two parameters were used to model numerals. First, DForm with four forms unit represents ranges of 0 - 9 numerals, tens representing a range of 10 - 99 and hund 100 - 999 range. The CardOrd represents ordinal (Nord) and cardinal (Ncard) numerals. The smart paradigm regular number (regNum) was used to implement the numerals. Ordinal numerals were formed from cardinal numerals by adding class gender morpheme supplied by function Ordprefix.</p><p>param</p><p>DForm = unit | teen | ten | hund;</p><p>CardOrd = NCard | Nord;</p><p>oper</p><p>regNum: Str -&gt; {s: DForm =&gt; CardOrd =&gt; Cgender =&gt; Str} =</p><p>\six -&gt; {s = table {</p><p>unit =&gt; table {NCard =&gt;\\g =&gt; six;</p><p>NOrd =&gt; \\g =&gt; Ordprefix g ++ six};</p><p>teen =&gt; table {NCard =&gt;\\g =&gt; “ikumi na” ++ six;</p><p>NOrd =&gt; \\g =&gt; Ordprefix g ++ “ikumi na” ++ six};</p><p>ten =&gt; table {NCard =&gt;\\g =&gt; “miongo” ++ six;</p><p>NOrd =&gt; \\g =&gt; Ordprefix g ++ “miongo” ++ six};</p><p>hund =&gt; table {NCard =&gt;\\g =&gt; “maana” ++ six;</p><p>NOrd =&gt; \\g =&gt; Ordprefix g ++ “maana” ++ six} } };</p></sec><sec id="s4_5"><title>4.5. Personal Pronouns and Possessives</title><p>The personal pronoun is a string but requires concord agreement of class gender, number and person since GF treats it as a noun phrase while the possessive inflect by class gender and number. The PronForm parameter was used to represent the above two scenarios as depicted below with the function make pronoun mkPron generating both lexemes by taking two string, class gender, number and person as arguments being supplied by the linearization lin of the pronoun as shown by the example he_Pron below. Finally, the function ProunSgprefix and ProunSgprefix provided the class gender-specific prefix for concatenation with possessive form stem as shown in <xref ref-type="fig" rid="fig1"><xref ref-type="fig" rid="fig">Figure </xref>1</xref>.</p><p>param</p><p>Agr = Ag Cgender Number Person;</p><p>PronForm = Pers | Poss Number Cgender;</p><p>lin</p><p>he_Pron = mkPron “we” “ake” G1 Sg P3;</p></sec><sec id="s4_6"><title>4.6. Other Morphology Categories</title><p>The demonstrative, quantifier and preposition configured as a string dependent on class gender and number parameters. Adverbs do not inflect hence are independent strings. The linearization type of preposition was configured with a Boolean operator to distinguish between the ones being fused with nouns and those not. Below are the linearization category and the smart paradigm mkprep.</p><p>lin</p><p>above_Prep = mkPrep “iulu” False;</p><p>oper</p><p>Prepp = {s: Number =&gt; Cgender =&gt; Str; isFused: Bool};</p><p>mkPrep = overload {</p><p>mkPrep: Str -&gt;Bool-&gt; Prep = \str,bool -&gt; lin Prep {s = \\n,g =&gt; str; isFused = bool};</p><p>mkPrep: (Number =&gt; Cgender =&gt; Str) -&gt;Bool-&gt; Prep = \t,bool -&gt; lin Prep {s = t; isFused = bool}; };</p></sec><sec id="s4_7"><title>4.7. Common Noun (CN)</title><p>In Indo-Europeans languages, the CN is combined with an adjective to form NP or another CN and later a determiner can be added as a pre-modifier or post-modifier. However, in Kikamba language, the determiner is added between the adjective and the noun. Thus, the design of CN using two strings as exemplified below was to enable string one “s” to hold the CN while string two “s2” to hold the adjective. Hence it would be easier to add a determiner between string one and two. The class gender was retained from the noun since it will be used in agreement (concord). Below is the rule for forming CN from an adjective and a noun. All noun modifiers come after it with the exception of some quantifiers. Kikamba language does not have articles.</p><p>lincat</p><p>CN = CNoun;</p><p>oper</p><p>CNoun: Type = {s: Number =&gt; Case =&gt; Str; g: Cgender; s2: Number =&gt; Str};</p><p>CN has pre and postmodifiers such as an adjective, relative clause, adverbs, sentence and noun phrase and based on them, ten syntax rules were constructed. Below is an example of combining an adjective and a common noun.</p><p>AdjCN ap cn = {s = cn.s; g = cn.g; s2 = \\n =&gt; cn.s2! n ++ ap.s ! cn.g ! n};</p></sec><sec id="s4_8"><title>4.8. Determiner Phrase (Det)</title><p>Det Phrases can either be possessive or demonstrated which were implemented using quantifiers, numbers and possessive pronouns. Three rules were implemented for Det Phrase and below is an example of one of the rules which form Det by taking a quantifier and a number.</p><p>DetQuant quant num = {s = \\Cgender =&gt;quant.s ! num.n!Cgender ++ num.s !Cgender;</p><p>n = num.n; isPre = True};</p></sec><sec id="s4_9"><title>4.9. Adjective Phrase</title><p>The adjective phrase was modeled via positive adjective, comparative adjective, post modifier of an adjective such as adverbs and also attaching it to a sentence. In total, eleven rules were used to implement adjective phrases and the comparative adjective phrase. The next rule exemplifies the implementation. The agreement consists of number and class gender and the Boolean value allows us to place the adjective phrase after the noun.</p><p>ComparA a np = {s = \\g,n =&gt; a.s !AAdj g n ++ “kuvita” ++ np.s ! npNom; isPre = False};</p></sec><sec id="s4_10"><title>4.10. Noun Phrase (NP)</title><p>NP was implemented from the common noun, proper names, determiners, pronouns and also recursion of NP with adverbs, pre-determiners and determiners. NP implementation used two parameters: case and agreement (concord). On the case, we introduce extra case NPoss to cater for NP formed from personal and possessive pronouns. Eight rules were implemented to form NP. Below is an example of how to form NP by combining a determiner and a common noun in the Kikamba language. The Boolean function associated with the determiner allows pre and post determiners of CN to be placed in the right position.</p><p>DetCN det cn = {s =\\c=&gt; case det.isPre of {</p><p>False =&gt; det.s!cn.g ++ cn.s ! det.n !npcase2case c ++ cn.s2!det.n;</p><p>True =&gt; cn.s ! det.n !npcase2case c ++ det.s!cn.g ++ cn.s2!det.n};</p><p>a =Ag cn.g det.n P3;};</p></sec><sec id="s4_11"><title>4.11. Verb Phrase (VP)</title><p>In VP the prefixes (focus, negation, subject marker, tense) morphemes were concatenated to verbs as mentioned in section 3.3 to make a complete verb. Since a whole verb can act as a sentence, then the parameters of sentences: polarity, tense and anterior in addition to agreements were used in the design as exemplified by the operation oper verb phrase. Five record strings were used: s for normal verb, progV for progressive verbs, compl for object of the verb, imp for imperative verbs and inf for infinitive verbs. The subcategorization of verbs was taken care of through compl (one place, two place and three place verb) and in total 20 rules were implemented based on the regular verb phrase function regVP.</p><p>oper</p><p>VerbPhrase: Type = {</p><p>s: Agr =&gt; Polarity =&gt; Tense =&gt; Anteriority =&gt; Str;</p><p>compl: Agr =&gt; Str;</p><p>progV: Str;</p><p>imp: Polarity =&gt; ImpForm =&gt; Str</p><p>inf: Str};</p></sec><sec id="s4_12"><title>4.12. Other Syntax Categories</title><p>A clause was formed by combing a noun phrase and a verb phrase and implemented the topology SVO where the O was the second string of verb phrase which implemented the compliment of the verb. In the next section, we illustrate one of the rules for forming clauses. The clauses formed a sentence with the same parameters. However, the difference in GF is that the polarity and tense in clauses are undetermined [<xref ref-type="bibr" rid="scirp.95871-ref25">25</xref>]. Finally, the sentence and interrogative forms utterance (utt), which were the starting category for this computational grammar and was modeled based on definition 2. Seven clause rules, eight utterance rules and seven sentence rules were implemented</p><p>PredVP np vp = let agr = verbAgr np.a in{s=\\pol,tense,anter =&gt; let</p><p>verb: Str = vp.s!Ag agr.g agr.n agr.p !pol!tense!anter;</p><p>obj: Str = vp.compl !Ag agr.g agr.n agr.p; in</p><p>np.s !npNom ++ verb ++ obj};</p></sec></sec><sec id="s5"><title>5. Results</title><p>The Kikamba grammar was subjected to test suites for purposes of testing and evaluation. The testing aimed to improve grammar quality (reduce over the generation and ensure coverage) during development while the evaluation objective was to check coverage and quality of the grammar after development. The linguistic phenomena covered for this grammar are shown in <xref ref-type="fig" rid="fig">Figure </xref>A1 and are the ones that were tested and evaluated. There are three ways used to create test suites for testing computational grammars [<xref ref-type="bibr" rid="scirp.95871-ref31">31</xref>] [<xref ref-type="bibr" rid="scirp.95871-ref32">32</xref>].</p><p>&#183; Grammar writer or expert writes the test suite data or uses already existing test suites.</p><p>&#183; Using natural existing corpus or treebanks.</p><p>&#183; Use of the comments created for each grammar rule that shows what the rule parses in the grammar.</p><p>We based our evaluation and testing on the aspect of the grammar already developed as per <xref ref-type="table" rid="table5"><xref ref-type="table" rid="table">Table </xref>5</xref>. Thus, we used method one for evaluating and method three for testing.</p><p>To create the test suite for testing, the comment(s) for each function/rule in the abstract syntax was used. The comments are/is an example(s) of what the rule can parse in the English language in addition to extra phrases generated by the grammar writer for each rule in the English language. The test suite for each rule was translated into Kikamba language phrases or lexicon (gold standard). The rule was implemented in such a way that its linearization output to match the gold standard, else the function was refined and the regression testing re-run until a match was obtained and also in case of changes of the module, re-runs were made to ensure no new noise was introduced. The above is the standard testing procedure for GF grammar [<xref ref-type="bibr" rid="scirp.95871-ref25">25</xref>] and also illustrated in <xref ref-type="fig" rid="fig">Figure </xref>2.</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5"><xref ref-type="table" rid="table">Table </xref>5</xref></label><caption><title> Grammar coverage</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"  >Coverage</th></tr></thead><tr><td align="center" valign="middle" >Sentence</td><td align="center" valign="middle" >Declarative, Questions</td></tr><tr><td align="center" valign="middle" >Tense</td><td align="center" valign="middle" >Present, Future, Past and Conditional</td></tr><tr><td align="center" valign="middle" >Verb</td><td align="center" valign="middle" >One-Place, Two-Place, Verb Phrase</td></tr><tr><td align="center" valign="middle" >Determiners</td><td align="center" valign="middle" >Quantifiers, Numbers and Possessive Pronoun</td></tr><tr><td align="center" valign="middle" >Noun</td><td align="center" valign="middle" >One Place Two-Place, Three Place Complex Noun</td></tr><tr><td align="center" valign="middle" >Adjective</td><td align="center" valign="middle" >Positive, Comparative and Complex</td></tr><tr><td align="center" valign="middle" >Noun Phrase</td><td align="center" valign="middle" >Personal Pronoun and NP Phrase</td></tr><tr><td align="center" valign="middle" >Adverb</td><td align="center" valign="middle" >Modifying Verbs, Numbers and Adjective</td></tr><tr><td align="center" valign="middle" >Others</td><td align="center" valign="middle" >Prepositional and Conjugation</td></tr></tbody></table></table-wrap><p>In our evaluation a 100 sentences test suite was developed from three sources: a linguist who was provided with the 500 different categories lexicons in GF so as to generate sentences, GF online treebanks<sup>3</sup> and Khegai [<xref ref-type="bibr" rid="scirp.95871-ref33">33</xref>] Russian work. The test suite was translated by a Kikamba language expert into the Kikamba test suite (the gold standard). Using the GF Kikamba grammar, the test suite was linearized into the Kikamba language. Where a sentence produced more than one linearization because of lexical variant or synonyms, then the one that best fit in reference to the gold standard was taken. The gold standard and the linearization output were matched using the online Tilde<sup>4</sup> machine translation platform and also the error rate Perl scripts in order to extract the metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER) and Position Independent Error Rate (PER) which are commonly used metrics for evaluating machine translation [<xref ref-type="bibr" rid="scirp.95871-ref34">34</xref>]. BLEU [<xref ref-type="bibr" rid="scirp.95871-ref35">35</xref>] (ranges from 0 to 1 or expressed as a percentage) demonstrated a good correlation of machine translation to human judgment and PER and WER based on Levenshtein distance [<xref ref-type="bibr" rid="scirp.95871-ref36">36</xref>] were excellent metrics to investigate the errors in Kikamba language since it has a lot of nasal insertion, deletion and substitute. The results were: cumulative 4-gram BLEU of 83.05%, WER of 12.82% and PER of 10.96%.</p><p>We shall demonstrate how coverage of morphology and syntax using the dominate topology was accomplished in four levels. The Graphviz<sup>5</sup> software will be used to provide the Kikamba parse tree and words alignment after parsing the equivalent in English.</p><p>&#183; Normal sentence with simple SVO topology</p><p>&#183; A sentence with a complex Noun Phrase</p><p>&#183; Prepositional usage</p><p>&#183; Normal questions and Wh-questions</p><p><xref ref-type="fig" rid="fig">Figure </xref>3 represents the sentence “these bad men will cut many trees” in Kikamba languages. The verb “cut” is a two-place verb hence has an object existing in future tense with positive polarity and simultaneous anteriority. The Sentence S is created from the clause Cl, which consists of NP and VP. Also, the VP is made of VPslash and NP. Therefore, the sentence is indirectly made of NP VPslash NP, which represents the SVO structures respectively. <xref ref-type="table" rid="table6"><xref ref-type="table" rid="table">Table </xref>6</xref> shows the morphology of individual categories.</p><p><xref ref-type="fig" rid="fig">Figure </xref>4(a) and <xref ref-type="fig" rid="fig">Figure </xref>4(b) demonstrates a complex noun phrase and one place verb in word alignment and parse tree respectively; thus, no object in the sentence. The gloss of the sentence is “all your big brothers didn’t sleep”. The NP consists of Noun, possessive determiner, adjective and determiner and the tense of the sentences is past tense with negative polarity and simultaneous anteriority. The morphology is discussed in <xref ref-type="table" rid="table7"><xref ref-type="table" rid="table">Table </xref>7</xref>. All tense, polarity and anteriority implemented in this grammar have been exemplified in <xref ref-type="table" rid="table">Table </xref>A1 at the appendix using the verb “sleep”.</p><p><xref ref-type="fig" rid="fig">Figure </xref>5(a) demonstrates the use of the auxiliary verb “to be”, the preposition</p><table-wrap id="table6" ><label><xref ref-type="table" rid="table6"><xref ref-type="table" rid="table">Table </xref>6</xref></label><caption><title> Words morphology</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Word</th><th align="center" valign="middle" >Category</th><th align="center" valign="middle" >Explanation</th></tr></thead><tr><td align="center" valign="middle" >Andu aume</td><td align="center" valign="middle" >Compound Noun</td><td align="center" valign="middle" >a class gender G1 number Pl prefix ndu-root a class gender G1 number Pl prefix uume-root</td></tr><tr><td align="center" valign="middle" >Aa</td><td align="center" valign="middle" >Quantifiers</td><td align="center" valign="middle" >class gender G1 dependent string</td></tr><tr><td align="center" valign="middle" >Athuku</td><td align="center" valign="middle" >Adjectives</td><td align="center" valign="middle" >a G1 concord prefix thuku Adj root</td></tr><tr><td align="center" valign="middle" >Ma</td><td align="center" valign="middle" >VP</td><td align="center" valign="middle" >Subject marker for class gender G1 and person 3</td></tr><tr><td align="center" valign="middle" >Ka</td><td align="center" valign="middle" >VP</td><td align="center" valign="middle" >Future tense morpheme in simultaneous</td></tr><tr><td align="center" valign="middle" >Tema</td><td align="center" valign="middle" >V2</td><td align="center" valign="middle" >Two place verb (with argument)</td></tr><tr><td align="center" valign="middle" >Miti</td><td align="center" valign="middle" >N</td><td align="center" valign="middle" >mi class gender G2 number Pl prefix ti-root</td></tr><tr><td align="center" valign="middle" >Miingi</td><td align="center" valign="middle" >Determiner</td><td align="center" valign="middle" >mi G1 concord prefix ingi Det root</td></tr></tbody></table></table-wrap><table-wrap id="table7" ><label><xref ref-type="table" rid="table7"><xref ref-type="table" rid="table">Table </xref>7</xref></label><caption><title> Words morphology</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Word</th><th align="center" valign="middle" >Category</th><th align="center" valign="middle" >Explanation</th></tr></thead><tr><td align="center" valign="middle" >Ana Inya</td><td align="center" valign="middle" >N2</td><td align="center" valign="middle" >Prefix a class gender G1 number Pl root ndu String to the noun</td></tr><tr><td align="center" valign="middle" >menyu</td><td align="center" valign="middle" >Possessive Det</td><td align="center" valign="middle" >class gender G1 dependent string</td></tr><tr><td align="center" valign="middle" >Anene</td><td align="center" valign="middle" >Adjective</td><td align="center" valign="middle" >a G1 concord prefix and the adjective root is nene</td></tr><tr><td align="center" valign="middle" >Onthe</td><td align="center" valign="middle" >Determiner</td><td align="center" valign="middle" >class gender G1 dependent string</td></tr><tr><td align="center" valign="middle" >Ma</td><td align="center" valign="middle" >VP</td><td align="center" valign="middle" >Subject marker for class gender G1 and person 3</td></tr><tr><td align="center" valign="middle" >Ti</td><td align="center" valign="middle" >VP</td><td align="center" valign="middle" >past tense morpheme in simultaneous</td></tr><tr><td align="center" valign="middle" >nee</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >Infix</td></tr><tr><td align="center" valign="middle" >koma</td><td align="center" valign="middle" >V</td><td align="center" valign="middle" >mi class gender G2 number Pl prefix ti-root</td></tr></tbody></table></table-wrap><p>“on” that is fused with the noun “table” to become “mesani” and preposition “of” which is translated “kya” based on class gender G4 of the pen. The gloss of the utterance used is “the pen of John was on the table”. <xref ref-type="fig" rid="fig">Figure </xref>5(b) shows word alignment between English and Kikamba languages for the same utterance.</p><p>In Kikamba language, the tone is used to mark a question; hence, there are no rearrangements of the declarative sentence constituents. <xref ref-type="fig" rid="fig">Figure </xref>6(a) demonstrates the coverage of Wh-question “which trees did the wind push?” while the <xref ref-type="fig" rid="fig">Figure </xref>6(b) shows the word alignment of the Wh-question in English and Kikamba, while <xref ref-type="fig" rid="fig">Figure </xref>7(a) shows the yes-no question using the question “did the students play the song” and the word alignment are demonstrated in <xref ref-type="fig" rid="fig">Figure </xref>7(b).</p><p>The Kikamba grammar is part and initial stage of creating a shared grammar for Kenyan Bantu languages through bootstrapping strategies, mainly grammar sharing and grammar porting. In order to maintain a standard regression testing of any new Bantu language that will be added via bootstrap, we parsed the hundred English sentences in order to create a treebank test suite. <xref ref-type="table" rid="table5"><xref ref-type="table" rid="table">Table </xref>5</xref> represents the categories covered in the treebanks. Below is an example of a tree which will linearize into “andu aume miongo ili athuku vyu nimananyw’ie nzovi” in the Kikamba language with a gloss of “the twenty very bad men drank beer” in the English language. The tree starts at the phrase level (PhrUtt) with no conjugation and vocative, taking sentence utterance (Utts). The clause (UseCl) is in the past tense, has a positive polarity and simultaneous anteriority. The function DetCN creates the noun phrase while ComplSlash creates the Verb phrase of a two-place verb drink with the function MassNP creating the compliment of the VP as a noun phrase as shown below.</p><p>PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) PPos (PredVP (DetCN (DetQuant DefArt (NumCard (NumNumeral (num (pot2as3 (pot1as2 (pot1 n2))))))) (AdjCN (AdAP very_AdA (PositA bad_A)) (UseN man_N))) (Com-plSlash (SlashV2a drink_V2) (MassNP (UseN beer_N)))))) NoVoc</p><p>The treebanks created had 2854 functions in total. With the largest tree having 62 functions while the shortest had 11 functions. The largest tree was made of two sentences which had complex verb phrases and noun phrases.</p></sec><sec id="s6"><title>6. Discussion</title><p>The statistical machine translation (SMT) Dholuo-English and Swahili-Dholuo [<xref ref-type="bibr" rid="scirp.95871-ref37">37</xref>] work gave a low BLEU score of 0.29 and 0.15, which the author attributed to lack of bilingual corpora. Given that the corpus was divided into ten portions; nine portions used for training and one portion used for testing, then the expectation was a high BLEU score. This is a clear indication that the use of a rule-based system will produce a high performance for under resourced languages. The SAWA corpus English to Swahili statistical machine translation [<xref ref-type="bibr" rid="scirp.95871-ref38">38</xref>] resulted in a BLEU score of 35, which is still low. Weku [<xref ref-type="bibr" rid="scirp.95871-ref39">39</xref>] reports a BLEU score of 32.6 on English-Swahili SMT based on Bayesian inference. We could not find a rule based system evaluation using the above metrics so as to compare</p><p>with and especially not a system for a Bantu language. Therefore, this work is a clear indication of how using the rule based system will help to produce highly accurate systems for these under resourced languages.</p><p>The error analysis was done sentence by sentence and <xref ref-type="fig" rid="fig">Figure </xref>8 summaries the issue which contributed to the noise. In Kikamba language, pronouns were dropped (prop drop) since they were represented in the subject marker of the verb and in some cases, they were not dropped. Secondly, some prepositions were fused in the noun but also had strings. Verbs contributed the most significant percentage of the errors due to morphophonological issues as a result of nasal deletion and insertion, which is present in the Kikamba language [<xref ref-type="bibr" rid="scirp.95871-ref40">40</xref>].</p><p>When a sentence had two adjectives, their order was changed in the translation and was heavily penalized by WER and BLEU hence the use of PER which allows words re-order and the error reduced to 10.96% from 12.82% of WER.</p></sec><sec id="s7"><title>7. Conclusions</title><p>Through this paper, we have formalized the grammar for Kikamba language through the high precision rule-based approach in interlingua GF environment. The metrics results after evaluation which are encouraging are 4-gram BLEU of 83.05%, WER of 12.82% and PER of 10.96%. Therefore our contribution would be: firstly, we have provided NLP tools; morphological analyzer and machine translator for under-resourced Kikamba languages by extending the GF library, which is a step towards BLARK. Secondly, the wide coverage of the Kikamba computational grammar provides a platform for building multilingual technological applications and also to generate the scarce bilingual corpus pairing with other languages present in GF for experimenting using data driven methods. Finally, we have also created a treebank that can be used to evaluate Bantu languages.</p><p>Future work would be working on the morphophonological rules of verbs, extending the lexicon so as to handle text and finally including questions as part of the grammar.</p></sec><sec id="s8"><title>Acknowledgements</title><p>We would like to acknowledge the contribution made by the following people in terms of Kikamba translation, Kikamba grammar structure, GF expertise. Prof. kyalo Wamitila, Prof. Angelina Kioko, Dr. Hans Lei&#223;, Dr. Otiso Wambua, Obed Mutiso, Joe Kyalo, Christopher Kithuka, immaculate Wanza and Rama Munara.</p></sec><sec id="s9"><title>Conflicts of Interest</title><p>The authors do not have any conflict of interest.</p></sec><sec id="s10"><title>Cite this paper</title><p>Kituku, B., Nganga, W. and Muchemi, L. (2019) Towards Kikamba Computational Grammar. Journal of Data Analysis and Information Processing, 7, 250-275. https://doi.org/10.4236/jdaip.2019.74015</p></sec><sec id="s11"><title>Appendix A</title><table-wrap id="table8" ><label><xref ref-type="table" rid="table">Table </xref>A1</label><caption><title> Examples of tense, negation and anteriority</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Form</th><th align="center" valign="middle" >Swahili</th><th align="center" valign="middle" >English</th></tr></thead><tr><td align="center" valign="middle" >TPresASimulPPos TPresASimulPNeg TPastASimulPPos TPastASimulPNeg TFutASimulPPos TFutASimulPNeg TCondASimulPPos TCondASimulPNeg TPresAAnterPPos TPresAAnterPNeg TPastAAnterPPos TPastAAnterPNeg TFutAAnterPPos TFutAAnterPNeg TCondAAnterPPos TCondAAnterPNeg</td><td align="center" valign="middle" >Nimakomaa we ndakomaa nimanakomie inyui mutineekoma ithyit ukakoma we ndukakoma makeethiwa makomie maikeethiwa makoma ithyi nitwakoma ithyi tuinakoma we niwakomete we ndwakomete nyie ngeethiwa ninakoma makeethiwa matanakoma we niwesaa kukoma we ndesaa kukoma</td><td align="center" valign="middle" >they sleeps he doesn’t sleep they slept you didn’t sleep we will sleep you won’t sleep they would sleep they wouldn’t sleep we have slept we haven’t slept he had slept you hadn’t slept they will have slept they won’t have slept she would have slept she wouldn’t have slept</td></tr></tbody></table></table-wrap></sec><sec id="s12"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.95871-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Guthrie, M. (1948) The Classification of the Bantu Languages. The International African Institute by the Oxford University Press, Oxford.</mixed-citation></ref><ref id="scirp.95871-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Kaviti, L.K. (2004) A Minimalist Perspective of The Principles and Parameters in Kikamba Morpho-Syntax. Doctoral Dissertation, University of Nairobi, Kenya.</mixed-citation></ref><ref id="scirp.95871-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Mbuvi, M.K. (2005) The Syntax of Kikamba Noun Modification. Unpublished Master’s Dissertation, University of Nairobi, Kenya.</mixed-citation></ref><ref id="scirp.95871-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Welmers, W.E. (1973) African Language Structures. University of California Press, Oakland, CA.</mixed-citation></ref><ref id="scirp.95871-ref5"><label>5</label><mixed-citation publication-type="book" xlink:type="simple">Kioko, A.N., Njoroge, M.C. and Kuria, P.M. (2012) Harmonizing the Orthography of Gikuyu and Kikamba. In: Iribemwangi, P., Ogechi, O.N. and Odour, N., Eds., Book Harmonization and Standardization of Kenyan Languages, Orthography and Other Aspects, The Centre for Advanced Studies of African Society, Cape Town, South Africa, 39-63.</mixed-citation></ref><ref id="scirp.95871-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Kioko, A.N. (1995) The Kikamba Multiple Applicative: A Problem for the Lexical Functional Grammar Analysis. South African Journal of African Languages, 15, 210-216. https://doi.org/10.1080/02572117.1995.10587081</mixed-citation></ref><ref id="scirp.95871-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Munyao, K.M. (2006) The Morph Syntax of Kikamba Verb Derivations: A Minimalist Approach. The University of Nairobi, Nairobi, Kenya.</mixed-citation></ref><ref id="scirp.95871-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Roberts-Kohno, R.R. (2000) Kikamba Phonology and Morphology. Doctoral Dissertation, Ohio State University, Columbus, OH.</mixed-citation></ref><ref id="scirp.95871-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Mutiga, J.M. (2002) The Tone System of Kikamba: A Case Study of Mwingi Dialect. Doctoral Dissertation, University of Nairobi, Kenya.</mixed-citation></ref><ref id="scirp.95871-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Kituku, B., Musumba, G. and Wagacha, P. (2015) Kamba Part of Speech Tagger Using Memory-Based Approach. International Journal on Natural Language Computing, 4, 43-53. https://doi.org/10.5121/ijnlc.2015.4204</mixed-citation></ref><ref id="scirp.95871-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Kituku, B., Wagacha, P. and De Pauw, G. (2011) A Memory-Based Approach to K&amp;#305;kamba Named Entity Recognition. Proceedings of the Conference on Human Language Technology for Development, Cairo, Egypt, 106-111.</mixed-citation></ref><ref id="scirp.95871-ref12"><label>12</label><mixed-citation publication-type="book" xlink:type="simple">Ng’ang’a, W. (2012) Building Swahili Resource Grammars for the Grammatical Framework. Shall We Play the Festschrift Game? In: Santos, D., Lindén, K. and Ng’ang’a, W., Eds., Shall We Play the Festschrift Game? Springer, Berlin, Heidelberg, 215-226. https://doi.org/10.1007/978-3-642-30773-7_13</mixed-citation></ref><ref id="scirp.95871-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Pretorius, L., Marais, L. and Berg, A.A. (2017) GF Miniature Resource Grammar for Tswana: Modelling the Proper Verb. Language Resources and Evaluation, 51, 159-189. https://doi.org/10.1007/s10579-016-9341-z</mixed-citation></ref><ref id="scirp.95871-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Krauwer, S. (2003) The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. Proceedings of SPECOM, 8-15.</mixed-citation></ref><ref id="scirp.95871-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Hyman, L.M. (1979) Phonology and Noun Structure. Aghem Grammatical Structure, 1-72.</mixed-citation></ref><ref id="scirp.95871-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Kihm, A. (2002) What’s in a Noun: Noun Classes, Gender and Nounness. Ms. Université Paris, Paris.</mixed-citation></ref><ref id="scirp.95871-ref17"><label>17</label><mixed-citation publication-type="book" xlink:type="simple">Demuth, K. (2000) Bantu Noun Class Systems: Loan Word and Acquisition Evidence of Semantic Productivity. Classification Systems. In: Senft, G., Ed., Book System of Nominal Classification, Cambridge University Press, Cambridge, 270-92</mixed-citation></ref><ref id="scirp.95871-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Ibrahim, M.H. (2014) Grammatical Gender: Its Origin and Development. Walter de Gruyter, 160.</mixed-citation></ref><ref id="scirp.95871-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Reichenbach, H. (1947) The Tenses of Verbs. Time: From Concept to Narrative Construct: A Reader.</mixed-citation></ref><ref id="scirp.95871-ref20"><label>20</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Rugemalira</surname><given-names> J.M. </given-names></name>,<etal>et al</etal>. (<year>2007</year>)<article-title>The Structure of the Bantu Noun Phrase</article-title><source> SOAS Working Papers in Linguistics</source><volume> 15</volume>,<fpage> 135</fpage>-<lpage>148</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.95871-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Kituku, B., Lawrence, M. and Wanjiku, N. (2016) A Review on Machine Translation Approaches. Indonesian Journal of Electrical Engineering and Computer Science, 1, 182-190. https://doi.org/10.1090/psapm/012/9981</mixed-citation></ref><ref id="scirp.95871-ref22"><label>22</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Curry</surname><given-names> H.B. </given-names></name>,<etal>et al</etal>. (<year>1961</year>)<article-title>Some Logical Aspects of Grammatical Structure</article-title><source> Structure of language and Its Mathematical Aspects</source><volume> 12</volume>,<fpage> 56</fpage>-<lpage>68</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.95871-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Ranta, A. (2009) GF: A Multilingual Grammar Formalism. Language and Linguistics Compass, 3, 1242-1265. https://doi.org/10.1111/j.1749-818X.2009.00155.x</mixed-citation></ref><ref id="scirp.95871-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Paikens, P. and Gruzitis, N. (2012) An Implementation of a Latvian Resource Grammar in Grammatical Framework. LREC 2012, 1680-1685.</mixed-citation></ref><ref id="scirp.95871-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Ranta, A. (2011) Grammatical Framework: Programming with Multilingual Grammars. CSLI Publications, Center for the Study of Language and Information, Stanford, CA.</mixed-citation></ref><ref id="scirp.95871-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Ljungl&amp;#246;f, P. (2004) Expressivity and Complexity of the Grammatical Framework. Doctoral Dissertation, Chalmers University, Sweden.</mixed-citation></ref><ref id="scirp.95871-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Ranta, A. (2006) Type Theory and Universal Grammar. Philosophia Scienti&amp;aelig;. Travaux d’histoire et de philosophie des sciences. 115-131.https://doi.org/10.4000/philosophiascientiae.415</mixed-citation></ref><ref id="scirp.95871-ref28"><label>28</label><mixed-citation publication-type="book" xlink:type="simple">Chelliah, S.L. (2001) The Role of Text Collection and Elicitation in Linguistic Fieldwork. In: Newman. P. and Ratliff, R., Eds., Book Linguistic Fieldwork, Cambridge University Press, Cambridge, 152-165.https://doi.org/10.1017/CBO9780511810206.008</mixed-citation></ref><ref id="scirp.95871-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Ngau, P. and Kumssa, L. (2004) Research Design, Data Collection and Analysis. Training Manual No. 12. United Nations Centre for Regional Development, Africa Office.</mixed-citation></ref><ref id="scirp.95871-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Carr, M. and Verner, J. (1997) Prototyping and Software Development Approaches. Department of Information Systems, City University of Hong Kong, Hong Kong., 319-338.</mixed-citation></ref><ref id="scirp.95871-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Br&amp;ouml;ker, N. (2000) The Use of Instrumentation in Grammar Engineering. Proceedings of the 18th Conference on Computational Linguistics, 1, 118-124.https://doi.org/10.3115/990820.990838</mixed-citation></ref><ref id="scirp.95871-ref32"><label>32</label><mixed-citation publication-type="book" xlink:type="simple">Butt, M. and Tracy, H.K. (2003) Grammar Writing, Testing and Evaluation. In: Ali, F., Ed., Book Handbook for Language Engineers, Stanford, CA, 129-180.</mixed-citation></ref><ref id="scirp.95871-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Khegai, J. and Ranta, A. (2004) Building and Using a Russian Resource Grammar in GF. In: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Heidelberg, 38-41.https://doi.org/10.1007/978-3-540-24630-5_4</mixed-citation></ref><ref id="scirp.95871-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Vilar, D., Xu, J., Luis Fernando, D.H. and Ney, H. (2006) Error Analysis of Statistical Machine Translation Output. LREC 2006, Genoa, Italy, 697-702.</mixed-citation></ref><ref id="scirp.95871-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Koehn, P. (2004) Statistical Significance Tests for Machine Translation Evaluation. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, July 2004, 388-395.</mixed-citation></ref><ref id="scirp.95871-ref36"><label>36</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Levenshtein</surname><given-names> V.I. </given-names></name>,<etal>et al</etal>. (<year>1966</year>)<article-title>Binary Codes Capable of Correcting Deletions, Insertions and Reversals</article-title><source> Soviet Physics Doklady</source><volume> 10</volume>,<fpage> 707</fpage>-<lpage>710</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.95871-ref37"><label>37</label><mixed-citation publication-type="other" xlink:type="simple">De Pauw, G., Maajabu. N. and Wagacha, P.W. (2010) A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging. In: Proceedings of the Second Workshop on African Language Technology (AfLaT 2010), European Language Resources Association (ELRA), Valletta, Malta, 15-20.</mixed-citation></ref><ref id="scirp.95871-ref38"><label>38</label><mixed-citation publication-type="other" xlink:type="simple">De Pauw, G., Wagacha. P.W. and de Schryver, G.M. (2011) Towards English-Swahili Machine Translation. Research Workshop of the Israel Science Foundation.</mixed-citation></ref><ref id="scirp.95871-ref39"><label>39</label><mixed-citation publication-type="other" xlink:type="simple">Weku, O.V. (2014) Use of Bayesian Model for Word Alignment in Swahili-English Statistical Machine Translation. Master’s Dissertation, University of Nairobi, Kenya.</mixed-citation></ref><ref id="scirp.95871-ref40"><label>40</label><mixed-citation publication-type="other" xlink:type="simple">Kioko, A. (1999) The Verb ‘Be’ in Kikamba: Issues in Identifying the Form. Chemchemi International Journal of Arts and Social Sciences, 94-105.</mixed-citation></ref></ref-list></back></article>