<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JDAIP</journal-id><journal-title-group><journal-title>Journal of Data Analysis and Information Processing</journal-title></journal-title-group><issn pub-type="epub">2327-7211</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jdaip.2015.34014</article-id><article-id pub-id-type="publisher-id">JDAIP-61283</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Identifying Semantic in High-Dimensional Web Data Using Latent Semantic Manifold
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>jit</surname><given-names>Kumar</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Sanjeev</surname><given-names>Maskara</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>I-Jen</surname><given-names>Chiang</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff3"><addr-line>School of Management, Taipei Medical University, Taiwan</addr-line></aff><aff id="aff2"><addr-line>The Practice PLC, Buckinghamshire, UK</addr-line></aff><aff id="aff1"><addr-line>Goa Institute of Management, Goa, India</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>ajitmaskara@gmail.com(JK)</email>;<email>sanjeevmaskara@gmail.com(SM)</email>;<email>ijchiang@ntu.edu.tw(IC)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>02</day><month>11</month><year>2015</year></pub-date><volume>03</volume><issue>04</issue><fpage>136</fpage><lpage>152</lpage><history><date date-type="received"><day>22</day>	<month>September</month>	<year>2015</year></date><date date-type="rev-recd"><day>accepted</day>	<month>16</month>	<year>November</year>	</date><date date-type="accepted"><day>19</day>	<month>November</month>	<year>2015</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Latent Semantic Analysis involves natural language processing techniques for analyzing relationships between a set of documents and the terms they contain, by producing a set of concepts (related to the documents and terms) called semantic topics. These semantic topics assist search engine users by providing leads to the more relevant document. We develope a novel algorithm called Latent Semantic Manifold (LSM) that can identify the semantic topics in the high-dimensional web data. The LSM algorithm is established upon the concepts of topology and probability. Asearch tool is also developed using the LSM algorithm. This search tool is deployed for two years at two sites in Taiwan: 1) Taipei Medical University Library, Taipei, and 2) Biomedical Engineering Laboratory, Institute of Biomedical Engineering, National Taiwan University, Taipei. We evaluate the effectiveness and efficiency of the LSM algorithm by comparing with other contemporary algorithms. The results show that the LSM algorithm outperforms compared with others. This algorithm can be used to enhance the functionality of currently available search engines.
 
</p></abstract><kwd-group><kwd>Latent Semantic Manifold</kwd><kwd> Conditional Random Field</kwd><kwd> Hidden Markov Model</kwd><kwd> Graph-Based Tree-Width Decomposition</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In the traditional approach to data gathering, we collect data on a few well-chosen variables, and then manually perform various tasks, such as finding relevant information, analyzing them, making decisions, and so on [<xref ref-type="bibr" rid="scirp.61283-ref1">1</xref>] . However, in this high-tech era, the high volumes of data are generated with high velocity from a variety of re- sources (also known as 3 V―Volume, Velocity, and Variety) [<xref ref-type="bibr" rid="scirp.61283-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref3">3</xref>] . The modern Information and Communica- tion Technology (ICT) infrastructure, the advent of cloud computing, the cheaper availability of storage device, and the low cost computing power have made people capable of recording and storing an enormous amount of data [<xref ref-type="bibr" rid="scirp.61283-ref4">4</xref>] . As a result, gigantic repositories that include data, texts, and media have rapidly grown during recent years [<xref ref-type="bibr" rid="scirp.61283-ref5">5</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref9">9</xref>] . Nowadays, we create as much information in every two days as we have done since the dawn of civilization [<xref ref-type="bibr" rid="scirp.61283-ref10">10</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref11">11</xref>] . Several huge repositories are freely available for the public use on the World Wide Web causing another problem―the relevant information is buried in the irrelevant ones.</p><p>To combat the problem to lose the relevant information in the overwhelming amount of data, a number of search engines have proliferated recently, which can aid users in searching contents which are relevant to them [<xref ref-type="bibr" rid="scirp.61283-ref8">8</xref>] . As the web pages are heterogeneous and consist of varying quality, they put limitations on search technologies [<xref ref-type="bibr" rid="scirp.61283-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref13">13</xref>] . Moreover, the relationships among the words (polysemy, synonymy, and homophony) and sentences (paraphrase, entailment, and contradiction), and ambiguities (lexical and structural) diminish the search engines’ power [<xref ref-type="bibr" rid="scirp.61283-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref15">15</xref>] . Hence, the search engines often return inconsistent, uninteresting, and unorganized results [<xref ref-type="bibr" rid="scirp.61283-ref9">9</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref12">12</xref>] . Web users have to devote substantial time and effort to differentiate meaningful items from the results returned by the search engines [<xref ref-type="bibr" rid="scirp.61283-ref9">9</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref16">16</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref17">17</xref>] . In order to facilitate and enhance relevant information access to the web users, it is essential for search engines to deal with ambiguities and imprecision [<xref ref-type="bibr" rid="scirp.61283-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref19">19</xref>] . The need to enhance the search engines’ capabilities has been felt such that the search engines can not only generate results of the web users’ queried terms, but also can filter and organize meaningful items from the irrelevant ones [<xref ref-type="bibr" rid="scirp.61283-ref20">20</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref22">22</xref>] .</p><p>Many effective search engines, such as MedEvi, EBIMed, MEDIE, PubNet, GoPubMed, Argo, and Vivisimo, have provided capabilities to fit search results to the users’ intent. These search engines can discover latent semantic (relationships between a set of documents and the terms they contain) in the search engine generated documents and classify these documents into homogeneous semantic clusters [<xref ref-type="bibr" rid="scirp.61283-ref23">23</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref33">33</xref>] . In these search engines, each semantic cluster is considered as a topic, which indicates a summary of the generated documents. Later, the users can explore the topics that are relevant to their intent. For example, upon using these search engines, a query term, APC (Adenomatous Polyposis Coli), can yield abstracts of the relevant PubMed articles. In this case, the generated results will consist of not only abstracts about Adenomatous Polyposis Coli, but also others such as Antigen Presenting Cells (APC), Anaphase Promoting Complex (APC), and Activated Protein C (APC). The users need to find articles which are relevant to their intent (here Adenomatous Polyposis Coli) after going through the abstracts generated from the search. In summary, rather than providing huge number of web links related to the queried terms, search engines need to generate results relevant to users’ intent.</p><p>In the past, many algorithms/techniques have been deployed to develop semantic search engines as described in the previous paragraph [<xref ref-type="bibr" rid="scirp.61283-ref25">25</xref>] . For instance, deterministic search techniques have provided metadata-enhanced search facility, where a user pre-selects different facets to generate more relevant search results [<xref ref-type="bibr" rid="scirp.61283-ref18">18</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref19">19</xref>] . However, scaling metadata-enhanced search facility to the web is difficult and requires many experts to define controlled-vocabulary in order to create unique labels for the concepts having the same terminology [<xref ref-type="bibr" rid="scirp.61283-ref34">34</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref35">35</xref>] . Luhn pointed out that the frequency of terms and their relative positions within a sentence in a document can be used to compute a relative measure of significance, first for the individual words and then for the sentences [<xref ref-type="bibr" rid="scirp.61283-ref36">36</xref>] . Word usage in a document collection tends to follow Zipf’s distribution, in which a few words are used very frequently, but the vast majority only rarely [<xref ref-type="bibr" rid="scirp.61283-ref37">37</xref>] . Therefore, Salton and McGill addressed the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x6.png" xlink:type="simple"/></inline-formula> scheme, which is a measure of each basic element (term) in a document collection to reveal the significance of elements within the collection [<xref ref-type="bibr" rid="scirp.61283-ref38">38</xref>] . For each document in the collection, the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x7.png" xlink:type="simple"/></inline-formula> value of each term is determined by the term frequency, that is, the number of occurrences of each term in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. We may view each document as a vector with one component corresponding to each term together with a weight for each component. Thus, the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x8.png" xlink:type="simple"/></inline-formula> scheme can reduce documents of arbitrary length to fixed- length lists of numbers. The <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x9.png" xlink:type="simple"/></inline-formula> weighting schemes are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. In addition, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x10.png" xlink:type="simple"/></inline-formula>can be successfully used for stop-words filtering in various subject fields including text summarization and classification. No doubt, the revolutionary change was realized in the information retrieval field with the introduction of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x11.png" xlink:type="simple"/></inline-formula> scheme and its variants. However, in the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x12.png" xlink:type="simple"/></inline-formula> scheme, the document collection is presented as a document-by-term matrix, which is usually enormously high-dimensional and sparse [<xref ref-type="bibr" rid="scirp.61283-ref38">38</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref40">40</xref>] . Often, for a single document, there are more than thousands of terms in the matrix, and most of the entries are zero. The <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x13.png" xlink:type="simple"/></inline-formula> scheme can bring down some terms; yet, it provides a relatively small amount of reduction, which is not enough to reveal the statistical measures within a document or between documents.</p><p>In the last decades, some other dimension reduction techniques, such as Latent Semantic Indexing, Probabilistic Latent Semantic Indexing, and Latent Dirichlet Allocation models have been proposed to overcome the shortcomings of earlier search engines. But, all these are based on bag-of-words models. The bag-of-words models follow Aldous and de Finetti theorem of exchangeability where the order of terms in a document or order of documents in a corpus can be neglected [<xref ref-type="bibr" rid="scirp.61283-ref41">41</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref43">43</xref>] . As the spatial information conveyed by the terms in the document or documents in the corpus was highly neglected in these approaches, we found a statistical issue attached with these bags-of-words models [<xref ref-type="bibr" rid="scirp.61283-ref42">42</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref45">45</xref>] . In the probability theory, the random variables (here referred as terms) <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x14.png" xlink:type="simple"/></inline-formula>are said to be exchangeable if the joint distribution <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x15.png" xlink:type="simple"/></inline-formula> is invariant under permutation of its arguments, so that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x16.png" xlink:type="simple"/></inline-formula> whenever <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x17.png" xlink:type="simple"/></inline-formula> is a permutation of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x18.png" xlink:type="simple"/></inline-formula>. However, these terms are exchangeable and the relationship between them can be established if the terms are located in proximity. For instance, we have a document describing products, such as laptops, mobile phones, and notepads. The appearance of the word “apple” can be associated with a company if it appears in proximity to words laptop, mobile phone, and notepad. However, in case, the word “apple” appears after several words or pages in the document, the relationship between “laptop, mobile phone or notepad” and “apple” weakens. Therefore, the criteria-the order of terms in a document can be neglected-should be modified to order of terms in a relationship of a document can be neglected. Likewise, the order of documents in a corpus can be neglected should be modified to the ordering documents in relationships of a corpus can be neglected. For instance, a search term “network” would yield different topics if it occurs nearby to a term, such as computer, traffic, artificial neural, or biological neural; and hence, the order of in-relationship terms might be neglected [<xref ref-type="bibr" rid="scirp.61283-ref46">46</xref>] .</p><p>As we can see from the literature review and our arguments that there is a need to enhance search engines’ capabilities to reveal latent semantics in high-dimensional web data while preserving the relationship and order of term(s) or document(s). We proposed a novel algorithm called Latent Semantic Manifold (LSM), which identifies homogeneous groups in web data while preserving the spatial information about terms in a document or documents in the corpus. This paper aims to explain the Latent Semantic Manifold algorithm (from now on, LSM algorithm), its deployment, and performance evaluation.</p></sec><sec id="s2"><title>2. Methods</title><p>This study consists of three key components: proposing and describing the LSM algorithm, its deployment, and evaluation. They are described in the following subsections.</p><sec id="s2_1"><title>2.1. Algorithm</title><p>The proposed LSM algorithm is based upon the concepts of probability and topology, which identifies the latentsemantic in data. <xref ref-type="fig" rid="fig1">Figure 1</xref> and <xref ref-type="table" rid="table1">Table 1</xref> provide the high-level view of the algorithm. The concepts deployed in the LSM algorithm are explained in the following four steps.</p><p>Step 1 (Identifying relevant fragment from the user query generated documents): A user can enter a query using a search engine, which generates a set of documents. The relevant fragments (paragraphs in the LSM) are identified from the generated documents. The identification of the fragments is handled by the “document preprocesssor” of the search engine, which typically normalizes the document stream to a predefined format, breaks the document stream into desired retrievable unit, and isolates and metatags subdocument pieces.</p><p>Step 2 (Recognizing named-entity and constructing heterogeneous manifold): It is crucial to extract significant “terms” from the fragments (identified in Step 1) to construct heterogeneous manifolds. Notably, we can extract various types of terms with a large number of training documents. However, extracting different types of terms and calculating their marginal and conditional probabilities is highly computation-intensive [<xref ref-type="bibr" rid="scirp.61283-ref47">47</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref51">51</xref>] . Therefore, we stick to identifying nouns (words or phrases) or named-entities in the LSM framework. Hidden Markov Mod- els (HMMs) are often used for part-of-speech tagging and sequential labeling [<xref ref-type="bibr" rid="scirp.61283-ref52">52</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref53">53</xref>] . Yet, in the last decade, discriminative linear chain Conditional Random Field (CRF) models have been used for tagging and sequential labeling of features in the corpus because of its advantages over the HMMs [<xref ref-type="bibr" rid="scirp.61283-ref54">54</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref56">56</xref>] . The primary advantage of CRFs over HMMs is their conditional nature. A CRF is a simple framework for labeling and segmenting data that</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Illustration of LSM algorithm</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x19.png"/></fig><p>models a conditional distribution P(z|x) by selecting the label sequence z, a named category, to label a novel observation sequence x with an associated undirected graph structure that obeys the Markov property. When</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> LSM algorithm that construct semantic manifold</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="3"  >Algorithm</th></tr></thead><tr><td align="center" valign="middle"  colspan="3"  >Require: A collection of returned documents from a search query. Ensure: A collection of semantic manifolds.</td></tr><tr><td align="center" valign="middle" >Step 1</td><td align="center" valign="middle"  colspan="2"  >Perform feature extractions using discriminative linear chain Conditional Random Field method to generate named entities.</td></tr><tr><td align="center" valign="middle" >Step 2</td><td align="center" valign="middle"  colspan="2"  >Construct a manifold from the set of named entities generated from the document collection.</td></tr><tr><td align="center" valign="middle"  rowspan="4"  >Step 3</td><td align="center" valign="middle"  colspan="2"  >Classify the manifold into isomorphic (homogeneous) categories by using the Graph-based Tree-width Decomposition algorithm starting from a fixed dimension local manifold. Require: <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x20.png" xlink:type="simple"/></inline-formula>is the vertex set of named entities that each t<sub>i</sub> is associated with its named categories equipped with a weighted probability. Ensure: <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x21.png" xlink:type="simple"/></inline-formula>is the set of isomorphic semantic manifolds. where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x22.png" xlink:type="simple"/></inline-formula>.</td></tr><tr><td align="center" valign="middle" >Step 3.1</td><td align="center" valign="middle" >Let a semantic topic set:<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x23.png" xlink:type="simple"/></inline-formula>. Let G = (V, E) be the undirected connected graph generated from the returned documents.</td></tr><tr><td align="center" valign="middle" >Step 3.2</td><td align="center" valign="middle" >Given a tree-width d, find a semantic manifold M<sub>j</sub> generated from single named entities for each semantic category z<sub>i</sub> initially in which |M<sub>j</sub><sub>|</sub> = d and the semantic mapping <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x24.png" xlink:type="simple"/></inline-formula> with a probability<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x25.png" xlink:type="simple"/></inline-formula>, and quantity <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x26.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" >Step 3.3</td><td align="center" valign="middle" >Perform graph decompositions on G starting from M<sub>j</sub>.</td></tr></tbody></table></table-wrap><p>conditioned on the observations that are given in a particular observation sequence, the CRF defines a single log-linear distribution over the labeled sequence. The CRF model does not need explicitly to present the dependencies of input variables x affording the use of rich and global features of the input, thus allows relaxation of the strong independence assumptions required by HMMs in order to ensure tractable inference. The relationships among these named-entities construct a complex structure called a heterogeneous manifold.</p><p>The named-entities are indicated with their marginal probabilities, and the correlations among named-entities are indicated with their conditional probabilities. For example, the jaguar is considered as a named-entity, and it is assigned to the animal or vehicle type depending on the overall context of the fragment. The named-entities are indicated with their marginal probabilities, and the correlations among the named-entities are indicated with their conditional probabilities. As illustrated in <xref ref-type="fig" rid="fig2">Figure 2</xref>, Jaguar is a named-entity with three possible types-animal, vehicle, and instrument. It has marginal probabilities, such as Panimal (Jaguar), Pvehicle (Jaguar), and Pinstrument (Jaguar). Likewise, it has conditional probabilities, such as P (Jaguar, Car|Vehicle), P (Jaguar, Motorcycle|Vehicle).</p><p>Step 3 (Decomposing a heterogeneous manifold into homogeneous manifolds): As mentioned in Step 2, the he- terogeneous manifold consists of a complex structure of named-entities, including estimates of marginal and con- ditional probabilities. A collection of fragment vectors lies on the heterogeneous manifolds, which contains some local spaces resembling Euclidean spaces of a fixed number of dimensions. Every point of the n-dimensional he- terogeneous manifold has a neighborhood homeomorphic to the n-dimensional Euclidean space Rn. In addition, all the points in the local spaces are strongly connected. As the heterogeneous manifold is overly complex, and the semantic is latent in local spaces; thus, instead of retaining just one heterogeneous manifold, we break it into a collection of homogeneous manifolds. The topological and geometrical concepts can be used to represent the la- tent semantics of a heterogeneous manifold as a collection of homogeneous manifolds. A Graph-based Tree-width Decomposition algorithm is used to decompose a heterogeneous manifold into a collection of homogeneous ma- nifolds [<xref ref-type="bibr" rid="scirp.61283-ref57">57</xref>] . As shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>, assuming Jaguar as the heterogeneous manifold, we can decompose it into three homogeneous manifolds bounded by dotted lines in three different colors. In the Graph-based Tree-width Decomposition algorithm, we start selecting a random fixed dimension local manifold to be a separator as shown in <xref ref-type="fig" rid="fig4">Figure 4</xref> [<xref ref-type="bibr" rid="scirp.61283-ref58">58</xref>] . Afterward, the local manifold is decomposed into two local manifolds that are not adjacent. This decomposition is recursive until no further decomposition is possible. We can express the above concept formally,</p><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> An example to demonstrate named-entities, its types, and associated marginal and conditional probabilities</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x27.png"/></fig><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> An example to demonstrate Graph-based Tree-width Decomposition</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x28.png"/></fig><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> An example to demonstrate the concept of separator under Graph-based Tree-width decomposition</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x29.png"/></fig><p>let a heterogeneous manifold M<sub>i</sub> for fragmenti be the set of homogeneous manifolds, such that</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x30.png" xlink:type="simple"/></inline-formula>. The semantics generated from fragment homogeneous manifolds</p><p>are independent. In addition, a semantic topic set <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x31.png" xlink:type="simple"/></inline-formula> . of the returned documents is associated</p><p>with semantic mapping <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x32.png" xlink:type="simple"/></inline-formula> with a probability<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x32.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x33.png" xlink:type="simple"/></inline-formula>, and quantity<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x32.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x33.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x34.png" xlink:type="simple"/></inline-formula>. The</p><p>probabilities indicate the number of documents that are relevant to a homogeneous manifold and match the user’s intent. To induce homogeneous manifolds, it is crucial to extract significant terms from fragments. In addition, we should demonstrate the relevance of each fragment to the homogeneous manifold. The users can refer only homogeneous manifold associated fragments, which they want.</p><p>Step 4 (Exploring the homogeneous manifolds): The relevant fragments cluster around their related homogeneous manifolds. For instance, a user query for the term APC, the fragments have aggregated into a collection of homogeneous manifolds as shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>. Each fragment is assigned to a particular homogeneous manifold.</p></sec><sec id="s2_2"><title>2.2. Deployment of the LSM Algorithm</title><p>The LSM algorithm was deployed to develop a search tool. A team of three researchers including an expert in the Java programming language developed the tool using the Eclipse Software Development Kit. The LSM tool was used for two years at two places in Taiwan: 1) Taipei Medical University Library, Taipei; and 2) Biomedical Engineering Laboratory, Institute of Biomedical Engineering, National Taiwan University, Taipei. The members of the library and lab used the LSM tool to perform semantic searches in the PubMed database.</p></sec><sec id="s2_3"><title>2.3. Performance Evaluation of the LSM Algorithm</title><p>Data sets: Two data sets, Reuters-21578-Distribution-1 and OHSUMED, were used to evaluate the performance</p><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> An example to demonstrate heterogeneous manifold, homogeneous manifolds and documents associated with homogeneous manifolds</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x35.png"/></fig><p>of the LSM algorithm. The Reuters-21578-Distribution-1 is a standard benchmark for the text categorization, which consists of Newswire articles classified into 135 topics [<xref ref-type="bibr" rid="scirp.61283-ref59">59</xref>] . In our tests, the documents with multiple topics (category labels) and single topic were separated. The topics that had less than five documents were removed. <xref ref-type="table" rid="table2">Table 2</xref> shows the summary of the Reuters-21578-Distribution-1 collection. OHSUMED is clinically oriented a Medline collection consisting of 348,566 references. It covers all the references from 270 medical journals belonging to 23 disease categories over a five-year period (1987-1991) [<xref ref-type="bibr" rid="scirp.61283-ref60">60</xref>] .</p><p>Evaluation criteria: Effectiveness and efficiency were measured as an experimental evaluation of the LSM algorithm. Effectiveness is defined as the ability to identify the right cluster (collection of documents). As shown in <xref ref-type="table" rid="table3">Table 3</xref>, the generated clusters were verified by human experts to measure the effectiveness. The three measures of the effectiveness (Precision, Recall, and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x36.png" xlink:type="simple"/></inline-formula>) were calculated using the contingency in <xref ref-type="table" rid="table3">Table 3</xref>. Precision and Recall are respectively defined as follows:</p><disp-formula id="scirp.61283-formula1273"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x37.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.61283-formula1274"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x38.png"  xlink:type="simple"/></disp-formula><p>Moreover, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x39.png" xlink:type="simple"/></inline-formula>measure, which combines Precision and Recall, is defined as follows:</p><disp-formula id="scirp.61283-formula1275"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x40.png"  xlink:type="simple"/></disp-formula><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x41.png" xlink:type="simple"/></inline-formula>measure is used in this paper, which is obtained assigning <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x42.png" xlink:type="simple"/></inline-formula> to be 1, which means that precision and recall have equal weight in evaluating the performance. In case, many categories are generated and compared, the overall precision and recall are calculated as the average of all precisions and recalls belonging to various categories. F<sub>1</sub> is calculated as the mean of all results, which is a macro-average of the categories.</p><p>In addition, two other evaluation metrics, Normalized Mutual Information (NMI) and overall F-measure, were also used [<xref ref-type="bibr" rid="scirp.61283-ref61">61</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref63">63</xref>] . Given the two sets of topics C and Cl, let C denote the topic set defined by experts, and Cl denotes the topic set generated by a clustering method, and both derived from the same corpora X. Let N(X) denote the number of total documents; N(z, X) denotes the number of documents in topic z; and N(z, z', X) denotes the number of documents both in topic z and topic z', for any topics in C. The Normalized Mutual Information metric MI(C, C') is defined as:</p><disp-formula id="scirp.61283-formula1276"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x43.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x44.png" xlink:type="simple"/></inline-formula></p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Statistics of reuters-21,578 corpora</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Statistics</th><th align="center" valign="middle" >Number of topics</th><th align="center" valign="middle" >Number of documents</th><th align="center" valign="middle" >Documents on a topic</th></tr></thead><tr><td align="center" valign="middle" >Origin</td><td align="center" valign="middle" >135</td><td align="center" valign="middle" >21,578</td><td align="center" valign="middle" >0 - 3945</td></tr><tr><td align="center" valign="middle" >Single topic</td><td align="center" valign="middle" >65</td><td align="center" valign="middle" >8649</td><td align="center" valign="middle" >1 - 3945</td></tr><tr><td align="center" valign="middle" >Single topic (≥5 documents)</td><td align="center" valign="middle" >51</td><td align="center" valign="middle" >9494</td><td align="center" valign="middle" >5 - 3945</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Contingency table for category (c<sub>i</sub>, where i = natural number)<sup>a</sup></title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="2"  >Category</th><th align="center" valign="middle"  colspan="2"  >Clustering results</th></tr></thead><tr><td align="center" valign="middle"  colspan="2"  ></td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" >No</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Expert Judgment</td><td align="center" valign="middle" >Yes</td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x45.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x46.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" >No</td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x47.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x48.png" xlink:type="simple"/></inline-formula></td></tr></tbody></table></table-wrap><p><sup>a</sup>TP: True Positive; FP: False Positive; FN: False Negative; TN: True Negative.</p><p>The Normalized Mutual Information metric MI(C, C') will return a value between zero and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x49.png" xlink:type="simple"/></inline-formula>, where H(C) and H(C') define the entropies of C and C' respectively. A higher <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x50.png" xlink:type="simple"/></inline-formula> value means that two topics are almost identical, whereas a lower value indicates the independence of topics. Therefore, the Normalized Mutual Information metric <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x51.png" xlink:type="simple"/></inline-formula> is</p><disp-formula id="scirp.61283-formula1277"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x52.png"  xlink:type="simple"/></disp-formula><p>Let <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x53.png" xlink:type="simple"/></inline-formula> be an F-measure for each cluster <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x53.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x54.png" xlink:type="simple"/></inline-formula> defined above. The overall F-measure can be defined as</p><disp-formula id="scirp.61283-formula1278"><graphic  xlink:href="http://html.scirp.org/file/4-2870096x55.png"  xlink:type="simple"/></disp-formula><p>where F(z, z') calculates the F-measure between z and z'.</p><p>Efficiency is the clustering time for a search query with a fixed number of features for each clustering scheme, where features set is fixed.</p><p>Experiments: The experiments were conducted using Reuters-21578-Distribution-1 and OHSUMED data sets. The clusters ranging from two to ten topics were randomly selected to evaluate the LSM with other clustering methods. For each clustering method, each test run was conducted on a selected topic, and Normalized Mutual Information of the topic and its corresponding cluster was calculated. After conducting fifty test runs on a fixed number of k’s topics, where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/4-2870096x56.png" xlink:type="simple"/></inline-formula>, the final performance scores were obtained by averaging mutual in- formation measures from these 50 test runs [<xref ref-type="bibr" rid="scirp.61283-ref61">61</xref>] . The t-test assessed whether homogeneous clusters generated by the two methods (LSM vs. other methods) were statistically different from each other as shown in <xref ref-type="table" rid="table4">Table 4</xref> and <xref ref-type="fig" rid="fig6">Figure 6</xref> in the result section. We also calculated the overall F-measure in combination of arbitrary k clusters by uniquely assigning to topics from the Reuters-21578-Distribution-1 data set where k was 3, 15, 30, and 60 [<xref ref-type="bibr" rid="scirp.61283-ref64">64</xref>] . Fifty test-runs were also performed using these LSM results to compare Frequent Itemset-based Hierarchical Clustering (FIHC) and bisecting k-means as shown <xref ref-type="table" rid="table5">Table 5</xref> and <xref ref-type="fig" rid="fig7">Figure 7</xref> in the Result section [<xref ref-type="bibr" rid="scirp.61283-ref64">64</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref65">65</xref>] .</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Normalized Mutual Information comparison of LSM algorithm with other sixteen methods using Reuters-21578- Distribution-1 dataset<sup>b</sup></title></caption><table><tbody><thead><tr><th align="center" valign="middle" >k</th><th align="center" valign="middle" >2</th><th align="center" valign="middle" >3</th><th align="center" valign="middle" >4</th><th align="center" valign="middle" >5</th><th align="center" valign="middle" >6</th><th align="center" valign="middle" >7</th><th align="center" valign="middle" >8</th><th align="center" valign="middle" >9</th><th align="center" valign="middle" >10</th><th align="center" valign="middle" >Average</th></tr></thead><tr><td align="center" valign="middle" >LSM</td><td align="center" valign="middle" >0.461</td><td align="center" valign="middle" >0.505</td><td align="center" valign="middle" >0.622</td><td align="center" valign="middle" >0.686</td><td align="center" valign="middle" >0.714</td><td align="center" valign="middle" >0.792</td><td align="center" valign="middle" >0.893</td><td align="center" valign="middle" >0.884</td><td align="center" valign="middle" >0.9</td><td align="center" valign="middle" >0.717</td></tr><tr><td align="center" valign="middle" >CCF</td><td align="center" valign="middle" >0.569</td><td align="center" valign="middle" >0.563</td><td align="center" valign="middle" >0.607</td><td align="center" valign="middle" >0.62</td><td align="center" valign="middle" >0.605</td><td align="center" valign="middle" >0.624</td><td align="center" valign="middle" >0.633</td><td align="center" valign="middle" >0.647</td><td align="center" valign="middle" >0.676</td><td align="center" valign="middle" >0.616</td></tr><tr><td align="center" valign="middle" >GMM</td><td align="center" valign="middle" >0.475</td><td align="center" valign="middle" >0.468</td><td align="center" valign="middle" >0.462</td><td align="center" valign="middle" >0.516</td><td align="center" valign="middle" >0.551</td><td align="center" valign="middle" >0.522</td><td align="center" valign="middle" >0.551</td><td align="center" valign="middle" >0.557</td><td align="center" valign="middle" >0.548</td><td align="center" valign="middle" >0.517</td></tr><tr><td align="center" valign="middle" >NB</td><td align="center" valign="middle" >0.466</td><td align="center" valign="middle" >0.348</td><td align="center" valign="middle" >0.401</td><td align="center" valign="middle" >0.405</td><td align="center" valign="middle" >0.409</td><td align="center" valign="middle" >0.404</td><td align="center" valign="middle" >0.435</td><td align="center" valign="middle" >0.411</td><td align="center" valign="middle" >0.418</td><td align="center" valign="middle" >0.411</td></tr><tr><td align="center" valign="middle" >GMM + DFM</td><td align="center" valign="middle" >0.47</td><td align="center" valign="middle" >0.466</td><td align="center" valign="middle" >0.45</td><td align="center" valign="middle" >0.513</td><td align="center" valign="middle" >0.531</td><td align="center" valign="middle" >0.506</td><td align="center" valign="middle" >0.535</td><td align="center" valign="middle" >0.535</td><td align="center" valign="middle" >0.536</td><td align="center" valign="middle" >0.505</td></tr><tr><td align="center" valign="middle" >KM</td><td align="center" valign="middle" >0.404</td><td align="center" valign="middle" >0.402</td><td align="center" valign="middle" >0.461</td><td align="center" valign="middle" >0.525</td><td align="center" valign="middle" >0.561</td><td align="center" valign="middle" >0.548</td><td align="center" valign="middle" >0.583</td><td align="center" valign="middle" >0.597</td><td align="center" valign="middle" >0.618</td><td align="center" valign="middle" >0.522</td></tr><tr><td align="center" valign="middle" >KM-NC</td><td align="center" valign="middle" >0.438</td><td align="center" valign="middle" >0.462</td><td align="center" valign="middle" >0.525</td><td align="center" valign="middle" >0.554</td><td align="center" valign="middle" >0.592</td><td align="center" valign="middle" >0.577</td><td align="center" valign="middle" >0.594</td><td align="center" valign="middle" >0.607</td><td align="center" valign="middle" >0.618</td><td align="center" valign="middle" >0.552</td></tr><tr><td align="center" valign="middle" >SKM</td><td align="center" valign="middle" >0.458</td><td align="center" valign="middle" >0.407</td><td align="center" valign="middle" >0.499</td><td align="center" valign="middle" >0.561</td><td align="center" valign="middle" >0.567</td><td align="center" valign="middle" >0.558</td><td align="center" valign="middle" >0.591</td><td align="center" valign="middle" >0.598</td><td align="center" valign="middle" >0.619</td><td align="center" valign="middle" >0.54</td></tr><tr><td align="center" valign="middle" >SKM-NCW</td><td align="center" valign="middle" >0.434</td><td align="center" valign="middle" >0.423</td><td align="center" valign="middle" >0.515</td><td align="center" valign="middle" >0.556</td><td align="center" valign="middle" >0.577</td><td align="center" valign="middle" >0.563</td><td align="center" valign="middle" >0.593</td><td align="center" valign="middle" >0.602</td><td align="center" valign="middle" >0.612</td><td align="center" valign="middle" >0.542</td></tr><tr><td align="center" valign="middle" >BP-NCW</td><td align="center" valign="middle" >0.391</td><td align="center" valign="middle" >0.377</td><td align="center" valign="middle" >0.431</td><td align="center" valign="middle" >0.478</td><td align="center" valign="middle" >0.493</td><td align="center" valign="middle" >0.5</td><td align="center" valign="middle" >0.519</td><td align="center" valign="middle" >0.529</td><td align="center" valign="middle" >0.532</td><td align="center" valign="middle" >0.472</td></tr><tr><td align="center" valign="middle" >AA</td><td align="center" valign="middle" >0.443</td><td align="center" valign="middle" >0.415</td><td align="center" valign="middle" >0.488</td><td align="center" valign="middle" >0.531</td><td align="center" valign="middle" >0.571</td><td align="center" valign="middle" >0.542</td><td align="center" valign="middle" >0.587</td><td align="center" valign="middle" >0.594</td><td align="center" valign="middle" >0.611</td><td align="center" valign="middle" >0.531</td></tr><tr><td align="center" valign="middle" >NC</td><td align="center" valign="middle" >0.484</td><td align="center" valign="middle" >0.461</td><td align="center" valign="middle" >0.555</td><td align="center" valign="middle" >0.592</td><td align="center" valign="middle" >0.617</td><td align="center" valign="middle" >0.594</td><td align="center" valign="middle" >0.64</td><td align="center" valign="middle" >0.634</td><td align="center" valign="middle" >0.643</td><td align="center" valign="middle" >0.58</td></tr><tr><td align="center" valign="middle" >RC</td><td align="center" valign="middle" >0.417</td><td align="center" valign="middle" >0.381</td><td align="center" valign="middle" >0.505</td><td align="center" valign="middle" >0.46</td><td align="center" valign="middle" >0.485</td><td align="center" valign="middle" >0.456</td><td align="center" valign="middle" >0.548</td><td align="center" valign="middle" >0.484</td><td align="center" valign="middle" >0.495</td><td align="center" valign="middle" >0.47</td></tr><tr><td align="center" valign="middle" >NMF</td><td align="center" valign="middle" >0.48</td><td align="center" valign="middle" >0.426</td><td align="center" valign="middle" >0.498</td><td align="center" valign="middle" >0.559</td><td align="center" valign="middle" >0.591</td><td align="center" valign="middle" >0.552</td><td align="center" valign="middle" >0.603</td><td align="center" valign="middle" >0.601</td><td align="center" valign="middle" >0.623</td><td align="center" valign="middle" >0.548</td></tr><tr><td align="center" valign="middle" >NMF-NCW</td><td align="center" valign="middle" >0.494</td><td align="center" valign="middle" >0.5</td><td align="center" valign="middle" >0.586</td><td align="center" valign="middle" >0.615</td><td align="center" valign="middle" >0.637</td><td align="center" valign="middle" >0.613</td><td align="center" valign="middle" >0.654</td><td align="center" valign="middle" >0.659</td><td align="center" valign="middle" >0.658</td><td align="center" valign="middle" >0.602</td></tr><tr><td align="center" valign="middle" >CF</td><td align="center" valign="middle" >0.48</td><td align="center" valign="middle" >0.429</td><td align="center" valign="middle" >0.503</td><td align="center" valign="middle" >0.563</td><td align="center" valign="middle" >0.592</td><td align="center" valign="middle" >0.556</td><td align="center" valign="middle" >0.613</td><td align="center" valign="middle" >0.609</td><td align="center" valign="middle" >0.629</td><td align="center" valign="middle" >0.553</td></tr><tr><td align="center" valign="middle" >CF-NCW</td><td align="center" valign="middle" >0.496</td><td align="center" valign="middle" >0.505</td><td align="center" valign="middle" >0.595</td><td align="center" valign="middle" >0.616</td><td align="center" valign="middle" >0.644</td><td align="center" valign="middle" >0.615</td><td align="center" valign="middle" >0.66</td><td align="center" valign="middle" >0.66</td><td align="center" valign="middle" >0.665</td><td align="center" valign="middle" >0.606</td></tr></tbody></table></table-wrap><p><sup>b</sup>LSM: Latent semantic manifold; CCF-k: clique community finding algorithm; GMM: Gaussian mixture model; NB: Naive Bayes clustering; GMM + DFM: Gaussian mixture model followed by the iterative cluster refinement method; KM: Traditional k-means; KM-NCL Traditional k-means and spectral clustering algorithm based on normalized cut criterion; SKM: Spherical k-means; SKM-NCW: Normalized-cut weighted form; BP-NCW: Spectral clustering based bipartite normalized cut; AA: Average association criterion; NC: Normalized cut criterion; RC: Spectral clustering based on ratio cut criterion; NMF: Non-negative matrix factorization; NMF-NCW: Nonnegative Matrix Factorization-based clustering; CF: Concept factorization; CF-NCW: Clustering by concept factorization.</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Precision, recall, overall F-measure, and Normalized Mutual Information (NMI) of Latent Semantic Manifold on Reuters-21578-Distribution-1 dataset</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >k</th><th align="center" valign="middle" >2</th><th align="center" valign="middle" >3</th><th align="center" valign="middle" >4</th><th align="center" valign="middle" >5</th><th align="center" valign="middle" >6</th><th align="center" valign="middle" >7</th><th align="center" valign="middle" >8</th><th align="center" valign="middle" >9</th><th align="center" valign="middle" >10</th></tr></thead><tr><td align="center" valign="middle" >Precision</td><td align="center" valign="middle" >0.9845</td><td align="center" valign="middle" >0.9579</td><td align="center" valign="middle" >0.9385</td><td align="center" valign="middle" >0.9352</td><td align="center" valign="middle" >0.8909</td><td align="center" valign="middle" >0.9013</td><td align="center" valign="middle" >0.9148</td><td align="center" valign="middle" >0.8913</td><td align="center" valign="middle" >0.8859</td></tr><tr><td align="center" valign="middle" >Recall</td><td align="center" valign="middle" >0.7085</td><td align="center" valign="middle" >0.6384</td><td align="center" valign="middle" >0.6453</td><td align="center" valign="middle" >0.6056</td><td align="center" valign="middle" >0.5916</td><td align="center" valign="middle" >0.6543</td><td align="center" valign="middle" >0.6822</td><td align="center" valign="middle" >0.6688</td><td align="center" valign="middle" >0.6805</td></tr><tr><td align="center" valign="middle" >Overall F-measure</td><td align="center" valign="middle" >0.7988</td><td align="center" valign="middle" >0.7297</td><td align="center" valign="middle" >0.7399</td><td align="center" valign="middle" >0.6986</td><td align="center" valign="middle" >0.6822</td><td align="center" valign="middle" >0.7329</td><td align="center" valign="middle" >0.7562</td><td align="center" valign="middle" >0.7343</td><td align="center" valign="middle" >0.7472</td></tr><tr><td align="center" valign="middle" >NMI</td><td align="center" valign="middle" >0.4617</td><td align="center" valign="middle" >0.5051</td><td align="center" valign="middle" >0.6221</td><td align="center" valign="middle" >0.6866</td><td align="center" valign="middle" >0.7148</td><td align="center" valign="middle" >0.7925</td><td align="center" valign="middle" >0.8936</td><td align="center" valign="middle" >0.8848</td><td align="center" valign="middle" >0.9006</td></tr></tbody></table></table-wrap><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Mutual information values of 2 to 10 clusters built by LSM algorithm and other sixteen methods using Reuters-21578-Distribution-1 datasets<sup>c</sup>. c. LSM: Latent semantic manifold; GMM-Gaussian mixture model; NB: Naive Bayes clustering; GMM + DFM: Gaussian mixture model followed by the iterative cluster refinement method; KM: Traditional k-means; KM-NC: Traditional k-means and spectral clustering algorithm based on normalized cut criterion; SKM: Spherical k-means; SKM-NCW: Normalized-cut weighted form; BP-NCW: Spectral clustering based bipartite normalized cut; AA: Average association criterion; NC: Normalized cut criterion; RC: Spectral clustering based on ratio cut criterion NMF: Non-negative matrix factorization; NMF: NCW-Nonnegative Matrix Factorization-based clustering; CF: Concept factorization; CF-NCW: Clustering by concept factorization; CCF: k-clique community finding algorithm</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x57.png"/></fig><p>The average precision, recall, overall F-measure, and Normalized Mutual Information of LSM, LST, PLSI, PLSI + AdaBoost, LDA, and CCF were evaluated using the Reuters-21578-Distribution-1 data set; and LSM, LST, and CCF were evaluated on an OHSUMED data set, as shown in <xref ref-type="table" rid="table6">Table 6</xref>, in the Result section [<xref ref-type="bibr" rid="scirp.61283-ref44">44</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref66">66</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref69">69</xref>] . Besides the effectiveness, the efficiency tests of LSM, LST, and CCF were performed as shown in <xref ref-type="fig" rid="fig8">Figure 8</xref> in the Result section.</p></sec></sec><sec id="s3"><title>3. Results</title><p>Normalized Mutual Information comparison of the LSM algorithm with the other sixteen methods using Reuters-21578-Distribution-1 data setis shown in <xref ref-type="table" rid="table4">Table 4</xref> and <xref ref-type="fig" rid="fig6">Figure 6</xref> [<xref ref-type="bibr" rid="scirp.61283-ref61">61</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref69">69</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref71">71</xref>] . The four metrics (precision, recall, overall F-measure, and Normalized Mutual Information) of LSM that used Reuters-21578-Distribution-1 data set for different k are listed in <xref ref-type="table" rid="table5">Table 5</xref>. In addition, the overall F-measure is compared with FIHC and bisecting k-means as shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>. The average precision, recall, overall F-measure, and Normalized Mutual Information of 1) LSM, LST, PLSI, LDA, and CCF, which used Reuters-21578-Distribution-1 data set; 2) LSM, LST and CCF, which used OHSUMED data set are shown in <xref ref-type="table" rid="table6">Table 6</xref>. The efficiency tests results of the three methods, LSM, LST, and CCF, are shown in <xref ref-type="fig" rid="fig8">Figure 8</xref>.</p><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> Overall F-measure of three methods, LSM, FIHC, and bisecting k-means, on Reuters-21578-Distribution-1 data set, where k (x-axis) is 3, 15, 30, 60 clusters</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x58.png"/></fig><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Efficiency of three clustering methods, wherein x-axis is the number of features and y-axis is run time in milliseconds (LSM: Latent semantic manifold; LST: Latent Semantic Topology; CCF: k-Clique Community Finding)</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x59.png"/></fig><table-wrap id="table6" ><label><xref ref-type="table" rid="table6">Table 6</xref></label><caption><title> The average precision, recall, overall F-measure, and Normalized Mutual Information (NMI) of LSM, LST, PLSI, PLSI + AdaBoost, LDA, and CCF on Reuters-21578-Distribution-1 dataset; and LSM, LST and CCF on OHSUMED<sup>d</sup></title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Dataset</th><th align="center" valign="middle" >Method</th><th align="center" valign="middle" >Precision</th><th align="center" valign="middle" >Recall</th><th align="center" valign="middle" >Overall F-measure</th><th align="center" valign="middle" >NMI</th></tr></thead><tr><td align="center" valign="middle"  rowspan="6"  >Reuters</td><td align="center" valign="middle" >LSM</td><td align="center" valign="middle" >0.81</td><td align="center" valign="middle" >0.773</td><td align="center" valign="middle" >0.786</td><td align="center" valign="middle" >0.717</td></tr><tr><td align="center" valign="middle" >LST</td><td align="center" valign="middle" >0.779</td><td align="center" valign="middle" >0.745</td><td align="center" valign="middle" >0.754</td><td align="center" valign="middle" >0.633</td></tr><tr><td align="center" valign="middle" >PLSI</td><td align="center" valign="middle" >0.649</td><td align="center" valign="middle" >0.627</td><td align="center" valign="middle" >0.636</td><td align="center" valign="middle" >0.54</td></tr><tr><td align="center" valign="middle" >PLSI + AdaBoost</td><td align="center" valign="middle" >0.772</td><td align="center" valign="middle" >0.812</td><td align="center" valign="middle" >0.697</td><td align="center" valign="middle" >N/A</td></tr><tr><td align="center" valign="middle" >LDA</td><td align="center" valign="middle" >0.66</td><td align="center" valign="middle" >0.714</td><td align="center" valign="middle" >0.686</td><td align="center" valign="middle" >0.61</td></tr><tr><td align="center" valign="middle" >CCF</td><td align="center" valign="middle" >0.727</td><td align="center" valign="middle" >0.73</td><td align="center" valign="middle" >0.723</td><td align="center" valign="middle" >0.616</td></tr><tr><td align="center" valign="middle"  rowspan="3"  >OHSUMED</td><td align="center" valign="middle" >LSM</td><td align="center" valign="middle" >0.59</td><td align="center" valign="middle" >0.479</td><td align="center" valign="middle" >0.522</td><td align="center" valign="middle" >0.315</td></tr><tr><td align="center" valign="middle" >LST</td><td align="center" valign="middle" >0.586</td><td align="center" valign="middle" >0.388</td><td align="center" valign="middle" >0.456</td><td align="center" valign="middle" >0.257</td></tr><tr><td align="center" valign="middle" >CCF</td><td align="center" valign="middle" >0.514</td><td align="center" valign="middle" >0.54</td><td align="center" valign="middle" >0.513</td><td align="center" valign="middle" >0.214</td></tr></tbody></table></table-wrap><p><sup>d</sup>LSM: Latent semantic manifold; LST: Latent semantic topology; PLSI: Probabilistic latent semantic indexing; PLSI + AdaBoost: Probabilistic latent semantic indexing + additive boosting methods; LDA: Latent Dirichlet allocation; CCF: k-clique community finding algorithm.</p></sec><sec id="s4"><title>4. Discussion</title><p>Our findings suggest that the LSM algorithm, which can discover the latent semantics in high-dimensional web data, might play an instrumental role in enhancing the search engine functionality. LSM carries out searches based on both keywords and meaning, which can assist researchers to perform semantic searches on databases. For example, a researcher can search APC with Adenomatous Polyposis Coli as his or her intended meaning in the PubMed database (the output of a user queried term APC is shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>).</p><p>APC can also have other meanings, such as Antigen-Presenting Cells, Anaphase Promoting Complex, or Activated Protein C. Suppose, in a homogeneous manifold, we find APC, Colorectal Cancer, and gene-related documents are assembled, the homogeneous manifoldwould point out the meaning of APC as Adenomatous Polyposis Gene. Similarly, suppose APC, Major Histocompatibility Complex, and T-cells-related documents are assembled, it would indicate the meaning of APC as Antigen Presenting Cells. <xref ref-type="fig" rid="fig9">Figure 9</xref> shows that documents returned from the queried term APC can automatically associate to various homogeneous manifolds (semantic topics). In addition, the researcher can obtain a different vantage point based on the underlying data. For example, a search for the medical term NOD2 that was performed within the PubMed database retrieved almost 300 abstracts of published or in-press articles (<xref ref-type="fig" rid="fig1">Figure 1</xref>0 shows latent semantic topics as a clustering result).</p><p>According to the result, inflammatory bowel disease and its type (Crohn’s disease and ulcerative colitis) are associated with gene NOD2. The term NOD2 was found to be evenly spread over these three topics-inflamma- tory bowel disease, Crohn’s disease, and ulcerative colitis. Some evolving topics, such as the bacterial component were also discovered. However, the result was different when we searched NOD2 on Genia Corpus (<xref ref-type="fig" rid="fig1">Figure 1</xref>1) which supports the argument the researcher can obtain a different meaningful vantage point based on the underlying data, using the “same” LSM algorithm [<xref ref-type="bibr" rid="scirp.61283-ref72">72</xref>] .</p><p>We can see that results (<xref ref-type="fig" rid="fig1">Figure 1</xref>0 and <xref ref-type="fig" rid="fig1">Figure 1</xref>1) are meaningfully structured with a possibility of semantic navigation in both databases. This indicates that the generalization capability of the LSM algorithm. We used concepts of topology in designing LSM algorithm. LSM has shown much better performance than the other sixteen clustering methods, especially when the number of clusters gets larger (<xref ref-type="table" rid="table4">Table 4</xref> and <xref ref-type="table" rid="table5">Table 5</xref>, and <xref ref-type="fig" rid="fig6">Figure 6</xref> and <xref ref-type="fig" rid="fig7">Figure 7</xref>). In general, we found that LSM could produce more accurate results than others could. We used paired t-test to assess the clustering results of the same topics by any two methods-LSM, LST, and CCF. The results of LSM were significantly better than the results of LST where we used 63 clusters in the experiments</p><fig id="fig9"  position="float"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> Result of query term, APC</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x60.png"/></fig><fig id="fig10"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>0</label><caption><title> Clustering result of the query term, NOD2, retrieved from PubMed</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x61.png"/></fig><fig id="fig11"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>1</label><caption><title> Clustering result of the query term, NOD2, retrieved from Genia Corpus</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/4-2870096x62.png"/></fig><p>(p-value &lt; 0.05) (<xref ref-type="table" rid="table6">Table 6</xref>). Similarly, with a p-value less than 0.05, the results of LSM were significantly better than the results of the CCF in 48 randomly selected clusters out of 72 (<xref ref-type="table" rid="table6">Table 6</xref>). The efficiency evaluation of three methods, LSM, LST, and CCF, demonstrated that LSM performed better than the others did. In the case of LSM, the time needed to build a latent semantic manifold does not increase significantly when the data became larger (<xref ref-type="fig" rid="fig8">Figure 8</xref>).</p><p>Limitation and future studies: This study has a few limitations that open up the scope of future studies. First, to identify and discriminate the correct topics in the collection of documents, a combination of features and their co-occurring relationships serve as clues, and probabilities display their significance. All features in documents comprise a topological probabilistic manifold, associate to probabilistic measures, and denote the underlying structure. This complex structure is decomposed into inseparable components at various levels (in various levels of skeletons) so that each component corresponds to topics in the collection of documents. This process is a computation-intensive and time-consuming, which strongly depend on features and their identifications (named- entities). Second, some terms with similar meanings, such as anticipate, believe, estimate, expect, intend, and project, were separated into independent topics. Likewise, some terms were repeatedly specified in many topics. These issues might be addressed by utilizing thesauri and some other adaptive methods [<xref ref-type="bibr" rid="scirp.61283-ref73">73</xref>] . Third, some tools, such as MedEvi, EBIMed, MEDIE, PubNet, GoPubMed, Argo, and Vivisimo, also perform a latent semantic search in high dimensional web data [<xref ref-type="bibr" rid="scirp.61283-ref23">23</xref>] - [<xref ref-type="bibr" rid="scirp.61283-ref33">33</xref>] . However, in this study, we did not compare LSM algorithm based tool with others. Some further study is needed to compare the LSM algorithm based tool with already existing tools to find a space of synergy. Fourth, in this study, the evaluation was carried out mainly by comparing with other latent semantic indexing (LSI) algorithms. However, many alternative approaches for searching, clustering, and categorization exist. Further study is needed to compare this approach with alternatives. Fifth, there are some already existing knowledge bases or resources in the biomedical domain, such as MeSH (Medical Subject Headings) [<xref ref-type="bibr" rid="scirp.61283-ref74">74</xref>] [<xref ref-type="bibr" rid="scirp.61283-ref75">75</xref>] . Some studies are needed to verify whether LSM algorithm based tool might be adapted to the existing knowledge bases or resources.</p></sec><sec id="s5"><title>5. Conclusion</title><p>We found that the LSM algorithm can discover the latent semantics in high-dimensional web data and can organize them into several semantic topics. This algorithm can be used to enhance the functionality of currently available search engines.</p></sec><sec id="s6"><title>Acknowledgements</title><p>The National Science Foundation (NSC 98-2221-E-038-012) supported this work.</p></sec><sec id="s7"><title>Cite this paper</title><p>Ajit Kumar,Sanjeev Maskara,I-Jen Chiang,1 1, (2015) Identifying Semantic in High-Dimensional Web Data Using Latent Semantic Manifold. Journal of Data Analysis and Information Processing,03,136-152. doi: 10.4236/jdaip.2015.34014</p></sec></body><back><ref-list><title>References</title><ref id="scirp.61283-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Donoho, D.L. (2000) High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. AMS Math Challenges Lecture, 1-32. http://mlo.cs.manchester.ac.uk/resources/Curses.pdf</mixed-citation></ref><ref id="scirp.61283-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 6.</mixed-citation></ref><ref id="scirp.61283-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Hoehndorf, R., Rebholz-Schuhmann, D., Haendel, M. and Stevens, R. (2014) Thematic Series on Biomedical Ontologies in JBMS: Challenges and New Directions. Journal of Biomedical Semantics, 5, 15. 
http://dx.doi.org/10.1186/2041-1480-5-15</mixed-citation></ref><ref id="scirp.61283-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Raman, A.C. (2014) Storage Infrastructure for Big Data and Cloud. Handbook of Research on Cloud Infrastructures for Big Data Analytics, 110. http://dx.doi.org/10.4018/978-1-4666-5864-6.ch005</mixed-citation></ref><ref id="scirp.61283-ref5"><label>5</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Ranganathan</surname><given-names> P. </given-names></name>,<etal>et al</etal>. (<year>2011</year>)<article-title>From Microprocessors to Nanostores: Rethinking Data-Centric Systems</article-title><source> Computer</source><volume> 44</volume>,<fpage> 39</fpage>-<lpage>48</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.61283-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W. and Rhee, S.Y. (2008) Big Data: The Future of Biocuration. Nature, 455, 47-50. http://dx.doi.org/10.1038/455047a</mixed-citation></ref><ref id="scirp.61283-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gomez-Perez, A., Buitelaar, P. and McCrae, J. (2012) Challenges for the Multilingual Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 11, 63-71.  
http://dx.doi.org/10.1016/j.websem.2011.09.001</mixed-citation></ref><ref id="scirp.61283-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Croft, W.B., Metzler, D. and Strohman, T. (2010) Search Engines: Information Retrieval in Practice. Addison-Wesley, Reading, 88.</mixed-citation></ref><ref id="scirp.61283-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Thomas, P., Starlinger, J., Vowinkel, A., Arzt, S. and Leser, U. (2012) Gene View: A Comprehensive Semantic Search Engine for PubMed. Nucleic Acids Research, 40, W585-W591. http://dx.doi.org/10.1093/nar/gks563</mixed-citation></ref><ref id="scirp.61283-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Every 2 Days We Create As Much Information As We Did up to 2003. 
http://techcrunch.com/2010/08/04/schmidt-data/</mixed-citation></ref><ref id="scirp.61283-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Mayer-Schonberger, V. and Cukier, K. (2013) Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston.</mixed-citation></ref><ref id="scirp.61283-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Lingwal, S. and Gupta, B. (2012) A Comparative Study of Different Approaches for Improving Search Engine Performance. International Journal of Emerging Trends &amp; Technology in Computer Science, 1, 123-132.</mixed-citation></ref><ref id="scirp.61283-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Freitas, A., Curry, E., Oliveira, J.G. and Riain, S.O. (2012) Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends. IEEE Internet Computing, 16, 24-33. http://dx.doi.org/10.1109/MIC.2011.141</mixed-citation></ref><ref id="scirp.61283-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Dalal, M.K. and Zaveri, M.A. (2013) Automatic Classification of Unstructured Blog Text.</mixed-citation></ref><ref id="scirp.61283-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Vercruysse, S. and Kuiper, M. (2012) Jointly Creating Digital Abstracts: Dealing with Synonymy and Polysemy. BMC Research Notes, 5, 601. http://dx.doi.org/10.1186/1756-0500-5-601</mixed-citation></ref><ref id="scirp.61283-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Singer, G., Norbisrath, U. and Lewandowski, D. (2012) Ordinary Search Engine Users Carrying out Complex Search Tasks. Journal of Information Science, 39, 346-358.</mixed-citation></ref><ref id="scirp.61283-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Brossard, D. and Scheufele, D.A. (2013) Science, New Media, and the Public. Science, 339, 40-41.  
http://dx.doi.org/10.1126/science.1232329</mixed-citation></ref><ref id="scirp.61283-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Beall, J. (2008) The Weaknesses of Full-Text Searching. The Journal of Academic Librarianship, 34, 438-444.  
http://dx.doi.org/10.1016/j.acalib.2008.06.007</mixed-citation></ref><ref id="scirp.61283-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Liu, L. and Feng, J. (2011) The Notion of “Meaning System” and Its Use for “Semantic Search”. Journal of Computations and Modelling, 1, 97-126.</mixed-citation></ref><ref id="scirp.61283-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Stumme, G., Hotho, A. and Berendt, B. (2006) Semantic Web Mining: State of the Art and Future Directions. Web Semantics: Science, Services and Agents on the World Wide Web, 4, 124-143.  
http://dx.doi.org/10.1016/j.websem.2006.02.001</mixed-citation></ref><ref id="scirp.61283-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S. and Tran, T. (2013) Repeatable and Reliable Semantic Search Evaluation. Web Semantics: Science, Services and Agents on the World Wide Web, 21, 14-29.  
http://dx.doi.org/10.1016/j.websem.2013.05.005</mixed-citation></ref><ref id="scirp.61283-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Nessah, D. and Kazar, O. (2013) An Improved Semantic Information Searching Scheme Based Multi-Agent System and an Innovative Similarity Measure. International Journal of Metadata, Semantics and Ontologies, 8, 282-297.  
http://dx.doi.org/10.1504/IJMSO.2013.058411</mixed-citation></ref><ref id="scirp.61283-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A. and Decker, S. (2011) Searching and Browsing Linked Data with Swse: The Semantic Web Search Engine. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 365-401. http://dx.doi.org/10.1016/j.websem.2011.06.004</mixed-citation></ref><ref id="scirp.61283-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Fazzinga, B., Gianforme, G., Gottlob, G. and Lukasiewicz, T. (2011) Semantic Web Search Based on Ontological Conjunctive Queries. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 453-473.  
http://dx.doi.org/10.1016/j.websem.2011.08.003</mixed-citation></ref><ref id="scirp.61283-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Lu, Z.Y. (2011) PubMed and Beyond: A Survey of Web Tools for Searching Biomedical Literature. Database, 2011, baq036. http://dx.doi.org/10.1093/database/baq036</mixed-citation></ref><ref id="scirp.61283-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Kim, J.J., Pezik, P. and Rebholz-Schuhmann, D. (2008) MedEvi: Retrieving Textual Evidence of Relations between Biomedical Concepts from Medline. Bioinformatics, 24, 1410-1412. http://dx.doi.org/10.1093/bioinformatics/btn117</mixed-citation></ref><ref id="scirp.61283-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M. and Stoehr, P. (2007) EBIMed—Text Crunching to Gather Facts for Proteins from Medline. Bioinformatics, 23, e237-e244.  
http://dx.doi.org/10.1093/bioinformatics/btl302</mixed-citation></ref><ref id="scirp.61283-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Ohta, T., Tsuruoka, Y., Takeuchi, J., Kim, J.D., Miyao, Y., Yakushiji, A., et al. (2006) An Intelligent Search Engine and GUI-Based Efficient MEDLINE Search Tool Based on Deep Syntactic Parsing. Proceedings of the COLING/ACL on Interactive Presentation Sessions, Sydney, 17-21 July 2006, Association for Computational Linguistics, 17-20.</mixed-citation></ref><ref id="scirp.61283-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Douglas, S.M., Montelione, G.T. and Gerstein, M. (2005) PubNet: A Flexible System for Visualizing Literature Derived Networks. Genome Biology, 6, R80. http://dx.doi.org/10.1186/gb-2005-6-9-r80</mixed-citation></ref><ref id="scirp.61283-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Doms, A. and Schroeder, M. (2005) GoPubMed: Exploring PubMed with the Gene Ontology. Nucleic Acids Research, 33, W783-W786. http://dx.doi.org/10.1093/nar/gki470</mixed-citation></ref><ref id="scirp.61283-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Argo: Genome Browser. http://www.broadinstitute.org/annotation/argo</mixed-citation></ref><ref id="scirp.61283-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Engels, R., Yu, T., Burge, C., Mesirov, J.P., DeCaprio, D. and Galagan, J.E. (2006) Combo: A Whole Genome Comparative Browser. Bioinformatics, 22, 1782-1783. http://dx.doi.org/10.1093/bioinformatics/btl193</mixed-citation></ref><ref id="scirp.61283-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Koshman, S., Spink, A. and Jansen, B.J. (2006) Web Searching on the Vivisimo Search Engine. Journal of the American Society for Information Science and Technology, 57, 1875-1887. http://dx.doi.org/10.1002/asi.20408</mixed-citation></ref><ref id="scirp.61283-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Sah, M. and Wade, V. (2012) Automatic Metadata Mining from Multilingual Enterprise Content. Web Semantics: Science, Services and Agents on the World Wide Web, 11, 41-62. http://dx.doi.org/10.1016/j.websem.2011.11.001</mixed-citation></ref><ref id="scirp.61283-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Bergamaschi, S., Domnori, E., Guerra, F., TrilloLado, R. and Velegrakis, Y. (2011) Keyword Search Over Relational Databases: A Metadata Approach. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, ACM, New York, 565-576. http://dx.doi.org/10.1145/1989323.1989383</mixed-citation></ref><ref id="scirp.61283-ref36"><label>36</label><mixed-citation publication-type="other" xlink:type="simple">Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2, 159-165. http://dx.doi.org/10.1147/rd.22.0159</mixed-citation></ref><ref id="scirp.61283-ref37"><label>37</label><mixed-citation publication-type="other" xlink:type="simple">Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort.</mixed-citation></ref><ref id="scirp.61283-ref38"><label>38</label><mixed-citation publication-type="other" xlink:type="simple">Salton, G. and McGill, M.J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York.</mixed-citation></ref><ref id="scirp.61283-ref39"><label>39</label><mixed-citation publication-type="other" xlink:type="simple">Kupiec, J., Pedersen, J. and Chen, F. (1995) A Trainable Document Summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 68-73.  
http://dx.doi.org/10.1145/215206.215333</mixed-citation></ref><ref id="scirp.61283-ref40"><label>40</label><mixed-citation publication-type="other" xlink:type="simple">Gabaix, X. (1999) Zipf’s Law for Cities: An Explanation. Quarterly Journal of Economics, 114, 739-767.  
http://dx.doi.org/10.1162/003355399556133</mixed-citation></ref><ref id="scirp.61283-ref41"><label>41</label><mixed-citation publication-type="other" xlink:type="simple">Aldous, D.J. (1985) Exchangeability and Related Topics. Springer, Berlin, 1-198.  
http://dx.doi.org/10.1007/bfb0099421</mixed-citation></ref><ref id="scirp.61283-ref42"><label>42</label><mixed-citation publication-type="other" xlink:type="simple">Warmuth, W. (1977) De Finetti, B.: Theory of Probability—A Critical Introductory Treatment, Volume 2. John Wiley and Sons, London-New York-Sydney-Toronto 1975. XIV, 375 S. Biometrical Journal, 19, 382.  
http://dx.doi.org/10.1002/bimj.4710190515</mixed-citation></ref><ref id="scirp.61283-ref43"><label>43</label><mixed-citation publication-type="other" xlink:type="simple">Reinhardt, H.E. (1978) Theory of Probability: A Critical Introductory Treatment, Vol. 2 (Bruno de Finetti). SIAM Review, 20, 200-201. http://dx.doi.org/10.1137/1020030</mixed-citation></ref><ref id="scirp.61283-ref44"><label>44</label><mixed-citation publication-type="other" xlink:type="simple">Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003) Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993-1022.</mixed-citation></ref><ref id="scirp.61283-ref45"><label>45</label><mixed-citation publication-type="other" xlink:type="simple">Flores, J.G., Gillard, L., Ferret, O. and de Chandelar, G. (2008) Bag of Senses versus Bag of Words: Comparing Semantic and Lexical Approaches on Sentence Extraction. TAC 2008 Workshop-Notebook Papers and Results, Gaithersburg, 17-19 November 2008, 158-167.</mixed-citation></ref><ref id="scirp.61283-ref46"><label>46</label><mixed-citation publication-type="other" xlink:type="simple">Chanlekha, H. and Collier, N. (2010) Analysis of Syntactic and Semantic Features for Fine-Grained Event-Spatial Understanding in Outbreak News Reports. Journal of Biomedical Semantics, 1, 3.  
http://dx.doi.org/10.1186/2041-1480-1-3</mixed-citation></ref><ref id="scirp.61283-ref47"><label>47</label><mixed-citation publication-type="other" xlink:type="simple">Juang, B.H. and Rabiner, L.R. (1991) Hidden Markov Models for Speech Recognition. Technometrics, 33, 251-272.  
http://dx.doi.org/10.1080/00401706.1991.10484833</mixed-citation></ref><ref id="scirp.61283-ref48"><label>48</label><mixed-citation publication-type="other" xlink:type="simple">Mooij, J.M. and Kappen, H.J. (2007) Sufficient Conditions for Convergence of the Sum-Product Algorithm. IEEE Transactions on Information Theory, 53, 4422-4437. http://dx.doi.org/10.1109/TIT.2007.909166</mixed-citation></ref><ref id="scirp.61283-ref49"><label>49</label><mixed-citation publication-type="other" xlink:type="simple">Yedidia, J.S., Freeman, W.T. and Weiss, Y. (2003) Understanding Belief Propagation and Its Generalizations. Exploring Artificial Intelligence in the New Millennium, 8, 236-239.</mixed-citation></ref><ref id="scirp.61283-ref50"><label>50</label><mixed-citation publication-type="other" xlink:type="simple">Yedidia, J.S., Freeman, W.T. and Weiss, Y. (2005) Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms. IEEE Transactions on Information Theory, 51, 2282-2312.  
http://dx.doi.org/10.1109/TIT.2005.850085</mixed-citation></ref><ref id="scirp.61283-ref51"><label>51</label><mixed-citation publication-type="other" xlink:type="simple">Wagholikar, K.B., Torii, M., Jonnalagadda, S. and Liu, H. (2013) Pooling Annotated Corpora for Clinical Concept Extraction. Journal of Biomedical Semantics, 4, 3. http://dx.doi.org/10.1186/2041-1480-4-3</mixed-citation></ref><ref id="scirp.61283-ref52"><label>52</label><mixed-citation publication-type="other" xlink:type="simple">Baum, L.E. and Petrie, T. (1966) Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37, 1554-1563. http://dx.doi.org/10.1214/aoms/1177699147</mixed-citation></ref><ref id="scirp.61283-ref53"><label>53</label><mixed-citation publication-type="other" xlink:type="simple">Rabiner, L.R. (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77, 257-286. http://dx.doi.org/10.1109/5.18626</mixed-citation></ref><ref id="scirp.61283-ref54"><label>54</label><mixed-citation publication-type="other" xlink:type="simple">Sutton, C. and McCallum, A. (2011) An Introduction to Conditional Random Fields. Machine Learning, 4, 267-373.  
http://dx.doi.org/10.1561/2200000013</mixed-citation></ref><ref id="scirp.61283-ref55"><label>55</label><mixed-citation publication-type="other" xlink:type="simple">Lafferty, J., McCallum, A. and Pereira, F.C. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.</mixed-citation></ref><ref id="scirp.61283-ref56"><label>56</label><mixed-citation publication-type="other" xlink:type="simple">Wallach, H.M. (2004) Conditional Random Fields: An Introduction. Technical Reports (CIS), 22.</mixed-citation></ref><ref id="scirp.61283-ref57"><label>57</label><mixed-citation publication-type="other" xlink:type="simple">Srebro, N. and Jaakkola, T. (2003) Weighted Low-Rank Approximations. Proceedings of the 20th International Conference on Machine Learning, ICML 2003, 3, 720-727.</mixed-citation></ref><ref id="scirp.61283-ref58"><label>58</label><mixed-citation publication-type="other" xlink:type="simple">Diestel, R. (2005) Graph Theory. Springer-Verlag, New York.</mixed-citation></ref><ref id="scirp.61283-ref59"><label>59</label><mixed-citation publication-type="other" xlink:type="simple">Rose, T., Stevenson, M. and Whitehead, M. (2002) The Reuters Corpus Volume 1—From Yesterday’s News to Tomorrow’s Language Resources. Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002, 2, 827-832.</mixed-citation></ref><ref id="scirp.61283-ref60"><label>60</label><mixed-citation publication-type="book" xlink:type="simple">Hersh, W., Buckley, C., Leone, T.J. and Hickam, D. (1994) OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Croft, B.W. and van Rijsbergen, C.J., Eds., SIGIR’94, Springer, London, 192-201. http://dx.doi.org/10.1007/978-1-4471-2099-5_20</mixed-citation></ref><ref id="scirp.61283-ref61"><label>61</label><mixed-citation publication-type="other" xlink:type="simple">Xu, W. and Gong, Y. (2004) Document Clustering by Concept Factorization. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 202-209.  
http://dx.doi.org/10.1145/1008992.1009029</mixed-citation></ref><ref id="scirp.61283-ref62"><label>62</label><mixed-citation publication-type="other" xlink:type="simple">Dalli, A. (2003) Adaptation of the F-Measure to Cluster Based Lexicon Quality Evaluation. Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable? Association for Computational Linguistics, Stroudsburg, 51-56.</mixed-citation></ref><ref id="scirp.61283-ref63"><label>63</label><mixed-citation publication-type="other" xlink:type="simple">Kummamuru, K., Lotlikar, R., Roy, S., Singal, K. and Krishnapuram, R. (2004) A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. Proceedings of the 13th International Conference on World Wide Web, ACM, New York, 658-665. http://dx.doi.org/10.1145/988672.988762</mixed-citation></ref><ref id="scirp.61283-ref64"><label>64</label><mixed-citation publication-type="other" xlink:type="simple">Fung, B.C., Wang, K. and Ester, M. (2003) Hierarchical Document Clustering Using Frequent Itemsets. Proceedings of the 2003 SIAM International Conference on Data Mining, 3, 59-70. http://dx.doi.org/10.1137/1.9781611972733.6</mixed-citation></ref><ref id="scirp.61283-ref65"><label>65</label><mixed-citation publication-type="other" xlink:type="simple">Steinbach, M., Karypis, G. and Kumar, V. (2000) A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 400, 525-526.</mixed-citation></ref><ref id="scirp.61283-ref66"><label>66</label><mixed-citation publication-type="other" xlink:type="simple">Cai, L. and Hofmann, T. (2003) Text Categorization by Boosting Automatically Extracted Concepts. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 182-189. http://dx.doi.org/10.1145/860435.860470</mixed-citation></ref><ref id="scirp.61283-ref67"><label>67</label><mixed-citation publication-type="other" xlink:type="simple">Chiang, I.J. (2007) Discover the Semantic Topology in High-Dimensional Data. Expert Systems with Applications, 33, 256-262. http://dx.doi.org/10.1016/j.eswa.2006.05.033</mixed-citation></ref><ref id="scirp.61283-ref68"><label>68</label><mixed-citation publication-type="other" xlink:type="simple">Hofmann, T. (1999) Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 50-57.  
http://dx.doi.org/10.1145/312624.312649</mixed-citation></ref><ref id="scirp.61283-ref69"><label>69</label><mixed-citation publication-type="other" xlink:type="simple">Palla, G., Derenyi, I., Farkas, I. and Vicsek, T. (2005) Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature, 435, 814-818. http://dx.doi.org/10.1038/nature03607</mixed-citation></ref><ref id="scirp.61283-ref70"><label>70</label><mixed-citation publication-type="other" xlink:type="simple">Dhillon, I.S. and Modha, D.S. (2001) Concept Decompositions for Large Sparse Text Data Using Clustering. Machine learning, 42, 143-175. http://dx.doi.org/10.1023/A:1007612920971</mixed-citation></ref><ref id="scirp.61283-ref71"><label>71</label><mixed-citation publication-type="other" xlink:type="simple">Shi, J. and Malik, J. (2000) Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888-905. http://dx.doi.org/10.1109/34.868688</mixed-citation></ref><ref id="scirp.61283-ref72"><label>72</label><mixed-citation publication-type="other" xlink:type="simple">Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J.I. (2003) GENIA Corpus—A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics, 19, i180-i182. http://dx.doi.org/10.1093/bioinformatics/btg1023</mixed-citation></ref><ref id="scirp.61283-ref73"><label>73</label><mixed-citation publication-type="other" xlink:type="simple">Cohen, W.W. and Richman, J. (2002) Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 475-480. http://dx.doi.org/10.1145/775047.775116</mixed-citation></ref><ref id="scirp.61283-ref74"><label>74</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Lipscomb</surname><given-names> C.E. </given-names></name>,<etal>et al</etal>. (<year>2000</year>)<article-title>Medical Subject Headings (MeSH)</article-title><source> Bulletin of the Medical Library Association</source><volume> 88</volume>,<fpage> 265</fpage>-<lpage>266</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.61283-ref75"><label>75</label><mixed-citation publication-type="other" xlink:type="simple">Lowe, H.J. and Barnett, G.O. (1994) Understanding and Using the Medical Subject Headings (MeSH) Vocabulary to Perform Literature Searches. JAMA, 271, 1103-1108. http://dx.doi.org/10.1001/jama.1994.03510380059038</mixed-citation></ref></ref-list></back></article>