<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">CS</journal-id><journal-title-group><journal-title>Circuits and Systems</journal-title></journal-title-group><issn pub-type="epub">2153-1285</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/cs.2016.79217</article-id><article-id pub-id-type="publisher-id">CS-69041</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Engineering</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  A Multi-Classifier Based Prediction Model for Phishing Emails Detection Using Topic Modelling, Named Entity Recognition and Image Processing
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>C.</surname><given-names>Emilin Shyni</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>S.</surname><given-names>Sarju</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>S.</surname><given-names>Swamynathan</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib></contrib-group><aff id="aff3"><addr-line>Department of Information Science and Technology, Anna University, Chennai, India</addr-line></aff><aff id="aff2"><addr-line>Department of Computer Science, St. Joseph’s College of Engineering and Technology, Kerala, India</addr-line></aff><aff id="aff1"><addr-line>Department of Information Technology, KCG College of Technology, Chennai, India</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>shyniedwin@gmail.com(CES)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>05</day><month>07</month><year>2016</year></pub-date><volume>07</volume><issue>09</issue><fpage>2507</fpage><lpage>2520</lpage><history><date date-type="received"><day>31</day>	<month>March</month>	<year>2016</year></date><date date-type="rev-recd"><day>accepted</day>	<month>21</month>	<year>April</year>	</date><date date-type="accepted"><day>26</day>	<month>July</month>	<year>2016</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Phishing is the act of attempting to steal a user’s financial and personal information, such as credit card numbers and passwords by pretending to be a trustworthy participant, during online communication. Attackers may direct the users to a fake website that could seem legitimate, and then gather useful and confidential information using that site. In order to protect users from Social Engineering techniques such as phishing, various measures have been developed, including improvement of Technical Security. In this paper, we propose a new technique, namely, “A Prediction Model for the Detection of Phishing e-mails using Topic Modelling, Named Entity Recognition and Image Processing”. The features extracted are Topic Modelling features, Named Entity features and Structural features. A multi-classifier prediction model is used to detect the phishing mails. Experimental results show that the multi-classification technique outperforms the single-classifier-based prediction techniques. The resultant accuracy of the detection of phishing e-mail is 99% with the highest False Positive Rate being 2.1%.
 
</p></abstract><kwd-group><kwd>Phishing</kwd><kwd> Conditional Random Field Classifier</kwd><kwd> Latent Dirichlet Allocation</kwd><kwd> Natural Language Processing</kwd><kwd> Machine Learning</kwd><kwd> Image Segmentation</kwd><kwd> Image Processing</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>The internet has great influence in people’s daily lives. The use of internet-based services, such as Online Banking and Online Purchasing has increased manifold in the past few years. The use of social networking sites and other similar services has also increased greatly in the last decade. Taking advantage of this dependence, social engineering schemes use spoofed emails to steal personal information (Identity Theft) from users. The email directs the user, via a hyper-link, into a fake web page owned by attackers that looks very similar to a legitimate site. Once the user enters any personal and financial information in the directed web page, it becomes available for attackers to access, and this is used to commit fraud and carry out illegal financial transactions. Technical subterfuge schemes trigger users to download malware onto their computers, by clicking on a link embedded in a spoofed email. Using these malware, attackers steal users’ credentials from their own devices. Anti-Phishing Working Group [<xref ref-type="bibr" rid="scirp.69041-ref1">1</xref>] reported that there were at least, 74,127 unique phishing websites detected between January 1, 2013 and March 31, 2013.</p><p>As part of the research in this paper, Topic Modelling features, Named Entity features and Structural features were utilized to detect phishing emails. Images from the legitimate site and phished sites are extracted and an image processing technique is used to compare the similarity of that images. The Topic Modelling features were extracted using the GibbsLDA, while the Named Entity features were extracted using the CRF Classifier. A total of 61 features were extracted and used for training the classifiers. The multi-classifier prediction model is built by using Random Forest (RF), Support Vector Machines (SVM) and LogitBoost. The dataset includes a corpus of 5260 e-mails including phished e-mails and legitimate e-mails. Performance is evaluated using the different measures like Precision, TPR, FPR, F-Measure and Recall. The dataset contains different combinations of phished and legitimate mails.</p><p>The rest of the paper is organized as Related Works in Section 2, Proposed Method in Section 3, Experiments and Results in Section 4 and Discussion based on the Experiments conducted along with work planned for the future in Section 5.</p></sec><sec id="s2"><title>2. Related Work</title><p>Phishing e-mails are a particular sort of spam mails that are used to get the personal and financial related data from the users, so its recognition and incapacitation obliges higher necessity than alternate sorts of the spam mail. Phishing mail has some remarkable characteristics contrasted with the legitimate mail. For instance, it is not intended for any specific user (an exception is the spear phishing mails), it is usually focused on a financial institution, and the content of the phished e-mail often includes terms associated with finance and any emergency.</p><p>Emails are not well structured documents, they are semi structured. Chandrasekaran [<xref ref-type="bibr" rid="scirp.69041-ref2">2</xref>] has shown the ease of use of the structural properties of the email to differentiate between a phished e-mail and a legitimate one. They have utilized 23 style marker features, two structural property characteristics and 18 functional words to classify e-mails. The exactness of the model is assessed using the Support Vector Machine (SVM) classifier. At the same time, using functional words did not help in effectively characterizing e-mails, on the grounds that the attackers may utilize the synonyms of the words.</p><p>Attackers use distinctive systems to defeat phishing discovery mechanisms, utilizing the frequency of words related to finance and emergency. Therefore alternate solutions must be used to detect phishing mails. Topic modelling is a machine learning and natural language processing technique that we can use to distinguish the topics in a given e-mail. For instance, the topic “finance” contains monetary terms such as “cash”, “money” and “amount”. As opposed to discovering the frequency of the monetary words, we discover the frequency of the topic from the given mail. Landauer [<xref ref-type="bibr" rid="scirp.69041-ref3">3</xref>] presented another Topic Modelling system called Latent Semantic Analysis (LSA), which aggregates the words into distinctive topics dependent upon Singular Value Decomposition (SVD) of the term/document matrix. Hofmann [<xref ref-type="bibr" rid="scirp.69041-ref4">4</xref>] proposed Probabilistic Latent Semantic Indexing (PLSI), an alternate topic modelling procedure with a strong statistical foundation. Latent Dirichlet Allocation (LDA) is the topic modelling technique presented by Blei [<xref ref-type="bibr" rid="scirp.69041-ref5">5</xref>] dependent upon the generative probabilistic model. LDA assembles topics dependent upon the context of the words that is it has the ability to differentiate between a “river bank” and a “financial bank”.</p><p>The majority of the phishing mails are not specifically targeted on any individual and generally phished mail targets fiscal organizations. Named Entity Recognition (NER) names the given content into predefined labels, for example, individual and organization names, this characteristic of NER might be used to recognize phished mails. Erik [<xref ref-type="bibr" rid="scirp.69041-ref6">6</xref>] proposed a language independent named-entity recognition called CoNLL-2003, capable of labeling the words identified with name of an individual, location and organization. Nadeau [<xref ref-type="bibr" rid="scirp.69041-ref7">7</xref>] carried out a survey of NER and classification, and recognized that CoNLL-2003 is well suited for labelling English and German words.</p><p>Spatial layout similarities of web pages are also used [<xref ref-type="bibr" rid="scirp.69041-ref8">8</xref>] to distinguish between a legitimate site and a phished site. An R-tree is constructed and special queries are used to compare the similarity of the pages.</p><p>The goal of this work is to use the combination of structural features, topic modelling features and Named Entity Recognition features for phishing email detection and thereby improving the accuracy of the detection mechanism.</p></sec><sec id="s3"><title>3. Proposed Method</title><p>In this section, we introduce a new methodology that incorporates natural language processing, machine learning and image processing in the detection of phished emails as shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p><sec id="s3_1"><title>3.1. Feature Construction</title><p>Each phishing mail in the Multipart Internet Mail Extension (MIME) format is parsed in to an html file to extract structural features; An HTML parser is then used to convert the html file into plain text, which in turn is used to extract the named entity features and Topic Modelling features. A Topic Modelling feature is extracted using GibbsLDA [<xref ref-type="bibr" rid="scirp.69041-ref9">9</xref>] and the Named Entities are extracted using the CRF Classifier. A total of 61 features are used to detect a phishing email here. The accuracy of the detection model is evaluated using different machine learning classification algorithms.</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Methodology for phishing detection in corpus email</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x6.png"/></fig><p>Raw email data are typically present in the MIME format. In this paper, words and hyperlinks present in the body of the email are used to extract the features. Thus, the body text with the hyperlink is extracted using the parser. The two types of parsers used are the MIME parser and HTML parser.</p><p>MIME parser: -The Apache James Mime4 [<xref ref-type="bibr" rid="scirp.69041-ref10">10</xref>] is used in the development of the parser for extracting the content from e-mail message streams in plain Multipart Internet Mail Extension (MIME) format. It only deals with the structure of the message stream and has been designed to be extremely tolerant towards messages violating these standards. Structural features are extracted from the parsed document.</p><p>HTML Parser: -MIME messages containing HTML documents are included as multipart/HTML subpart in the email body. When the MIME parser detects a HTML subpart, it invokes the HTML parser to separate the text, style-sheets, hyperlinks and scripts. This output is given to the CRF Classifier for Topic Modelling.</p></sec><sec id="s3_2"><title>3.2. Named Entity Recognition (NER)</title><p>The NER tags series of words in a text that should be the names of stuffs (nouns), such as individual and corporation names, or genetic material and protein names. The Conditional Random Field (CRF) is used to extract such named entities from the text of the email using the NER software written by Stanford’s Natural Language Processing Group [<xref ref-type="bibr" rid="scirp.69041-ref11">11</xref>] . Ramanathan [<xref ref-type="bibr" rid="scirp.69041-ref12">12</xref>] gives a detailed usage of the Named Entity Recognition for phishing detection.</p><p>Conditional Random Fields</p><p>Conditional Random Fields (CRFs), used in machine learning for structured prediction, are a class of statistical modelling methods.</p><p>Given vector of input variables <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x7.png" xlink:type="simple"/></inline-formula> and a vector of output variables <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x8.png" xlink:type="simple"/></inline-formula> a model (discriminative) asses the conditional probability P(Y/X), and a generative model approximates the P(Y,X). The CRF is a discriminative model, undirectional that does not include a model of P(X). Lafferty [<xref ref-type="bibr" rid="scirp.69041-ref13">13</xref>] , defined the probability of a particular label sequence Y, given that the observation sequence X is a normalized product of latent functions, denoted as</p><disp-formula id="scirp.69041-formula643"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x9.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x10.png" xlink:type="simple"/></inline-formula> states the transition feature function and the tags at sites i and i − 1 in the tag sequence; <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x11.png" xlink:type="simple"/></inline-formula>is a state feature function of the tag at site i and the observation sequence; <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x12.png" xlink:type="simple"/></inline-formula>and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x13.png" xlink:type="simple"/></inline-formula> are parameters to be estimated from the training data.</p><p>The probabilities of a tag sequence Y given an observation [<xref ref-type="bibr" rid="scirp.69041-ref14">14</xref>] sequence X to be written as</p><disp-formula id="scirp.69041-formula644"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x14.png"  xlink:type="simple"/></disp-formula><p>Z(X) is called the Normalization Factor. The log-likelihood of CRF, is given by</p><disp-formula id="scirp.69041-formula645"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x15.png"  xlink:type="simple"/></disp-formula><p>Conditional Random Fields Classifier (CRF Classifier)</p><p>Stanford’s NER [<xref ref-type="bibr" rid="scirp.69041-ref11">11</xref>] software is published with a pre-trained model that has been trained on CoNLL, MUC6, MUC7, and ACE datasets. The CoNLL 2003 English training is the data-set used in this work. The CRF Named Entity Labeller component identifies and labels each word to one of three entities, namely, location, organization, and person. The output from the CRF Named Entity Labeller is used for extracting the Named Entity, which is the first set of features used for phishing detection. <xref ref-type="fig" rid="fig2">Figure 2</xref> shows the Named Entity Recognition result, in which it labels the Sites, Corporate and individual.</p><p>Topic Modelling</p><p>The LDA [<xref ref-type="bibr" rid="scirp.69041-ref5">5</xref>] is a Natural Language Processing (NLP) method which is used to extract Topics from the collection of documents. They are modelled via a hidden Dirichlet random variable that specifies probability distribution on a latent, low-dimensional Topic space. Documents are represented as random mixtures over latent Topics and each Topic is represented by distribution over words.</p><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> Named entity recognition from the e-mail</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x16.png"/></fig><p>Given α and β are the parameters, the joint distribution of a Topic mixture which is given by θ, a set of N words w, and a set of N Topics z is given by:</p><disp-formula id="scirp.69041-formula646"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x17.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/37-7600672x18.png" xlink:type="simple"/></inline-formula> is θ<sub>i</sub> for the unique i such that z<sub>n</sub> = 1. The marginal distribution of a document is obtained by Integrating on θ and summing on z, which is expressed as follows:</p><disp-formula id="scirp.69041-formula647"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x19.png"  xlink:type="simple"/></disp-formula><p>The probability of the corpus, D is obtained by taking the product of marginal probabilities of single documents, and is expressed as follows:</p><disp-formula id="scirp.69041-formula648"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x20.png"  xlink:type="simple"/></disp-formula><p>The parameters α and β are corpus level parameters. The document level variables θ<sub>d</sub> are sampled once per document. The word level variables w<sub>dn</sub> and z<sub>dn</sub> are sampled once for each word in the document. Several algorithms have been developed to solve LDA that requires estimation of the posterior probability distribution of hidden Topic variables.</p><p>GibbsLDA</p><p>Topic Modelling is implemented by using JGibbLDA [<xref ref-type="bibr" rid="scirp.69041-ref9">9</xref>] . The parameter inference process requires less computational time than parameter estimation, JGibbLDA with the focus on inferring hidden/latent Topic Structures of unseen data upon the model estimated using GibbLDA++. This component consists of the following sub components.</p><p>The email extracted from the HTML parser is given to the Topic Modelling module after the pre-processing of the text document; a Term Document Frequency (TDF) matrix is created. This TDF matrix is used to train the LDA Model. LDA requires the number of Topics, K, to be initialized; in addition, LDA requires Dirichlet parameters, α, parameter of the Dirichlet prior to the per-document Topic distributions, and β, parameter of the Dirichlet prior on the per-topic word distributions, to be specified in advance.</p><p>The LDA Topic Probability Extractor extracts word/topic and topic/document distribution probabilities computed by the LDA model inference sub-component. Topic/Document distribution probabilities are used as the second set of features to build the classifier. By using these probability distributions instead of actual words, the classifier is expected to be quite robust in detecting phishing attacks. <xref ref-type="table" rid="table1">Table 1</xref> shows the topic distribution in a given email.</p></sec><sec id="s3_3"><title>3.3. Structural Features</title><p>Emails have different Structural features, in which 10 of these structural features are used in this paper as the third set of features for detecting phished emails. <xref ref-type="table" rid="table2">Table 2</xref> shows the extracted structural features.</p><p>The targeted URL from the email is extracted and is checked for the legitimate site. All the images from the original site and the targeted sites are used for the similarity measures. All the images are resized into 300 &#215; 300</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Topic distribution using LDA</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Topic 0</th><th align="center" valign="middle" >Account</th><th align="center" valign="middle" >Paypal</th><th align="center" valign="middle" >Yahoo</th><th align="center" valign="middle" >Business</th><th align="center" valign="middle" >companies</th></tr></thead><tr><td align="center" valign="middle" >Topic 1</td><td align="center" valign="middle" >Subject</td><td align="center" valign="middle" >from</td><td align="center" valign="middle" >email</td><td align="center" valign="middle" >send</td><td align="center" valign="middle" >contact</td></tr><tr><td align="center" valign="middle" >Topic 2</td><td align="center" valign="middle" >Contact</td><td align="center" valign="middle" >People</td><td align="center" valign="middle" >Unsubscribe</td><td align="center" valign="middle" >Information</td><td align="center" valign="middle" >personal</td></tr><tr><td align="center" valign="middle" >Topic 3</td><td align="center" valign="middle" >Investment</td><td align="center" valign="middle" >money</td><td align="center" valign="middle" >amount</td><td align="center" valign="middle" >Bank</td><td align="center" valign="middle" >account</td></tr><tr><td align="center" valign="middle" >Topic 4</td><td align="center" valign="middle" >click</td><td align="center" valign="middle" >urgent</td><td align="center" valign="middle" >invalidate</td><td align="center" valign="middle" >important</td><td align="center" valign="middle" >verify</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Structural features extracted</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Feature Description</th></tr></thead><tr><td align="center" valign="middle" >1 Binary feature indicating whether the word “Dear” is present or not 2 Binary feature indicating whether a HTML tag is present or not 3 Binary feature indicating whether JavaScript has been used or not 4 Binary feature indicating whether the tag “ahref” is present or not 5 Binary feature indicating whether CGI has been used or not 6 Binary feature indicating the opening tag of table 7 Binary feature indicating whether OnClick event is present or not 8 Number of HTML opening comment tags 9 Binary feature indicating whether the text colour has been set to white 10 Binary feature indicating whether a URL contains “&amp;” , “%” or “@” 11 Binary feature indicating whether a URL contains an IP address 12 Binary feature indicating the image similarity between an original site and a phished one, using image segmentation</td></tr></tbody></table></table-wrap><p>pixels and are segmented into 25 RGB triplets as shown in the <xref ref-type="fig" rid="fig3">Figure 3</xref>. Each segment has a 30 &#215; 30 pixel size and 25 &#215; 3 feature vector is created. The similarity is calculated by using the Euclidean distance.</p><p>The distance from feature vector A to feature vector A will be zero. The maximum dissimilarity is calculated as the equation number 7.</p><disp-formula id="scirp.69041-formula649"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x21.png"  xlink:type="simple"/></disp-formula><p>where D is the dissimilarity value. This dissimilarity value is also used as the one of the features to detect the phished mails. <xref ref-type="fig" rid="fig4">Figure 4</xref> and <xref ref-type="fig" rid="fig5">Figure 5</xref> shows the targeted page in the mail and its original page. The dissimilarity measure between two pages is as shown in <xref ref-type="table" rid="table3">Table 3</xref>.</p></sec><sec id="s3_4"><title>3.4. Prediction Model</title><p>Multiple learning systems try to exploit the local different behavior of the base learners to enhance the accuracy of the overall learning system. Multiple classifiers are a set of classifiers whose individual predictions are combined in some way to classify new examples. Combining classifiers solves three problems, and <xref ref-type="fig" rid="fig6">Figure 6</xref> shows the classifier combining steps.</p><p>The Prediction model is used to predict class label (Phished/Legitimate) of the given mail based on the training set which is constructed from the publicly available email corpus. The features that are extracted from the given mail are used to construct the ARFF file, which is the input to the prediction model and the output is the class label. Three classifiers are used to construct the prediction model; they are Random Forest (RF), Support Vector Machine (SVM) and LogitBoost. Each classifier predicts the category to which the mail belongs, and finally a decision is taken based on the majority voting algorithm. The Fk in the figure represents the feature sets; the public email corpus is used to collect the features and is used to train the classifiers. When a new mail arrives, the prediction model is capable of detecting a class label based on the training set.</p><p>The majority voting algorithm is shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>, the parameters to the algorithm are Classifiers (C) and class Labels (L). The algorithm returns the majority class label.</p><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Image</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x22.png"/></fig><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Legitimate page</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x23.png"/></fig><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> Phished page</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x24.png"/></fig><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Classifier combining steps</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x25.png"/></fig><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> Majority voting algorithm</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x26.png"/></fig><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Dissimilarity calculation</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Legitimate Page Image Name</th><th align="center" valign="middle" >Phished Page Image Name</th><th align="center" valign="middle" >Dissimilarity</th></tr></thead><tr><td align="center" valign="middle" >Logo</td><td align="center" valign="middle" >logo</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >bigflix_hp_small</td><td align="center" valign="middle" >bigflix_hp_small</td><td align="center" valign="middle" >15.234</td></tr><tr><td align="center" valign="middle" >cc_cashback_small</td><td align="center" valign="middle" >cc_cashback_small</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >hsbc-advance</td><td align="center" valign="middle" >hsbc-advance</td><td align="center" valign="middle" >7.892</td></tr><tr><td align="center" valign="middle" >PB_Bigfix</td><td align="center" valign="middle" >PB_Bigfix</td><td align="center" valign="middle" >3.529</td></tr><tr><td align="center" valign="middle" >security_device</td><td align="center" valign="middle" >security_device</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >Total Dissimilarity</td><td align="center" valign="middle"  colspan="2"  >26.655</td></tr></tbody></table></table-wrap></sec></sec><sec id="s4"><title>4. Experiments</title><p>In this section, the performance of the proposed methodology is evaluated and the results are reported. The methodology is evaluated, using openly available standard datasets containing phishing and non-phishing data. The Evaluation of the phishing detection is carried out on email datasets using different classifiers named below.</p><sec id="s4_1"><title>4.1. Data Set Description</title><p>The data used for the evaluation of the proposed system has been obtained from the data sets available in the public domain [<xref ref-type="bibr" rid="scirp.69041-ref15">15</xref>] - [<xref ref-type="bibr" rid="scirp.69041-ref17">17</xref>] . The data set contains 5260 emails in all, and this includes both phished and legitimate mails. The composite mixture of the phished and legitimate mails is given as the data set (<xref ref-type="table" rid="table4">Table 4</xref>), and the features are extracted. These features are used as the input to the classifiers, for measuring their performance.</p></sec><sec id="s4_2"><title>4.2. Training and Testing</title><p>The CRF Classifier (Stanford NER, 2013) is trained using the CoNLL 2003 English training data. Topic modelling is done by using the JGibbLDA with the following parameters, Dirichlet prior on the per-document Topic distributions (α), Dirichlet prior on the per-topic word distribution (β), Number of Topics (k) and Number of iterations (i), as shown in <xref ref-type="table" rid="table5">Table 5</xref>. Structural features were also extracted from the email. All 61 features were used to construct the ARFF file, which is the input file format of the WEKA. The k-fold cross validation was used to build the classifier, with a “k” value of 10. Thus 90% of the data is used to build the model, and the remaining 10% used as testing data.</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Data sets</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Data Set 1</th><th align="center" valign="middle" >Phished</th><th align="center" valign="middle" >50%</th></tr></thead><tr><td align="center" valign="middle" >Legitimate</td><td align="center" valign="middle" >50%</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Data Set 2</td><td align="center" valign="middle" >Phished</td><td align="center" valign="middle" >40%</td></tr><tr><td align="center" valign="middle" >Legitimate</td><td align="center" valign="middle" >60%</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Data Set 3</td><td align="center" valign="middle" >Phished</td><td align="center" valign="middle" >30%</td></tr><tr><td align="center" valign="middle" >Legitimate</td><td align="center" valign="middle" >70%</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Data Set 4</td><td align="center" valign="middle" >Phished</td><td align="center" valign="middle" >20%</td></tr><tr><td align="center" valign="middle" >Legitimate</td><td align="center" valign="middle" >80%</td></tr><tr><td align="center" valign="middle"  rowspan="2"  >Data Set 5</td><td align="center" valign="middle" >Phished</td><td align="center" valign="middle" >10%</td></tr><tr><td align="center" valign="middle" >Legitimate</td><td align="center" valign="middle" >90%</td></tr></tbody></table></table-wrap><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> LDA parameter values</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Parameter</th><th align="center" valign="middle" >Value</th></tr></thead><tr><td align="center" valign="middle" >Per-Document Topic Distributions (α)</td><td align="center" valign="middle" >0.5</td></tr><tr><td align="center" valign="middle" >Per-Word Topic Distribution (β)</td><td align="center" valign="middle" >0.1</td></tr><tr><td align="center" valign="middle" >Number of Topics (k)</td><td align="center" valign="middle" >5</td></tr><tr><td align="center" valign="middle" >Number of Iterations (i)</td><td align="center" valign="middle" >100</td></tr></tbody></table></table-wrap></sec><sec id="s4_3"><title>4.3. Performance Analysis</title><p>The classification performance of the phishing detection is evaluated using the standard measures of performance described as follows. True Positive (TP) means the actual and predicted categories are positive and False Positive (FP) means the predicted value should have the negative classified instead of positive. Other performance metrics used in classifications are accuracy, precision, recall and F-measure. Receiver Operating Characteristic (ROC) represents the different trade-off between false positives and false negatives</p><disp-formula id="scirp.69041-formula650"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x27.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.69041-formula651"><label>(9)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/37-7600672x28.png"  xlink:type="simple"/></disp-formula><p>where PM indicates phished mail and LM indicates legitimate mail.</p></sec><sec id="s4_4"><title>4.4. Training and Testing</title><p>Results obtained from experimental setup are shown in <xref ref-type="table" rid="table6">Table 6</xref>, where the TPR and FPR for all the classifiers using the different data sets are shown. Each classifier algorithm gives dissimilar results for different datasets (<xref ref-type="table" rid="table4">Table 4</xref>). From the results it has been identified that the Multi-classifier gives good results when compared to a classifiers individual performance.</p><p><xref ref-type="fig" rid="fig8">Figure 8</xref> shows the accuracy of the Multi-Classifier and individual classifiers for phishing email detection. Evaluation of the figure gives a clear picture of the performance, and helps conclude that the Multi-Classifica- tion based methodology has a higher accuracy when compared to the others. In every data set, it gives an accuracy of above 96% and it reaches 99%. SVM, Random Forest and LogiBoost gives an accuracy of above 93%, but the Multi-classifier reaches above 96%.</p><p>Comparison of classifiers based on the Precision (P) and Recall (R) is shown in the <xref ref-type="table" rid="table7">Table 7</xref>. In all the data sets the Multi-classification gives a higher recall rate when compared to the individual classifiers. <xref ref-type="fig" rid="fig6">Figure 6</xref></p><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Comparison of accuracy of individual classifiers and multi-classifier</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x29.png"/></fig><table-wrap id="table6" ><label><xref ref-type="table" rid="table6">Table 6</xref></label><caption><title> Comparison of TP (True Positive)-FP (False Positive) rate of individual classifiers and multi-classifier</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Classifier Used</th><th align="center" valign="middle"  colspan="2"  >Data set 1</th><th align="center" valign="middle"  colspan="2"  >Data set 2</th><th align="center" valign="middle"  colspan="2"  >Data set 3</th><th align="center" valign="middle"  colspan="2"  >Data set 4</th><th align="center" valign="middle"  colspan="2"  >Data set 5</th></tr></thead><tr><td align="center" valign="middle" >TP (%)</td><td align="center" valign="middle" >FP (%)</td><td align="center" valign="middle" >TP (%)</td><td align="center" valign="middle" >FP (%)</td><td align="center" valign="middle" >TP (%)</td><td align="center" valign="middle" >FP (%)</td><td align="center" valign="middle" >TP (%)</td><td align="center" valign="middle" >FP (%)</td><td align="center" valign="middle" >TP (%)</td><td align="center" valign="middle" >FP (%)</td></tr><tr><td align="center" valign="middle" >SVM</td><td align="center" valign="middle" >95.5</td><td align="center" valign="middle" >4.5</td><td align="center" valign="middle" >96.8</td><td align="center" valign="middle" >4.0</td><td align="center" valign="middle" >96.5</td><td align="center" valign="middle" >5.3</td><td align="center" valign="middle" >97.0</td><td align="center" valign="middle" >7.3</td><td align="center" valign="middle" >98.0</td><td align="center" valign="middle" >13.6</td></tr><tr><td align="center" valign="middle" >Random Forest</td><td align="center" valign="middle" >93.8</td><td align="center" valign="middle" >6.3</td><td align="center" valign="middle" >95.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >96.5</td><td align="center" valign="middle" >2.5</td><td align="center" valign="middle" >97.5</td><td align="center" valign="middle" >5.3</td><td align="center" valign="middle" >98.5</td><td align="center" valign="middle" >4.6</td></tr><tr><td align="center" valign="middle" >LogitBoost</td><td align="center" valign="middle" >95.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >95.3</td><td align="center" valign="middle" >5.7</td><td align="center" valign="middle" >97.3</td><td align="center" valign="middle" >5.5</td><td align="center" valign="middle" >97.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >98.8</td><td align="center" valign="middle" >11.3</td></tr><tr><td align="center" valign="middle" >Multi-Classifier</td><td align="center" valign="middle" >96.3</td><td align="center" valign="middle" >3.8</td><td align="center" valign="middle" >97.0</td><td align="center" valign="middle" >3.7</td><td align="center" valign="middle" >97.5</td><td align="center" valign="middle" >3.9</td><td align="center" valign="middle" >99.0</td><td align="center" valign="middle" >2.1</td><td align="center" valign="middle" >98.8</td><td align="center" valign="middle" >9.0</td></tr></tbody></table></table-wrap><table-wrap id="table7" ><label><xref ref-type="table" rid="table7">Table 7</xref></label><caption><title> Comparisons of precision and recall of individual and multi-classifiers</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Classifier Used</th><th align="center" valign="middle"  colspan="2"  >Data set 1</th><th align="center" valign="middle"  colspan="2"  >Data set 2</th><th align="center" valign="middle"  colspan="2"  >Data set 3</th><th align="center" valign="middle"  colspan="2"  >Data set 4</th><th align="center" valign="middle"  colspan="2"  >Data set 5</th></tr></thead><tr><td align="center" valign="middle" >Precision (%)</td><td align="center" valign="middle" >Recall (%)</td><td align="center" valign="middle" >Precision (%)</td><td align="center" valign="middle" >Recall (%)</td><td align="center" valign="middle" >Precision (%)</td><td align="center" valign="middle" >Recall (%)</td><td align="center" valign="middle" >Precision (%)</td><td align="center" valign="middle" >Recall (%)</td><td align="center" valign="middle" >Precision (%)</td><td align="center" valign="middle" >Recall (%)</td></tr><tr><td align="center" valign="middle" >SVM</td><td align="center" valign="middle" >95.5</td><td align="center" valign="middle" >4.5</td><td align="center" valign="middle" >96.8</td><td align="center" valign="middle" >4.0</td><td align="center" valign="middle" >96.5</td><td align="center" valign="middle" >5.3</td><td align="center" valign="middle" >97.0</td><td align="center" valign="middle" >7.3</td><td align="center" valign="middle" >98.0</td><td align="center" valign="middle" >13.6</td></tr><tr><td align="center" valign="middle" >Random Forest</td><td align="center" valign="middle" >93.8</td><td align="center" valign="middle" >6.3</td><td align="center" valign="middle" >95.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >96.5</td><td align="center" valign="middle" >2.5</td><td align="center" valign="middle" >97.5</td><td align="center" valign="middle" >5.3</td><td align="center" valign="middle" >98.5</td><td align="center" valign="middle" >4.6</td></tr><tr><td align="center" valign="middle" >LogitBoost</td><td align="center" valign="middle" >95.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >95.3</td><td align="center" valign="middle" >5.7</td><td align="center" valign="middle" >97.3</td><td align="center" valign="middle" >5.5</td><td align="center" valign="middle" >97.8</td><td align="center" valign="middle" >4.3</td><td align="center" valign="middle" >98.8</td><td align="center" valign="middle" >11.3</td></tr><tr><td align="center" valign="middle" >Multi-Classifier</td><td align="center" valign="middle" >96.3</td><td align="center" valign="middle" >3.8</td><td align="center" valign="middle" >97.0</td><td align="center" valign="middle" >3.7</td><td align="center" valign="middle" >97.5</td><td align="center" valign="middle" >3.9</td><td align="center" valign="middle" >99.0</td><td align="center" valign="middle" >2.1</td><td align="center" valign="middle" >98.8</td><td align="center" valign="middle" >9.0</td></tr></tbody></table></table-wrap><p>shows the comparison of all classifiers based on the Precision. While considering the precision, it is found that the multi-classifier out performs the individual classifier.</p><p>Comparison of classifiers based on the Precision (P) and Recall (R) is shown in the <xref ref-type="table" rid="table7">Table 7</xref>. In all data sets the multi-classification gives a higher recall rate when compared to the individual classifiers. <xref ref-type="fig" rid="fig9">Figure 9</xref> shows the comparison of all classifiers based on the Precision. While considering the precision, it is found that the multi-classifier out performs the individual classifier.</p><p>Finally, the performance of classifiers SVM, LogitBoost and Random Forest is compared, using the area under Receiver Operator Characteristics (ROC) curve. From the results it is clear that multi classifier prediction outperforms the individual classifier performances. <xref ref-type="fig" rid="fig1">Figure 1</xref>0 shows the ROC for individual classifiers and <xref ref-type="fig" rid="fig1">Figure 1</xref>1 shows the ROC for Multi classifiers.</p><fig id="fig9"  position="float"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> Comparison of F-measure of classifiers</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x30.png"/></fig><fig id="fig10"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>0</label><caption><title> ROC for individual classifiers</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x31.png"/></fig><fig id="fig11"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>1</label><caption><title> ROC for multi classifiers</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/37-7600672x32.png"/></fig><p>Considering all the experimental results, the Multi-Classifier withstands scrutiny with respect to the detection of phishing mails, and is capable of overcoming the flaws of using classifiers separately.</p></sec></sec><sec id="s5"><title>5. Discussion and Future Work</title><p>The present work has detailed phishing detection techniques, using the Gibbs LDA, CRF classifier and Image Processing. In addition, the multi-classifier prediction technique overcomes the drawbacks of individual classifiers.</p><p>Using the LDA and CRF improves the performance of detecting phished emails. The CRF’s ability to automatically extract Named Entities from the body of the emails was greatly instrumental in determining the legitimacy of a given mail. As the CRF extracts the name based on the context in which the word appears, it is a very useful tool in combating the schemes adopted by phishers. The LDA is capable of discovering hidden Topics from the phishing messages, and is also efficient in handling synonyms. The dataset used contains various proportions of phished and legitimate mails, useful in the evaluation of the performance of the Classifiers, which help identify the most accurate ones available. The addition of structural features also improves the efficiency of phished mail detection. The image segmentation techniques improve the overall performance of the phishing email detection. The proposed methodology preserves an accuracy of 99% with an FP rate of 2.1% for detecting phishing mails. It achieves high accuracy. In future work we can have the segmentation in image processing technique and find the accuracy for detecting the phishing e-mails.</p></sec><sec id="s6"><title>Cite this paper</title><p>C. Emilin Shyni,S. Sarju,S. Swamynathan, (2016) A Multi-Classifier Based Prediction Model for Phishing Emails Detection Using Topic Modelling, Named Entity Recognition and Image Processing. Circuits and Systems,07,2507-2520. doi: 10.4236/cs.2016.79217</p></sec></body><back><ref-list><title>References</title><ref id="scirp.69041-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">APWG (2013) Anti Phishing Working Group. http://www.antiphishing.org</mixed-citation></ref><ref id="scirp.69041-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Chandrasekaran, M., Narayanan, M. and Upadhyaya, S. (2006) Phishing Email Detection Based on Structural Properties. Proceedings of 9th Annual NYS Cyber Security Conference, Albany, 14 June 2006, 2-8.</mixed-citation></ref><ref id="scirp.69041-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Landauer, T.K. and Dumais, S.T. (1997) A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104, 211-240. http://dx.doi.org/10.1037/0033-295X.104.2.211</mixed-citation></ref><ref id="scirp.69041-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Hofmann, T. (1999) Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, 15-19 August 1999, 50-57. http://dx.doi.org/10.1145/312624.312649</mixed-citation></ref><ref id="scirp.69041-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Blei, M., Andrew, Y. and Michael, I. (2003) Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993-1022.</mixed-citation></ref><ref id="scirp.69041-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Sang, E.F.T.K. and De Meulder, F. (2003) Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the Seventh Conference on Natural Language Learning, Edmonton, Canada, 31 May 2003, 142-147.</mixed-citation></ref><ref id="scirp.69041-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, W., Lu, H., Xu, B. and Yang, H. (2013) Web Phishing Detection Based on Page Spatial Layout Similarity. Informatica, 37, 231-244.</mixed-citation></ref><ref id="scirp.69041-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Nadeau, D. and Sekine, S. (2007) A Survey of Named Entity Recognition and Classification. Lingvisticae Investigationes, 30, 3-26. http://dx.doi.org/10.1075/li.30.1.03nad</mixed-citation></ref><ref id="scirp.69041-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Gibbs LDA (2013) LDA Using Gibbs Sampling. http://jgibblda.sourceforge.net/</mixed-citation></ref><ref id="scirp.69041-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Apache James Mime4J Parser (2013). http://james.apache.org/mime4j</mixed-citation></ref><ref id="scirp.69041-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">The Stanford Natural Language Processing Group (2013). http://nlp.stanford.edu</mixed-citation></ref><ref id="scirp.69041-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Ramanathan, V. and Wechsler, H. (2013) Phishing Detection and Impersonated Entity Discovery Using Conditional Random Field and Latent Dirichlet Allocation. Computers and Security, 34, 123-139. http://dx.doi.org/10.1016/j.cose.2012.12.002</mixed-citation></ref><ref id="scirp.69041-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Lafferty, McCallum, A. and Pereira, F. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Dat. Proceedings of International Conference on Machine Learning, San Francisco, 28 June-1 July 2001, 282-289.</mixed-citation></ref><ref id="scirp.69041-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Wallach, H.M. (2004) Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21.</mixed-citation></ref><ref id="scirp.69041-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Phish Tank (2013). http://www.phishtank.com</mixed-citation></ref><ref id="scirp.69041-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Phishingcorpus Homepage (2013). http://monkey.org/~jose/wiki/doku.php?id=PhishingCorpus</mixed-citation></ref><ref id="scirp.69041-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">SpamAssassin (2013). http://spamassassin.apache.org</mixed-citation></ref></ref-list></back></article>