<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2015.36011</article-id><article-id pub-id-type="publisher-id">JCC-57254</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Comparative Study of the Parallelization of the Smith-Waterman Algorithm on OpenMP and Cuda C
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>madou</surname><given-names>Chaibou</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Oumarou</surname><given-names>Sie</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Département de Mathmatiques et Informatique, Université de Ouagadougou, Ouagadougou, Burkina Faso</addr-line></aff><aff id="aff1"><addr-line>Laboratoire de Mathématiques et Informatique (LAMI), Université de Ouagadougou, Ouagadougou, 
Burkina Faso</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>chaibouam@univ-ouaga.bf(MC)</email>;<email>sie@univ-ouaga.bf(OS)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>26</day><month>05</month><year>2015</year></pub-date><volume>03</volume><issue>06</issue><fpage>107</fpage><lpage>117</lpage><history><date date-type="received"><day>14</day>	<month>April</month>	<year>2015</year></date><date date-type="rev-recd"><day>accepted</day>	<month>15</month>	<year>June</year>	</date><date date-type="accepted"><day>18</day>	<month>June</month>	<year>2015</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on parallelizable parts of the algorithm without changing the initial formulation of the algorithm. Approaching data in that way give us a formulation more flexible. To examine this approach, we encode it in OpenMP and Cuda C. The performance obtained shows the interest of our paper.
 
</p></abstract><kwd-group><kwd>Cuda</kwd><kwd> GP-GPU</kwd><kwd> OpenMP</kwd><kwd> Parallel Computing</kwd><kwd> Smith-Waterman</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In this paper, we discuss the parallelization of the Smith-Waterman algorithm [<xref ref-type="bibr" rid="scirp.57254-ref1">1</xref>] -[<xref ref-type="bibr" rid="scirp.57254-ref3">3</xref>] on the proteins sequences alignment. This algorithm permits to compare protein sequence of large sizes. The sequence alignment analyses sequences of amino acids to extract similar subsequences. The results of such analysis answer questions such as:</p><p>• Is that a new sequence fully or partially in the database?</p><p>• Does this sequence contain a given gene?</p><p>• How a gene can migrate from other previously identified genes?</p><p>• etc.</p><p>Answers to these questions can help to simulate changes or mutations used in medicine, the recognition of body (from the classification of individuals based on genetic maps), phylogeny (comparing very similar sequences for inferring evolutionary relationships of proteins within families), etc.</p><p>Many algorithms are used in sequence alignment. They can be classified into two types:</p><p>• Approach gives rigorous results but is extremely slow. The algorithm of Needleman and Wusch [<xref ref-type="bibr" rid="scirp.57254-ref4">4</xref>] for global search and the Smith-Waterman’s algorithm for local search belong to the algorithms of this category.</p><p>• Very fast approach but with results less satisfactory for very large databases. This is a compromise between speed and sensitivity. BLAST<sup>1</sup> [<xref ref-type="bibr" rid="scirp.57254-ref5">5</xref>] and FASTA<sup>2</sup> [<xref ref-type="bibr" rid="scirp.57254-ref6">6</xref>] are two algorithm of this category. BLAST algorithm uses a heuristic to detect the anchor points to locate areas of identical sequences. FASTA is used for a quick comparison of protein or nucleotide.</p><p>Our works focuses on the search for satisfactory solutions with reduced execution time.</p></sec><sec id="s2"><title>2. Preliminaries</title><p>The Smith-Waterman algorithm is used to find the large alignment between two sequences based on the substitution matrix and the fixed penalty. It allows to extract the longest similar segments in the two aligned sequences.</p><sec id="s2_1"><title>2.1. Principle of the Algorithm</title><p>To determine similarity between two nucleotide or protein, the Smith-Waterman algorithm compares all possible segments and assigns a score. It returns segment with the highest score. For example, consider s and t two sequences to be compared. The algorithm begins by creating a matrix M of dimensions equal to the lengths of the sequences s and t. Then the cell values of matrix M are calculated starting from the cell in the upper left corner to the cell at the bottom right corner. Formula (1) gives the expression of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x5.png" xlink:type="simple"/></inline-formula> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x6.png" xlink:type="simple"/></inline-formula>.</p><disp-formula id="scirp.57254-formula124"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x7.png"  xlink:type="simple"/></disp-formula><p>Where:</p><p>-S is a “Blosum Scoring matrix”<sup>3</sup>.</p><p>-d is a fixed constant corresponding to the alignment of a letter and an empty score (-).</p><p>-t [j] s [i] means that t [j] and s [i] are animo acide.</p><p>-s[i] and t[j]―correspond respectively to the alignement of the―a animo acide (animo acide with-).</p><p>-M [i] [j] is intuitively the score of an alignment ending with t [i] s [j]. At each step, when a maximum value is calculated, it is stored along the direction in which it is obtained: on the diagonal (i − 1, j − 1), just above (i − 1, j) or to the left (i, j − 1). Computing the matrix M and the information regarding the directions in which the highest values are obtained, require much time and memory space.</p><p>To restore the best alignment, we proceed as follows:</p><p>• Find the maximum value in the matrix M. This is the end of the local alignment with the best score.</p><p>• Go to (the) cell (s) adjacent(s) Maximum score:</p><p>-movement on the diagonal shows an alignment of the letters t [i] and s [j];</p><p>-an horizontal movement means of a bias t [i] and a blank (−) between s [j − 1] and s [j];</p><p>-an vertical movement is analogous to the horizontal displacement;</p><p>-when the maximum score is 0, it means that the optimal local alignment starts at M [i + 1] [j + 1].</p><p>• The optimum score is given by the equation:</p><disp-formula id="scirp.57254-formula125"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x9.png"  xlink:type="simple"/></disp-formula></sec><sec id="s2_2"><title>2.2. Application Example</title><p>Here are two sequences to be compared:</p><p>t: CGGGTATC</p><p>s: CCCTAGGT</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref> shows the values in the matrix of scores at the end of the execution of the Smith Waterman algorithm:</p><p>Once all the cells of the matrix scores are calculated to find the best local alignment, we start with the cell where the maximum score has been identified, then back to the cell that was used to determine the score and so on. And the optimal local alignment in the Smith-Waterman algorithm is:</p><p>C G G G --A T</p><p>---C T A G G T</p></sec><sec id="s2_3"><title>2.3. Highlighting Parallelizable Calculations</title><p>As shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>, calculations are done in parallel according to the anti-diagonal.</p><p>At time T1, a single cell is calculated, at time T2, two cells are calculated, at time T3, three cells are calculated, etc.</p><p>Generally, the cell M(i,j) is computed at time T<sub>ij</sub> = i + j ? 1.</p></sec><sec id="s2_4"><title>2.4. Linear Representation of Cells</title><p>The number of cells calculated at each iteration T<sub>i</sub> is given as above.</p><p>T1 at first, then T2 and T3, and so on. We obtain the representation in <xref ref-type="fig" rid="fig3">Figure 3</xref>.</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Example of sequence alignment</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x10.png"/></fig><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> Cases calculable at the same time T<sub>i</sub></title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x11.png"/></fig><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Linear representation of the parallelizable boxes</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x12.png"/></fig><p>This representation of the scoring matrix clearly shows the tasks that can be performed simultaneously.</p></sec><sec id="s2_5"><title>2.5. Evolution of the Number of Cells at the Same Step</title><p>Suppose we have two sequences of identical size N. The number of computable Nb<sub>i</sub> cells at the same time T<sub>i</sub> is given by the Formula (2).</p><disp-formula id="scirp.57254-formula126"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x13.png"  xlink:type="simple"/></disp-formula><p>We assume Nb the sum of Nb<sub>i</sub>. Formula (3) permits to verify that the new approach takes into account the N<sup>2</sup> cells of the scoring matrix.</p><disp-formula id="scirp.57254-formula127"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x14.png"  xlink:type="simple"/></disp-formula><p>Thus without taking into account the dependancies between the cells during the computation, the number of iterations to calculate the N<sup>2</sup> cells is 2N − 1.</p><p>So, if <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x15.png" xlink:type="simple"/></inline-formula> represents the average number of cells computable per passage we have in (4), <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x16.png" xlink:type="simple"/></inline-formula></p><disp-formula id="scirp.57254-formula128"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x17.png"  xlink:type="simple"/></disp-formula></sec></sec><sec id="s3"><title>3. Proposed Models and Materials</title><p>To evaluate the matrix scoring, most of the existing approaches use directly the matrix in <xref ref-type="fig" rid="fig2">Figure 2</xref> through a double iteration on the rows and columns.</p><p>These approaches have the merit of simplicity. However, given the dependencies between the cells, the value of a cell may not be calculated during the first passage so that the matrix scoring requires more than <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x18.png" xlink:type="simple"/></inline-formula> iterations.</p><p>To remedy this situation, we use a approach that consists of a transformation of the initial matrix [<xref ref-type="bibr" rid="scirp.57254-ref7">7</xref>] -[<xref ref-type="bibr" rid="scirp.57254-ref10">10</xref>] , which doesn’t change its essential properties but rather optimizes the calculation order of cells.</p><sec id="s3_1"><title>3.1. Transformation of the Matrix of Scores</title><p>M represents the matrix of scores, i the line number and j the column number.</p><p>We define an application <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x19.png" xlink:type="simple"/></inline-formula> as follows:</p><p>For each cell referenced (i, j):</p><disp-formula id="scirp.57254-formula129"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x20.png"  xlink:type="simple"/></disp-formula><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x21.png" xlink:type="simple"/></inline-formula>is an bijective application. It transforms the matrix in <xref ref-type="fig" rid="fig2">Figure 2</xref> to the matrix in <xref ref-type="fig" rid="fig3">Figure 3</xref>.</p><p>In fact <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x22.png" xlink:type="simple"/></inline-formula> is a change of reference from (i, j) &#174; (i, i + j − 1).</p><p>The function f represents a rotation of the scoring matrix cells which transforms antidiagonals to columns. However, we must keep in mind the treatments which must be applied on sequences to be aligned.</p></sec><sec id="s3_2"><title>3.2. Dependency of the Calculations in the New Representation</title><p>In the original representation, M [i] [j] depends on the values of cells M [i] [j − 1], M [i − 1] [j] and M [i − 1] [j − 1], as shown in <xref ref-type="fig" rid="fig4">Figure 4</xref>.</p><p>In the new representation, computing m [i] [j] depends on m [i-1] [j − 2], m [i − 1] [j − 1] and m [i] [j − 1] as shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>.</p><p>The change of representation permits to calculate values of cells on the same column. So, the genomic sequence located on the column reference changes as we have shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>. It follows a refitting of the value S [s [i]] [t [j]] used in the research of the value of m [i] [j] which becomes S [s [i’]] [t [j’ − i’]] for m [i’] [j’].</p></sec><sec id="s3_3"><title>3.3. Reconstitution of the Solution</title><p>Once all the values of the matrix scores are calculated, we must produce the results. The score is obtained in the same way as in the original form, i.e. the maximum value of cells, but acids are obtained using function<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x23.png" xlink:type="simple"/></inline-formula>.</p><disp-formula id="scirp.57254-formula130"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x24.png"  xlink:type="simple"/></disp-formula></sec><sec id="s3_4"><title>3.4. Materials</title><sec id="s3_4_1"><title>3.4.1. Dataset</title><p>To examine the acceleration rate, we use the Smith-Waterman algorithm on the alignment of genomic sequences. This algorithm has been subject of several parallel implementations [<xref ref-type="bibr" rid="scirp.57254-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.57254-ref11">11</xref>] . For illustration, we will consider the computation of cell values of the dynamic matrix used in this algorithm. The sequences we use have been downloaded from the existing genomic databases. The substitution matrix used is BLOSUM [<xref ref-type="bibr" rid="scirp.57254-ref62">62</xref>] with penalty-2.</p></sec><sec id="s3_4_2"><title>3.4.2. Specifications of the Sequential Computer Used</title><p>It has the following features:</p><p>Processor: Pentium (R) Dual-Core CPU E5500 @ 2.80 Ghz 2.80 Ghz;</p><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Calculating M[i][j] dependency</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x25.png"/></fig><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> New dependencies computing m[i][j]</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x26.png"/></fig><p>Ram: 3.00 Go HD: 500 Go</p><p>Operating System: Ubuntu 11.10.</p><p>C compiler used: gcc version 4.4.6.</p></sec><sec id="s3_4_3"><title>3.4.3. GP-GPU Specifications</title><p>The graphic’s card used is a NVIDIA GeForce GTX 670, which consists of seven multiprocessors equivalent to 1344 CUDA cores, clocked at 1.4 GHz, two (2) gigabytes of memory shared between cores hearts, 64 KB constant memory and 64 KB of shared memory per CPU.</p></sec></sec></sec><sec id="s4"><title>4. Experiments and Results</title><sec id="s4_1"><title>4.1. Classical Approach versus Our Approach in Sequential Mode</title><p>In <xref ref-type="fig" rid="fig6">Figure 6</xref>, we have compared the sequential computation time of the matrix’s scoring in its initial representation and the new reorganization. We note that the new representation has almost the same performance as the initial representation for sequences of length less to fourteen thousand (14,000) nucleotides. This experiment aims to show that the two representations of the matrix’s scoring are equivalent, sequentially in the range of sequences that we study: no spare time. This enables us to guarantee for the continuation of the study that considering the new representation of the matrix’s coring did not induce additional execution time. For the rest, we will use the same interval of sequences.</p><p>Note that, we have been limited in trying to go beyond this size of sequence due to our calculation capabilities.</p></sec><sec id="s4_2"><title>4.2. OpenMP Implementation of Our Method</title><p>OpenMP is based on the principle of shared memory [<xref ref-type="bibr" rid="scirp.57254-ref12">12</xref>] -[<xref ref-type="bibr" rid="scirp.57254-ref15">15</xref>] . The computation to be performed is decomposed into multiple tasks. Tasks are performed by the available computational units. The treatment to be performed, and data variables can be stored in a location accessible to all processing units. Shared variables are declared in the Shared() option while thread-specific variables are in the Private() optional list().</p><p>Experiments were performed on DNA sequences of various lengths using OpenMP. As we shall see, the optimum dosages of the block size depend mainly on the size N of the sequences to align.</p><p>We propose two opportunities for parallelization of the calculation of the values of cells in the matrix scoring by the Smith-Waterman algorithm.</p><p>The scoring matrix is reorganized: all cells in a column can be calculated simultaneously. Thus at each iteration, the cells of a single column are calculated. Each thread will calculate elements of one or more cells.</p><p>A thread can read the updated content elements from another thread to calculate its own elements.</p><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Original version versus the new version</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x27.png"/></fig><p>This solution has the advantage of providing the list of computable elements in an iteration but has the disadvantage of combining expectations.</p><p>It should also be noted that at this level, it is possible to calculate in multiple loops. At each loop, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x28.png" xlink:type="simple"/></inline-formula>threads are created.</p><p>As there’s no extra time outside access for reading or writing to the different cells of the matrix, the runtime in both cases are the same. So, we will treat one case.</p><sec id="s4_2_1"><title>4.2.1. Mathematical Modeling of the OpenMP Runtime</title><p>We assume:</p><p>N: length of the sequences to be aligned;</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x29.png" xlink:type="simple"/></inline-formula>: time performance of each iteration (i, j). It is also the time to treat; the value of a cell of the matrix;</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x30.png" xlink:type="simple"/></inline-formula>: initialization time before starting calculations;</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x31.png" xlink:type="simple"/></inline-formula>: latency time: wait time for all threads to finish their tasks in iterated;</p><p>E(x): denotes the integer portion of x.</p><p>From these assumptions:</p><p>The sequential computational time of the cell values of the scoring matrix is given in (5).</p><disp-formula id="scirp.57254-formula131"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x32.png"  xlink:type="simple"/></disp-formula><p>As we shall see, the optimum dosages of the block size depend mainly on the size of the sequences to align N. So that, we propose two (2) opportunities for the parallelization of calculation of the scoring matrix cells using the Smith-Waterman algorithm.</p><p>During the k<sup>th</sup> executing of the loop, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x33.png" xlink:type="simple"/></inline-formula>, there are exactly k cells to calculate.</p><p>If <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x34.png" xlink:type="simple"/></inline-formula> there are N-(k-N) (or 2N-k) cells to calculate. At each iteration, each thread is responsible for</p><p>computing <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x35.png" xlink:type="simple"/></inline-formula> cells (first phase) and then<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x35.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x36.png" xlink:type="simple"/></inline-formula>.</p><p>We deduce T the total time calculation using OpenMP as follows:</p><disp-formula id="scirp.57254-formula132"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x37.png"  xlink:type="simple"/></disp-formula><p>Hence T is given in Formula (6)</p><disp-formula id="scirp.57254-formula133"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x38.png"  xlink:type="simple"/></disp-formula></sec><sec id="s4_2_2"><title>4.2.2. Determining the Optimum Value of nb<sub>-</sub>Threads</title><disp-formula id="scirp.57254-formula134"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x39.png"  xlink:type="simple"/></disp-formula><p>Differentiating the expression of T relative to nb<sub>-</sub>threads, we obtain:</p><disp-formula id="scirp.57254-formula135"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x40.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.57254-formula136"><graphic  xlink:href="http://html.scirp.org/file/11-1730201x41.png"  xlink:type="simple"/></disp-formula><p>No value of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x42.png" xlink:type="simple"/></inline-formula> cancels the derivative.</p></sec><sec id="s4_2_3"><title>4.2.3. Estimation of the Theoretical Acceleration</title><p>Acceleration is calculated in Formula (7).</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x43.png" xlink:type="simple"/></inline-formula>;</p><disp-formula id="scirp.57254-formula137"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x44.png"  xlink:type="simple"/></disp-formula></sec><sec id="s4_2_4"><title>4.2.4. Measured Accelerations</title><p>We distinguish two cases:</p><p>Case 1: one calculation for each thread per iteration</p><p>As shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>, the parallelization with OpenMP does not significantly affect system performance. For sequences of 2500 and 5000 nuclides, peak is reached with two threads and acceleration is 1.5. Sequences of 10,000 nuclides, optimum acceleration is 1.5 and is obtained with 8 threads. The sequence of 14,000 nuclides gives maximum acceleration of 1.13 with 2 threads. In summary, with OpenMP, the best results are obtained with two threads.</p><p>Case 2: several calculations for each thread by iteration</p><p>We also tested if a thread performs a set of calculations rather than one. The results we have in this implementation are very similar to those obtained in the previous implementation. There is a very sleazy performance improvement of the order of a few hundredths of a second in some cases. <xref ref-type="fig" rid="fig8">Figure 8</xref> recapitulates the results obtained for two (2) threads by varying the cell size to calculate each thread, per iteration.</p></sec></sec><sec id="s4_3"><title>4.3. GP-GPU Implementation of Our Method</title><p>Initial form of the Smith Waterman algorithm has many implementation on GP-GPU as in [<xref ref-type="bibr" rid="scirp.57254-ref16">16</xref>] -[<xref ref-type="bibr" rid="scirp.57254-ref20">20</xref>] . To perform the calculation on GP-GPU, the scoring matrix is represented in vector form. Each ring launched calculates the elements of a column of the new representation. These elements are identified from the parameters (i, j). In total 2N iterations are launched.</p><sec id="s4_3_1"><title>4.3.1. Mathematical Modeling of the GP-GPU’s Runtime</title><p>A GP-GPU implementation starts on a CPU then uses a kernel (program running on GP-GPU). So there is cooperation between the CPU and the GPU cores. Communications between GPU and CPU are simulated out through the GP-GPU memory. The CPU copies data to be used by the GP-GPU there and CPU also reads the contents of the GP-GPU memory for reuse in sequential calculations or simply confirm.</p><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> Performance based on the number of threads</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x45.png"/></fig><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Performance with two threads, varying chunk size per iteration</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x46.png"/></fig>The Kernel<p>We assume:</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x47.png" xlink:type="simple"/></inline-formula>: kernel initialization time;</p><p>NB: number of blocks per multiprocessor;</p><p>NWP: number of warps;</p><p>NTW: number tasks per warp;</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/11-1730201x48.png" xlink:type="simple"/></inline-formula>: iteration’s execution time per warp;</p><p>The theoretical time of calculating the scoring matrix is given in (8)</p><disp-formula id="scirp.57254-formula138"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/11-1730201x49.png"  xlink:type="simple"/></disp-formula><p>To speed up the calculations, we transferred the data in memory GP-GPU and selected a grid of 512 &#180; 512 &#180; 512. This setting is sufficient for processing sequences that we handle.</p></sec><sec id="s4_3_2"><title>4.3.2. Performance on GP-GPU</title><p><xref ref-type="fig" rid="fig9">Figure 9</xref> presents the results obtained with GP-GPU.</p><p>We note that the acceleration increases with the size of the sequences examined.</p></sec></sec><sec id="s4_4"><title>4.4. Comparison of the Implementations on OpenMP and GP-GPU</title><p><xref ref-type="fig" rid="fig1">Figure 1</xref>0 represents the results obtained.</p><p>For OpenMP we examine three (3) cases. The first case uses two (2) threads.</p><p>The second uses also two (2) threads with a chunk of fifty (50) and the last one two (2) threads with a chunk of two hundred and fifty (250).</p><p>The best case is the second one. We notice also that beyond 14,000 acids per sequence, the three cases have equivalent results.</p><p>The implementation on GP-GPU gives better acceleration compared to OpenMP. For sequences used, the performance is improved more than 25 times.</p></sec></sec><sec id="s5"><title>5. Conclusions</title><p>In this paper, we present a method based on the rotation of the scores matrix in order to improve the implementation of the Smith Waterman algorithm.</p><p>This transformation explicits the parallelism contained in this algorithm and facilitates its exploitation across different platforms of parallelization.</p><p>We validate the application of this method with OpenMP and Cuda C. For each representation, we also measure the performance while executing the loop of the Smith-Waterman algorithm. It appears that the number</p><fig id="fig9"  position="float"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> Performance on GP-GPU</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x50.png"/></fig><fig id="fig10"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>0</label><caption><title> Performance based on the number of threads</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/11-1730201x51.png"/></fig><p>of threads used with OpenMP increases performance but depends on the size of the sequences to be compared. Similarly, on GP-GPU the choice of grid dimensions is an essential element of improving performance. We note a little performance with OpenMP and performance increases with the size of the sequences on the GP-GPU. At the end, GP-GPU improves performance of computing the Smith-Waterman algorithm. For the sequences used, the performance is improved more than 25 times compared with OpenMP. Ultimately, this study allows the following conclusions:</p><p>• Expanding the use of GP-GPU to parallel computing in addition to graphics for which they are at the basis created. The relatively low cost of GP-GPU will make parallel computing more accessible to the public.</p><p>• In the case of the Smith-Waterman algorithm, we conclude that the GP-GPU accelerates it more than OpenMP.</p><p>• In general, it is recommended to use GP-GPU than OpenMP for massively parallel and long calculations.</p><p>• We propose a mathematical modeling of time calculating of the matrix’s scoring on the OpenMP and GP-GPU. This equation setting allows us to make wise choices in the number of thread (OpenMP) and the size of the grid computing (GP-GPUs).</p></sec><sec id="s6"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.57254-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Smith, T.F. and Waterman, M.S. (1981) Identification of Common Molecular Subsequences. Journal of Molecular Biology, 147, 195-197. http://dx.doi.org/10.1016/0022-2836(81)90087-5</mixed-citation></ref><ref id="scirp.57254-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Boukerche, A., Melo, A.C.M.A., Ayala-Rincon, M. and Santana, T.M. (2005) Parallel Smith-Waterman Algorithm for Local Dna Comparison in a Cluster of Workstations. Experimental and Efficient Algorithms, 3503, 464-475. www.springerlink.com/content/xwn2q2qfm4hgvr3t</mixed-citation></ref><ref id="scirp.57254-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Nguyen, V.-H. and Lavenier, D. (2009) PLAST: Parallel Local Alignment Search Tool for Database Comparison. BMC Bioinformatics, 10, 329.</mixed-citation></ref><ref id="scirp.57254-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48, 443-453. http://dx.doi.org/10.1016/0022-2836(70)90057-4</mixed-citation></ref><ref id="scirp.57254-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic Local Alignment Search Tool. Journal of Molecular Biology, 215, 403-410.</mixed-citation></ref><ref id="scirp.57254-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Sciences of the United States of America, 85, 2444-2448. http://dx.doi.org/10.1073/pnas.85.8.2444</mixed-citation></ref><ref id="scirp.57254-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Aluru, S., Futamura, N. and Mehrotra, K. (2003) Parallel Biological Sequence Comparison Using Prefix Computations. Journal of Parallel and Distributed Computing, 63, 264-272.</mixed-citation></ref><ref id="scirp.57254-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Edmiston, E.W., Core, N.G., Saltz, J.H. and Smith, R.M. (1988) Parallel Processing of Biological Sequence Comparison Algorithms. International Journal of Parallel Programming, 17, 259-275. http://dx.doi.org/10.1007/BF02427852</mixed-citation></ref><ref id="scirp.57254-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Rajko, S. and Aluru, S. (2004) Space and Time Optimal Parallel Sequence Alignments. IEEE Transactions on Parallel Distributed Systems, 15, 1070-1081. http://dx.doi.org/10.1109/TPDS.2004.86</mixed-citation></ref><ref id="scirp.57254-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Sarje, A. and Aluru, S. (2009) Parallel Genomic Alignments on the Cell Broadband Engine. IEEE Transactions on Parallel and Distributed Systems, 20, 1600-1610.</mixed-citation></ref><ref id="scirp.57254-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Lander, E., Mesirov, J.P. and Taylor, W. (1988) Protein Sequence Comparison on a Data Parallel Computer. Proceedings of the International Conference on Parallel Processing, 3, 257-263.</mixed-citation></ref><ref id="scirp.57254-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Eigenmann, R. and Voss, M. (2001) OpenMP Shared Memory Parallel Programming. Lecture Notes in Computer Science 2104. Springer-Verlag, Heidelberg.</mixed-citation></ref><ref id="scirp.57254-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Ferrer, M.M.R., Gajinov, V., Unsal, O.S., Cristal, A., Ayguad, E. and Valero, M. (2008) Nebelung: Execution Environment for Transactional OpenMP. International Journal of Parallel Programming, 36, 326-346.</mixed-citation></ref><ref id="scirp.57254-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Chapman, B., Jost, G. and van der Pas, R. (2008) Using OpenMP Portable Shared Memory Parallel Programming. The MIT Press, Cambridge.</mixed-citation></ref><ref id="scirp.57254-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Gonzalez, M., Ayguad, E., Martorell, X. and Labarta, J. (2001) Defining and Supporting Pipelined Executions in OpenMP. Proceedings of the 2nd International Workshop on OpenMP Applications and Tools, Lafayette, IN, 30-31 July 2001, 155-169. http://dx.doi.org/10.1007/3-540-44587-0_14</mixed-citation></ref><ref id="scirp.57254-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Ligowski, L. and Rudnicki, W. (2009) An Efficient Implementation of Smith Waterman Algorithm on GPU Using CUDA, for Massively Parallel Scanning of Sequence Databases. Proceedings of the 2009 IEEE International Symposium on Parallel &amp; Distributed Processing, Rome, 23-29 May 2009, 1-8.http://dx.doi.org/10.1109/IPDPS.2009.5160931</mixed-citation></ref><ref id="scirp.57254-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Liu, Y., Huang, W., Johnson, J. and Vaidya, S. (2006) GPU Accelerated Smith-Waterman. Proceedings of the International Conference on Computational Science, Reading, UK, 28-31 May 2006, 188-195.http://dx.doi.org/10.1007/11758549_29</mixed-citation></ref><ref id="scirp.57254-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Liu, Y., Schmidt, B. and Maskell, D.L. (2009) MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA. Proceedings of the 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors, Boston, 7-9 July 2009, 121-128.</mixed-citation></ref><ref id="scirp.57254-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Voss, G., Muller-Wittig, W. and Schmidt, B. (2005) Using Graphics Hardware to Accelerate Biological Sequence Database Scanning. Proceedings of the TENCON 2005—2005 IEEE Region 10 Conference, Melbourne, 21-24 November 2005, 1-6.</mixed-citation></ref><ref id="scirp.57254-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Liu, W.G., Schmidt, B., Voss, G., Schroder, A. and Muller-Wittig, W. (2006) Bio-Sequence Database Scanning on a GPU. Proceedings of the 20th International Parallel and Distributed Processing Symposium, Rhodes Island, 25-29 April 2006, 8.</mixed-citation></ref></ref-list></back></article>