<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">OJS</journal-id><journal-title-group><journal-title>Open Journal of Statistics</journal-title></journal-title-group><issn pub-type="epub">2161-718X</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ojs.2022.122016</article-id><article-id pub-id-type="publisher-id">OJS-116617</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mohamed</surname><given-names>M. Shoukri</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Maha</surname><given-names>M. Aleid</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia</addr-line></aff><aff id="aff1"><addr-line>Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Canada</addr-line></aff><pub-date pub-type="epub"><day>14</day><month>03</month><year>2022</year></pub-date><volume>12</volume><issue>02</issue><fpage>216</fpage><lpage>237</lpage><history><date date-type="received"><day>16,</day>	<month>March</month>	<year>2022</year></date><date date-type="rev-recd"><day>16,</day>	<month>April</month>	<year>2022</year>	</date><date date-type="accepted"><day>19,</day>	<month>April</month>	<year>2022</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Background:
   The Poisson and the Negative Binomial distributions are commonly used to model count data. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial has a variance larger than the mean and therefore both models are appropriate to model over-dispersed count data. <b>Objectives:</b> A new two-parameter probability distribution called the Quasi-Negative Binomial Distribution (QNBD) is being studied in this paper, generalizing the well-known negative binomial distribution. This model turns out to be quite flexible for analyzing count data. Our main objectives are to estimate the parameters of the proposed distribution and to discuss its applicability to genetics data. As an application, we demonstrate that the QNBD regression representation is utilized to model genomics data sets. <b>Results:</b> The new distribution is shown to provide a good fit with respect to the “Akaike Information Criterion”, AIC, considered a measure of model goodness of fit. The proposed distribution may serve as a viable alternative to other distributions available in the literature for modeling count data exhibiting overdispersion, arising in various fields of scientific investigation such as genomics and biomedicine
  .
 
</p></abstract><kwd-group><kwd>Queuing Models</kwd><kwd> Overdispersion</kwd><kwd> Moment Estimators</kwd><kwd> Delta Method</kwd><kwd> Bootstrap</kwd><kwd> Maximum Likelihood Estimation</kwd><kwd> Fisher’s Information</kwd><kwd> Orthogonal Polynomials</kwd><kwd> Regression Models</kwd><kwd> RNE-Seq Data</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>A random variable X is said to have “Quasi Negative Binomial Distribution”, QNBD if the probability function is given by:</p><p>P x = P ( X = x ) = β − 1 ( β − 1 ) + β x ⋅ Γ ( β + β x ) θ x ( 1 − θ ) β + β x − x − 1 x ! Γ ( β + β x − x )                                         x = 0 , 1 , 2 , ⋯                                         0 &lt; θ &lt; 1                                         0 &lt; β θ &lt; 1 (1)</p><p>The distribution whose probability function is given in (1) was first derived by Tak&#225;cs [<xref ref-type="bibr" rid="scirp.116617-ref1">1</xref>] as a queuing model. He assumed that we have a single server queue with independent customers arriving according to a Poisson process in batches of size ( β − 1 ) with traffic intensity π and exponential service time with mean 1/α. It is also assumed that the service time is independent of the interarrival time. Under these conditions, the probabilities of arrival θ = π / ( π + α ) and departure = 1 −θ. Tak&#225;cs [<xref ref-type="bibr" rid="scirp.116617-ref1">1</xref>] and later Consul and Gupta [<xref ref-type="bibr" rid="scirp.116617-ref2">2</xref>] showed that the probability that a buy period has ( β − 1 ) x is for fixed β given by (1). The distribution is a member of the Lagrange class of distributions [<xref ref-type="bibr" rid="scirp.116617-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.116617-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.116617-ref5">5</xref>].</p><p>The shape of the histogram of X depends on the combination (β, θ). In <xref ref-type="fig" rid="fig1">Figure 1</xref> &amp; <xref ref-type="fig" rid="fig2">Figure 2</xref> we can see that the distribution has a much longer tail for large values of β.</p><p>The paper is structured as follows: In Section 2 we demonstrate the connection between the QNBD and the regular exponential family of distributions [<xref ref-type="bibr" rid="scirp.116617-ref6">6</xref>] and derive the higher order central moments of the distribution. A limiting form of the distribution will be investigated as well. In Section 3, we derive the first order approximation of the variances and biases of the moment estimators of (β,θ). In Section 4, we derive the maximum likelihood estimators and their asymptotic variances and biases. In Section 5, we develop the regression model and establish discuss the maximum likelihood estimation of the regression parameters. In Section 6, we apply the models to real-life data arising from genomic studies. We provide general discussion in Section 7.</p></sec><sec id="s2"><title>2. Moments of the Distribution</title><p>The simplest approach to derive the higher order central moments of the distribution is to first write (1) in the general form of the linear exponential family.</p><p>For fixed β, the QNBD belongs to the regular exponential family of discrete random variables:</p><p>P x = h ( x ) exp [ η ( θ ) S ( x ) − ψ ( θ ) ] (2)</p><p>with</p><p>h ( x ) ≡ β − 1 ( β − 1 ) + β x ⋅ Γ ( β + β x ) x ! Γ ( β + β x − x )</p><p>S ( x ) ≡ x</p><p>η ( θ ) = log [ θ ( 1 − θ ) 1 − β ]</p><p>ψ ( θ ) = − log ( 1 − θ ) β − 1</p><p>The mean μ ′ 1 and variance μ 2 of X are given respectively by:</p><p>μ ′ 1 = ( β − 1 ) θ 1 − β θ (3)</p><p>μ 2 = ( β − 1 ) θ ( 1 − θ ) ( 1 − β θ ) 3 (4)</p><p>Writing</p><p>g ( θ ) = exp [ η ( θ ) ] , and f ( θ ) = exp [ ψ ( θ ) ] , one can establish a recurrence relationship among the central moments so that:</p><p>μ r + 1 = E [ ( x − μ ′ 1 ) r ] = g ( θ ) g ′ ( θ ) [ ∂ μ r ∂ θ + ∂ μ ′ 1 ∂ θ μ r − 1 ]</p><p>Here</p><p>μ ′ 1 = g ( θ ) g ′ ( θ ) ⋅ ∂ ln f ∂ θ r = 1 , 2 , ⋯ (5)</p><p>Therefore, the third and fourth central moments are:</p><p>μ 3 = ( β − 1 ) θ ( 1 − θ ) ( 1 − 2 θ + 2 β θ − β θ 2 ) ( 1 − β θ ) 5 (6)</p><p>μ 4 = 3 μ 2 2 + ( β − 1 ) θ ( 1 − θ ) ( 1 − β θ ) 7 M (7)</p><p>where,</p><p>M = 1 − 6 θ + 6 θ 2 + 2 β θ ( 4 − 9 θ + 4 θ 2 ) + β 2 θ 2 ( 6 − 6 θ + θ 2 ) .</p><p>Moreover, the fifth central moment is given by:</p><p>μ 5 = 10 μ 2 μ 3 + ( β − 1 ) θ ( 1 − θ ) ( 1 − β θ ) − 9 B</p><p>where</p><p>B = 1 − 14 θ + 36 θ 2 + 24 θ 3 + 2 θ β ( 11 − 42 θ + 28 θ 2 )     − θ 2 β ( 29 − 96 θ + 58 θ 2 ) + θ 2 β 2 ( 58 − 96 θ + 29 θ 2 )     − 2 θ 3 β 2 ( 28 − 42 θ + 11 θ 2 ) + 2 θ 3 β 3 ( 12 − 9 θ + θ 2 )     − θ 4 β 3 ( 18 − 12 θ + θ 2 )</p></sec><sec id="s3"><title>3. Moment Estimators</title><p>Suppose that we have a random sample x 1 , x 2 , ⋯ , x n with sample mean x &#175; and sample variance s 2</p><p>x &#175; = 1 n ( x 1 + x 2 + ⋯ + x n )</p><p>s 2 = 1 n ∑ i = 1 n ( x i − x &#175; ) 2</p><p>Equating the sample statistics to their corresponding population parameters (3) and (4) and solving for θ and β we get</p><p>θ ^ = 1 − x &#175; ( 1 + x &#175; ) 2 s 2 (8)</p><p>β ^ = 1 + x &#175; ( 1 + x &#175; ) s 2 (9)</p><p>We use the delta method to evaluate the variances and biases of a moment estimator.</p><p>From Kendall and Ord [<xref ref-type="bibr" rid="scirp.116617-ref7">7</xref>] we have:</p><p>var ( θ ^ ) = var ( x &#175; ) ( ∂ ˙ θ ^ ∂ x &#175; ) 2 + var ( s 2 ) ( ∂ ˙ θ ^ ∂ s 2 ) 2 + 2 cov ( x &#175; , s 2 ) ( ∂ ˙ θ ^ ∂ &#175; x &#175; ) ( ∂ ˙ θ ^ ∂ &#175; s 2 )</p><p>Bias ( θ ^ ) = 1 2 ! [ var ( x &#175; ) ∂ ˙ θ ^ ∂ x &#175; 2 + var ( s 2 ) ∂ ˙ 2 θ ^ ∂ 2 s 2 + 2 cov ( x &#175; , s 2 ) ∂ ˙ 2 θ ^ ∂ &#175; x &#175; ∂ s 2 ]</p><p>With similar expressions for var ( β ^ ) and Bias ( β ^ ) .</p><p>One can show that:</p><p>V 1 = var ( θ ^ ) = ( 1 − θ ) n θ ( β − 1 ) ( 1 − β θ ) [ 1 + 2 β θ − 3 θ ] 2                                     + ( μ 4 − μ 2 2 ) n ⋅ ( 1 − β θ ) 6 θ 2 ( β − 1 ) 2 − 2 μ 3 n ( 1 − β θ ) 4 θ 2 ( β − 1 ) 2</p><p>V 2 = var ( β ^ ) = ( 1 − θ ) [ 1 + β θ − 2 θ ] 2 n θ ( 1 − θ ) ( β − 1 ) + ( μ 4 − μ 2 2 ) n ⋅ ( 1 − β θ ) 8 θ 2 ( 1 − θ ) 2 ( β − 1 ) 2                                       − 2 μ 3 n ( 1 − β θ ) 6 [ 1 + β θ − 2 θ ] θ 2 ( 1 − θ ) 2 ( β − 1 ) 2</p><p>Bias ( θ ^ ) = − 4 ( β − 1 ) n ( 1 − β θ ) [ 2 + β θ − 3 θ ] − μ 4 − μ 2 2 n [ ( 1 − β θ ) 6 θ 2 ( 1 − θ ) ( β − 1 ) 2 ]       + μ 3 n [ ( 1 − β θ ) 4 ( 1 + 2 β θ − 3 θ ) ( β − 1 ) 2 θ 2 ( 1 − θ ) ]</p><p>Bias ( β ^ ) = 1 n [ 1 + ( μ 4 − μ 2 2 ) ( 1 − β θ ) 7 θ 2 ( 1 − θ ) 2 ( β − 1 ) 2 − μ 3 ( 1 − β θ ) 3 θ 2 ( 1 − θ ) 2 ( 1 + β θ − 2 θ ) ]</p><p>cov ( β ^ , θ ^ ) = μ 2 n ( ∂ θ ∂ x &#175; ) ( ∂ β ∂ x &#175; ) + μ 4 − μ 2 2 n ( ∂ θ ∂ s 2 ) ( ∂ β ∂ s 2 )     + μ 3 n [ ( ∂ θ ∂ x &#175; ) ( ∂ β ∂ s 2 ) + ( ∂ θ ∂ s 2 ) ( ∂ β ∂ x &#175; ) ]</p><p>Note that, the information matrix is the determinant of the variance covariance matrix of the moment estimators and is given by:</p><p>D = var ( θ ^ ) ⋅ var ( β ^ ) − cov 2 ( θ ^ , β ^ )</p><p>Example: Modeling the number of brain lesions to predict Multiple Sclerosis.</p><p>The use of gadolinium (Gd) withT1 weighted imaging can identify areas of breakdown in the blood-brain barrier and increases the reliability and in detecting active Multiple Sclerosis (MS) lesions [<xref ref-type="bibr" rid="scirp.116617-ref8">8</xref>]. The number of new Gd enhancing lesions is a widely used end point for monitoring disease activity and for evaluating the effect of treatments in phase II clinical trials. In these studies, the results of the Magnetic Resonance Imaging (MRI) end point are in the form of counts [<xref ref-type="bibr" rid="scirp.116617-ref9">9</xref>]. To deal with the problem of overdispersion, the negative binomial distribution is used to model this type of data.</p><p>As application of the QNBD we simulated lesions count data like the situation described in [<xref ref-type="bibr" rid="scirp.116617-ref8">8</xref>] (<xref ref-type="table" rid="table1">Table 1</xref>).</p><p>The sample size = 116 subjects.</p><p>The histogram of the data s is given in <xref ref-type="fig" rid="fig3">Figure 3</xref>.</p><p>The y-axis we have the frequency of each x.</p><p>mean(x) = 3.37, and var(x) = 69.63.</p><p>The moment estimators are θ ^ = 0.077 and β ^ = 10.227 .</p><p>Bootstrapping the distribution of the moment estimators</p><p>SE ( θ ^ ) = 0.344 , and SE ( β ^ ) = 0.106 (<xref ref-type="fig" rid="fig4">Figure 4</xref>).</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Distribution of the number of brain lesions</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >X</th><th align="center" valign="middle" >0</th><th align="center" valign="middle" >1</th><th align="center" valign="middle" >2</th><th align="center" valign="middle" >3</th><th align="center" valign="middle" >4</th><th align="center" valign="middle" >5</th><th align="center" valign="middle" >6</th><th align="center" valign="middle" >7</th><th align="center" valign="middle" >8</th><th align="center" valign="middle" >9</th><th align="center" valign="middle" >10</th><th align="center" valign="middle" >11</th><th align="center" valign="middle" >12</th><th align="center" valign="middle" >13</th><th align="center" valign="middle" >14</th><th align="center" valign="middle" >15</th><th align="center" valign="middle" >16</th><th align="center" valign="middle" >17</th><th align="center" valign="middle" >18</th><th align="center" valign="middle" >50</th><th align="center" valign="middle" >60</th></tr></thead><tr><td align="center" valign="middle" >Freq</td><td align="center" valign="middle" >56</td><td align="center" valign="middle" >14</td><td align="center" valign="middle" >7</td><td align="center" valign="middle" >5</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >1</td></tr></tbody></table></table-wrap><p>The distribution is negatively skewed. The empirical bias in the moment estimator of θ ^ is bias ( θ ^ ) = − 0.1659 .</p><p>Similarly bias ( β ^ ) = − 9.77 .</p><p>From <xref ref-type="fig" rid="fig5">Figure 5</xref> we may infer that the distribution of β ^ seems to be a mixture of two distribution or is bimodal. From these results, we may conclude that the moment estimators are not reliable unless we have extremely large sample. In the next section, we discuss the maximum likelihood estimation.</p></sec><sec id="s4"><title>4. Maximum Likelihood Estimators (MLE)</title><p>It is well-known that the estimators obtained from application of the method of MLE possess optimal properties such asymptotic normality and efficiency. Based on a simple random sample the log-likelihood (l) function is given by:</p><p>l = n log ( β − 1 ) − ∑ i = 1 n log ( β − 1 + β x i ) + ∑ i = 1 n ∑ j = 1 x i [ β ( 1 + x i ) − j ]           + n x &#175; log θ + n ( β − 1 ) ( 1 + x &#175; ) log ( 1 − θ ) (10)</p><p>β θ ˜ = x &#175; β ( 1 + x &#175; ) − 1 (11)</p><p>Similarly, setting ∂ l ∂ β equal to zero and solving for β we get:</p><p>β ˜ = ( 1 + x &#175; ) − 1 + x &#175; ( 1 + x &#175; ) − 1 [ 1 − exp [ Ω x − ( ( β ˜ − 1 ) + ( 1 − x &#175; ) ) − 1 ] ] − 1 (12)</p><p>where</p><p>Ω x = − ( n ( 1 + x &#175; ) ) − 1 [ ∑ i = 1 n ( 1 + x i ) β ˜ ( 1 + x i ) − 1 − ∑ i = 1 n ∑ j = 1 x i ( 1 + x i ) β ˜ ( 1 + x i ) − j ]</p><p>The MLEs of θ and β are thus obtained by solving (11) and (12) iteratively, noting that (12) is in the form of β ˜ = f ( β ˜ ) or a fixed-point equation.</p><p>Elements of the variance-covariance matrix of the ( θ ˜ , β ˜ ) are obtained by inverting the Fisher’s information matrix. We can show that</p><p>i θ θ = − E [ ∂ 2 l ∂ θ 2 ] = n ( β − 1 ) θ ( 1 − θ ) ( 1 − β θ ) (13)</p><p>i θ β = − E [ ∂ 2 l ∂ θ ∂ β ] = n 1 − β θ (14)</p><p>i β β = − E [ ∂ 2 l ∂ β 2 ] = n ( β − 1 ) 2 − E [ ∑ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) − 1 ] 2 + ∑ i = 1 n ∑ j = 1 x i ( 1 + x i ) 2 [ β ( 1 + x i ) − j ] 2 ] (15)</p><p>var ( θ ˜ ) = i β β / Δ ,   var ( β ˜ ) = i θ θ / Δ ,</p><p>and</p><p>cov ( θ ˜ , β ˜ ) = − i θ β / Δ</p><p>where Δ = i θ θ ⋅ i β β − i θ β 2</p><p>We note that on using the digamma approximation we can write</p><p>i β β = n ( β − 1 ) 2 − E [ ∑ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) − 1 ] 2 + ∑ i = 1 n x i ( 1 + x i ) 2 [ 1 + β ( 1 + x i ) ] [ 1 + β ( 1 + x i ) − x i ] ]</p><p>The R-Code for fitting the QNBD is given in Appendix 1.</p></sec><sec id="s5"><title>5. Orthogonal Polynomial Approximation for i β β</title><p>The evaluation of the asymptotic variance covariance matrix is difficult because P 22 = − E [ ∂ 2 log l ∂ β 2 ] does not have a tractable form. To overcome this difficulty, following [<xref ref-type="bibr" rid="scirp.116617-ref5">5</xref>] we employ an asymptotic expansion for ∂ 2 log P x ∂ β as a linear combination of orthogonal polynomials. From Morgan et al. [<xref ref-type="bibr" rid="scirp.116617-ref9">9</xref>], if P x is a distribution function with feint moments μ r of all orders, then the point x 0 is a point of increase for P x , if P x 0 + h &gt; P x 0 − h for every h &gt; 0 . If the distribution function P has atleast Y points of increase, Cram&#233;r [<xref ref-type="bibr" rid="scirp.116617-ref10">10</xref>] has proved that there exists a sequence of polynomials G 0 ( x ) , G 1 ( x ) , uniquly determined under the following conditions:</p><p>1) G n ( x ) is of degree n, and the coefficient of x n in G n ( x ) is positive</p><p>2) G n ( x ) satisfy the orthogonality conditions</p><p>∑ x = 0 ∞ G r ( x ) G s ( x ) = E ( G r 2 ( x ) )</p><p>If r = s</p><p>= 0         r ≠ s ( r , s = 0 , 1 , 2 , ⋯ )</p><p>Szegő [<xref ref-type="bibr" rid="scirp.116617-ref11">11</xref>] derived the formal Fourier expansion of a continuous function h ( x ) in terms of a set of orthogonal polynomials such that:</p><p>h ( x ) = ∑ r = 0 ∞ a r G r ( x )</p><p>where a r are selected so that:</p><p>∑ r = 0 ∞ [ ∂ log P x ∂ β − ∑ r = 0 ∞ a r G r ( x ) ] 2 P x</p><p>is minimum. He showed that</p><p>a 0 ≡ 0 , a 1 = ∂ μ ′ 1 ∂ β / E ( G 1 2 ( x ) ) ,</p><p>a 2 = ( ∂ μ 2 ∂ β − μ 3 μ 2 ∂ μ ′ 1 ∂ β ) / E ( G 2 2 ( x ) )</p><p>Direct calculations give:</p><p>G 0 ≡ 1 , G 1 ( x ) = x − μ ′ 1 and, G 2 ( x ) = ( x − μ ′ 1 ) 2 − μ 3 μ 2 ( x − μ ′ 1 ) − μ 2 , are the orthogonal polynomials associated with the probability function P x , where E ( G 1 2 ( x ) ) = μ 2 . Moreover, we write</p><p>Ω = E ( G 2 2 ( x ) ) = E [ ( x − μ ′ 1 ) 4 + μ 3 2 μ 2 2 ( x − μ ′ 1 ) 2 + μ 2 2 − 2 μ 3 μ 2 ( x − μ ′ 1 ) 3                                                 − 2 μ 2 ( x − μ ′ 1 ) 2 + 2 μ 3 μ 2 ( x − μ ′ 1 ) μ 2 ]</p><p>Hence</p><p>Ω = μ 4 + μ 3 2 μ 2 + μ 2 2 − 2 μ 3 2 μ 2 − 2 μ 2 2 = μ 4 − μ 3 2 μ 2 − μ 2 2</p><p>Now, since ∂ μ ′ 1 ∂ β = ∂ ∂ β [ ( β − 1 ) θ 1 − β θ ] = θ ( 1 − θ ) ( 1 − β θ ) 2 , and</p><p>∂ μ 2 ∂ β = ∂ ∂ β [ ( β − 1 ) θ ( 1 − θ ) ( 1 − β θ ) 3 ] = θ ( 1 − θ ) ( 1 + 2 β θ − 3 θ ) ( 1 − β θ ) 4</p><p>then</p><p>∂ log P x ∂ β = ∂ μ ′ 1 ∂ β ( x − μ ′ 1 ) E ( G 1 2 ( x ) ) + ( ∂ μ 2 ∂ β − μ 3 μ 2 ⋅ ∂ μ ′ 1 ∂ β ) E ( G 2 2 ( x ) ) [ ( x − μ ′ 1 ) 2 − μ 3 μ 2 ( x − μ ′ 1 ) − μ 2 ]</p><p>Since</p><p>n E [ ∂ log P ∂ β ] 2 = − E [ ∂ log l ∂ β 2 ]</p><p>Then</p><p>i β β ≃ n [ θ ( 1 − θ ) ( β − 1 ) ( 1 − β θ ) + θ 4 ( 1 − θ ) 2 ( 1 − β θ ) 6 [ μ 4 − μ 3 2 μ 2 − μ 2 2 ] ]</p><p>The asymptotic relative efficiency of the moment estimators is therefore given by:</p><p>Eff = 1 Δ D</p><p>For the lesion data Eff = 16.6%. We interpret this number as follows: for the moment estimators to be as efficient as the maximum likelihood estimators, we need a sample size that is 16.6% larger compared to the sample size used for the maximum likelihood estimation.</p><p>3.4 Asymptotic biases of the MLE</p><p>Unlike the moment estimators, the ( θ ˜ , β ˜ ) do not have closed form expressions, and the applications of the delta method cannot be used to obtain their asymptotic biases. Sherton and Wallington [<xref ref-type="bibr" rid="scirp.116617-ref12">12</xref>] used an approach that depends on the asymptotic expansion of the log-likelihood functions. We denote the biases of θ ˜ , and β ˜ by b 1 ( θ ˜ ) and b 2 ( β ˜ ) , and these are the solutions of the system of equations</p><p>( P 11 P 12 P 21 P 22 ) [ b 1 ( θ ˜ ) b 2 ( β ˜ ) ] = [ − A 1 2 D n − A 2 2 D n ]</p><p>In the above system of equation we have the following notations:</p><p>D = P 11 P 22 − P 12 2</p><p>A 1 = P 22 − P 1 , 11 − 2 P 12 P 1 , 12 + P 11 P 1 , 22</p><p>A 2 = P 22 − P 2 , 11 − 2 P 12 P 2 , 12 + P 11 P 2 , 22</p><p>And</p><p>P 1 , 11 = E [ P − 2 ∂ P ∂ θ ⋅ ∂ 2 P ∂ θ 2 ]</p><p>P 1 , 12 = E [ P − 2 ∂ P ∂ θ ⋅ ∂ 2 P ∂ θ ∂ β ]</p><p>P 1 , 22 = E [ P − 2 ∂ P ∂ θ ⋅ ∂ 2 P ∂ β 2 ]</p><p>P 2 , 11 = E [ P − 2 ∂ P ∂ β ⋅ ∂ 2 P ∂ θ 2 ]</p><p>P 2 , 12 = E [ P − 2 ∂ P ∂ β ⋅ ∂ 2 P ∂ θ ∂ β ]</p><p>P 2 , 22 = E [ P − 2 ∂ P ∂ β ⋅ ∂ 2 P ∂ β 2 ]</p><p>Since</p><p>∂ P ∂ θ = ( x − μ ′ 1 ) ( 1 − β θ ) θ ( 1 − θ ) ⋅ P</p><p>We can show that</p><p>P 1 , 11 = μ 2 w 1 + μ 3 w 2</p><p>where</p><p>w 1 = ( 1 − β θ ) ( 2 θ − β θ 2 − 1 ) θ 3 ( 1 − θ ) 3</p><p>and</p><p>w 2 = ( 1 − β θ ) 3 θ 3 ( 1 − θ ) 3</p><p>P 1 , 22 = [ θ ( 1 − θ ) ( 1 − β θ ) ] − 1</p><p>P 2 , 11 = ( 1 − θ ) − 2 ,     P 1 , 22 = P 2 , 22 = 0</p><p>P 2 , 12 = D 1 + D 2 + D 3</p><p>where</p><p>D 1 = 1 − 2 θ + β θ ( 2 − 0 ) ( β − 1 ) ( 1 − θ ) 2</p><p>D 2 = [ μ 2 ∂ μ 2 ∂ β − μ 3 ∂ μ ′ 1 ∂ β μ 2 μ 4 − μ 3 2 − μ 2 3 ] 2 [ μ 5 + μ 3 3 μ 2 2 − 2 μ 3 μ 4 μ 2 ]</p><p>D 3 = 2 ( 1 − β θ ) 5 ( β − 1 ) 2 θ ( 1 − θ ) 2 [ μ 2 − ∂ μ 2 ∂ β − μ 3 ∂ μ ′ 1 ∂ β ]</p><p>where</p><p>∂ μ ′ 1 ∂ β = θ ( 1 − θ ) ( 1 − β θ ) 2     and     ∂ μ 2 ∂ β = θ ( 1 − θ ) ( 1 + 2 β θ − 3 θ ) ( 1 − β θ ) 4</p><p>Finally, using the above information we can show that P 2 , 22 = 0 .</p><p>Solving the system of equations, we obtain the asymptotic biases so that</p><p>bias ( θ ˜ ) = θ 2 ( 1 − θ ) 3 ( 1 − β θ ) 2 2 n [ ( β − 1 ) ( 1 − θ ) P 22 − θ ( 1 − β θ ) ] 2     &#215; [ P 22 ( 1 − θ ) 2 − 2 P 1 , 12 ( 1 − θ ) + 2 P 22 θ ( 1 − θ ) ( 1 − β θ ) − 2 β ( β − 1 ) P 22 2 θ ( 1 − β θ ) 2 ]</p><p>bias ( β ˜ ) = θ 2 ( 1 − θ ) 2 ( 1 − β θ ) 2 n [ ( β − 1 ) ( 1 − θ ) P 22 − ( 1 − β θ ) ] 2     &#215; [ 2 β ( β − 1 ) β P 22 θ ( 1 − β θ ) − 2 θ ( 1 − θ ) + 2 ( β − 1 ) P 22 θ − ( β − 1 ) P 22 1 − θ ]</p><p>For the lesion data, the biases of the maximum likelihood estimators are given by:</p><p>bias ( θ ˜ ) = 0.003 , and bias ( β ˜ ) = 0.002</p></sec><sec id="s6"><title>6. Quasi Negative Binomial Regression</title><p>Our aim in this section is develop regression model based on the GNBD. The approach is facilitated by the fact that the QNBD is a member of the regular exponential family shown in [<xref ref-type="bibr" rid="scirp.116617-ref13">13</xref>]. We employ the transformation:</p><p>τ ( θ i ) = z i T γ _ ,     i = 1 , 2 , ⋯ , k (16)</p><p>Here we assume to τ ( θ i ) be monotone, differentiable, and positive function of θ [<xref ref-type="bibr" rid="scirp.116617-ref13">13</xref>]. In (16) z is a vector of υ &#215; 1 ( ν &lt; n ) exploratory variables and γ is a vector of regression parameters. To estimate γ 1 , γ 2 , ⋯ , γ q , and β, we assure that</p><p>x 1 ~ QNBD ( θ i , β ) , i = 1 , 2 , ⋯ , n</p><p>are independent random variables and</p><p>logit [ θ i ( z ) ] = [ z i T γ ] (17)</p><p>In this section, we derive the maximum likelihood estimators of the regression parameters, the parameter β and their asymptotic properties. The log-likelihood function is given by:</p><p>l = n log ( β − 1 ) − ∑ i = 1 n log ( β − 1 + β x i )     + ∑ i = 1 n log Γ ( β + β x i ) − ∑ i = 1 n log Γ ( β + β x i − x i )     + ∑ i = 1 n     x i log θ i + ∑ i = 1 n ( β − 1 ) ( 1 − x i ) log ( 1 − θ i ) = l 1 ( β , x i ) + l 2 ( β , x i , θ i )</p><p>∂ l 2 γ = ∑ i = 1 n     x i ∂ ∂ γ r [ log θ i ] + ∑ i = 1 n ( β − 1 ) ( 1 + x i ) ∂ ∂ γ r [ log ( 1 − θ i ) ] = ∑ i = 1 n     x i z i r [ 1 + e z i T γ ] − 1 − ∑ i = 1 n ( β − 1 ) ( 1 + x i ) z i r e z i T γ [ 1 + e z i T γ ] − 1</p><p>σ r β ≡ ∂ 2 l 2 ∂ β ∂ γ r = − ∑ i = 1 n ( 1 + x i ) z i r e z i T γ [ 1 + e z i T γ ] − 1</p><p>− I β r = − E [ ∂ 2 l ∂ β ∂ γ r ] = ∑ i = 1 n     z i r e z i T 1 + e z i T γ E [ 1 + x i ] = ∑ i = 1 n     z i r e z i T γ [ ( 1 + e z i T γ ) ( 1 − ( β − 1 ) e z i T γ ) ] − 1</p><p>∂ 2 l 2 ∂ γ r ∂ γ s = − ∑ i = 1 n     x i z i r z i s e z i T γ ( 1 + e z i T γ ) 2       − ( β − 1 ) ∑ i = 1 n ( 1 + x i ) z i r z i s [ e z i T γ 1 + e z i T γ − [ e z i T γ 1 + e z i T γ ] 2 ]</p><p>− E [ ∂ 2 l 2 ∂ γ r ∂ γ s ] ≡ σ r s = ∑ i = 1 n     z i r z i s θ i ( 1 − θ i ) ( β − 1 ) θ i 1 − β θ i + ( β − 1 ) ∑ i = 1 n     z i r z i s ( 1 − θ i ) 1 − β θ i [ θ i − θ i 2 ]</p><p>− E [ ∂ 2 l ∂ γ r ∂ γ s ] = ( β − 1 ) ∑ i = 1 n     z i r z i s θ i ( 1 − θ i ) 1 − β θ i</p><p>where,</p><p>θ i = e z i T γ / ( 1 + e z i T γ )</p><p>∂ l ∂ β r can be approximated using the results:</p><p>Γ ′ ( y ) Γ ( y ) = − δ − 1 y + ∑ j = 1 ∞ y j ( y + j )</p><p>where δ is Euler’s number. Therefore</p><p>∂ l ∂ β ≃ n β − 1 − ∑ i = 1 n ( 1 + x i ) β ( 1 + x i ) − 1 + ∑ i = 1 n ( 1 + x i ) [ log ( 1 + β ( 1 + x i ) )     − log ( β ( 1 + x i ) + 1 − x i ) ] + ∑ i = 1 n ( 1 + x i ) log ( 1 − θ i )</p><p>∂ l ∂ β = n β − 1 − ∑ i = 1 n ( 1 + x i ) ( β − 1 + β x i ) + ∑ i = 1 n Γ ′ ( β + β x i ) Γ ( β + β x i ) ( 1 + x i )     − ∑ i = 1 n Γ ′ ( β + β x i − x i ) Γ ( β + β x i − x i ) ( 1 + x i ) + ∑ i = 1 n     x i log [ ( 1 − θ i ) θ i ] + ∑ i = 1 n log ( 1 − θ i ) = n β − 1 + ∑ i = 1 n ( 1 + x i ) β − 1 + β x i + ∑ i = 1 n Γ ′ ( β + β x i ) Γ ( β + β x i ) ( 1 + x i )     − ∑ i = 1 n Γ ′ ( β + β x i − x i ) Γ ( β + β x i − x i ) ( 1 + x i ) + ∑ i = 1 n ( 1 + x i ) log ( 1 − θ i )</p><p>∂ 2 l ∂ β 2 ≐ − n ( β − 1 ) 2 − ∑ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) − 1 ] 2     + ∑ i = 1 n ( 1 + x i ) { ( 1 + x i ) 1 + β ( 1 + x i ) − ( 1 + x i ) 1 + β ( 1 + x i ) − x i ]</p><p>Simplifying we get:</p><p>− ∂ 2 l ∂ β 2 = n ( β − 1 ) 2 − ∑ i = 1 n ( 1 + x i ) 2 [ β ( 1 + x i ) − 1 ] 2     + ∑ i = 1 n x i ( 1 + x i ) 2 [ 1 + β ( 1 + x i ) ] 2 − x i [ 1 + β ( 1 + x i ) ]</p><p>σ β β = − E [ ∂ 2 l ∂ β 2 ] can be approximated by:</p><p>σ β β = 1 ( β − 1 ) 2 ∑ i = 1 n     θ i ( 2 − θ i )     + ( β − 1 ) ∑ i = 1 n θ i ( 1 − θ i ) 2 ( 1 − β θ i ) ( 1 − 2 β θ i + β ) ( 1 − 3 β θ i + β + θ i )</p><p>The variance covariance matrix of the estimated parameters, and β based on the regression model is given by the inverse of Fisher’s information matrix:</p><p>Σ = [ σ γ γ σ γ β σ β β ] − 1 = [ M O C ]</p><p>where M is and q &#215; q symmetric matrix whose elements are m i j so that m i j = cov ( γ ^ i , γ ^ j ) , and O is a 1 &#215; q matrix whose elements are O j = cov ( γ ^ i , β ^ ) and C is a 1 &#215; 1 element with C = var ( β ^ ) .</p><p>The simplest approach to obtain the maximum likelihood estimators of γ and β is by solving the equations;</p><p>∂ l ∂ β = 0 and ∂ l ∂ γ r = 0 , r = 1 , 2 , ⋯ , r iteratively using a numeric technique such as Newton-Raphson. Following Cox and Hinkley [<xref ref-type="bibr" rid="scirp.116617-ref14">14</xref>], we have as n → ∞ and under certain regularity conditions, the maximum likelihood estimators of ϕ ^ = ( γ ^ , β ^ ) are asymptotically normal and consistent.</p><p>That is</p><p>V n &#175; ( ϕ ^ − ϕ ) → N q + 1 ( 0 , Σ ) in law</p><p>4-Limiting form of the QNBD: The Quasi-Poisson Distribution</p><p>As β → ∞ , θ → 0 , so that β θ = α , the distribution (1) takes the following form:</p><p>P x = α x x ! ( 1 + x ) x − 1 e − α ( 1 + x ) ,     0 &lt; α &lt; 1</p><p>μ = E ( x ) = α 1 − α ,     var ( x ) = α ( 1 − α ) 3</p><p>Therefore, var ( x ) = μ ( 1 + μ ) 2 . Expressing the distribution in terms of the mean parameter μ, the limiting distribution can be written as:</p><p>P x = ( 1 + x ) x − 1 x ! ( μ 1 + μ ) x e − μ 1 + μ ( 1 + x ) (18)</p><p>In a paper that follows, we shall discuss the issues of maximum likelihood estimation for the parameter μ of the probability function (18) and the regression model associated with it.</p></sec><sec id="s7"><title>7. Data Analysis: RNA_SEQ Data: Modeling the Distribution of Read Counts</title><p>Over the past decade, various statistical analysis tools have been developed to analyze expression profiling data generated by microarrays (Reviewed in [<xref ref-type="bibr" rid="scirp.116617-ref15">15</xref>] [<xref ref-type="bibr" rid="scirp.116617-ref16">16</xref>] [<xref ref-type="bibr" rid="scirp.116617-ref17">17</xref>] ). Before these tools can be applied to RNA-Seq data, it is worth noting that microarray data and RNA-Seq data are inherently different [<xref ref-type="bibr" rid="scirp.116617-ref16">16</xref>]. Microarray data is “analog” since expression levels are represented as continuous hybridization signal intensities. In contrast, RNA-Seq data is “digital”, representing expression levels as discrete counts. This inherent difference leads to the difference in the parametric statistical methods that are used since they often depend on the assumptions of the random mechanism that generates the data. The Poisson, Binomial and Negative binomial distributions are more suitable for modeling discrete data in an RNA-Seq experiment. Therefore, a statistical method developed for microarray data analysis cannot be directly applied to RNA-Seq data analysis without first examining the underlying distributions. Recently several statistical methods have been developed to deal specifically with RNA-Seq count data [<xref ref-type="bibr" rid="scirp.116617-ref17">17</xref>]. In an RNA-Seq dataset, the expression levels of a specific gene were modeled using the Poisson distribution. This Poisson model is verified in the case where there are only technical replicates using a single source of RNA [<xref ref-type="bibr" rid="scirp.116617-ref15">15</xref>]. In the Poisson model, over-dispersion occurs if the sample variance is greater than the sample mean. There could be several sources that cause over-dispersion in RNA-Seq data, including the variability in biological replicates due to heterogeneity within a population of cells, possible correlation between gene expressions due to regulation, and other uncontrolled variations [<xref ref-type="bibr" rid="scirp.116617-ref18">18</xref>]. The existence of over-dispersion in real data was observed in several previous studies [<xref ref-type="bibr" rid="scirp.116617-ref18">18</xref>]. Popular models to safeguard against over-dispersion include the negative binomial distribution, or two-stage Poisson distribution [<xref ref-type="bibr" rid="scirp.116617-ref19">19</xref>], as discussed below.</p><p>When over-dispersion is observed across the samples, the gene counts cannot be estimated accurately by a simple Poisson model [<xref ref-type="bibr" rid="scirp.116617-ref20">20</xref>]. One way to handle this problem is to allow the Poisson mean to be a random variable and then model the gene counts by the marginal distribution of the mean count. Specifically, assume that the Poisson mean follows a Gamma distribution then the marginal distribution of the gene count has a Negative Binomial distribution with mean μ i and variance = μ i ( 1 + ε μ i ) , where ε is the dispersion parameter [<xref ref-type="bibr" rid="scirp.116617-ref20">20</xref>].</p><p>Yoon and Nam [<xref ref-type="bibr" rid="scirp.116617-ref21">21</xref>] [<xref ref-type="bibr" rid="scirp.116617-ref22">22</xref>] showed that the gene dispersion value as estimated under the negative binomial modelling of read counts is the key determinant of the read count bias.</p><p>Whenever multiple samples are available and instead of modeling the raw expression, we model the gene counts as a function of the experimental sample and gene dispersion as covariates. For highly expressed genes we used the QNB regression model for published data that we downloaded from http://woldlab.caltech.edu/rnaseq/.</p><p>The published data were downloaded from http://www.ncbi.nlm.nih.gov/sra/ as the fastq files: SRA010153 for the MAQC data, SRP000727 for the human data (the two low-coverage MAQC samples were excluded), SRX000559-SRX000564 for the yeast data.</p><p>We analyzed the read count of the Mice-Brain tissue data under four experimental conditions:</p><p>Z<sub>1</sub> = Chrom_ chr11, Z<sub>2</sub> = Chrom chr9_ra, Z<sub>3</sub> = Chrom chrUn_ra, and Z<sub>4</sub> = Chrom chr13_ra, and d = the gene dispersion levels. Z<sub>j</sub> are modeled as categorical variables with categorical with Z<sub>4</sub> being the reference category, and d is measured on the continuous scale. <xref ref-type="fig" rid="fig6">Figure 6</xref> shows the histogram of the read counts for the 4 groups (Tables 2(a)-(d)).</p><p>We now analyze the data using three count regression models; the Poisson, the Negative binomial, and the QNB (Tables 3-5).</p><table-wrap-group id="2"><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> (a) Summary statistics of the read count data for Chrom-Chr1 sample; (b) Summary statistics of the read count data for Chrom-Chr13 sample; (c) Summary statistics of the read count data for Chrom-Chr9_ran sample; (d) Summary statistics of the read count data for Chrom-ChrUn_ran sample</title></caption><table-wrap id="2_1"><caption><title> Chrom-chr1 (b)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Variable</th><th align="center" valign="middle" >N</th><th align="center" valign="middle" >Mean</th><th align="center" valign="middle" >Std Dev</th><th align="center" valign="middle" >Minimum</th><th align="center" valign="middle" >Maximum</th></tr></thead><tr><td align="center" valign="middle" >d count</td><td align="center" valign="middle" >36823 36823</td><td align="center" valign="middle" >6.668 7.99</td><td align="center" valign="middle" >7.997 8.905</td><td align="center" valign="middle" >1.0 1.0</td><td align="center" valign="middle" >75.0 68.0</td></tr></tbody></table></table-wrap><table-wrap id="2_2"><caption><title> Chrom-chr13_ra (c)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Variable</th><th align="center" valign="middle" >N</th><th align="center" valign="middle" >Mean</th><th align="center" valign="middle" >Std Dev</th><th align="center" valign="middle" >Minimum</th><th align="center" valign="middle" >Maximum</th></tr></thead><tr><td align="center" valign="middle" >d count</td><td align="center" valign="middle" >13 13</td><td align="center" valign="middle" >21.307 1.077</td><td align="center" valign="middle" >8.586 0.277</td><td align="center" valign="middle" >2.0 1.0</td><td align="center" valign="middle" >25.0 2.0</td></tr></tbody></table></table-wrap><table-wrap id="2_3"><caption><title> Chrom-chr9_ran (d)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Variable</th><th align="center" valign="middle" >N</th><th align="center" valign="middle" >Mean</th><th align="center" valign="middle" >Std Dev</th><th align="center" valign="middle" >Minimum</th><th align="center" valign="middle" >Maximum</th></tr></thead><tr><td align="center" valign="middle" >d count</td><td align="center" valign="middle" >698 698</td><td align="center" valign="middle" >10.126 3.030</td><td align="center" valign="middle" >9.293 2.369</td><td align="center" valign="middle" >1.0 1.0</td><td align="center" valign="middle" >50.0 13.0</td></tr></tbody></table></table-wrap><table-wrap id="2_4"><caption><title> Chrom-chrUn_ran)</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Variable</th><th align="center" valign="middle" >N</th><th align="center" valign="middle" >Mean</th><th align="center" valign="middle" >Std Dev</th><th align="center" valign="middle" >Minimum</th><th align="center" valign="middle" >Maximum</th></tr></thead><tr><td align="center" valign="middle" >d count</td><td align="center" valign="middle" >89 89</td><td align="center" valign="middle" >22.843 1.157</td><td align="center" valign="middle" >6.626 0.541</td><td align="center" valign="middle" >1.0 1.0</td><td align="center" valign="middle" >25.0 4.0</td></tr></tbody></table></table-wrap></table-wrap-group><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Fitting the data to Poisson regression model</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >covariate</th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >SE</th><th align="center" valign="middle" >P-Value</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >1.936</td><td align="center" valign="middle" >0.267</td><td align="center" valign="middle" >0.0000</td></tr><tr><td align="center" valign="middle" >Z<sub>1</sub></td><td align="center" valign="middle" >0.671</td><td align="center" valign="middle" >0.267</td><td align="center" valign="middle" >0.012*</td></tr><tr><td align="center" valign="middle" >Z<sub>2</sub></td><td align="center" valign="middle" >−0.021</td><td align="center" valign="middle" >0.268</td><td align="center" valign="middle" >0.939</td></tr><tr><td align="center" valign="middle" >Z<sub>3</sub></td><td align="center" valign="middle" >0.427</td><td align="center" valign="middle" >0.285</td><td align="center" valign="middle" >0.134</td></tr><tr><td align="center" valign="middle" >d</td><td align="center" valign="middle" >−0.127</td><td align="center" valign="middle" >0.0006</td><td align="center" valign="middle" >&lt;0.000001</td></tr></tbody></table></table-wrap><p>AIC: 314241</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Results of fitting data to the Negative Binomial regression model</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >covariate</th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >SE</th><th align="center" valign="middle" >P-Value</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >2.36746</td><td align="center" valign="middle" >0.33</td><td align="center" valign="middle" >0.06e-13***</td></tr><tr><td align="center" valign="middle" >Z<sub>1</sub></td><td align="center" valign="middle" >0.16405</td><td align="center" valign="middle" >0.33</td><td align="center" valign="middle" >0.619</td></tr><tr><td align="center" valign="middle" >Z<sub>2</sub></td><td align="center" valign="middle" >−0.39876</td><td align="center" valign="middle" >0.33</td><td align="center" valign="middle" >0.229</td></tr><tr><td align="center" valign="middle" >Z<sub>3</sub></td><td align="center" valign="middle" >0.21046</td><td align="center" valign="middle" >0.35</td><td align="center" valign="middle" >0.551</td></tr><tr><td align="center" valign="middle" >d</td><td align="center" valign="middle" >−0.10833</td><td align="center" valign="middle" >0.001</td><td align="center" valign="middle" >&lt;0.00001</td></tr></tbody></table></table-wrap><p>(Dispersion parameter for Negative Binomial (2.0488, with SE = 0.0185). AIC: 214866</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Results of fitting data to the QNBD</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ></th><th align="center" valign="middle" >Estimate</th><th align="center" valign="middle" >SE</th><th align="center" valign="middle" >P-Value</th></tr></thead><tr><td align="center" valign="middle" >(Intercept)</td><td align="center" valign="middle" >−5.1351</td><td align="center" valign="middle" >0.2826</td><td align="center" valign="middle" >0.0000</td></tr><tr><td align="center" valign="middle" >Z<sub>1</sub></td><td align="center" valign="middle" >0.1839</td><td align="center" valign="middle" >0.2696</td><td align="center" valign="middle" >0.2475</td></tr><tr><td align="center" valign="middle" >Z<sub>2</sub></td><td align="center" valign="middle" >0.1029</td><td align="center" valign="middle" >0.2704</td><td align="center" valign="middle" >0.3519</td></tr><tr><td align="center" valign="middle" >Z<sub>3</sub></td><td align="center" valign="middle" >0.1255</td><td align="center" valign="middle" >0.2872</td><td align="center" valign="middle" >0.3310</td></tr><tr><td align="center" valign="middle" >d</td><td align="center" valign="middle" >−0.0224</td><td align="center" valign="middle" >0.0005</td><td align="center" valign="middle" >0.0000</td></tr><tr><td align="center" valign="middle" >β estimate</td><td align="center" valign="middle" >134.8252</td><td align="center" valign="middle" >54.8250</td><td align="center" valign="middle" >0.0070</td></tr></tbody></table></table-wrap><p>AIC = 213104</p><p>1) Modeling read count as a Poisson regression model glm(formula = y ~ Z<sub>1</sub> + Z<sub>2</sub> + Z<sub>3</sub> + d, family = poisson, data = ratdata);</p><p>2) Modeling read count using Negative Binomial to account for overdispersion;</p><p>3) Quasi negative binomial regression model.</p></sec><sec id="s8"><title>8. Comments on the Data Fitting</title><p>We used three count regression models to fit the RNA-SEQ data. All models were fitted using the R package [<xref ref-type="bibr" rid="scirp.116617-ref23">23</xref>]. The first is a Poisson regression model, the second is the well-known negative binomial, and the third is the proposed QNB regression model. The Poisson model is fitted in R by applying the “GLM” while the negative binomial is fitted by using the “MASS” package in R. We provided the R-code for fitting the QNB in Appendix 2 in Appendix 2. We based the comparisons among these models on the AIC values (the smaller the better). Clearly, the Poisson model with the largest AIC = 314241, is the worst as it fails to properly account for the overdispersion in the data. Remarkable improvement is attained when the negative binomial regression model is used as its AIC = 214866. Although the QNB regression model has the smallest AIC = 213104, the improvement over the negative binomial is not tangible. We still believe that our proposed model should be a close competitor to the negative binomial model.</p></sec><sec id="s9"><title>9. Discussion</title><p>There has been a growing interest among bioinformaticians and statisticians in constructing flexible distributions for counts that exhibit overdispersion to improve the modeling of count data. As a result, significant progress has been made towards generalizing some well-known discrete models, which have been successfully applied to problems arising in several areas of research. The proposed distribution was utilized to model two data sets; it was shown to provide a better fit than several other related models, including some with the same number of parameters. In the future paper, we shall demonstrate the applicability of the limiting form of our proposed distribution to genomics data together with inference procedures using multiple samples. Finally, we believe that the inferential results developed in this article should find numerous applications in bioinformatics, genomics, medicine, data engineering, and other areas of physical sciences.</p></sec><sec id="s10"><title>Acknowledgements</title><p>The authors acknowledge the positive comments made by anonymous reviewers of this work.</p></sec><sec id="s11"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s12"><title>Cite this paper</title><p>Shoukri, M.M. and Aleid, M.M. (2022) Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data. Open Journal of Statistics, 12, 216-237. https://doi.org/10.4236/ojs.2022.122016</p></sec><sec id="s13"><title>Appendices</title>Appendix 1: R-CODE for Fitting the Univariate Version of the QNBD Using the Maximum Likelihood Method Applied to the “Brain Lesion” Data<p>QNBD&lt;- function(x,theta,beta,log = FALSE){</p><p>loglik &lt;- log((((beta-1)/(beta-1+beta*x))*(factorial(beta-1+beta*x))/</p><p>(factorial(x)*factorial(beta-1+beta*x-x))*</p><p>((theta^x)*(1-theta)^(beta*x+beta-1-x))))</p><p>if(log = = FALSE)</p><p>density &lt;- exp(loglik)</p><p>else density&lt;-loglik</p><p>return(density)</p><p>}</p><p>parameter &lt;- maxlogL(x = x,dist = &quot;QNBD&quot;,start = c(.01,2),optimizer = 'optim')</p><p>summary(parameter)</p><p>The fitting results by the method maximum likelihood are:</p><p>AIC = 426.76 ,     θ ˜ = 0.298 &#177; 0.015 ,     β ˜ = 2.81 &#177; 0.057</p>Appendix 2: R-CODE: QNB Regression Fitting by the Method of Maximum Likelihood Applied to the RNA_SEQ Read Count Data<p>llik=function(y,par){</p><p>b0=par [<xref ref-type="bibr" rid="scirp.116617-ref1">1</xref>]</p><p>b1=par [<xref ref-type="bibr" rid="scirp.116617-ref2">2</xref>]</p><p>b2=par [<xref ref-type="bibr" rid="scirp.116617-ref3">3</xref>]</p><p>b3=par [<xref ref-type="bibr" rid="scirp.116617-ref4">4</xref>]</p><p>b4=par [<xref ref-type="bibr" rid="scirp.116617-ref5">5</xref>]</p><p>beta=par [<xref ref-type="bibr" rid="scirp.116617-ref6">6</xref>]</p><p>n=length(y)</p><p>eta=b0+b1*x1+b2*x2+b3*x3+b4*x4</p><p>mu=exp(eta)/(1+exp(eta))</p><p>ll=sum(log(beta-1)-log(beta-1+beta*y)</p><p>+lgamma(beta+beta*y)-lgamma(1+y)-lgamma(beta+beta*y-y)</p><p>+y*log(mu)+(beta+beta*y-1-y)*log(1-mu))</p><p>return(-ll)</p><p>}</p><p>res=optim(par=c(2,.6,-.02,.42,-.12,2.1),llik,y=y,method=&quot;BFGS&quot;,hessian=T)</p><p>theta=res$par</p><p>theta</p><p>#CALCULATING THE STANDARD ERRORS OF MLE</p><p>out3=nlm(llik,theta,y=y,hessian=TRUE)</p><p>fish=out3$hessian</p><p>solve(fish)</p><p>element=diag((solve(fish)))</p><p>se=sqrt(element)</p><p>qqnorm(y,resid(out3))</p><p>z=theta/se</p><p>p_value=1-pnorm(abs(z))</p><p>result.GNBD=data.frame(theta,se,z,p_value)</p><p>result.GNBD=round(result.GNBD,4)</p></sec></body><back><ref-list><title>References</title><ref id="scirp.116617-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Takács, L. (1962) A Generalization of the Ballot Problem and Its Application in the Theory of Queues. Journal of the American Statistical Association, 57, 327-337.  
https://doi.org/10.1080/01621459.1962.10480662</mixed-citation></ref><ref id="scirp.116617-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Consul, P.C. and Gupta, H.C. (1980) The Generalized Negative Binomial Distribution and Its Characterization by Zero Regression. SIAM Journal of Applied Mathematics, 39, 231-237. https://doi.org/10.1137/0139020</mixed-citation></ref><ref id="scirp.116617-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Consul, P.C. and Shenton, L.R. (1972) Use of Lagrange Expansion for Generating Generalized Probability Distributions. SIAM Journal of Applied Mathematics, 23, 239-248. https://doi.org/10.1137/0123026</mixed-citation></ref><ref id="scirp.116617-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Consul, P.C. and Famoye, F. (2006) Lagrangian Probability Distributions. Birkh&amp;auml;user, Boston.</mixed-citation></ref><ref id="scirp.116617-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Shoukri, M.M. (1980) Estimation of Generalized Discrete Distributions. Unpublished PhD Thesis, The University of Calgary, Calgary.</mixed-citation></ref><ref id="scirp.116617-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 135, 370-384.  
https://doi.org/10.2307/2344614</mixed-citation></ref><ref id="scirp.116617-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Kendall, M. and Ord, K. (2009) The Advanced Theory of Statistics. Vol. 1, 6th Edition, Griffin, London.</mixed-citation></ref><ref id="scirp.116617-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Rudick, R., Antel, J., Confavreux, C., Confavreux, C., Cutter, G., Ellison, G., et al. (1996) Clinical Outcomes Assessment in Multiple Sclerosis. Annals of Neurology, 40, 469-479. https://doi.org/10.1002/ana.410400321</mixed-citation></ref><ref id="scirp.116617-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Morgan, C.J., Aban, I.B., Katholi, C.R. and Cutter, G.R. (2010) Modeling Lesion Counts in Multiple Sclerosis When Patients Have Been Selected for Baseline Activity. Multiple Sclerosis, 16, 926-934. https://doi.org/10.1177/1352458510373110</mixed-citation></ref><ref id="scirp.116617-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Cramér, H. (1946) Mathematical Methods of Statistics. Princeton University Press, Princeton.</mixed-citation></ref><ref id="scirp.116617-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Szeg&amp;#337;, G. (1939) Orthogonal Polynomials. Vol. 23, Colloquium Publications, American Mathematical Society, New York.</mixed-citation></ref><ref id="scirp.116617-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Shenton, L.R. and Wallington, P.A. (1962) The Bias of the Moment Estimators with an Application to the Negative Binomial Distribution. Biometrika, 49, 193-204.  
https://doi.org/10.1093/biomet/49.1-2.193</mixed-citation></ref><ref id="scirp.116617-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. Chapman Hall, London.</mixed-citation></ref><ref id="scirp.116617-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Cox, D.R. and Hinkley, D. (1974) Theoretical Statistics. Chapman and Hall, London.</mixed-citation></ref><ref id="scirp.116617-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">McCarthy, D.J., Chen, Y. and Smyth, G.K. (2021) Differential Expression Analysis of RNA-Seq Experiments with Respect to Biological Variation. Nucleic Acids Research, 40, 4288-4297. https://doi.org/10.1093/nar/gks042</mixed-citation></ref><ref id="scirp.116617-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Pan, W. (2002) A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments. Bioinformatics, 18, 546-554. https://doi.org/10.1093/bioinformatics/18.4.546</mixed-citation></ref><ref id="scirp.116617-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. and Gilad, Y. (2008) RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays. Genome Research, 18, 15-1517.  
https://doi.org/10.1101/gr.079558.108</mixed-citation></ref><ref id="scirp.116617-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Koch, C.M., Chiu, S.F., Akbarpour, M., Bahart, A., Ridge, K.M., Bartom, E.T. and Winter, D.R. (2018) A Beginner’s Guide to Analysis of RNA Sequencing Data. American Journal of Respiratory Cell and Molecular Biology, 59, 145-157.  
https://doi.org/10.1101/gr.079558.108</mixed-citation></ref><ref id="scirp.116617-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Yoon, S., Kim, S.Y. and Nam, D. (2016) Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates. PLoS ONE, 11, e0165919.  
https://doi.org/10.1371/journal.pone.0165919</mixed-citation></ref><ref id="scirp.116617-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Auer, P.L. and Doerge, R.W. (2011) A Two-Stage Poisson Model for Testing RNA-Seq Data. Statistical Applications in Genetics and Molecular Biology, 10, 26.  
https://doi.org/10.2202/1544-6115.1627</mixed-citation></ref><ref id="scirp.116617-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Yoon, S. and Nam, D. (2017) Gene Dispersion Is the Key Determinant of the Read Count Bias in Differential Expression Analysis of RNA-Seq Data. BMC Genomics, 18, Article No. 408. https://doi.org/10.1186/s12864-017-3809-0</mixed-citation></ref><ref id="scirp.116617-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Robinson, M.D. and Smyth, G.K. (2008) Small-Sample Estimation of Negative Binomial Dispersion, with Applications to SAGE Data. Biostatistics, 9, 321-332.  
https://doi.org/10.1093/biostatistics/kxm030</mixed-citation></ref><ref id="scirp.116617-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">https://cran.r-project.org/bin/windows/base/</mixed-citation></ref></ref-list></back></article>