Real-Time COVID-19 Forecasting for Four States of India Using a Regression Transmission Model ()
1. Introduction
As on 18th July 2020, there are more than million cases of COVID-19 reported from India [1]. The first case was reported on 31st of January 2020 [2]. As the pandemic of COVID continued to sweep across the world the country took a series of measures to address it. This included improving the testing capacity and having a testing strategy to identify the cases [3]. However, the uncertainty over the duration and the burden of the pandemic is visible with reports both peer-reviewed and not peer-reviewed [4]. These reports indicate the epidemic to be a range from few hundred thousand to few hundred millions with peak varying between April to July 2020 [5] [6]. Since the first reported case, India has taken several non-pharmaceutical interventions to address the pandemic [7].
India follows a federal structure, where health is a state subject and the centre plays its supporting role at the time of need. While the earlier models provide the bigger picture, the planning for response needed short term projections that can keep a close eye on the upcoming wave of cases, which can help them in local decision making. The states in India are different in population size, density and connectivity to the other parts of the world directly [8]. The burden of COVID-19 both in terms of cases and death differ from state to state [1]. Also the strength of health system is not uniform across the country, thus the local response is expected to be different [9].
The short term forecasting is being found useful for similar infectious diseases earlier [10] [11] [12]. These models use the reported cases as inputs and use various methods like discrete time stochastic model, a conditional intensity of accumulation of cases using non parametric probability method or generation dependent growth factors to develop simple but robust models for forecasting disease respectively. Short term projections using growth models and modified SIR model (Susceptible-Infected-recovered) have also been used for early epidemic [13] [14]. Keeping the above facts in view, this paper is focused on creating real-time and short-term projections for COVID-19 in the near future that can be helpful for the states in India and applicable not only for early but also for later part of the epidemic.
2. Methodology
2.1. Data
The data required for the model was available from several publicly available domains. However, due to reported discrepancies, the authors decided to choose the ones that have the desired information and reported by government through daily bulletins [15]. We collected daily updated data on the number of confirmed cases from all the respective state government daily bulletins and dashboard, which were reporting daily from the first case identified in their state. The data was available in the respective state government websites [16] [17] [18] [19]. These states included Kerala, Tamil Nadu, Andhra Pradesh, and Odisha. The time period of data collection was from 30th January 2020 to 18th July 2020.
2.2. Model
We re-calibrated the semi-mechanistic discrete-time stochastic compartmental disease model, the details of the model can be found elsewhere [20]. The model consisted of two integer states or compartments of “Susceptible” - “Exposed” - “Infectious”, thus can be considered as mechanistic. The mean latent period was assumed to be 2.5 days. The duration of infectiousness for the model was obtained from the literature search showed pre-symptomatic period was 5 - 6 days with 97% of people infected persons showing infections before 12 days [21] and in India the disease detection from day of sample collection was around 4 days [22]. As soon as people were detected they were removed from the non-infected population through quarantine measures, thus removing them from further transmitting COVID-19. Thus, we assumed the infectious period to be 8 days with range 4 to 12 days.
All the new infections entered the “Exposed” compartment as a Poisson distribution with mean as a product of time-varying reproductive rate r(t) and proportion infected at the time. The model used the available information on the latent period of the disease and the infectious period to move people between the compartments using independent geometric transitions with first-order Euler method [23]. Random-walk methods are used elsewhere for modelling the reproductive rate in outbreak situations [24]. The model used a multiplicative normal random walk with a log-linear drift to generate the r(t) parameter. Assuming the uniform prior distribution of parameters the model was fit to the number of reported cases [10] [25]. The reported cases were adjusted at each state space using particle Markov Chain Monte Carlo simulation with Metropolis-Hastings updating [26].
The interquartile range, 25th-75th and in the 5-95th percentile ranges were calculated. The model used a minimum of 400 particles (400 - 800) and 30,000 iterations until the overall acceptance value was within the acceptable range of 20 - 30 per cent (Table 1) [27] [28]. The starting timeline and the initial cases for the states varied as the initial case detection was a different period for different states. credible intervals for the reported cases, forecasted cases and time-varying reproduction number were generated from the posterior distribution samples. Two weeks forecast starting from 5th July 2020 were generated and were validated during the actual reported data during the forecasted period. R-version 4.0 was used for the analysis [29]. C++ using the Rcpp package for computational efficiency, and ggplot2 package was used to produce charts [30].
3. Results
The results show an increasing number of cases for all the states with varying level except in Tamil Nadu, where is expected to remain stable [Figures 1-4]. For every figure, the input incidence data is denoted by black dots the lower left. The time-varying reproductive number r(t) during the same period is denoted with black lines and shaded regions in the upper left. The forecasted results for the r(t) and the number of cases is illustrated with blue lines and shaded regions. In all shaded regions, the central line indicates the median, the darker shaded region indicates the interquartile range and the lighter shaded region indicates the 5-95th percentile range.
The common result for all the state indicates towards a clear time-varying r(t) with more than 1 and showing a stable trend for the forecasted period. The number of cases is showing an increasing trend in the rest of the three states (Table 2). All the forecasted figures are within the 25th to 75th percentile/interquartile range. However, the first week of projection matches more closely with the actual values in comparison to the following week.
![]()
Table 2. Comparison of Day-wise projections, upper and lower interquartile range with actual reported COVID-19 cases.
4. Discussion
The study covered nearly 200 million population at risk. Though all the states are in the coastal region of India are different in language, socioeconomic dynamics and health system capacity [9] [37]. Though the first case of COVID was reported from Kerala in late January, most of the other states in the study reported the first cases in March. This analysis uses reported cases from the public health system, shows the heterogeneity of the epidemic movement through the time-varying effective reproductive rate and near-future forecast of COVID-19 burden, which closely matches with the actual number of cases.
![]()
Figure 1. Two weeks forecast: Time varying reproduction number and new cases in Kerala.
![]()
Figure 2. Two weeks forecast: Time varying reproduction number and new cases in Tamil Nadu.
![]()
Figure 3. Two weeks forecast: Time varying reproduction number and new cases in Andhra Pradesh.
![]()
Figure 4. Two weeks forecast: Time varying reproduction number and new cases in Odisha.
4.1. Epidemic Model and Validity
The results from four different states show that the model is robust and can be deployed for other states too. The variations in the actual number and the forecasted figures need to be analysed with the perspective of finding reasonable answers [38]. Most of the models are based on the assumption that “given current situation remains the same for future”, which is difficult to achieve, particularly concerning changing policy in testing and response strategy in the states [39] [40]. Changing case definitions, testing strategy and response to the pandemic is known to have influenced our understanding of the trajectory of the epidemic not only in India but also around the world [41]. The social distancing and lock down measures had shown to reduce the contagion, which was reported to be reversed once the started to open up [42] [43]. As this model demonstrates that the near future cases can be predicted with close certainty, repeated application of the model in constant time intervals can provide vital information on the projected number of cases in short runs and thus used for deployment of mitigation strategies. The criticism that outputs of mathematical models are not always useful may be considered in the right spirit, with a reasonable understanding of the models and the need for thinking beyond the one-time application of models [44] [45].
4.2. Effective Reproduction Number
The effective reproductive number indicates the risk of the epidemic in a given point of time. Though r(t) lower than unit indicates towards the loss of force in the epidemic, it remains sensitive to the fact that most of the population is not infected thus providing little protection, long way from herd immunity and thus the potential risk of transmission. It is also important to mention that r(t) lower than unit does not exclude the potential of localised outbreaks, which can influence the epidemic trajectory [33] [46]. However, the trend in r(t) does provide advance information on the highly unpredictable nature of this epidemic.
4.3. Need for Better Data
Any model output is as good as the data is as the assumptions are. The details about the day of symptom onset today of sample collection and the day of diagnosis and day of reporting was not available in the respective public domains of the states, thus were not considered in the model. Availability of this information may improve the model performance further. Also, as the capacity for testing was increased gradually the delay between the sample collection and testing had declined [3]. Having local data for each state with specific serial interval or generation time could have helped in the improvement of this analysis. The other published studies from India have also relied on external information source [7] [47]. It is expected that those who will be replicating the study can focus on further improving the model by addressing the gaps. The states’ response was mixed with some states going for only institutional quarantine while others going for home and institutional quarantine. An additional limitation of this model is reliant on the published information, which was dependence on the published case reports, thus sensitive to the local level variation in the testing strategy.
4.4. Consensus on Models
There are many types of forecasting and predictive modelling has been published on the COVID epidemic in India [48] [49]. All these models are relevant in the scientific quest and add value to the knowledge base of this novel virus. However, different models will provide different results creating confusion, and making it difficult for the policymakers to take decisions. However, there are other epidemic response initiatives in India which focuses on a single tool for the epidemic projection leading to harmony in the response [50]. The other model of managing a large number of real-time scientific publication is by aggregating them all and adding mutual accountability [51]. Most importantly the real-time forecasts can happen only if it is repeated again and again regular intervals [14].
5. Conclusion
The real-time short-term forecasting used in the four states provides a good approximation of the near future epidemic trajectory. The tool is available in the public domain and needs to be used on a given interval repeatedly to ensure tracking of the epidemic at the local level. This can be used for understanding the future burden.
Acknowledgements
The constant guidance and sharing of original codes by Jason Asher are highly appreciated. The paper also acknowledges the critical inputs of Dr. Shailaja Tetali and Dr. Jammy Rajesh.
Ethics Statement
The analysis is done using publicly available secondary data and no patient information was taken for analysis, thus ethical clearance was not required.