Forecasting of COVID-19 epidemic size in four high hitting nations (USA, Brazil, India and Russia) by Fb-Prophet machine learning model

Purpose – As of July 30, 2020, more than 17 million novel coronavirus disease 2019 (COVID-19) cases were registered including 671,500 deaths. Yet, there is no immediate medicine or vaccination for control this dangerous pandemic and researchers are trying to implement mathematical or time series epidemic models to predict the disease severity with national wide data. Design/methodology/approach – In this study, the authors considered COVID-19 daily infection data four most COVID-19 affected nations (such as the USA, Brazil, India and Russia) to conduct 60-day forecasting of total infections. To do that, the authors adopted a machine learning (ML) model called Fb-Prophet and the results confirmed that the total number of confirmed cases in four countries till the end of July were collected and projections were made by employing Prophet logistic growth model. Findings – Results highlighted that by late September, the estimated outbreak can reach 7.56, 4.65, 3.01 and 1.22 million cases in the USA, Brazil, India and Russia, respectively. The authors found some underestimation and overestimation of daily cases, and the linear model of actual vs predicted cases found a p -value (<2.2e-16) lower than the R 2 value of 0.995. Originality/value – In this paper, the authors adopted the Fb-Prophet ML model because it can predict the epidemic trend and derive an epidemic curve.


Introduction
The latest epidemic caused by novel coronavirus disease 2019 (COVID-19) is already spread all over the world [1]. The world has reached the brink of stagnation and struggled by daily registered new infections [2] and researchers confirmed that the present pandemic has been caused by the severe accurate respiratory syndrome coronavirus 2 (SARS-CoV-2) [3]. By the end of July 2020, over 17 million people were globally infected and 650,000 people died because of this deadly virus [4]. It is said that the virus can transfer easily by either physical contact or absorption of droplets from an infected case during a talk, cough or sneeze.
The capacity to recognize the rate at which the virus spread is vital in the battle against pandemics. Monitoring the degree of spreading pace at random time can help national authorities for planning public health and policymaking to address the pandemic outcomes [5]. Some recent studies are proposing transmission dynamic models for easy understanding of virus spread in a specific population, also to propose preventive measures [6][7][8]. Especially forecasting by time series models can successfully analyze the COVID-19 disease characteristics and a cumulative number of infections [9]. The present study is in line with the research associated with the calculation of COVID-19 cases in China by time series and panel data models have successfully presented the control of endogeneity, dependence and unobserved heterogeneity [6]. The authors presented a linear relationship between confirmed cases and deaths and the nonlinear relationship between total registered cases and confirmed cases. The recent spreading characteristics of COVID-19 were compared by previous coronavirus families (i.e. SARS and Middle East respiratory syndrome (MERS)) by adopting the propagation growth model is presented in ref [10]. Results mentioned that the COVID-19 transmission rate is almost double than of SARS and MERS and infected cases increased twice every two-three days without having human intervention.
Many researchers worldwide produced studies associated with COVID-19 predictions in severely affected countries. Indian research conducted by [11] was considered an suspectableexposed-infected-recovered (SEIR) compartmental model in order to understand the knowledge of virus longevity and manage healthcare systems at regional levels. Another study with the incorporation of the SEIR model has successfully estimated the virus dynamics by adding an isolation compartment and proposed controlling measures of infection rates [12]. The Italian study with the adoption of autoregressive integrated moving average (ARIMA) time series models has successfully predicted confirmed and recovered cases by the continuation of a 60day national lockdown and results achieved 93.75%, and 84.4% for both confirmed and recovered cases [13]. The work of [14] was developed four-time series models, namely, autoregressive (AR) models, moving average (MA), a combination of both AR and MA (ARMA) and ARIMA to compare the best-fitted model in prediction of COVID-19 spread in Saudi Arabia. Outcomes suggested that the ARIMA model was outperformed than the other three models.
The impact seasonal characteristics in virus spread from Wuhan and Italy was well analyzed in ref [15] by the incorporation of time series models. Results highlighted that the cold weather in early 2020 has largely caused the virus spread in Wuhan and a similar strike has been observed in Northern Italy. In continuation of the above studies, we developed the COVID-19 predictive model of the four most affected nations such as the USA, Brazil, India and Russia to calculate the total possible infections by end of September 2020. In this paper, we adopted the Fb-Prophet machine learning (ML) model because it can predict the epidemic trend and derive an epidemic curve [16].
The rest of the paper is as follows. The next section presented data collection sources and Prophet model equations. In Section 3, two-month projections on cumulative infections are presented for four included countries. Finally, Section 4 summarizes the main results of the present work along with suggested measures that need to follow for the fight against COVID-19.

Data sources
Many COVID-19 open data sources that are available to do epidemic forecasting. The most recent daily outbreak data have been retrieved from the John Hopkins University dashboard that displays country-level epidemic trends [4]. The data can be automatically updated on a daily base since the epidemic origination. The periodical analysis is conducted by COVID-19 data of mentioned four nationalities from January 20 to July 30, 2020. The dashboard including nationwide infected cases including deaths confirmed cases and total confirmed cases.

Fb-Prophet model
ML algorithms for predictive analysis are works through training of historical data and deep learning, linear regression, artificial neural networks and Bayesian algorithms are examples of them [17]. These algorithms select the best suitable model according to dataset features and predict future outcomes. This study applied similar practices to COVID-19 prediction for global epidemic data. We applied Fb-Prophet, famously known as the open-source framework of Facebook that was introduced in 2017 to perform time series forecasting by an additive model.
Fb-Prophet nonlinear trends are set with daily, weekly and yearly seasonality, plus holiday effects [16]. This perfectly fits for historical data of several seasons data and strong seasonal effects and it is fully automatic with limited manual involvement. A well-derived Prophet model not only helps to future predictions but also to detect anomalies and fill gaps in missing values. Most scholars prefer to conduct epidemic forecasting either by time series models (i.e. ARIMA) or SEIR models. This paper includes a nonlinear time series model of three components such as seasonality, trend and holidays y(t) 5 g(t) þ s(t) þ h(t) þ e t ; where g(t): stepwise linear or logistic growth curve for modeling of nonperiodic changes in time series, s(t): seasonal changes, h(t): effects of holidays with irregular schedules, and e t: error term. To fit and forecast the effects of seasonal changes, the model relies on the Fourier series for adjustments, and seasonal changes s (t) is derived as sðtÞ ¼ P N n¼1 ðan cosð 2πnt T Þ þ bn sinð 2πnt T ÞÞ; where parameters [a 1 , b 1 . . . a n , b n ] need to be estimated for a given N and T is the time value. In this research, we developed an example of Prophet class with fit and predictive techniques. The model input always is a time series with two features: t is time and y: is the total cases in a particular country.

Results
This section presents the experimental results of COVID-19 trend forecasting for four countries based on historical epidemic data (January 20, 2020-July 30, 2020). Weekly epidemic trends and model performance are further analyzed to understand model effectiveness.

Epidemic trend forecasting
We produce model input of two parameters such as time (in months) and total confirmed cases. The prophet model without considering daily and yearly seasonality because we do not have sufficient data to measure. Figure 1

Epidemic curve of daily register cases
We plotted daily epidemic characteristic curves for each country to understand the disease behaviors. These curves present only artificial data patterns but did not confirm the actual infections per day. This may happen because of the fluctuations in data characteristics of available data and factual information. Sometimes there is a possibility of not every infected person could not be tested or confirmed [18]. For instance, in the USA a low epidemic size has been observed on Wednesday and a high epidemic on Saturday (Refer Figure 2), but it is not necessarily happened to be the USA people can be exposed to the virus on Saturday. In Brazil, epidemic size is high on Friday and Saturday and low on Monday. Moreover, a high epidemic can be observed in India and Russia on Sunday and Monday, and low on Wednesday for India and Friday for Russia. The model parametric relationship between COVID-19 confirmed cases in four nations has presented in Table 1. The model generated forecasting values are statistically significant with 95% confidence intervals (CI) and the corresponding predictive R 2 value is from 99.91% to 99.99%. The summarization of daily cases has observed are presented in Table 2. To analyze the developed model we plot the linear regression models by presenting actual cases (x-axis) versus predicted cases (y-axis). In these models for the only country India, we found both underestimation and overestimation and for the other three countries such as the USA, Brazil and Russia simple linear trend has observed (Refer Figure 3). These models are further validated by getting R 2 value ranged from 0.9951 to 0.9999 at a 5% level significance. For all four models the lowest p-value of <2.2e-16 can be observed which indicates statistical significance (95% CI) of model relation.

Discussion
According to the modeling outcomes, the epidemic size of four high hit nations will even get worse by late September. The epidemic projected in the USA can reach up to 7.5 million infected cases. This result is follow up to the research of [19] that describes the timeline of live forecasting. The epidemic size in the USA has been viewed as worse than others including high deaths. A similar type of situation can also be observed in the other three countries. Because of its fast-spreading nature, there is a great inclination of daily COVID-19 infections. Our model forecasting results have validated this point and alarming these high affected countries for better management of healthcare systems especially ICU care. Since the outbreak already exceeds the capacity of national health services, governments should strongly alert the public with key prevention measures.
In the USA, over three million infections (as of July 07, 2020) have happened, and results saying that it can be more than double in the next two months. It can be assumed that slow control, imprecise policies and lack of awareness happening in the USA is in a large pandemic country lists [20]. As mentioned, these figures might not be necessarily true because of the limited number of tests that are conducted. In developed countries like India, this epidemic grew frightening the national authorities. India has already surpassed the epidemic size of Russia and becomes the third largest COVID-19 pandemic nation after the USA and Brazil. Our model forecasting estimating that India can become the second worst-hit nation by late September. By gradual lift of the strict lockdown was imposed in March, this country allows most businesses because of an economic crisis. The present pandemic already becomes a dreadful threat to humankind. Some European countries like Germany, Spain, Italy, etc. are already in progress of understanding the epidemic peak which results in the decline of new . Brazil also being viewed as a massive epidemic spread place and positioned a high number of deaths after the USA. Simultaneously, the epidemic size of the other two countries such as India and Russia will reach up to 3.15, and 1.22 million confirmed cases by the September end. Besides forecasting, our study also highlighted the weekly epidemic characteristics including daily seasonal modeling. These models are most effective in understanding the dynamic spread of COVID-19 and suggest immediate actions to control the epidemic. From the beginning of the COVID-19 pandemic, some statistical and mathematical modeling studies are available to predict national and global epidemics by altering the degrees of accuracy. The uncertainties in prediction accuracy depend upon the assumptions that have been made on available data. These forecasting outcomes might vary largely because of the difference in input value parameters and assumptions. During novel pandemics like COVID-19, the quality and availability are to keep changes as the epidemic progress and cause uncertainties in predictions at early stages and improved in further stages.
By incorporating the Fb-Prophet ML model, we achieved more than 99% of prediction accuracy. However, we found a little bias in linear modeling for India's epidemic forecasting with the possibility of either overestimation or underestimation. Another study involved with the Fb-Prophet model estimated that the total epidemic size 1,737,272 for Brazil, 283,029 for Russia, 330,043 for India by mid-June 2020, and the global outbreak reported 14.12 million infections will peak in October [22].

Recommendations
A novel pandemic caused by COVID-19 has been affecting almost every world nation. COVID-19 is a deadly disease of the 21st century that results in over 8.5 million deaths and still ongoing. Especially, the mentioned four countries currently are facing severe epidemic It can understand that for any country it is beyond the capacity to test every individual. But imposing partial lockdown in cities, avoiding international travel, shutdown the malls, theaters and gyms, and could make this epidemic practically controlled. Healthcare authorities have should make mask-wearing is compulsory in public and kept the ban on large gatherings. Our model results highlighted that epidemic size could be doubled and peak can be observed in October 2020. In that scenario, all national governments should think to impose a second phase country lockdown with no ease and national authorities should make sure of people have been confined to home. Healthcare centers and hospitals need to manage the patient flow, also address issues like overcrowd and bed availability. Universities and other educational institutions are encouraged to continue e-learning methods.

Strengths and limitations
We involved the Fb-Prophet ML model for forecasting analysis. In SEIR, the model assumptions are made for every suspected case that has an equal chance of getting contact with another person, and the transmission rate remains the same throughout an epidemic duration. This model considered having similar transmission rates for both quarantine and nonquarantine population. At the same time, time series models like ARIMA are dealing with one or more values per time step and attribute tuning has been mandatory to get comprehensive accuracy. But the Fb-Prophet model does not require the interpolation of missing data and enhances better forecasting by an accumulation of seasonal modeling. Despite the high prediction accuracy, the adopted Fb-Prophet model possesses some limitations. Primarily by lack of more clear data on daily and yearly seasonality more detailed predictions are not possible, but these models are helping to forecast future cumulative infections. But to the best of our knowledge, the forecasting results generated in this work are effective for the current pandemic situation.

Conclusions
The present analysis was conducted by considering live COVID-19 epidemic data of the USA, Brazil, India and Russia which retrieved from the John Hopkins University dashboard. Projections are highlighted that there is a chance of an epidemic peak in early October in those countries. It demands the possibility for a second phase national lockdown with no ease or else there could be a chance of getting a second-wave outbreak. This study proposed a forecasting method with the Fb-Prophet model for COVID-19 analysis. A prophet is perfect for nonlinear trends that fitted with daily, weekly, yearly seasonality plus holiday effects. We only applied time-series data (ds) as model trend terms. This will leave a knowledge gap for future research. By converting the imposition of preventive measures like lockdown, travel bans as holiday effects in the model can enhance research significance. The model proposed in our work significantly improves the estimations of infection numbers in other global countries in order to help national authorities to do better planning of health policy interventions.