I. Introduction
Coronavirus disease 2019 (COVID-19) is rapidly spreading across the globe and has become a significant public health threat to humankind infecting millions worldwide [
1]. India is a low middle-income country in the South-East Asia region with a population of 1.3 billion. India reported its first case of COVID-19 on January 30, 2020 [
2]. The case numbers were almost static for over a month and gradually started to increase during early March. As of July 7, 2020, India recorded 719,665 cases and 20,160 people succumbed to COVID-19 [
1]. Considering the rising menace of COVID-19, it is essential to explore the methods and resources that might predict the case numbers expected and help in identifying the locations of outbreaks. This will help us understand what to expect and prepare for in terms of caseload and intensive care requirements.
India has an established disease surveillance system, the Integrated Disease Surveillance Program (IDSP), to identify the signals, suspects, and cases of certain notified diseases [
3]. IDSP enables the government to make evidence-based decisions on outbreaks. However, the system captures data only when people access the healthcare service. All around the world, non-conventional, informal data sources such as school absenteeism, over the counter drug disbursement, Internet search engines, and social media are being explored and used to supplement formal disease surveillance systems in predicting outbreaks [
4]. Search engines and social media tools include Google Trends (GT; Web search, YouTube search, News search, Image search), Twitter, Wikipedia, Baidu, Weibo, and so forth. GTs have been used over the last decade to provide reliable predictions of outbreaks of influenza and other diseases [
4–
7].
Internet usage among Indians has been on the rise, reaching about 451 million (36%) active users every month, with two-thirds of them being daily users [
8]. Search engines are one of the most commonly used facilities in the Internet for identifying and learning information on a wide range of subjects. Among search engines, Google has a monopoly in India, with 98.8% of the total search engine market share [
9]. YouTube is an archive/database of videos uploaded across the world on multiple subjects and topics. India has a major share of people accessing and watching YouTube, with around 265 million active users monthly [
10]. Both Google and YouTube are free to use, and the data on search terms and patterns are available in open source.
It has been shown that relative search volumes (RSV) of terms specific to a disease from GTs can predict outbreaks of that particular disease in India [
6]. However, it is necessary to determine and confirm the correlation, if any, between GTs and other diseases in the country [
6]. COVID-19 is one such disease which has the additional feature of being a novel infection in the current scenario. Hence, we have conducted a study to analyze the potential use of GTs to monitor public concern regarding COVID-19 epidemic infection in India and to evaluate the GTs data in predicting the COVID-19 outbreak in India.
II. Methods
Our study was based on a most common search engine database used in India, Google Trends, using different keywords which the public might have used to access information on COVID-19, from January 30, 2020 to April 15, 2020. All data used in our study were available in open source, and no explicit permission was required to utilize the data.
1. Data on Google Search Terms
The Google Trends homepage (
www.google.com/trends) features clustered topics that Google detects to be related and trending together on either Web search, YouTube, or Google News. Trending keywords are collected based on Google’s Knowledge Graph technology, and data is normalized and presented on a scale from 0 to 100, where the highest point, 100, divides each point on the graph [
6,
11]. On the results page, the user can add topics to compare them simultaneously in the charts by clicking the + Compare button or remove an item by clicking the “x” that appears in its box when the user hovers his or her cursor over it. Using this comparison method, we assessed 15 possible keywords that the Indian population might have used. Among them, the five most commonly used keywords were considered. The Google search terms used for the analysis were “coronavirus”, “COVID”, “COVID 19”, “corona”, and “virus”.
Web search is a generic search, irrespective of whether the content is images, videos, or text news. News search is specific for articles published in the media. The study period RSVs for each of the search terms were retrieved from the GTs for India [
12]. The RSV number represents the proportion of popularity of a term relative to the peak popularity during the reference period for the selected region. Hence, it gives a relative weight in terms of temporal and spatial aspects for search phrases in Google. A value of 100 means the term was at the peak of its popularity, while a value of 25 indicates that the search term was 25% as popular as that of its peak popularity during the specified time in the particular region. The reference period for the RSV data for the search terms was from January 30, 2020 to April 15, 2020. India reported its first case of COVID-19 on January 30, 2020 [
2].
2. Data on the Number of COVID-19 Cases
The number of daily new confirmed cases and the cumulative confirmed cases in India were obtained for the period until April 15 from
https://datahub.io/. The data were sourced from this upstream repository maintained by the team at Johns Hopkins University Center for Systems Science and Engineering (CSSE). The upstream dataset obtains data from the World Health Organization (WHO), for India. A confirmed case is defined as one in which the patient tests positive for COVID-19 in the reverse transcriptase-polymerase chain reaction (RT-PCR) test.
3. Statistical Analysis
Data were downloaded in Excel format. The analysis was done using SPSS trial version 26.0 (IBM, Armonk, NY, USA). Spearman correlation was used to determine the correlation between the daily new confirmed cases, daily cumulative cases, and the Google search terms. To establish the temporal relationships for up to 30 days, we also did a lag correlation analysis. An r-value of >0.7 is considered as a high correlation, and a p-value of <0.05 is considered as a statistically significant result.
III. Results
Figure 1 shows the overall trends of data from the keyword search for “coronavirus”, “COVID”, “corona”, “COVID 19”, and “virus” (infective agent category) during the selected period and the overall mean RSV of these keywords. It was observed that, among the search terms used, “coronavirus” and “corona” were the terms most commonly used by surfers using Google.
Figure 1 also shows that the dynamics of GT data in India were related to public concern at the time of various important announcements and actions taken by the government of India. The spike in search volumes started after the WHO declared COVID-19 as a pandemic on March 11, 2020 and when the Indian government made it a notifiable disease on March 14, 2020. It reached its peak immediately after India instituted a nationwide lockdown on March 24, 2020.
Figure 2 presents the correlations matrix between the two most common keywords used in various sub-searches with cumulative confirmed cases, daily new cases, and cumulative deaths. The calculated Spearman correlation coefficient was found to be highly significant with all variables at the
p-value level of 0.01.
1. Lag Correlation Analysis
Table 1 and
Figure 3A–3C show the lag Spearman correlation between the RSV from GTs for various sub-searches (Web search, YouTube search, and News search), and the cumulative laboratory-confirmed COVID-19 cases. Correlation between the News search terms “coronavirus” and “corona” was high (
r > 0.7) with the daily cumulative case for lag periods of 21 days and 20 days, respectively. The strength of correlation increases as the lag period decreases, reaching the maximum (
r = 0.83) during lag periods of 11 days and 9 days for “coronavirus” and “corona”, respectively. The correlation fluctuates and falls, thereafter. GTs for the search terms “coronavirus” and “corona” in Web searches were found to be highly correlated (
r > 0.7) with the daily cumulative cases, for a lag period of 15 days from the peak of the cumulative case numbers. Similar to the News search, the strength of correlation increases as the lag period decreases, and it reaches the maximum (
r = 0.89), on the zero-day i.e., the day on which the cases peak. With regards to YouTube search, a high correlation exists between the terms “coronavirus”, “corona”, and cumulative cases with lag periods of 20 days and 19 days, respectively. The strength of correlation reaches the maximum for the terms “coronavirus”(
r = 0.86) and “corona” (
r = 0.84), 11 days and 9 days, respectively, before the day the cases peak.
Table 2 and
Figure 3D–3F show the lag Spearman correlation between the RSV from GTs for various sub-searches and the daily new laboratory-confirmed COVID-19 cases. Correlation between the Web search terms “coronavirus” and “corona” was high (
r > 0.7) with the daily new cases, from lag periods of 14 days and 15 days, respectively. The strength of correlation increases as the lag period decreases, and it reaches the maximum during lag periods of 4 days for “corona” (
r = 0.82), 4 days and zero-days for “coronavirus” (
r = 0.81). News search GTs for the terms “coronavirus” and “corona” were found to be highly correlated (
r > 0.7) with the daily new cases 21 days and 19 days before the cases peak, respectively. The strength of correlation reaches the maximum during lag periods of 13 days and 9 days for “coronavirus” (
r = 0.77) and “corona” (
r = 0.78), respectively. In YouTube search, a high correlation exists between the terms “coronavirus”, “corona”, and new case numbers with lag periods of 20 days and 19 days, respectively. The strength of correlation reaches the maximum15 days for the term “coronavirus” (
r = 0.82) and 10 days for “corona” (
r = 0.79) before the day the cases peak.
IV. Discussion
Search queries have been widely used to predict disease outbreaks all over the world [
13,
14]. The fundamental principle behind this theory is that symptomatic and soon to be symptomatic people, among others, will search for details about the disease on the internet before reaching a health facility or accessing healthcare [
15]. This will cause a spike in search queries for the particular disease before the patients are captured by the routine disease surveillance system of the health authorities. Our analysis revealed that the terms “coronavirus” and “corona” were the most popular terms used for Google search in India. Li et al. [
14] in their study from China included the term “pneumonia” as well because, during the early stages of the pandemic, COVID-19 was identified as “pneumonia of unknown etiology”. However, by the time the first case emerged in India, it was established to be caused by a coronavirus [
16].
We found that the GTs from the Google Web, Google News, and YouTube strongly correlate with the cumulative and new COVID-19 case numbers. The maximum lag period for predicting COVID-19 cases was found to be 21 days with the News search for the term “coronavirus”, that is, the search volume for “coronavirus” peaked 21 days before the peak number of cases. Li et al. [
14] reported that search engines were able to predict the COVID-19 outbreak 1 to 2 weeks earlier than that of India’.
The greater lag time for India may be attributed to the fact that Indians were sensitized to the corona disease by news from China and other countries, which could have influenced their search behavior. The Internet search pattern and behavior of the population depend on the influence of various factors, such as peer groups, mass media bulletins, government actions, social media interactions, and so forth. They are among the determinants of health-seeking behavior [
17]. The series of disease control measures by India, such as suspending international travel and countrywide lockdown to establish physical distancing, may also have played a role in a gradual increase rather than rapid spiking of the COVID-19 case number [
18]. However, Li et al. [
14] compared the search terms with the new suspected and new confirmed cases, whereas we considered cumulative confirmed and new confirmed cases. The maximum strength of correlation for new confirmed case numbers was found with the term “coronavirus” in Google Web search (
r = 0.82) and YouTube search (
r = 0.82), while the strength of correlation was higher (
r = 0.96) in China [
14].
In recent years, GTs have been widely explored as an option to predict various diseases. Shin et al. [
5] found in their study in Korea that GTs were useful in predicting Middle East respiratory syndrome coronavirus (MERS-CoV) outbreaks 4 days in advance of the routine disease surveillance system, which is a shorter lag period than our findings. The greater lag period in our study could have been due to the curiosity associated with the novel infection, COVID-19. Santangelo et al. [
19] reported that GTs could predict a measles outbreak as early as 4 weeks before the conventional surveillance data in Italy.
In contrast, Provenzano et al. [
20] reported no advance prediction capability for Wikipedia trends with maximum correlation happening on day zero. Carneiro and Mylonakis [
15] reported the ability of GTs to predict influenza outbreaks 7 to 10 days earlier than conventional systems. Wilson et al. [
21] concluded that GTs could only be explored as supplementary to conventional systems because the Google Flu Trends system did not offer any early prediction and its predictions were in line with the formal surveillance systems for influenza-like illness (ILI) cases in New Zealand. GTs are recommended for countries that do not have well-established and robust disease surveillance systems. Not only the prediction of cases but also the effectiveness of disease control measures have been assessed using GTs. Google searches of COVID-19 control-related terms like “handwashing” have been found to be negatively correlated with the increase in the number of COVID-19 cases, thus acting as an indicator of the effectiveness of COVID-19 prevention strategies [
22].
However, our study based on GTs should be cautiously interpreted because it had the following limitations. We included only search terms used in the English language. India is a multi-linguistic country, but the search terms in the other major Indian languages were not accounted for in our study. The fundamental measure of association studied here is correlation, and even a strong correlation per se cannot be used as sufficient evidence for making GTs a primary tool of surveillance [
23].
The details of the algorithm of the methodology by which this search data is generated by Google is also unclear. GTs require a large proportion of regular internet users in the country for it to be an effective predictor [
15]. However, the exact quantification of this proportion is not available from the literature. Hence, the data obtained by GTs is from one segment of the population only. GTs are more influenced by the media popularity of a particular disease [
24], as people will be inclined to look into a disease or condition that is actively displayed and discussed in the popular media.
This phenomenon might have occurred in our study, as we saw a spike in searches using keywords related to COVID-19 whenever a landmark decision was taken by the WHO or the Indian government, which might have had greater media dissemination. It might have caused a disproportionate swing among the public in their internet searching patterns, and may have led to overestimation of the ground reality of the disease. On the other hand, if the general public has poor knowledge about a disease, then the epidemiological burden of that particular disease tends to be underestimated by GTs [
24]. Ours was a retrospective study. Real-time prediction of lag time of a disease and outbreaks requires mathematical modelling in addition to internet search data such as the RSVs from GTs, which is used to correlate the search terms with the disease burden, are calculated based on retrospective data. Hence, future research should focus on strategies to improve the reliability of GTs in disease prediction by formulating mathematical models incorporating internet search data. In the meantime, GTs should not be used as a replacement for robust disease surveillance; rather, it should be explored only to supplement it [
21].
In conclusion, our study revealed that Google Web, You-Tube, and News might be useful to predict outbreaks of COVID-19 2 to 3 weeks earlier than the routine disease surveillance or reporting system in India. This can be further explored and tested for each state in India, using the search terms in the state specific languages. However, Google search data may be considered only as a supplementary tool in COVID-19 monitoring and planning in India until more evidence is generated on its reliability and real-time prediction efficacy. Further, positive search terms, such as “handwashing” and “masks”, which are related to public awareness, can be explored for their usefulness in assessing the effectiveness of COVID-19 transmission prevention measures at large.