ramikrispin / coronavirus Goto Github PK

View Code? Open in Web Editor NEW

498.0 32.0 212.0 2.07 GB

The coronavirus dataset

Home Page: https://ramikrispin.github.io/coronavirus/

License: Other

R 10.93% Dockerfile 1.79% Shell 2.28% HTML 46.14% Julia 0.55% JavaScript 38.32%

covid-19 rstats dataset covid19-data covid19

coronavirus's Introduction

coronavirus

The coronavirus package provides a tidy format for the COVID-19 dataset collected by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The dataset includes daily new and death cases between January 2020 and March 2023 and recovery cases until August 2022.

More details available here, and a csv format of the package dataset available here

Data source: https://github.com/CSSEGISandData/COVID-19

Source: Centers for Disease Control and Prevention’s Public Health Image Library

Important Notes

As of March 10th, 2023, JHU CCSE stopped collecting and tracking new cases
As of August 4th, 2022 JHU CCSE stopped track recovery cases, please see this issue for more details
Negative values and/or anomalies may occurred in the data for the following reasons:
- The calculation of the daily cases from the raw data which is in cumulative format is done by taking the daily difference. In some cases, some retro updates not tie to the day that they actually occurred such as removing false positive cases
- Anomalies or error in the raw data
- Please see this issue for more details

Vignettes

Additional documentation available on the following vignettes:

Installation

Install the CRAN version:

install.packages("coronavirus")

Install the Github version (refreshed on a daily bases):

# install.packages("devtools")
devtools::install_github("RamiKrispin/coronavirus")

Datasets

The package provides the following two datasets:

coronavirus - tidy (long) format of the JHU CCSE datasets. That includes the following columns:
- date - The date of the observation, using Date class
- province - Name of province/state, for countries where data is provided split across multiple provinces/states
- country - Name of country/region
- lat - The latitude code
- long - The longitude code
- type - An indicator for the type of cases (confirmed, death, recovered)
- cases - Number of cases on given date
- uid - Country code
- province_state - Province or state if applicable
- iso2 - Officially assigned country code identifiers with two-letter
- iso3 - Officially assigned country code identifiers with three-letter
- code3 - UN country code
- fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
- combined_key - Country and province (if applicable)
- population - Country or province population
- continent_name - Continent name
- continent_code - Continent code
covid19_vaccine - a tidy (long) format of the the Johns Hopkins Centers for Civic Impact global vaccination dataset by country. This dataset includes the following columns:
- country_region - Country or region name
- date - Data collection date in YYYY-MM-DD format
- doses_admin - Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independently
- people_partially_vaccinated - Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twice
- people_fully_vaccinated - Cumulative number of people who received all prescribed doses necessary to be considered fully vaccinated
- report_date_string - Data report date in YYYY-MM-DD format
- uid - Country code
- province_state - Province or state if applicable
- iso2 - Officially assigned country code identifiers with two-letter
- iso3 - Officially assigned country code identifiers with three-letter
- code3 - UN country code
- fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
- lat - Latitude
- long - Longitude
- combined_key - Country and province (if applicable)
- population - Country or province population
- continent_name - Continent name
- continent_code - Continent code

The refresh_coronavirus_jhu function enables to load of the data directly from the package repository using the Covid19R project data standard format:

covid19_df <- refresh_coronavirus_jhu()
#> �[4;32mLoading 2020 data�[0m
#> �[4;32mLoading 2021 data�[0m
#> �[4;32mLoading 2022 data�[0m
#> �[4;32mLoading 2023 data�[0m

head(covid19_df)
#>         date    location location_type location_code location_code_type
#> 1 2021-12-31 Afghanistan       country            AF         iso_3166_2
#> 2 2020-03-24 Afghanistan       country            AF         iso_3166_2
#> 3 2022-11-02 Afghanistan       country            AF         iso_3166_2
#> 4 2020-03-23 Afghanistan       country            AF         iso_3166_2
#> 5 2021-08-09 Afghanistan       country            AF         iso_3166_2
#> 6 2023-03-02 Afghanistan       country            AF         iso_3166_2
#>       data_type value      lat     long
#> 1     cases_new    28 33.93911 67.70995
#> 2 recovered_new     0 33.93911 67.70995
#> 3     cases_new    98 33.93911 67.70995
#> 4 recovered_new     0 33.93911 67.70995
#> 5    deaths_new    28 33.93911 67.70995
#> 6     cases_new    18 33.93911 67.70995

Usage

data("coronavirus")

head(coronavirus)
#>         date province country     lat      long      type cases   uid iso2 iso3
#> 1 2020-01-22  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 2 2020-01-23  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 3 2020-01-24  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 4 2020-01-25  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 5 2020-01-26  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 6 2020-01-27  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#>   code3    combined_key population continent_name continent_code
#> 1   124 Alberta, Canada    4413146  North America             NA
#> 2   124 Alberta, Canada    4413146  North America             NA
#> 3   124 Alberta, Canada    4413146  North America             NA
#> 4   124 Alberta, Canada    4413146  North America             NA
#> 5   124 Alberta, Canada    4413146  North America             NA
#> 6   124 Alberta, Canada    4413146  North America             NA

Summary of the total confrimed cases by country (top 20):

library(dplyr)

summary_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases)

summary_df %>% head(20) 
#> # A tibble: 20 × 2
#>    country        total_cases
#>    <chr>                <dbl>
#>  1 US               103802702
#>  2 India             44690738
#>  3 France            39866718
#>  4 Germany           38249060
#>  5 Brazil            37076053
#>  6 Japan             33320438
#>  7 Korea, South      30615522
#>  8 Italy             25603510
#>  9 United Kingdom    24658705
#> 10 Russia            22075858
#> 11 Turkey            17042722
#> 12 Spain             13770429
#> 13 Vietnam           11526994
#> 14 Australia         11399460
#> 15 Argentina         10044957
#> 16 Taiwan*            9970937
#> 17 Netherlands        8712835
#> 18 Iran               7572311
#> 19 Mexico             7483444
#> 20 Indonesia          6738225

Summary of new cases during the past 24 hours by country and type (as of 2023-03-09):

library(tidyr)

coronavirus %>% 
  filter(date == max(date)) %>%
  select(country, type, cases) %>%
  group_by(country, type) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type,
              values_from = total_cases) %>%
  arrange(-confirmed)
#> # A tibble: 201 × 4
#> # Groups:   country [201]
#>    country        confirmed death recovery
#>    <chr>              <dbl> <dbl>    <dbl>
#>  1 US                 46931   590        0
#>  2 United Kingdom     28783     0        0
#>  3 Australia          13926   115        0
#>  4 Russia             12385    38        0
#>  5 Belgium            11570    39        0
#>  6 Korea, South       10335    12        0
#>  7 Japan               9834    80        0
#>  8 Germany             7829   127        0
#>  9 France              6308    11        0
#> 10 Austria             5283    21        0
#> # … with 191 more rows

Plotting daily confirmed and death cases in Brazil:

library(plotly)

coronavirus %>% 
  group_by(type, date) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type, values_from = total_cases) %>%
  arrange(date) %>%
  mutate(active = confirmed - death - recovery) %>%
  mutate(active_total = cumsum(active),
                recovered_total = cumsum(recovery),
                death_total = cumsum(death)) %>%
  plot_ly(x = ~ date,
                  y = ~ active_total,
                  name = 'Active', 
                  fillcolor = '#1f77b4',
                  type = 'scatter',
                  mode = 'none', 
                  stackgroup = 'one') %>%
  add_trace(y = ~ death_total, 
             name = "Death",
             fillcolor = '#E41317') %>%
  add_trace(y = ~recovered_total, 
            name = 'Recovered', 
            fillcolor = 'forestgreen') %>%
  layout(title = "Distribution of Covid19 Cases Worldwide",
         legend = list(x = 0.1, y = 0.9),
         yaxis = list(title = "Number of Cases"),
         xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))

Plot the confirmed cases distribution by counrty with treemap plot:

conf_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases) %>%
  mutate(parents = "Confirmed") %>%
  ungroup() 
  
  plot_ly(data = conf_df,
          type= "treemap",
          values = ~total_cases,
          labels= ~ country,
          parents=  ~parents,
          domain = list(column=0),
          name = "Confirmed",
          textinfo="label+value+percent parent")

data(covid19_vaccine)

head(covid19_vaccine)
#>         date country_region continent_name continent_code combined_key
#> 1 2020-12-29        Austria         Europe             EU      Austria
#> 2 2020-12-29        Bahrain           Asia             AS      Bahrain
#> 3 2020-12-29        Belarus         Europe             EU      Belarus
#> 4 2020-12-29        Belgium         Europe             EU      Belgium
#> 5 2020-12-29         Canada  North America             NA       Canada
#> 6 2020-12-29          Chile  South America             SA        Chile
#>   doses_admin people_at_least_one_dose population uid iso2 iso3 code3 fips
#> 1        2123                     2123    9006400  40   AT  AUT    40 <NA>
#> 2       55014                    55014    1701583  48   BH  BHR    48 <NA>
#> 3           0                        0    9449321 112   BY  BLR   112 <NA>
#> 4         340                      340   11589616  56   BE  BEL    56 <NA>
#> 5       59079                    59078   37855702 124   CA  CAN   124 <NA>
#> 6          NA                       NA   19116209 152   CL  CHL   152 <NA>
#>        lat       long
#> 1  47.5162  14.550100
#> 2  26.0275  50.550000
#> 3  53.7098  27.953400
#> 4  50.8333   4.469936
#> 5  60.0000 -95.000000
#> 6 -35.6751 -71.543000

Taking a snapshot of the data from the most recent date available and calculate the ratio between total doses admin and the population size:

df_summary <- covid19_vaccine |>
  filter(date == max(date)) |>
  select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
  mutate(doses_pop_ratio = doses_admin / population,
         total_pop_ratio = total / population) |>
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) |>
  arrange(- total)

head(df_summary, 10)
#>          date country_region doses_admin      total population continent_name
#> 1  2023-03-09          China          NA 1310292000 1404676330           Asia
#> 2  2023-03-09          India          NA 1027379945 1380004385           Asia
#> 3  2023-03-09             US   672076105  269554116  329466283  North America
#> 4  2023-03-09      Indonesia   444303130  203657535  273523621           Asia
#> 5  2023-03-09         Brazil   502262440  189395212  212559409  South America
#> 6  2023-03-09       Pakistan   333759565  162219717  220892331           Asia
#> 7  2023-03-09     Bangladesh   355143411  151190373  164689383           Asia
#> 8  2023-03-09          Japan   382415648  104675948  126476458           Asia
#> 9  2023-03-09         Mexico   225063079   99071001  127792286  North America
#> 10 2023-03-09        Vietnam   266252632   90466947   97338583           Asia
#>    doses_pop_ratio total_pop_ratio
#> 1               NA       0.9328071
#> 2               NA       0.7444759
#> 3         2.039893       0.8181539
#> 4         1.624368       0.7445702
#> 5         2.362927       0.8910225
#> 6         1.510960       0.7343837
#> 7         2.156444       0.9180335
#> 8         3.023611       0.8276319
#> 9         1.761163       0.7752502
#> 10        2.735325       0.9294048

Plot of the total doses and population ratio by country:

# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6

# Filter the data
d <- df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) 


# Replot it
p3 <- plot_ly() |>
  add_markers(x = d$population,
              y = d$total,
              text = ~ paste("Country: ", d$country_region, "<br>",
                             "Population: ", d$population, "<br>",
                             "Total Doses: ", d$total, "<br>",
                             "Ratio: ", round(d$total_pop_ratio, 2), 
                             sep = ""),
              color = d$continent_name,
              type = "scatter",
              mode = "markers") |>
  add_lines(x = c(line_start, line_end),
            y = c(line_start, line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_lines(x = c(line_start, line_end),
            y = c(0.5 * line_start, 0.5 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  
  add_lines(x = c(line_start, line_end),
            y = c(0.25 * line_start, 0.25 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_annotations(text = "1:1",
                  x = log10(line_end * 1.25),
                  y = log10(line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:2",
                  x = log10(line_end * 1.25),
                  y = log10(0.5 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:4",
                  x = log10(line_end * 1.25),
                  y = log10(0.25 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
                  showarrow = FALSE,
                  xref = "paper",
                  yref = "paper",
                  x = -0.05, y = - 0.33) |>
  layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
         margin = list(l = 50, r = 50, b = 90, t = 70),
         yaxis = list(title = "Number of Doses",
                      type = "log"),
         xaxis = list(title = "Population Size",
                      type = "log"),
         legend = list(x = 0.75, y = 0.05))

Dashboard

Note: Currently, the dashboard is under maintenance due to recent changes in the data structure. Please see this issue

A supporting dashboard is available here

Data Sources

The raw data pulled and arranged by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) from the following resources:

World Health Organization (WHO): https://www.who.int/
DXY.cn. Pneumonia. 2020. https://ncov.dxy.cn/ncovh5/view/pneumonia.
BNO News: https://bnonews.com/index.php/2020/04/the-latest-coronavirus-cases/
National Health Commission of the People’s Republic of China (NHC):
http:://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml
China CDC (CCDC): http:://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm
Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html
Macau Government: https://www.ssm.gov.mo/portal/
Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0
US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html
Government of Canada: https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/symptoms.html
Australia Government Department of Health:https://www.health.gov.au/health-alerts/covid-19
European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/covid-19/country-overviews
Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19
Italy Ministry of Health: https://www.salute.gov.it/nuovocoronavirus
1Point3Arces: https://coronavirus.1point3acres.com/en
WorldoMeters: https://www.worldometers.info/coronavirus/
COVID Tracking Project: https://covidtracking.com/data/. (US Testing and Hospitalization Data. We use the maximum reported value from “Currently” and “Cumulative” Hospitalized for our hospitalization number reported for each state.)
French Government: https://dashboard.covid19.data.gouv.fr/
COVID Live (Australia): https://covidlive.com.au/
Washington State Department of Health:https://doh.wa.gov/emergencies/covid-19
Maryland Department of Health: https://coronavirus.maryland.gov/
New York State Department of Health: https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Testing/xdss-u53e/data
NYC Department of Health and Mental Hygiene: https://www.nyc.gov/site/doh/covid/covid-19-data.page and https://github.com/nychealth/coronavirus-data
Florida Department of Health Dashboard: https://services1.arcgis.com/CY1LXxl9zlJeBuRZ/arcgis/rest/services/Florida_COVID19_Cases/FeatureServer/0 and https://fdoh.maps.arcgis.com/apps/dashboards/index.html#/8d0de33f260d444c852a615dc7837c86
Palestine (West Bank and Gaza): https://corona.ps/details
Israel: https://govextra.gov.il/ministry-of-health/corona/corona-virus/
Colorado: https://covid19.colorado.gov/

coronavirus's People

Contributors

Stargazers

Watchers

Forkers

han-tun peymankor stjordanis jaeyk primus-ai janrinaldo noelenenoone gonwi alexlifshitz ricaelum42 cammysoh risingroboplanet shuerga manikhossain08 goedelisiert kalilu mikblack kaziahosunhabibripon gshotwell jebyrnes achenxu gj330 navmedvideos restevesd codesbymeng amy-chue syderitic gabeochieng piccolbo peterpra gravitytrope estebanvar90 chprianto pps121 nikunjdnp bangil0 raangulo cuulee robinrajsb glaserm89 jdconejeros aedobbyn ccamara rgsilveira gokhu18 devmatheuspinto arti1996 hellofatty rzanin perrrrry miftahur-rahman adrianh31 iannimuliterno josephignace stephanielong12 snowdj conglesolutionx jinzhang817 aaxd stepannette bydata shreyas-padhye abhi15sep drac-muskee xiaojunzeng shudan21 jemsethio williger moonisali etzivakis kmera estrellamar bawcos juergenpf amrofi logikol hswang108 peacegui brentmat pnojai verrah haizhuolaojisite bkhouiled science4u aerinjung harmensuis rchiebunker mine-cetinkaya-rundel nuelma991 kallil12 kronnyec jayshin-agc molimicha-tech mosaddek-hossain irondingo subeom nikhil-pandarge abhijeetjee01 ecardenes mishraarnab

coronavirus's Issues

Separate update and data access into different packages? 0.3 thought.

One thing that occurs to me is that, as we move forward, to rectify issues in different datasets, we have to do a lot of reprocessing. While some folk might want the latest data scrapped directly from a site, adding update raw scripts has upped the number of dependencies a great deal. If in 0.3 we move to more access to a broad array of data sets, as mentioned in #24 then perhaps it would be worth splitting the raw update scripts into a separate package to lower the dependencies? Just a thought. Curious what you think.

Error when devtools::install_github("RamiKrispin/coronavirus")

install.packages("devtools")
devtools::install_github("RamiKrispin/coronavirus")

Error message:
Installing package into ‘/databricks/spark/R/lib’ Error : Failed to install 'coronavirus' from GitHub:
(converted from warning) installation of package ‘units’ had non-zero exit status.

Expected changes in the package CRAN version 0.2.0

As the many changes occurred since the release of v0.1.0 significant changes in the data structure expected in v0.2.0 (expected to release to CRAN by May 15. Changes are available on the dev-v020 branch

coronavirus 0.2.0

Data changes:
- coronavirus dataset - Change the structure of the US data from March 23rd, 2020 and forward. The US data is now available on an aggregated level. More information about the changes in the raw data available on this issue
- Changes in the columns names and order:
  - Province.State changed to province
  - Country.Region changed to country
  - Lat changed to lat
  - Long changed to long
- The covid_south_korea and covid_iran that were avialble on the dev version were removed from the package and moved to new package covid19wiki, for now available only on Github

Problem with Dataset update

Hello, i have a problem with the update of 18/4/20 data. I cant update the data but in your dashboard and your repo it seems that you update the data yesterday.
Any idea?
P.S I reinstalled the data from: devtools::install_github("covid19r/coronavirus") but still nothing

coronavirus::update_datasets()
Updates are available on the coronavirus Dev version, do you want to update? n/YY
Skipping install of 'coronavirus' from a github remote, the SHA1 (f3a23eb) has not changed since last install.
Use force = TRUE to force installation
The data was refresed, please restart your session to have the new data available

The datasets update

Hi Rami;
When I runing the "update_datasets" I get the following error message: Error in update_datasets() : could not find function "update_datasets".
And if I want to install the package I get the following error message : Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called ‘fs’.
What is the problem, please?

Dataset lagging behind other sources

Hi Rami Krispin:

I notice that sometimes the coronavirus dataset is lagging behind that of worldometers by days. If you rely on worldometers to update the dataset, that would be the most reliable in my opinion.

For example, the data for May 16 is missing from coronavirus. You could easily import the data from worldometers once a day and update the repository. I am using the coronavirus for the dataset beyond two days and merge that to that of worldometers (which publish the data for today and a day before).

No more data updates after March 26?

Dear Rami,
I see that the last available date is March 26?
Do you expect to update the dataset?
Best,
Antoine

Data not updated

Hi Rami,

Thanks for the excellent work. It seems that data is not being updated since last two days.

Thanks
Shubhram

" Azerbaijan" Has Space in the Name

" Azerbaijan" has an initial space in the name that should not exist.

> coronavirus$Country.Region %>% unique 
 [1] "Japan"                "South Korea"          "Thailand"             "Mainland China"       "Macau"               
 [6] "US"                   "Taiwan"               "Singapore"            "Vietnam"              "Hong Kong"           
[11] "France"               "Malaysia"             "Nepal"                "Australia"            "Canada"              
[16] "Cambodia"             "Germany"              "Sri Lanka"            "Finland"              "United Arab Emirates"
[21] "India"                "Philippines"          "Italy"                "Russia"               "Sweden"              
[26] "UK"                   "Spain"                "Belgium"              "Others"               "Egypt"               
[31] "Iran"                 "Israel"               "Lebanon"              "Afghanistan"          "Bahrain"             
[36] "Iraq"                 "Kuwait"               "Oman"                 "Algeria"              "Austria"             
[41] "Croatia"              "Switzerland"          "Brazil"               "Georgia"              "Greece"              
[46] "North Macedonia"      "Norway"               "Pakistan"             "Romania"              "Denmark"             
[51] "Estonia"              "Netherlands"          "San Marino"           " Azerbaijan"          "Belarus"             
[56] "Iceland"              "Lithuania"            "Mexico"               "New Zealand"          "Nigeria"             
[61] "North Ireland"

Negative values were found in the package

This package is really cool! Thanks for putting them together!

I found some negative values in this package. I checked the raw data source, and didn't see negative values there. I guess some errors might be introduced during the data processing?

> library(coronavirus)
> data("coronavirus")
> coronavirus %>% arrange(cases) %>% head(20)
# A tibble: 20 x 7
   Province.State                         Country.Region   Lat   Long date       cases type     
   <chr>                                  <chr>          <dbl>  <dbl> <date>     <int> <chr>    
 1 "Omaha, NE (From Diamond Princess)"    US              41.3  -96.0 2020-02-24   -11 confirmed
 2 "Diamond Princess cruise ship"         Others          35.4  140.  2020-03-06   -10 confirmed
 3 "From Diamond Princess"                Australia       35.4  140.  2020-02-29    -8 confirmed
 4 "Travis, CA (From Diamond Princess)"   US              38.3 -122.  2020-02-24    -5 confirmed
 5 "New York County, NY"                  US              40.7  -74.0 2020-03-07    -5 confirmed
 6 "Hainan"                               Mainland China  19.2  110.  2020-02-15    -4 recovered
 7 "Guizhou"                              Mainland China  26.8  107.  2020-02-06    -3 recovered
 8 "Ningxia"                              Mainland China  37.3  106.  2020-02-09    -2 recovered
 9 "Heilongjiang"                         Mainland China  47.9  128.  2020-02-11    -2 recovered
10 "Lackland, TX (From Diamond Princess)" US              29.4  -98.6 2020-02-24    -2 confirmed
11 ""                                     Japan           36    138   2020-01-23    -1 confirmed
12 "Queensland"                           Australia      -28.0  153.  2020-01-31    -1 confirmed
13 "Queensland"                           Australia      -28.0  153.  2020-02-02    -1 confirmed
14 "Shanxi"                               Mainland China  37.6  112.  2020-02-03    -1 recovered
15 "Guangxi"                              Mainland China  23.8  109.  2020-02-12    -1 recovered
16 "Hong Kong"                            Hong Kong       22.3  114.  2020-02-21    -1 recovered
17 "Diamond Princess cruise ship"         Others          35.4  140.  2020-02-23    -1 recovered
18 ""                                     Italy           43     12   2020-02-24    -1 recovered
19 "Northern Territory"                   Australia      -12.5  131.  2020-03-06    -1 confirmed
20 ""                                     South Korea     36    128   2020-01-22     1 confirmed

Changes in the structure of the data

As per changes in the raw data, as of March 23rd some major changes are taking places in the format of the data:

The US data is now aggregated, the state and county-level data removed from this series
The recovered cases were removed from the data

Recovered data

How we can get recovered data now????

As you see in my dashboard I have used recovered cases: https://shubhrampandey.shinyapps.io/coronaVirusViz

Build data automation

Create a script to refresh all the dataset
Add unit test
Auto-pus to the repo

CRAN Release checkout

Update the following:

Readme
Vignette
News

Build a shiny app

problem accessing the dataframe

edited../

Adding a country filter to the dashboard

Congrats on this visualization, it's the best I've seen!

Will it be possible to add a country filter to the dashboard in the Summary and Trends tab?

I want to see how fast the virus is spreading daily through certain European countries and it's not currently possible as the data is only aggregated.

Error in Hubei data for 3/11/2020

Hi,

It looks like there's an error in your dataset for the confirmed cases entry for Hubei for 3/11/2020 -- the total cumulative cases are listed (67773) instead of the new cases for that day (13), although all other entries for other location-days appear to be reporting new cases for that day (based on comparing with the raw .csv's from Johns Hopkins).

Thanks for the package!

Problem with update data

I cant update data since 6/4. i use coronavirus::update_datasets(silence = TRUE) but it says "no update available"
Any help?
Thanks

is there any plan to automate data update?

This is a really great data package... but do you have any plan to update dataset automatically?

When i pull data ("coronavirus") only brings 1497 cases

Hello,

I'm from Chile, and I noticed that when I run library(coronavirus);data("coronavirus") even in the devtools version, I don't get information from Chile, but only from Australia, Belgium, Cambodia, Canada, Egypt, Finland, France, Germany, Hong Kong, India, Italy, Japan, Macau, Mainland China, Malaysia, etc., and until February 16 of 2020.

Connecting through Power BI

Hello, first-time git hubber lol full time Bi-onear. Im looking to create some dashboards from this dataset. i''m using https://github.com/javierluraschi/covid as the repo details but I'm being asked for a owner i have put javierluraschi but I'm being told that I do not have access.
any tips?
Kindest Regards
Jimmy

Data structure Changes / dataset updates

Rami:

Why are we having all these data changes?

I can rationalize that the recovered case might be inaccurate and maybe hospitals are over burdened with saving patients, and not able to collect the data.

But I don't understand why the state and province data was removed.

Are you working on getting the state / province data back into the coronavirus dataset?
Will the county / city data be available with Lat / Long data?
Is there an issue with the daily updates to the datasets? I had issues getting the data on the morning of March 29 (6 am EST [+ 4 GMT] )?

Recovered + Active cases

How we can get except from recovered data and active cases?
There gonna be any changes at the coronavirus package including active and recovered columns?
Or is gonna be as it is?

Problem with update_dataset

Hi Rami,

I upgraded to R 4.0.0 recently. Since then I'm getting the error below with update_dataset(). I removed the coronvirus package and re-installed it several times but no luck.

`> update_dataset()
Updates are available on the coronavirus Dev version, do you want to update? n/YY
Downloading GitHub repo RamiKrispin/coronavirus@master
√ checking for file 'C:\Users\31672\AppData\Local\Temp\Rtmp6dHYtb\remotesef875d26408\RamiKrispin-coronavirus-e77e858/DESCRIPTION' (405ms)

preparing 'coronavirus': (2s)
√ checking DESCRIPTION meta-information ...
checking for LF line-endings in source and make files and shell scripts
checking for empty or unneeded directories
building 'coronavirus_0.2.0.tar.gz'

Caught an warning!
<simpleWarning: package ‘coronavirus’ is in use and will not be installed>`

US Data by State and County

What is the status on the US data and County information.?
What Package / Package location?
Will the data be added to the coronavirus package or to another package?
Will there be history information?
I also want to build a US / state / County / City dashboard? So I need Lat and Long data?
I see links to a package for US
a.
#22

install.packages("covid19us")
devtools::install_github("aedobbyn/covid19us")

b. Is this package a separate effort or are both of you (Rami and Author: Amanda Dobbyn) working on it together?

Thanks again for all the work you are doing.

Kenney

Latest data as of 25th of May is from 14th of May

Hi, the latest data accessible through this library is over ten days old - is this intentional?

Thank you,

Error in as.Date.default(df4$date) : do not know how to convert 'df4$date' to class “Date”

From a fresh install today:

> library(coronavirus)
> update_datasets()
[1] "The coronavirus data set is up-to-date"
Error in as.Date.default(df4$date) : 
  do not know how to convert 'df4$date' to class “Date”

sessionInfo:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] coronavirus_0.1.0.9002

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3                  rstudioapi_0.10            
 [3] googleAnalyticsR_0.7.0.9000 magrittr_1.5               
 [5] tidyselect_1.0.0            R6_2.4.1                   
 [7] rlang_0.4.4                 fansi_0.4.1                
 [9] httr_1.4.1                  dplyr_0.8.4                
[11] tools_3.6.1                 packrat_0.4.9-3            
[13] utf8_1.1.4                  cli_2.0.1                  
[15] googleAuthR_1.1.1.9000      cranlogs_2.1.0             
[17] assertthat_0.2.1            digest_0.6.23              
[19] tibble_2.1.3                gargle_0.4.0               
[21] lifecycle_0.1.0             crayon_1.3.4               
[23] purrr_0.3.3                 tidyr_1.0.2                
[25] vctrs_0.2.2                 fs_1.3.1                   
[27] curl_4.3                    memoise_1.1.0              
[29] glue_1.3.1                  compiler_3.6.1             
[31] pillar_1.4.3                jsonlite_1.6.1             
[33] pkgconfig_2.0.3

License ?

Hello, I was wondering if you were thinking about putting a license on this dataset. Can I use it for my own open source project ?

The number of confirmed cases in Spain on 2020-04-24 is a negative number

In the CRAN version of the coronavirus package the number of confirmed cases in Spain on 2020-04-24 is -10034, which seems like a mistake.

To visualize the number of confirmed cases in Spain I used the following
coronavirus %>% filter(country=="Spain" & type=="confirmed") %>% ggplot(aes(date, cases)) + geom_line()

To identify the specific data point I used the following
coronavirus[which.min(coronavirus$cases),]

Negative case values

Is there a significance to negative case values? Or is it a bug? I didn't see anything in the introduction about it.

Covid-19

This looks great and I'm going to use this for a stats class I teach! There are a couple of places that say "Corvid 19" instead of Covid-19. Corvid is the family of birds that includes crows and ravens :-)

Country name in standard format

Hi,

Could you please name the country in a standard format because when I am plotting this data in a spatial map then some of the countries data is not plotted for eg. Iran because in your dataset Iran is named as Iran(Islamic Republic of)

Thanks

Bug in Italian data ?

There might be a bug in the data of the 12th of March.
When I apply a filter by country, I don't see any new cases that day in Italy or France

This is the filtered data for Italy for:

  date       confirmed death recovered active confirmed_cum death_cum
  <date>         <int> <int>     <int>  <int>         <int>     <int>
1 2020-03-09      1797    97       102   1598          9172       463
2 2020-03-10       977   168         0    809         10149       631
3 2020-03-11      2313   196       321   1796         12462       827
4 2020-03-12         0     0         0      0         12462       827
5 2020-03-13      5198   439       394   4365         17660      1266
6 2020-03-14      3497   175       527   2795         21157      1441

Data before applying the filter:

Italy | 43.0000 | 12.0000 | 2020-03-12 | 0 | recovered

Seems that the data is not up to date

Hi Rami,

Thank you for the excellent work! I installed the package today and found that if I use the tail() it gave me the output below, which implies that the data was not updated to today. Could you help me figure out the reason? Hope it would not take you much time. Not sure whether other users have the same problem. Thank you in advance!

Best,
Lili

Pkg is GREAT but update_datasets() Fx gives error in Rstudio.

Hi Rami,
Great and critically IMPORTANT pkg!!.
Installed the CRAN version.

Q:
Trying to update
to the latest version of the Coronavirus dataset
with Fx:
> update_datasets()
but get:
Error in update_datasets() : could not find function "update_datasets"

Help, Rami!
Am I using the wrong function syntax?.
How can I easily update to the latest Coronavirus dataset?.
Thanks 10^6 !!

SFd99
San Francisco.
latest Rstudio and R / Ubuntu Linux 18.04 64bits

Update package locally

Hi,

I am seeing that you have updated data on the github but that didn't update in my local computer however I have used:

if (!require(coronavirus)) {devtools::install_github("RamiKrispin/coronavirus",upgrade = "always"); library(coronavirus)}

Please help me out.

Add new datasets

France province level
Switzerland canton level
US county level

Some impossible data

Hi,
There are 19 records with negative case numbers?
Regards,
Anthony

A tibble: 19 x 7

Province.State Country.Region 1 "" Japan 2 "" 3 "Diamond Princess" Cruise Ship 4 "From Diamond Princess" Australia 5 "Hainan" China 6 "Guizhou" China 7 "Saint Barthelemy" France 8 "Washington, D.C." US 9 "Heilongjiang" China 10 "Ningxia" China 11 "" Japan 12 "Northern Territory" Australia 13 "Queensland" Australia 14 "Queensland" Australia 15 "" Italy 16 "Diamond Princess" 17 "Guangxi" China 18 "Hong Kong" China 19 "Shanxi" China Lat Long date cases type

36 138 2020-02-07 -20 confirmed
Korea, South 36 128 2020-03-08 -17 recovered
35.4 140. 2020-03-06 -10 confirmed
35.4 140. 2020-02-29 -8 confirmed
19.2 110. 2020-02-15 -4 recovered
26.8 107. 2020-02-06 -3 recovered
17.9 -62.8 2020-03-09 -2 confirmed
38.9 -77.0 2020-03-10 -2 confirmed
47.9 128. 2020-02-11 -2 recovered
37.3 106. 2020-02-09 -2 recovered
36 138 2020-01-23 -1 confirmed
-12.5 131. 2020-03-06 -1 confirmed
-28.0 153. 2020-01-31 -1 confirmed
-28.0 153. 2020-02-02 -1 confirmed
43 12 2020-02-24 -1 recovered
Cruise Ship 35.4 140. 2020-02-23 -1 recovered
23.8 109. 2020-02-12 -1 recovered
22.3 114. 2020-02-21 -1 recovered
37.6 112. 2020-02-03 -1 recovered

Dataset Issue

Hello, thanks a lot for the resources. At some points, the number of cases is negative (<0). Is that fine? Does it have any specific meaning? Or is this an error? Find some in the screenshot below:

Iran Dataset not in the correct location

I made these changes to reading in df4

Update the Iran dataset

df4 <- read.csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus-csv/master/iran/covid_iran_long.csv", stringsAsFactors = FALSE)

df4 <- read.csv(file = "https://github.com/RamiKrispin/coronavirus-csv/blob/master/iran/covid_iran_long.csv", stringsAsFactors = FALSE)

Data degraded and update_datasets doesn't work

The data degraded to as of Feb 16.
Also update_datasets triggers the following error.

> coronavirus::update_datasets()
Error in getExportedValue(pkg, name) : 
  lazy-load database '/home/teru/R/x86_64-pc-linux-gnu-library/3.6/coronavirus/data/Rdata.rdb' is corrupt
In addition: Warning message:
In getExportedValue(pkg, name) : internal error -3 in R_decompress1

Merge and test dev-jebyrnes branch

#19

How do u make it realtime and auto update

Hey mate, thanks for a great package, can't figure out how to make it auto-updating and real-time?

I would like to use it here -

[LIVE] Coronavirus: Real-Time Counter, World Map, News - Come hangout and chat - 24x7 live

https://www.youtube.com/watch?v=i9rVyLr00Ck

Update_datasets is not pulling in "recovered" data?

I just updated the data as of March 26 and there are no recovered cases. Is that data missing or is in not available? I see the data on the John Hopkins Coronavirus dashboard?

pivot_wider creates multiple rows with NAs

Hi,

First of all thanks for this data, I'm working on a predictions project and started wanted to wide the data by creating columns per type, but for some countries is creating multiple rows adding NAs
`pivot_test <- coronavirus %>% filter(country=="Cameroon" & date=="2020-05-18")

pivot_test %>% pivot_wider(names_from = type, values_from = cases)

A tibble: 2 x 8

date province country lat long confirmed death recovered

1 2020-05-18 "" Cameroon 3.85 11.5 424 0 NA
2 2020-05-18 "" Cameroon 3.85 11.5 NA NA 0`

This is happening for Cameroon, Canada, Czechia, Grenada, Laos, Mozambique, Syria, Tajikistan, Timor-Leste, Yemen, China.
I suspect this could be a data issue, for example I found Canada do not have the recovered cases divided by province, not sure if this could be an issue.

Thanks in advance.

An idea for 0.3 - making data retrieval flexible.

As I look at the infrastructure so far, and where the whole coronavirus data retrieval and sharing effort is going, I had a thought. I'm curious what you think, @RamiKrispin and other users.

So, thus far, we've added two additional data sets. My contribution for 0.2 has been spatial (more on that in another issue) and adding the coronadatascraper data, as well as providing update_coronavirus_raw() and update_coronavirus_cds_raw(). The update_datasets() function works well to pull from QC-ed (as it were) data from a single source, and doesn't require the reprocessing of the raw scripts.

I'm noticing, though, as more datasets are added, the size of the package grows larger, such that it is generating warnings. This is only going to get worse as more data accumulated.

This has made me think. A lot. Particularly about many good ropensci packages - I use https://github.com/ropensci/rnoaa a lot for work. These packages provide an architecture to allow people to get big NOAA data sets. Sometimes pre-filtered (they use a database to hold them all), but, at least the data doesn't live with the package (except for some small demo datasets). So, I was wondering, after the dust clears and 0.2 is on CRAN, as we work to 0.3, should we rethink our architecture a bit. Consider the following proposal.

We really try and make this package the big aggregator of all SARS-Cov-2 datasets out there.
The datasets are hosted, after processing, on github, and updated daily (this could be more frequent with a different hosting plan - might be worth contacting @ropensci to see how they do it) (or working with them - I have contacts!).
The package has three main functions:
- get_coronavirus_data_info() - this returns a data frame of the name of the data sets, a brief description, and link to the source.
- get_coronavirus_data(dataset = "JHU") JHU is the default. This pulls one of the datasets listed in the above function from the QC-ed source from #2 above.
- `get_coronavirus_data_raw(dataset = "JHU") JHU as the default again. This calls one of our update_raw scripts that pulls the data from its source, so that users who want to get the freshest data can do so. But, we include a warning about the possible issues of doing so (datasets are in flux, etc.)
get_coronavirus_data() can return a tibble or a sf object, when possible. OR - and this is radical - we could have a column of options of different returntypes for each dataset in get_coronavirus_data_info(). This gives us the flexibility of handling most any dataset that comes our way
Each dataset gets a vignette with all of the requisite metadata and a simple example.

Thoughts?

0 Case Vectors (non-essential)

Is there any easy way to generate/insert a vector for each location and date that report 0 occurences of each type when there is no report?

This is probably my problem more than anybody's but doing any kind of data animation with ggplot or looking at cumulative sums gets really messy without having 0 case vectors.

Problem with update Dataset

I cant update data for 10/4. I use update function and reinstalled from (devtools::install_github("RamiKrispin/coronavirus") but still nothing.

Any idea?

Thanks

Total number of recovered cases negative

It seems that there are negative numbers for total number of recovered. Ex:

ramikrispin / coronavirus Goto Github PK

coronavirus's Introduction

coronavirus

Important Notes

Vignettes

Installation

Datasets

Usage

Dashboard

Data Sources

coronavirus's People

Contributors

Stargazers

Watchers

Forkers

coronavirus's Issues

coronavirus 0.2.0

A tibble: 19 x 7

Update the Iran dataset

df4 <- read.csv(file = "https://github.com/RamiKrispin/coronavirus-csv/blob/master/iran/covid_iran_long.csv", stringsAsFactors = FALSE)

A tibble: 2 x 8

Recommend Projects

Recommend Topics

Recommend Org