Giter Site home page Giter Site logo

ramikrispin / coronavirus Goto Github PK

View Code? Open in Web Editor NEW
498.0 32.0 212.0 2.07 GB

The coronavirus dataset

Home Page: https://ramikrispin.github.io/coronavirus/

License: Other

R 10.93% Dockerfile 1.79% Shell 2.28% HTML 46.14% Julia 0.55% JavaScript 38.32%
covid-19 rstats dataset covid19-data covid19

coronavirus's Introduction

coronavirus

R-CMD Data Pipeline CRAN_Status_Badge lifecycle License: MIT GitHub commit Downloads

The coronavirus package provides a tidy format for the COVID-19 dataset collected by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The dataset includes daily new and death cases between January 2020 and March 2023 and recovery cases until August 2022.

More details available here, and a csv format of the package dataset available here

Data source: https://github.com/CSSEGISandData/COVID-19

Source: Centers for Disease Control and Prevention’s Public Health Image Library

Important Notes

  • As of March 10th, 2023, JHU CCSE stopped collecting and tracking new cases
  • As of August 4th, 2022 JHU CCSE stopped track recovery cases, please see this issue for more details
  • Negative values and/or anomalies may occurred in the data for the following reasons:
    • The calculation of the daily cases from the raw data which is in cumulative format is done by taking the daily difference. In some cases, some retro updates not tie to the day that they actually occurred such as removing false positive cases
    • Anomalies or error in the raw data
    • Please see this issue for more details

Vignettes

Additional documentation available on the following vignettes:

Installation

Install the CRAN version:

install.packages("coronavirus")

Install the Github version (refreshed on a daily bases):

# install.packages("devtools")
devtools::install_github("RamiKrispin/coronavirus")

Datasets

The package provides the following two datasets:

  • coronavirus - tidy (long) format of the JHU CCSE datasets. That includes the following columns:

    • date - The date of the observation, using Date class
    • province - Name of province/state, for countries where data is provided split across multiple provinces/states
    • country - Name of country/region
    • lat - The latitude code
    • long - The longitude code
    • type - An indicator for the type of cases (confirmed, death, recovered)
    • cases - Number of cases on given date
    • uid - Country code
    • province_state - Province or state if applicable
    • iso2 - Officially assigned country code identifiers with two-letter
    • iso3 - Officially assigned country code identifiers with three-letter
    • code3 - UN country code
    • fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
    • combined_key - Country and province (if applicable)
    • population - Country or province population
    • continent_name - Continent name
    • continent_code - Continent code
  • covid19_vaccine - a tidy (long) format of the the Johns Hopkins Centers for Civic Impact global vaccination dataset by country. This dataset includes the following columns:

    • country_region - Country or region name
    • date - Data collection date in YYYY-MM-DD format
    • doses_admin - Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independently
    • people_partially_vaccinated - Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twice
    • people_fully_vaccinated - Cumulative number of people who received all prescribed doses necessary to be considered fully vaccinated
    • report_date_string - Data report date in YYYY-MM-DD format
    • uid - Country code
    • province_state - Province or state if applicable
    • iso2 - Officially assigned country code identifiers with two-letter
    • iso3 - Officially assigned country code identifiers with three-letter
    • code3 - UN country code
    • fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
    • lat - Latitude
    • long - Longitude
    • combined_key - Country and province (if applicable)
    • population - Country or province population
    • continent_name - Continent name
    • continent_code - Continent code

The refresh_coronavirus_jhu function enables to load of the data directly from the package repository using the Covid19R project data standard format:

covid19_df <- refresh_coronavirus_jhu()
#> �[4;32mLoading 2020 data�[0m
#> �[4;32mLoading 2021 data�[0m
#> �[4;32mLoading 2022 data�[0m
#> �[4;32mLoading 2023 data�[0m

head(covid19_df)
#>         date    location location_type location_code location_code_type
#> 1 2021-12-31 Afghanistan       country            AF         iso_3166_2
#> 2 2020-03-24 Afghanistan       country            AF         iso_3166_2
#> 3 2022-11-02 Afghanistan       country            AF         iso_3166_2
#> 4 2020-03-23 Afghanistan       country            AF         iso_3166_2
#> 5 2021-08-09 Afghanistan       country            AF         iso_3166_2
#> 6 2023-03-02 Afghanistan       country            AF         iso_3166_2
#>       data_type value      lat     long
#> 1     cases_new    28 33.93911 67.70995
#> 2 recovered_new     0 33.93911 67.70995
#> 3     cases_new    98 33.93911 67.70995
#> 4 recovered_new     0 33.93911 67.70995
#> 5    deaths_new    28 33.93911 67.70995
#> 6     cases_new    18 33.93911 67.70995

Usage

data("coronavirus")

head(coronavirus)
#>         date province country     lat      long      type cases   uid iso2 iso3
#> 1 2020-01-22  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 2 2020-01-23  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 3 2020-01-24  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 4 2020-01-25  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 5 2020-01-26  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#> 6 2020-01-27  Alberta  Canada 53.9333 -116.5765 confirmed     0 12401   CA  CAN
#>   code3    combined_key population continent_name continent_code
#> 1   124 Alberta, Canada    4413146  North America             NA
#> 2   124 Alberta, Canada    4413146  North America             NA
#> 3   124 Alberta, Canada    4413146  North America             NA
#> 4   124 Alberta, Canada    4413146  North America             NA
#> 5   124 Alberta, Canada    4413146  North America             NA
#> 6   124 Alberta, Canada    4413146  North America             NA

Summary of the total confrimed cases by country (top 20):

library(dplyr)

summary_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases)

summary_df %>% head(20) 
#> # A tibble: 20 × 2
#>    country        total_cases
#>    <chr>                <dbl>
#>  1 US               103802702
#>  2 India             44690738
#>  3 France            39866718
#>  4 Germany           38249060
#>  5 Brazil            37076053
#>  6 Japan             33320438
#>  7 Korea, South      30615522
#>  8 Italy             25603510
#>  9 United Kingdom    24658705
#> 10 Russia            22075858
#> 11 Turkey            17042722
#> 12 Spain             13770429
#> 13 Vietnam           11526994
#> 14 Australia         11399460
#> 15 Argentina         10044957
#> 16 Taiwan*            9970937
#> 17 Netherlands        8712835
#> 18 Iran               7572311
#> 19 Mexico             7483444
#> 20 Indonesia          6738225

Summary of new cases during the past 24 hours by country and type (as of 2023-03-09):

library(tidyr)

coronavirus %>% 
  filter(date == max(date)) %>%
  select(country, type, cases) %>%
  group_by(country, type) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type,
              values_from = total_cases) %>%
  arrange(-confirmed)
#> # A tibble: 201 × 4
#> # Groups:   country [201]
#>    country        confirmed death recovery
#>    <chr>              <dbl> <dbl>    <dbl>
#>  1 US                 46931   590        0
#>  2 United Kingdom     28783     0        0
#>  3 Australia          13926   115        0
#>  4 Russia             12385    38        0
#>  5 Belgium            11570    39        0
#>  6 Korea, South       10335    12        0
#>  7 Japan               9834    80        0
#>  8 Germany             7829   127        0
#>  9 France              6308    11        0
#> 10 Austria             5283    21        0
#> # … with 191 more rows

Plotting daily confirmed and death cases in Brazil:

library(plotly)

coronavirus %>% 
  group_by(type, date) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type, values_from = total_cases) %>%
  arrange(date) %>%
  mutate(active = confirmed - death - recovery) %>%
  mutate(active_total = cumsum(active),
                recovered_total = cumsum(recovery),
                death_total = cumsum(death)) %>%
  plot_ly(x = ~ date,
                  y = ~ active_total,
                  name = 'Active', 
                  fillcolor = '#1f77b4',
                  type = 'scatter',
                  mode = 'none', 
                  stackgroup = 'one') %>%
  add_trace(y = ~ death_total, 
             name = "Death",
             fillcolor = '#E41317') %>%
  add_trace(y = ~recovered_total, 
            name = 'Recovered', 
            fillcolor = 'forestgreen') %>%
  layout(title = "Distribution of Covid19 Cases Worldwide",
         legend = list(x = 0.1, y = 0.9),
         yaxis = list(title = "Number of Cases"),
         xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))

Plot the confirmed cases distribution by counrty with treemap plot:

conf_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases) %>%
  mutate(parents = "Confirmed") %>%
  ungroup() 
  
  plot_ly(data = conf_df,
          type= "treemap",
          values = ~total_cases,
          labels= ~ country,
          parents=  ~parents,
          domain = list(column=0),
          name = "Confirmed",
          textinfo="label+value+percent parent")

data(covid19_vaccine)

head(covid19_vaccine)
#>         date country_region continent_name continent_code combined_key
#> 1 2020-12-29        Austria         Europe             EU      Austria
#> 2 2020-12-29        Bahrain           Asia             AS      Bahrain
#> 3 2020-12-29        Belarus         Europe             EU      Belarus
#> 4 2020-12-29        Belgium         Europe             EU      Belgium
#> 5 2020-12-29         Canada  North America             NA       Canada
#> 6 2020-12-29          Chile  South America             SA        Chile
#>   doses_admin people_at_least_one_dose population uid iso2 iso3 code3 fips
#> 1        2123                     2123    9006400  40   AT  AUT    40 <NA>
#> 2       55014                    55014    1701583  48   BH  BHR    48 <NA>
#> 3           0                        0    9449321 112   BY  BLR   112 <NA>
#> 4         340                      340   11589616  56   BE  BEL    56 <NA>
#> 5       59079                    59078   37855702 124   CA  CAN   124 <NA>
#> 6          NA                       NA   19116209 152   CL  CHL   152 <NA>
#>        lat       long
#> 1  47.5162  14.550100
#> 2  26.0275  50.550000
#> 3  53.7098  27.953400
#> 4  50.8333   4.469936
#> 5  60.0000 -95.000000
#> 6 -35.6751 -71.543000

Taking a snapshot of the data from the most recent date available and calculate the ratio between total doses admin and the population size:

df_summary <- covid19_vaccine |>
  filter(date == max(date)) |>
  select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
  mutate(doses_pop_ratio = doses_admin / population,
         total_pop_ratio = total / population) |>
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) |>
  arrange(- total)

head(df_summary, 10)
#>          date country_region doses_admin      total population continent_name
#> 1  2023-03-09          China          NA 1310292000 1404676330           Asia
#> 2  2023-03-09          India          NA 1027379945 1380004385           Asia
#> 3  2023-03-09             US   672076105  269554116  329466283  North America
#> 4  2023-03-09      Indonesia   444303130  203657535  273523621           Asia
#> 5  2023-03-09         Brazil   502262440  189395212  212559409  South America
#> 6  2023-03-09       Pakistan   333759565  162219717  220892331           Asia
#> 7  2023-03-09     Bangladesh   355143411  151190373  164689383           Asia
#> 8  2023-03-09          Japan   382415648  104675948  126476458           Asia
#> 9  2023-03-09         Mexico   225063079   99071001  127792286  North America
#> 10 2023-03-09        Vietnam   266252632   90466947   97338583           Asia
#>    doses_pop_ratio total_pop_ratio
#> 1               NA       0.9328071
#> 2               NA       0.7444759
#> 3         2.039893       0.8181539
#> 4         1.624368       0.7445702
#> 5         2.362927       0.8910225
#> 6         1.510960       0.7343837
#> 7         2.156444       0.9180335
#> 8         3.023611       0.8276319
#> 9         1.761163       0.7752502
#> 10        2.735325       0.9294048

Plot of the total doses and population ratio by country:

# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6

# Filter the data
d <- df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) 


# Replot it
p3 <- plot_ly() |>
  add_markers(x = d$population,
              y = d$total,
              text = ~ paste("Country: ", d$country_region, "<br>",
                             "Population: ", d$population, "<br>",
                             "Total Doses: ", d$total, "<br>",
                             "Ratio: ", round(d$total_pop_ratio, 2), 
                             sep = ""),
              color = d$continent_name,
              type = "scatter",
              mode = "markers") |>
  add_lines(x = c(line_start, line_end),
            y = c(line_start, line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_lines(x = c(line_start, line_end),
            y = c(0.5 * line_start, 0.5 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  
  add_lines(x = c(line_start, line_end),
            y = c(0.25 * line_start, 0.25 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_annotations(text = "1:1",
                  x = log10(line_end * 1.25),
                  y = log10(line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:2",
                  x = log10(line_end * 1.25),
                  y = log10(0.5 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:4",
                  x = log10(line_end * 1.25),
                  y = log10(0.25 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
                  showarrow = FALSE,
                  xref = "paper",
                  yref = "paper",
                  x = -0.05, y = - 0.33) |>
  layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
         margin = list(l = 50, r = 50, b = 90, t = 70),
         yaxis = list(title = "Number of Doses",
                      type = "log"),
         xaxis = list(title = "Population Size",
                      type = "log"),
         legend = list(x = 0.75, y = 0.05))

Dashboard

Note: Currently, the dashboard is under maintenance due to recent changes in the data structure. Please see this issue

A supporting dashboard is available here

Data Sources

The raw data pulled and arranged by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) from the following resources:

coronavirus's People

Contributors

ramikrispin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coronavirus's Issues

Separate update and data access into different packages? 0.3 thought.

One thing that occurs to me is that, as we move forward, to rectify issues in different datasets, we have to do a lot of reprocessing. While some folk might want the latest data scrapped directly from a site, adding update raw scripts has upped the number of dependencies a great deal. If in 0.3 we move to more access to a broad array of data sets, as mentioned in #24 then perhaps it would be worth splitting the raw update scripts into a separate package to lower the dependencies? Just a thought. Curious what you think.

Error when devtools::install_github("RamiKrispin/coronavirus")

install.packages("devtools")
devtools::install_github("RamiKrispin/coronavirus")

Error message:
Installing package into ‘/databricks/spark/R/lib’ Error : Failed to install 'coronavirus' from GitHub:
(converted from warning) installation of package ‘units’ had non-zero exit status.

Expected changes in the package CRAN version 0.2.0

As the many changes occurred since the release of v0.1.0 significant changes in the data structure expected in v0.2.0 (expected to release to CRAN by May 15. Changes are available on the dev-v020 branch

coronavirus 0.2.0

  • Data changes:
    • coronavirus dataset - Change the structure of the US data from March 23rd, 2020 and forward. The US data is now available on an aggregated level. More information about the changes in the raw data available on this issue
    • Changes in the columns names and order:
      • Province.State changed to province
      • Country.Region changed to country
      • Lat changed to lat
      • Long changed to long
    • The covid_south_korea and covid_iran that were avialble on the dev version were removed from the package and moved to new package covid19wiki, for now available only on Github

Problem with Dataset update

Hello, i have a problem with the update of 18/4/20 data. I cant update the data but in your dashboard and your repo it seems that you update the data yesterday.
Any idea?
P.S I reinstalled the data from: devtools::install_github("covid19r/coronavirus") but still nothing

coronavirus::update_datasets()
Updates are available on the coronavirus Dev version, do you want to update? n/YY
Skipping install of 'coronavirus' from a github remote, the SHA1 (f3a23eb) has not changed since last install.
Use force = TRUE to force installation
The data was refresed, please restart your session to have the new data available

The datasets update

Hi Rami;
When I runing the "update_datasets" I get the following error message: Error in update_datasets() : could not find function "update_datasets".
And if I want to install the package I get the following error message : Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called ‘fs’.
What is the problem, please?

Dataset lagging behind other sources

Hi Rami Krispin:

I notice that sometimes the coronavirus dataset is lagging behind that of worldometers by days. If you rely on worldometers to update the dataset, that would be the most reliable in my opinion.

For example, the data for May 16 is missing from coronavirus. You could easily import the data from worldometers once a day and update the repository. I am using the coronavirus for the dataset beyond two days and merge that to that of worldometers (which publish the data for today and a day before).

Data not updated

Hi Rami,

Thanks for the excellent work. It seems that data is not being updated since last two days.

Thanks
Shubhram

" Azerbaijan" Has Space in the Name

" Azerbaijan" has an initial space in the name that should not exist.

> coronavirus$Country.Region %>% unique 
 [1] "Japan"                "South Korea"          "Thailand"             "Mainland China"       "Macau"               
 [6] "US"                   "Taiwan"               "Singapore"            "Vietnam"              "Hong Kong"           
[11] "France"               "Malaysia"             "Nepal"                "Australia"            "Canada"              
[16] "Cambodia"             "Germany"              "Sri Lanka"            "Finland"              "United Arab Emirates"
[21] "India"                "Philippines"          "Italy"                "Russia"               "Sweden"              
[26] "UK"                   "Spain"                "Belgium"              "Others"               "Egypt"               
[31] "Iran"                 "Israel"               "Lebanon"              "Afghanistan"          "Bahrain"             
[36] "Iraq"                 "Kuwait"               "Oman"                 "Algeria"              "Austria"             
[41] "Croatia"              "Switzerland"          "Brazil"               "Georgia"              "Greece"              
[46] "North Macedonia"      "Norway"               "Pakistan"             "Romania"              "Denmark"             
[51] "Estonia"              "Netherlands"          "San Marino"           " Azerbaijan"          "Belarus"             
[56] "Iceland"              "Lithuania"            "Mexico"               "New Zealand"          "Nigeria"             
[61] "North Ireland" 

Negative values were found in the package

This package is really cool! Thanks for putting them together!

I found some negative values in this package. I checked the raw data source, and didn't see negative values there. I guess some errors might be introduced during the data processing?

> library(coronavirus)
> data("coronavirus")
> coronavirus %>% arrange(cases) %>% head(20)
# A tibble: 20 x 7
   Province.State                         Country.Region   Lat   Long date       cases type     
   <chr>                                  <chr>          <dbl>  <dbl> <date>     <int> <chr>    
 1 "Omaha, NE (From Diamond Princess)"    US              41.3  -96.0 2020-02-24   -11 confirmed
 2 "Diamond Princess cruise ship"         Others          35.4  140.  2020-03-06   -10 confirmed
 3 "From Diamond Princess"                Australia       35.4  140.  2020-02-29    -8 confirmed
 4 "Travis, CA (From Diamond Princess)"   US              38.3 -122.  2020-02-24    -5 confirmed
 5 "New York County, NY"                  US              40.7  -74.0 2020-03-07    -5 confirmed
 6 "Hainan"                               Mainland China  19.2  110.  2020-02-15    -4 recovered
 7 "Guizhou"                              Mainland China  26.8  107.  2020-02-06    -3 recovered
 8 "Ningxia"                              Mainland China  37.3  106.  2020-02-09    -2 recovered
 9 "Heilongjiang"                         Mainland China  47.9  128.  2020-02-11    -2 recovered
10 "Lackland, TX (From Diamond Princess)" US              29.4  -98.6 2020-02-24    -2 confirmed
11 ""                                     Japan           36    138   2020-01-23    -1 confirmed
12 "Queensland"                           Australia      -28.0  153.  2020-01-31    -1 confirmed
13 "Queensland"                           Australia      -28.0  153.  2020-02-02    -1 confirmed
14 "Shanxi"                               Mainland China  37.6  112.  2020-02-03    -1 recovered
15 "Guangxi"                              Mainland China  23.8  109.  2020-02-12    -1 recovered
16 "Hong Kong"                            Hong Kong       22.3  114.  2020-02-21    -1 recovered
17 "Diamond Princess cruise ship"         Others          35.4  140.  2020-02-23    -1 recovered
18 ""                                     Italy           43     12   2020-02-24    -1 recovered
19 "Northern Territory"                   Australia      -12.5  131.  2020-03-06    -1 confirmed
20 ""                                     South Korea     36    128   2020-01-22     1 confirmed

Changes in the structure of the data

As per changes in the raw data, as of March 23rd some major changes are taking places in the format of the data:

  • The US data is now aggregated, the state and county-level data removed from this series
  • The recovered cases were removed from the data

Adding a country filter to the dashboard

Congrats on this visualization, it's the best I've seen!

Will it be possible to add a country filter to the dashboard in the Summary and Trends tab?

I want to see how fast the virus is spreading daily through certain European countries and it's not currently possible as the data is only aggregated.

Error in Hubei data for 3/11/2020

Hi,

It looks like there's an error in your dataset for the confirmed cases entry for Hubei for 3/11/2020 -- the total cumulative cases are listed (67773) instead of the new cases for that day (13), although all other entries for other location-days appear to be reporting new cases for that day (based on comparing with the raw .csv's from Johns Hopkins).

Thanks for the package!

Problem with update data

I cant update data since 6/4. i use coronavirus::update_datasets(silence = TRUE) but it says "no update available"
Any help?
Thanks

When i pull data ("coronavirus") only brings 1497 cases

Hello,

I'm from Chile, and I noticed that when I run library(coronavirus);data("coronavirus") even in the devtools version, I don't get information from Chile, but only from Australia, Belgium, Cambodia, Canada, Egypt, Finland, France, Germany, Hong Kong, India, Italy, Japan, Macau, Mainland China, Malaysia, etc., and until February 16 of 2020.

Connecting through Power BI

Hello, first-time git hubber lol full time Bi-onear. Im looking to create some dashboards from this dataset. i''m using https://github.com/javierluraschi/covid as the repo details but I'm being asked for a owner i have put javierluraschi but I'm being told that I do not have access.
any tips?
Kindest Regards
Jimmy

Data structure Changes / dataset updates

Rami:

Why are we having all these data changes?

I can rationalize that the recovered case might be inaccurate and maybe hospitals are over burdened with saving patients, and not able to collect the data.

But I don't understand why the state and province data was removed.

  1. Are you working on getting the state / province data back into the coronavirus dataset?
  2. Will the county / city data be available with Lat / Long data?
  3. Is there an issue with the daily updates to the datasets? I had issues getting the data on the morning of March 29 (6 am EST [+ 4 GMT] )?

Recovered + Active cases

How we can get except from recovered data and active cases?
There gonna be any changes at the coronavirus package including active and recovered columns?
Or is gonna be as it is?

Problem with update_dataset

Hi Rami,

I upgraded to R 4.0.0 recently. Since then I'm getting the error below with update_dataset(). I removed the coronvirus package and re-installed it several times but no luck.

`> update_dataset()
Updates are available on the coronavirus Dev version, do you want to update? n/YY
Downloading GitHub repo RamiKrispin/coronavirus@master
√ checking for file 'C:\Users\31672\AppData\Local\Temp\Rtmp6dHYtb\remotesef875d26408\RamiKrispin-coronavirus-e77e858/DESCRIPTION' (405ms)

  • preparing 'coronavirus': (2s)
    √ checking DESCRIPTION meta-information ...
  • checking for LF line-endings in source and make files and shell scripts
  • checking for empty or unneeded directories
  • building 'coronavirus_0.2.0.tar.gz'

Caught an warning!
<simpleWarning: package ‘coronavirus’ is in use and will not be installed>`

US Data by State and County

  1. What is the status on the US data and County information.?

  2. What Package / Package location?

  3. Will the data be added to the coronavirus package or to another package?

  4. Will there be history information?

  5. I also want to build a US / state / County / City dashboard? So I need Lat and Long data?

  6. I see links to a package for US
    a.
    #22

install.packages("covid19us")
devtools::install_github("aedobbyn/covid19us")

b. Is this package a separate effort or are both of you (Rami and Author: Amanda Dobbyn) working on it together?

Thanks again for all the work you are doing.

Kenney

Error in as.Date.default(df4$date) : do not know how to convert 'df4$date' to class “Date”

From a fresh install today:

> library(coronavirus)
> update_datasets()
[1] "The coronavirus data set is up-to-date"
Error in as.Date.default(df4$date) : 
  do not know how to convert 'df4$date' to classDate

sessionInfo:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] coronavirus_0.1.0.9002

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3                  rstudioapi_0.10            
 [3] googleAnalyticsR_0.7.0.9000 magrittr_1.5               
 [5] tidyselect_1.0.0            R6_2.4.1                   
 [7] rlang_0.4.4                 fansi_0.4.1                
 [9] httr_1.4.1                  dplyr_0.8.4                
[11] tools_3.6.1                 packrat_0.4.9-3            
[13] utf8_1.1.4                  cli_2.0.1                  
[15] googleAuthR_1.1.1.9000      cranlogs_2.1.0             
[17] assertthat_0.2.1            digest_0.6.23              
[19] tibble_2.1.3                gargle_0.4.0               
[21] lifecycle_0.1.0             crayon_1.3.4               
[23] purrr_0.3.3                 tidyr_1.0.2                
[25] vctrs_0.2.2                 fs_1.3.1                   
[27] curl_4.3                    memoise_1.1.0              
[29] glue_1.3.1                  compiler_3.6.1             
[31] pillar_1.4.3                jsonlite_1.6.1             
[33] pkgconfig_2.0.3 

License ?

Hello, I was wondering if you were thinking about putting a license on this dataset. Can I use it for my own open source project ?

The number of confirmed cases in Spain on 2020-04-24 is a negative number

In the CRAN version of the coronavirus package the number of confirmed cases in Spain on 2020-04-24 is -10034, which seems like a mistake.

To visualize the number of confirmed cases in Spain I used the following
coronavirus %>% filter(country=="Spain" & type=="confirmed") %>% ggplot(aes(date, cases)) + geom_line()

To identify the specific data point I used the following
coronavirus[which.min(coronavirus$cases),]

Negative case values

Is there a significance to negative case values? Or is it a bug? I didn't see anything in the introduction about it.

Covid-19

This looks great and I'm going to use this for a stats class I teach! There are a couple of places that say "Corvid 19" instead of Covid-19. Corvid is the family of birds that includes crows and ravens :-)

Country name in standard format

Hi,

Could you please name the country in a standard format because when I am plotting this data in a spatial map then some of the countries data is not plotted for eg. Iran because in your dataset Iran is named as Iran(Islamic Republic of)

Thanks

Bug in Italian data ?

There might be a bug in the data of the 12th of March.
When I apply a filter by country, I don't see any new cases that day in Italy or France

This is the filtered data for Italy for:

  date       confirmed death recovered active confirmed_cum death_cum
  <date>         <int> <int>     <int>  <int>         <int>     <int>
1 2020-03-09      1797    97       102   1598          9172       463
2 2020-03-10       977   168         0    809         10149       631
3 2020-03-11      2313   196       321   1796         12462       827
4 2020-03-12         0     0         0      0         12462       827
5 2020-03-13      5198   439       394   4365         17660      1266
6 2020-03-14      3497   175       527   2795         21157      1441

Data before applying the filter:

Italy | 43.0000 | 12.0000 | 2020-03-12 | 0 | recovered

D.

Seems that the data is not up to date

Hi Rami,

Thank you for the excellent work! I installed the package today and found that if I use the tail() it gave me the output below, which implies that the data was not updated to today. Could you help me figure out the reason? Hope it would not take you much time. Not sure whether other users have the same problem. Thank you in advance!

image

Best,
Lili

Pkg is GREAT but update_datasets() Fx gives error in Rstudio.

Hi Rami,
Great and critically IMPORTANT pkg!!.
Installed the CRAN version.

Q:
Trying to update
to the latest version of the Coronavirus dataset
with Fx: 
> update_datasets()
but get:
Error in update_datasets() : could not find function "update_datasets"

Help, Rami! 
Am I using the wrong function syntax?.
How can I easily update to the latest Coronavirus dataset?.
Thanks 10^6 !!

SFd99
San Francisco.
latest Rstudio and R / Ubuntu Linux 18.04 64bits

Update package locally

Hi,

I am seeing that you have updated data on the github but that didn't update in my local computer however I have used:

if (!require(coronavirus)) {devtools::install_github("RamiKrispin/coronavirus",upgrade = "always"); library(coronavirus)}

Please help me out.

Add new datasets

  • France province level

  • Switzerland canton level

  • US county level

Some impossible data

Hi,
There are 19 records with negative case numbers?
Regards,
Anthony

A tibble: 19 x 7

Province.State Country.Region Lat Long date cases type

1 "" Japan 36 138 2020-02-07 -20 confirmed
2 "" Korea, South 36 128 2020-03-08 -17 recovered
3 "Diamond Princess" Cruise Ship 35.4 140. 2020-03-06 -10 confirmed
4 "From Diamond Princess" Australia 35.4 140. 2020-02-29 -8 confirmed
5 "Hainan" China 19.2 110. 2020-02-15 -4 recovered
6 "Guizhou" China 26.8 107. 2020-02-06 -3 recovered
7 "Saint Barthelemy" France 17.9 -62.8 2020-03-09 -2 confirmed
8 "Washington, D.C." US 38.9 -77.0 2020-03-10 -2 confirmed
9 "Heilongjiang" China 47.9 128. 2020-02-11 -2 recovered
10 "Ningxia" China 37.3 106. 2020-02-09 -2 recovered
11 "" Japan 36 138 2020-01-23 -1 confirmed
12 "Northern Territory" Australia -12.5 131. 2020-03-06 -1 confirmed
13 "Queensland" Australia -28.0 153. 2020-01-31 -1 confirmed
14 "Queensland" Australia -28.0 153. 2020-02-02 -1 confirmed
15 "" Italy 43 12 2020-02-24 -1 recovered
16 "Diamond Princess" Cruise Ship 35.4 140. 2020-02-23 -1 recovered
17 "Guangxi" China 23.8 109. 2020-02-12 -1 recovered
18 "Hong Kong" China 22.3 114. 2020-02-21 -1 recovered
19 "Shanxi" China 37.6 112. 2020-02-03 -1 recovered

Dataset Issue

Hello, thanks a lot for the resources. At some points, the number of cases is negative (<0). Is that fine? Does it have any specific meaning? Or is this an error? Find some in the screenshot below:
Capture

Data degraded and update_datasets doesn't work

The data degraded to as of Feb 16.
Also update_datasets triggers the following error.

> coronavirus::update_datasets()
Error in getExportedValue(pkg, name) : 
  lazy-load database '/home/teru/R/x86_64-pc-linux-gnu-library/3.6/coronavirus/data/Rdata.rdb' is corrupt
In addition: Warning message:
In getExportedValue(pkg, name) : internal error -3 in R_decompress1

pivot_wider creates multiple rows with NAs

Hi,

First of all thanks for this data, I'm working on a predictions project and started wanted to wide the data by creating columns per type, but for some countries is creating multiple rows adding NAs
`pivot_test <- coronavirus %>% filter(country=="Cameroon" & date=="2020-05-18")

pivot_test %>% pivot_wider(names_from = type, values_from = cases)

A tibble: 2 x 8

date province country lat long confirmed death recovered

1 2020-05-18 "" Cameroon 3.85 11.5 424 0 NA
2 2020-05-18 "" Cameroon 3.85 11.5 NA NA 0`

This is happening for Cameroon, Canada, Czechia, Grenada, Laos, Mozambique, Syria, Tajikistan, Timor-Leste, Yemen, China.
I suspect this could be a data issue, for example I found Canada do not have the recovered cases divided by province, not sure if this could be an issue.

Thanks in advance.

An idea for 0.3 - making data retrieval flexible.

As I look at the infrastructure so far, and where the whole coronavirus data retrieval and sharing effort is going, I had a thought. I'm curious what you think, @RamiKrispin and other users.

So, thus far, we've added two additional data sets. My contribution for 0.2 has been spatial (more on that in another issue) and adding the coronadatascraper data, as well as providing update_coronavirus_raw() and update_coronavirus_cds_raw(). The update_datasets() function works well to pull from QC-ed (as it were) data from a single source, and doesn't require the reprocessing of the raw scripts.

I'm noticing, though, as more datasets are added, the size of the package grows larger, such that it is generating warnings. This is only going to get worse as more data accumulated.

This has made me think. A lot. Particularly about many good ropensci packages - I use https://github.com/ropensci/rnoaa a lot for work. These packages provide an architecture to allow people to get big NOAA data sets. Sometimes pre-filtered (they use a database to hold them all), but, at least the data doesn't live with the package (except for some small demo datasets). So, I was wondering, after the dust clears and 0.2 is on CRAN, as we work to 0.3, should we rethink our architecture a bit. Consider the following proposal.

  1. We really try and make this package the big aggregator of all SARS-Cov-2 datasets out there.

  2. The datasets are hosted, after processing, on github, and updated daily (this could be more frequent with a different hosting plan - might be worth contacting @ropensci to see how they do it) (or working with them - I have contacts!).

  3. The package has three main functions:
    - get_coronavirus_data_info() - this returns a data frame of the name of the data sets, a brief description, and link to the source.
    - get_coronavirus_data(dataset = "JHU") JHU is the default. This pulls one of the datasets listed in the above function from the QC-ed source from #2 above.
    - `get_coronavirus_data_raw(dataset = "JHU") JHU as the default again. This calls one of our update_raw scripts that pulls the data from its source, so that users who want to get the freshest data can do so. But, we include a warning about the possible issues of doing so (datasets are in flux, etc.)

  4. get_coronavirus_data() can return a tibble or a sf object, when possible. OR - and this is radical - we could have a column of options of different returntypes for each dataset in get_coronavirus_data_info(). This gives us the flexibility of handling most any dataset that comes our way

  5. Each dataset gets a vignette with all of the requisite metadata and a simple example.

Thoughts?

0 Case Vectors (non-essential)

Is there any easy way to generate/insert a vector for each location and date that report 0 occurences of each type when there is no report?

This is probably my problem more than anybody's but doing any kind of data animation with ggplot or looking at cumulative sums gets really messy without having 0 case vectors.

Problem with update Dataset

I cant update data for 10/4. I use update function and reinstalled from (devtools::install_github("RamiKrispin/coronavirus") but still nothing.

Any idea?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.