mountainmath / cancensus Goto Github PK
View Code? Open in Web Editor NEWR wrapper for calling CensusMapper APIs
Home Page: https://mountainmath.github.io/cancensus/index.html
License: Other
R wrapper for calling CensusMapper APIs
Home Page: https://mountainmath.github.io/cancensus/index.html
License: Other
Related to #126
The following code throws a parsing error now.
dataset <- "CA16"
regions_list10 <- list_census_regions(dataset) %>%
filter(level=="CMA") %>%
top_n(10,pop) %>%
as_census_region_list
csd_geo <- get_census(dataset, level = 'CSD', regions = regions_list10)
Warning: 1 parsing failure.
row col expected actual file
9 rpid a double s_1_35_35 literal data
Data looks fine in download, but the warning is unexpected.
This is from code written pre-CRAN release.
Currently, get_census()
returns variables in wide format:
cancensus::get_census(dataset='CA16', regions=list(CMA="59933"),
vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
level='CSD') %>% glimpse()
Rows: 39
Columns: 13
$ GeoUID <chr> "5915001", …
$ Type <fct> CSD, CSD, C…
$ `Region Name` <fct> Langley (DM…
$ `Area (sq km)` <dbl> 314.76313, …
$ Population <dbl> 117285, 258…
$ Dwellings <dbl> 43720, 1226…
$ Households <dbl> 41982, 1184…
$ CD_UID <chr> "5915", "59…
$ PR_UID <chr> "59", "59",…
$ CMA_UID <chr> "59933", "5…
$ `v_CA16_408: Occupied private dwellings by structural type of dwelling data` <dbl> 41980, 1184…
$ `v_CA16_409: Single-detached house` <dbl> 21690, 2730…
$ `v_CA16_410: Apartment in a building that has five or more storeys` <dbl> 1100, 40, 6…
When working with multiple vector variables it might be preferable in some cases to have these in long format. Frequently I will call tidyr::gather()
or pivot_longer()
to collect these variables after retrieval.
Tidycensus has a parameter output
that works like this:
One of "tidy" (the default) in which each row represents an enumeration unit-variable combination, or "wide" in which each row represents an enumeration unit and the variables are in the columns.
And in practice:
get_acs(geography = "county", variables = vars, state = "CA", geometry = TRUE, output = "wide")
I think this would be pretty easy to add but would require some work to make sure the various utility functions still work with data in long format.
I am thinking we should change the default geo_format to NA. Had some conversations with people with older R version that could not get cancensus
to work because
That's probably an extreme case, but I feel we should keep the entry barrier as low as possible. Also, I ran into a couple of bugs in sf/ggplot, so it might still take some time before this matures. So it might be better to let the user explicitly select which geography format they want to work with rather than selecting a default.
Thoughts?
This is a more general look at the issue identified in #87
The process for creating the variable descriptions should work roughly something like this:
parent_vector
fieldThe current process to traverse the label hierarchy to create concatenated variable descriptions works like this, where result
is a data frame of that census vectors and labels for that census dataset:
# traverse hierarchy to add description field to variables
result$description <- result$label
list <- result
browser()
while (any(!is.na(list$parent_vector))) {
parent_list=result %>% dplyr::filter(vector %in% list$parent_vector)
result$description[!is.na(list$parent_vector)] <-
paste(parent_list$label,result$description[!is.na(list$parent_vector)],sep=", ")
list=parent_list
}
This doesn't actually work correctly, as it doesn't index the positions of the parent vectors correctly, and results in misaligned concatenation.
@mountainMath and myself have tried a few different approaches over the last while to fix this, but have been stumped. I've had better luck by focusing on recursive join functions but can't quite get it work.
As an alternative, I put together an alternative approach that uses a function that traverses the label hierarchy for any given vector, and then splitting the data frame and vectorizing over that list. It works like this:
# traversal function
parent_labels <- function(vector_list) {
base <- vector_list
n <- 0
vector_list <- result[result$vector == base$parent_vector,] %>% distinct(vector, .keep_all = TRUE)
# Recursively assemble all parents of any vector
while(n != nrow(vector_list)) {
n = nrow(vector_list)
new_list <- result %>% filter(vector %in% vector_list$parent_vector)
vector_list <- vector_list %>% rbind(new_list) %>% distinct(vector, .keep_all = TRUE)
}
# Reverse order and collapse the parent labels into a single string
labels <- vector_list[order(desc(row.names(vector_list))),]$label
labels <- paste(labels, collapse = ": ")
return(labels)
}
result_split <- split(result,seq_len(nrow(result)))
full_labels <- lapply(result_split, parent_labels)
Despite using a vectorized approach (*apply family is generally the fastest way to do anything in R), this process is unacceptably slow when run on the entire dataset. I've benchmarked each function call at around 0.015 seconds, but if you have 5000 vectors, then that's too slow to be used reliably.
IMO, this is the biggest outstanding issue left before cleaning up for CRAN. The vectors are hard to find and hard to parse without reliable descriptions, and this affects search_census_vectors
as well. Does anyone have a better solution for addressing this?
My goal is to have this package end up on CRAN eventually. There's still work to do to clean up the vector/geo search and discovery, but I think it's pretty close otherwise. I'm not familiar with the process, so if anyone know what we need to do in order to be approved, please share insights.
(1) In find_census_vectors(), type = c("male", "female")
sometimes works, but throws a warning. Try:
find_census_vectors("full year, full time", "CA16",
type = c("male", "female"),
query_type = "exact")
Could you maybe remove the warning? Just please don't remove the ability to return data for both genders with one command, it is very handy.
(2) And sometimes type = c("male", "female")
doesn't work:
find_census_vectors("part year and/or part time",
"CA16",
type = c("male", "female"),
query_type = "exact")
Says 'No exact matches found. Please check spelling and try again or consider using semantic or keyword search.' However, this works (note exact same query term):
find_census_vectors("part year and/or part time",
"CA16",
type = "male",
query_type = "exact")
Could you please make this consistent and allow to set type = c("male", "female")
?
Right now the region
parameter looks something like region='{"PR":["59"]}'
. It would be nice to be able to use something more idiomatic to the R language, like
regions = list(PR = c("59"), ...)
This should be possible with jsonlite::toJSON
, although we may need to ensure that the list elements are vectors to avoid making the server barf.
We're currently providing people with the option to return short census vector variables when using the parameter labels = short
in calling get_census(...)
. Somewhere along the way, the functionality I built in during an earlier branch to store and return the detailed variables was lost, but we're still making reference to it in the documentation.
The previous functionality was called via cancensus.labels()
which doesn't appear to be in the master code anymore. I think it's a pretty useful feature when working with short form variable names, and easy enough to implement, so I'm wondering if we removed that on purpose or by accident. If an accident, we can re-insert it where appropriate.
If anyone has a suggestion for a better way of handling the variable descriptions, feel free to comment.
There's currently very little documentation on how to use the additional data open-sourced by CMHC used here:
Can these articles be adapted into a vignette or two to include on the site and bundled with the package?
I followed the cancensus installation instructions but was having some trouble running the example commands:
Querying CensusMapper API for regions data...
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'name,geo_uid,type,population,flag,CMA_UID,CD_UID,PR_UID
Canada,01,C,35151728,,,,
Ontario,35,PR,13448494,,,,
Quebec,24,PR,8164361,,,,
+ vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
+ level='CSD', use_cache = FALSE, geo_format = NA)
Querying CensusMapper API...
Downloading: 1.4 kB Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'GeoUID,Type,Region Name,Area (sq km),Population ,Dwellings ,Households ,v_CA16_408: Occupied private dwellings by structural type of dwelling data,v_CA16_409: Single-detached house,v_CA16_410: Apartment in a building that has five or more storeys
5915001,CSD,Langley (DM),314.76313,117285,43720,41982,41980,21690,1100
5915002,CSD,Langley (CY),10.222,25888,12264,11840,11840,2730,40
These are my sessionInfo() results:
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cancensus_0.1.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 digest_0.6.12 dplyr_0.7.4 withr_2.1.0
[5] assertthat_0.2.0 R6_2.2.2 jsonlite_1.5 magrittr_1.5
[9] git2r_0.19.0 httr_1.3.1 rlang_0.1.4 curl_3.0
[13] bindrcpp_0.2 devtools_1.13.4 tools_3.4.3 glue_1.2.0
[17] compiler_3.4.3 pkgconfig_2.0.1 memoise_1.1.0 bindr_0.1
[21] tibble_1.3.4
There is a resolution suggested by @dshkol – install readr (install.packages('readr')
).
We are making quite heavy use of caching, and I am wondering if we should either have a separate convenience function where we can bundle that code or use a package for caching.
Also, the this code
cache_dir <- system.file("cache/", package = "cancensus")
does not seem to be doing anything, at least for me.
search_census_vectors("income","CA16")
Returns too many rows but
search_census_vectors("median household income","CA16")
Returns nothing, not even related vectors. This is because the process for searching takes the entire string rather than tokenizing it.
search_census_vectors <- function(searchterm, dataset, type=NA, ...) {
#to do: add caching of vector list here
veclist <- list_census_vectors(dataset, ...)
result <- veclist[grep(searchterm, veclist$label, ignore.case = TRUE),]
... lots of other code
We should take the search term, and tokenize it, and search against tokens rather than the complete term. Adding this is a to-do for next version.
The package currently uses a very unusual naming convention for functions, namely package.function_name
. I haven't seen this style in R before, but R is also famous for its inconsistent naming conventions. I didn't give it much thought for a while, but as the package draws nearer to release, I thought it might be a good idea to think more carefully about naming. I think we should be motivated to have clear, unambiguous names, of course, but beyond that things are more a matter of style.
Most other R packages in my experience do not use a prefix or suffix at all (e.g. dplyr
). However, I did take a look at the most recently updated 50 or so API-backed R packages, and there the naming conventions are more varied.
Probably the closest package to this one is the censusapi package, which uses a getCensus()
function for querying and a listCensusMetadata()
function for inspection. The junr package, which is also somewhat similar, uses get_index()
, get_data()
and list_titles()
, which I think are too generic -- I would prefer to stick "census" in there somewhere. (I think the dwapi package also suffers from this problem).
Some packages use nouns; for example the hansard package has commons_divisions()
and mp_vote_record()
and the HIBPwned package has account_breaches()
. I like the specificity of this approach. Others use verbs, e.g. geocode()
in the banR package.
My favourite approach at the moment is, as with the censusapi package, to take a specific noun and prefix it with "get", as with the GetSports()
function in the pinnacle package or the get_lang()
function in the languagelayeR package, or the get_owf()
function in the openwindfarm package, or even the get_video_details()
function in the tuber package.
Of course, there are some packages that use prefixes, e.g.
bold_
k9_
ref_
bea
ct_
core_
nneo_
There are also some that use suffixes, e.g.
_sidrar
suffix for its functions, adopting get_
, search_
, etc function names -- so this could also be considered in the first categorysearch_pv
function (that is, it uses a suffix)querySW
functionThere is also the approach taken by the rosettaApi package, which is to have a single, top-level api()
function.
All this taken in stride, my proposal is to make the following changes:
cancensus::cancensus.load
becomes cancensus::get_census
cancensus::cancensus.load_data
becomes cancensus::get_census_data
cancensus::cancensus.load_geo
becomes cancensus::get_census_geometry
cancensus::cancensus.list_datasets
becomes list_census_datasets
, although I believe @mountainMath wants to add some non-census datasets in the future.cancensus::cancensus.list_vectors
becomes cancensus::list_census_vectors
cancensus::cancensus.search_vectors
becomes cancensus::search_census_vectors
cancensus::cancensus.list_regions
becomes cancensus::list_census_regions
cancensus::cancensus.census_labels
become cancensus::census_vectors
, since that is what they are termed in the parameters and in the list_*
function.All internal functions should just ditch their cancensus.
prefix, since they're not exported anyway.
We can also keep aliases around for all these changes for a release or two so that no-one's code breaks.
I think that the current function parameters all have pretty good names, so I don't have any suggested changes there.
cm
as a prefix, which is the same prefix currently used by the API key when it is an environment variable. So for example cm_get
and cm_list_datasets
.cancensus::cancensus()
as the main entry point (replacing existing load*()
functions). I don't think this is a particularly intuitive approach for this package.Requires update for security flaw in rendering of html page in documentation. Not an issue with the package.
Could also benefit from an update to the vignette to account for newer versions of ggplot2
+ sf
interaction, and other example tools like mapdeck
.
There is an issue with the description field in list_census_vectors
, it pulls incorrect parent vector labels.
max_level=NA
as an additional parameter. Then e.g. max_level=1
would only get direct children and no grandchildren.I'm getting a bunch of deprecation warnings related to usage of dplyr
and tibble
packages when fetching information. Here's a reprex (note that I'm hiding my api key). This doesn't impact my work, but worth looking into:
library(cancensus)
#> Census data is currently stored temporarily.
#>
#> In order to speed up performance, reduce API quota usage, and reduce unnecessary network calls, please set up a persistent cache directory by setting options(cancensus.cache_path = '<path to cancensus cache directory>')
#>
#> You may add this option, together with your API key, to your .Rprofile.
my_api_key <- "<HIDDEN>"
options(cancensus.api_key = my_api_key)
all_regions <- list_census_regions("CA16")
#> Querying CensusMapper API for regions data...
all_vars <- list_census_vectors("CA16")
region_age_gender_educ <- get_census(dataset = "CA16", regions = list(C = "01"),
vectors = c("v_CA16_65", "v_CA16_66",
"v_CA16_83", "v_CA16_84", "v_CA16_101", "v_CA16_102",
"v_CA16_119", "v_CA16_120", "v_CA16_137", "v_CA16_138", "v_CA16_155", "v_CA16_156",
"v_CA16_173", "v_CA16_174", "v_CA16_191", "v_CA16_192", "v_CA16_209", "v_CA16_210",
"v_CA16_227", "v_CA16_228", "v_CA16_245", "v_CA16_246",
"v_CA16_5055", "v_CA16_5056",
"v_CA16_5058", "v_CA16_5059",
"v_CA16_5064", "v_CA16_5065",
"v_CA16_5073", "v_CA16_5074",
"v_CA16_5076", "v_CA16_5077",
"v_CA16_5079", "v_CA16_5080"),
level = "PR")
#> Census data is currently stored temporarily.
#>
#> In order to speed up performance, reduce API quota usage, and reduce unnecessary network calls, please set up a persistent cache directory by setting options(cancensus.cache_path = '<path to cancensus cache directory>')
#>
#> You may add this option, together with your API key, to your .Rprofile.
#> Querying CensusMapper API...
#> Downloading: 2.2 kB Downloading: 2.2 kB Downloading: 2.2 kB Downloading: 2.2 kB
#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
#> Please use `as_tibble()` instead.
#> The signature and semantics have changed, see `?as_tibble`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
The package's main function currently accepts a datasets
parameter that specifies which census to query, one of c("CA16", "CA11", "CA06")
at the moment. It would be nice if there was some way for users to "discover" what data were available, for example with a function like
> cancensus::list_datasets()
# # A tibble: 3 x 2
# dataset description
# <chr> <chr>
# 1 CA16 2016 Census
# 2 CA11 2011 National Household Survey
# 3 CA06 2006 Census
As I said via email, I think there are two ways of doing this, either (1) adding a datasets
endpoint to the API that would return this kind of information, and/or (2) just hardcoding it into the package for now.
> list_census_regions('CA01')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) :
could not find function "handle_cm_status_code"
> list_census_regions('ca06')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) :
could not find function "handle_cm_status_code"
> list_census_regions('CA06')
Querying CensusMapper API for regions data...
Error in handle_cm_status_code(response, NULL) :
could not find function "handle_cm_status_code"
Caught in a CRAN check
https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/cancensus-00check.html
find_census_vectors('after tax income', dataset = 'CA16', type = 'total', query_type = 'semantic')
Error in if (min(lev_dist_df) > 2 | is.infinite(min(lev_dist_df))) { :
missing value where TRUE/FALSE needed
Right now we set the API key and cache path via options. It might be cleaner to set these as environment variables by default instead. Nothing would change in terms of user experience, but we would change the defaults in the function calls and also change the docs, but fall back to options in case environment variables aren't set. And also change the docs and vignette accordingly.
Currently the functionality of child_census_vectors
and parent_census_vectors
requires as an input a standard census vectors list, like the one produced by a call to list_census_vectors
.
This works fine when in a workflow where you are browsing through the vectors but adds additional overhead when you already know which vector to target.
Suppose you want to get the child vectors of Non-official languages
. Currently you have to:
search_census_vectors("Non-official language",'CA16')
list_census_vectors('CA16') %>%
filter(vector == 'v_CA16_551') %>%
child_census_vectors()
We should allow this to work in the situation where you already know the vector code.
child_census_vectors('v_CA16_551')
Also, currently running the above does not do anything useful, but it does not throw a warning or error either, so that is problematic as well. The above call returns
[1] "v_CA16_551"
Unfortunately I didn't think about this while testing the latest version before submitting to CRAN, but we can add this for the next release which can be sooner than the last one was.
census_data_sp <- get_census(dataset='CA16', regions=list(CMA="59933"),
vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
level='CSD', geo_format = "sp", use_cache = FALSE)
Querying CensusMapper API...
Downloading: 1.4 kB Querying CensusMapper API...
Downloading: 25 kB Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, :
Cannot open layer
The "sf" option does work
Cancensus adds the detailed labels as an attribute to the data it returns that can be read with the label_vectors
function if geo_fromat=NA
, but this gets lost when geo_format="sf"
.
French characters are not imported correctly. Here's an example:
data <- get_census(dataset='CA16', regions=list(PR="24"),
vectors=c("pop2016"= "v_CA16_401"),
level='CSD',geo_format="sf")
data %>% filter(GeoUID == 2401023) %>% pull(name)
[1] "Les Îles-de-la-Madeleine (M�)"
Issues may not be the best place for this but wanted to let you guys know that I've started working on a comprehensive set of documentation/vignettes to detail cancensus
usage. I've been in and out of town recently but over the next week or so I intend to put together this material.
I see it working best as a three-part document:
Any suggestions?
I'm halfway through Part 1, and I'll post them here first when ready. I'm going to be out of town Friday-Monday, but should gradually work towards finishing these by end of next week.
My plan is to put this series up here as vignettes, but also on a site (good reason to finally finish that rmarkdown+blogdown+hugo site). I think a condensed version would also work well as a submission to Rbloggers, R views, as well as general social media.
In the meantime, is there anything we still need to do in order to submit to CRAN?
Replicate by running any of the following:
View(search_census_vectors("income","CA11"))
View(search_census_vectors("income","CA16"))
The description
field is concatenating incorrectly. I suspect this is due to an issue in how it traverses the variable hierarchy after we switched to a recursive approach rather than the bumbling hard-coded version that was there before.
Another thing we should consider is what fields should be included in the search. Currently the search works as: veclist[grep(searchterm, veclist$label, ignore.case = TRUE),]
which searches solely through the label
field, but not the description field. If the decription field works accurately, it may be useful to include those - however then we may have too many variables returned when someone searches for something general like language or income. Etc. Thoughts?
I've been working on fixing vignette builds for Travis, and it has occured to me that the four existing vignettes (plus the README
) cover pretty similar ground. It might be better to amalgamate the existing ones into a single, longer vignette (or perhaps two) with more subsections. I don't think "more is better" really applies here, since most people will probably only read one vignette.
The README
file can then highlight a single cool example, and then link to the hosted vignette sections for more info.
It is also possible that some of the existing examples, particularly the thoughtful or complex ones, would make better blog posts than vignette content. It's not unusual to point to blog posts in the README
file, either.
Thoughts on this? Should I try amalgamating the vignettes myself? And how should this relate to #65?
When downloading census data using get_census
with sf
spatial format, one has to have already loaded the sf
library, otherwise the resulting data object will be a tbl_df
with a geometry
variable rather than an sf
class object. This can lead to all sorts of problems when doing any operations on that data, such as transformation or summarization or plotting.
Something to look at. I imagine it's bad practice to force load the sf
library with the get_census
call if the user selects the sf
format?
The new c++ based json parser simdjson is making some waves with very impressive benchmarks compared to established parsers like jsonlite. An R implementation just hit CRAN: https://github.com/eddelbuettel/rcppsimdjson.
Perhaps risky to adopt just yet but we should monitor. Possible downside would be adding an Rcpp dependency. But the upside could be significant improvement in parsing speed of long datasets for cancensus and cansim.
There seems to be an issue with caching data and geo_format. The current implementation stores files using the geo_format specified in the first call. If a subsequent call requests the same geographic data in a different geo_format the call fails.
This is due to the fact that if you query the API for an invalid dataset, like
https://censusmapper.ca/data_sets/CA17/place_names.csv
You get a malformed CSV file with no content instead of some sort of error. I don't know whether it is possible to check for this issue on the server side, or if we should implement validation for the dataset
parameter on the client side to prevent this issue.
At the moment a basic load
call like the following
census_data <- cancensus::cancensus.load(
dataset='CA16', regions='{"PR":["59"]}',
vectors = c("v_CA16_408"),
level='CD'
)
Returns a the following sf
object:
# Simple feature collection with 39 features and 18 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -123.4319 ymin: 49.00193 xmax: -122.4091 ymax: 49.57428
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# # A tibble: 39 x 19
# a t dw hh id pop name pop2 rgid rpid ruid Type `Region Name` `Area (sq km)` Population Dwellings Households
# * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 0.28012 CSD 286 285 5915825 471 Matsqui 4 498 59 5915 59933 CSD Matsqui 4 0.28012 471 286 285
# 2 0.53308 CSD 4 4 5915810 10 Musqueam 4 5 59 5915 59933 CSD Musqueam 4 0.53308 10 4 4
# 3 0.02195 CSD 25 22 5915805 54 Coquitlam 1 39 59 5915 59933 CSD Coquitlam 1 0.02195 54 25 22
# 4 1.05651 CSD 932 920 5915806 1855 Burrard Inlet 3 1472 59 5915 59933 CSD Burrard Inlet 3 1.05651 1855 932 920
# 5 0.27764 CSD 178 160 5915807 576 Mission 1 574 59 5915 59933 CSD Mission 1 0.27764 576 178 160
# 6 1.79039 CSD 1433 1309 5915808 2931 Capilano 5 2700 59 5915 59933 CSD Capilano 5 1.79039 2931 1433 1309
# 7 0.58363 CSD 16 15 5915809 49 Barnston Island 3 47 59 5915 59933 CSD Barnston Island 3 0.58363 49 16 15
# 8 0.49049 CSD 40 37 5915811 123 Seymour Creek 2 107 59 5915 59933 CSD Seymour Creek 2 0.49049 123 40 37
# 9 0.30729 CSD 15 15 5915813 40 Katzie 2 0 59 5915 59933 CSD Katzie 2 0.30729 40 15 15
# 10 1.78376 CSD 32 32 5915816 94 McMillan Island 6 68 59 5915 59933 CSD McMillan Island 6 1.78376 94 32 32
# # ... with 29 more rows, and 2 more variables: `v_CA16_408: Occupied private dwellings by structural type of dwelling data` <dbl>, geometry <simple_feature>
There are quite a number of duplicate columns here with unhelpful names; the order looks pretty arbitrary; and some of the columns clearly have the wrong type.
My suggestion for the appropriate result is the transformation (abusing dplyr
here for illustration):
df %>%
mutate_at(vars(Households, Dwellings, Population), funs(as.integer)) %>%
select(id, name, level = t, pop = Population, area = `Area (sq km)`,
dwellings = Dwellings, households = Households, starts_with("v_"))
# Simple feature collection with 39 features and 8 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -123.4319 ymin: 49.00193 xmax: -122.4091 ymax: 49.57428
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# # A tibble: 39 x 9
# id name level pop area dwellings households
# <chr> <chr> <chr> <int> <dbl> <int> <int>
# 1 5915825 Matsqui 4 CSD 471 0.28012 286 285
# 2 5915810 Musqueam 4 CSD 10 0.53308 4 4
# 3 5915805 Coquitlam 1 CSD 54 0.02195 25 22
# 4 5915806 Burrard Inlet 3 CSD 1855 1.05651 932 920
# 5 5915807 Mission 1 CSD 576 0.27764 178 160
# 6 5915808 Capilano 5 CSD 2931 1.79039 1433 1309
# 7 5915809 Barnston Island 3 CSD 49 0.58363 16 15
# 8 5915811 Seymour Creek 2 CSD 123 0.49049 40 37
# 9 5915813 Katzie 2 CSD 40 0.30729 15 15
# 10 5915816 McMillan Island 6 CSD 94 1.78376 32 32
# # ... with 29 more rows, and 2 more variables: `v_CA16_408: Occupied private dwellings by
# # structural type of dwelling data` <dbl>, geometry <simple_feature>
We should also explain these columns in the help entry.
We discussed this via email, but I think it is worth filing an issue to highlight that this is a desired feature. Something like the current list_datasets()
command for vectors and regions would be nice. They could also return tidy data frames that could be filtered, examined, etc.
I'm imagining something like the following for regions:
> cancensus::list_regions(dataset = "CA16") %>%
> filter(level %in% c("CMA", "PR"), province == "British Columbia")
Would yield
# # A tibble: 8 x 5
# region level name province
# <chr> <chr> <chr> <chr>
# 1 59915 CMA Kelowna British Columbia
# 2 59925 CMA Kamloops British Columbia
# 3 59930 CMA Chilliwack British Columbia
# 4 59932 CMA Abbotsford - Mission British Columbia
# 5 59933 CMA Vancouver British Columbia
# 6 59935 CMA Victoria British Columbia
# 7 59938 CMA Nanaimo British Columbia
# 8 59970 CMA Prince George British Columbia
# 9 59 PR British Columbia British Columbia
For vectors it could be even more simple: list_vectors(dataset = "CA16")
could return a two-column data frame with the vector
and its accompanying description
.
As emerged in #99, some of the current functions still emit warnings/messages when quietly = TRUE
. This should be resolved, and the vignettes can then be updated to remove warning/message suppression.
This isn't really a client R package issue but it would need to be implemented in the package as well. We should have more informative error messages for malformed API calls or API rate issues to communicate to users.
{cancensus} does not read complex geometries like Nunavut any more. It throws an error
GDAL Error 1: geoJSON object too complex
My hunch is that the {sf} package recently switched out their geoJSON driver to rely on GDAL, which has a fairly low memory limit (unless explicitly compiled with higher memory limit).
One way around this is to use a different geoJSON driver. For example, geojsonsf::geojson_sf
is very fast and has no problem reading in large geojson.
Let's check if we're affected by the upcoming changes to Travis for open-source projects and if doing so we cna adjust and use one of the recommended alternatives: https://ropensci.org/technotes/2020/11/19/moving-away-travis/
As discussed in #44, it does not seem desirable to use tmap
. So we should remove it in favour of ggplot2
examples.
See related #126
This function is duplicated by just using get_census
with geo parameters set and no variable parameters.
We can soft deprecate this so it still works in legacy code but throws a warning on deprecation, and remove it from active documentation.
I noticed that when data is pulled without geo, we lose a couple of useful columns.
cma.ct <- get_census("CA16", regions=list(CMA=cma),
vectors = vectors, level = "CT",
labels = "short", geo_format = NA)
produces
$ GeoUID <chr> "8250001.01", "8250001.02", "8250001.03", "8250001.04", "8250001.05", "8250...
$ Type <fct> CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT,...
$ `Region Name` <fct> Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Calgary, Cal...
$ `Area (sq km)` <dbl> 1.72193, 3.94892, 1.04878, 2.57535, 1.15113, 3.34596, 2.97159, 3.66137, 3.5...
$ Population <dbl> 5232, 6517, 2205, 5942, 2905, 3793, 6123, 5132, 6218, 2837, 5192, 4671, 261...
$ Dwellings <dbl> 2156, 2619, 823, 2325, 1045, 1448, 2410, 2082, 2407, 1073, 1876, 1746, 1056...
$ Households <dbl> 2104, 2571, 820, 2293, 1042, 1445, 2315, 2011, 2382, 1064, 1846, 1737, 1040...
...
Whereas
cma.ct <- get_census("CA16", regions=list(CMA=cma),
vectors = vectors, level = "CT",
labels = "short", geo_format = "sf")
produces
$ `Shape Area` <dbl> 1.88067, 0.58484, 1.41712, 400.47943, 261.22246, 8...
$ Type <fct> CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT, CT...
$ Dwellings <int> 1220, 850, 1943, 1271, 2339, 1666, 1021, 2264, 227...
$ Households <int> 1114, 835, 1928, 1192, 2271, 1508, 923, 2227, 2213...
$ GeoUID <chr> "8250055.00", "8250076.15", "8250052.09", "8250204...
$ Population <int> 3141, 2214, 5733, 3931, 6852, 4116, 2550, 6903, 65...
$ `Adjusted Population (previous Census)` <int> 2906, 2239, 5437, 3448, 6002, 3593, 2448, 5871, 64...
$ PR_UID <chr> "48", "48", "48", "48", "48", "48", "48", "48", "4...
$ CMA_UID <chr> "48825", "48825", "48825", "48825", "48825", "4882...
$ CSD_UID <chr> "4806016", "4806016", "4806016", "4806014", "48060...
$ CD_UID <chr> "4806", "4806", "4806", "4806", "4806", "4806", "4...
$ `Region Name` <fct> Calgary, Calgary, Calgary, Rocky View County, Rock...
$ `Area (sq km)` <dbl> 1.88067, 0.58484, 1.41712, 400.47943, 261.22246, 8...
...
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-114.1179 5..., MULTI...
I think that the additional columns for PR_UID
, CMA_UID
, CSD_UID
, CD_UID
etc. should be retained even when no geo format is specified as its a common requirement to merge and aggregate at different levels of census geography, even when not explicitly working with spatial data. This would reduce load on the server by reducing the number of unnecessary calls for spatial data.
I can think of a couple of convenience function that we might want to add.
cancensus
load calls.I can imagine that at least I will use these quite a bit, not sure about others. Should we add cancensus
function to do this?
I noticed that the list_regions call sends down data for some CMAs that don't actually have census data attached to them. This is fixed on the server now, will take a day for the server side cache to expire. So we should all make sure we refresh the cached regions in 24 hours.
Just ran into some errors trying to recompile old code. Reprex below:
dataset <- "CA16"
regions_list10 <- list_census_regions(dataset) %>%
filter(level=="CMA") %>%
top_n(10,pop) %>%
as_census_region_list
csd_geo <- get_census_geometry(dataset, level = 'Regions', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CSD', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CD', regions = regions_list10)
csd_geo <- get_census_geometry(dataset, level = 'CMA', regions = regions_list10)
Each of the get_census_geometry
calls here fails with the error
Error in get_census(dataset, level, regions, vectors = c(), geo_format = geo_format, :
thelevel
parameter must be one of 'Regions', 'PR', 'CMA', 'CD', 'CSD', 'CT', or 'DA'
In addition: Warning message:
In get_census(dataset, level, regions, vectors = c(), geo_format = geo_format, :
passingregions
as a character vector is depreciated, and will be removed in future versions
Flagging this to figure out what is causing the issue and if its a deprecation issue think about it making it a softer deprecation for legacy code.
The original code was written pre-CRAN release.
The labels=short options breaks with sf format selected. I added a check to fix this, but am now thinking it would be better to implement the code higher up directly on the dat object. Not sure if that survives the potential merge with the geo data, but that can be dealt with.
Dupe of #47, but for list_vectors()
:
> cancensus.list_vectors("CA17")
## Querying CensusMapper API for vectors data...
## Downloading: 20 B Error: '' does not exist in current working directory ('...').
versus the current list_regions()
:
> cancensus.list_regions("CA17")
## Querying CensusMapper API for regions data...
## Error in cancensus.handle_status_code(response, NULL) :
## Download of Census Data failed. Invalid Dataset Parameter
If this can't be addressed easily on the server side, we should be able to implement dataset validation on the client side.
I think this would be nice to add, but since I don't own the repository I can't set it up myself. If we enable it I'm happy to work on the configuration.
Fair warning, though: R CMD check
fails pretty hard on the vignettes on my local machine, so it would likely fail at the outset on Travis as well.
There appears to be an issue where list_census_region_list
can in some situations load and keep region ids as integers instead of converting them into characters. This is not picked up when converting to a region list using as_census_region_list
, which will lead to a malformed API call.
This occurs for some users who do not have the readr
package installed and for whom the data is loaded using the base read.csv
call instead, however there is no need to force a dependency on the readr package. This is straightforward to fix by either ensuring that region_ids are converted to strings in the first step, or by ensuring that they are strings when sent through as_census_region_list
.
Getting a really weird error when running
get_census("CA16",regions=list(CSD="5915022"), vectors = c("med_hh_inc"="v_CA16_2397"), geo_format = 'sf')
Error message.
Show in New WindowClear OutputExpand/Collapse Output
Error in rename.sf(., !!!vectors) : internal error: can't find `agr` columns
using sf_0.9-5
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.