The tidyjson from colearendt

Fix "no visible binding for global variable" notes in cmd_check()

Currently getting this

checking R code for possible problems ... NOTE
json_structure: no visible binding for global variable ‘level’
json_structure_arrays: no visible binding for global variable ‘type’
json_structure_arrays: no visible binding for global variable
  ‘document.id’
json_structure_arrays: no visible binding for global variable
  ‘child.id’
json_structure_arrays: no visible binding for global variable ‘level’
json_structure_arrays: no visible binding for global variable
  ‘parent.id’
... 9 lines ...
  ‘parent.id’
json_structure_objects: no visible binding for global variable ‘index’
json_structure_objects: no visible binding for global variable ‘key’
read_json: no visible global function definition for ‘tail’
should_json_structure_expand_more: no visible binding for global
  variable ‘level’
Undefined global functions or variables:
  child.id document.id index key level parent.id tail type
Consider adding
  importFrom("utils", "tail")
to your NAMESPACE file.

I believe this can be solved by avoiding non-standard evaluation, and using the _ version of dplyr functions instead.

Should spread_all discard scalar values from associated JSON?

Currently it leaves the JSON as is

'{"a": 1, "b": [1, 2, 3]}' %>% spread_all
#> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id     a
#>                <chr>       <int> <dbl>
#> 1 {"a":1,"b":[1,2...           1     1

Perhaps instead it should strip these away:

'{"a": 1, "b": [1, 2, 3]}' %>% spread_all
#> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id     a
#>                <chr>       <int> <dbl>
#> 1 {"b":[1,2,3]}                1     1

This makes sense since they are already captured in the tbl_json object, and it will make it easier to see that the next step should be enter_object and then gather_array.

Create as.character.tbl_json to convert back into a character string

Can wrap around toJSON, but with appropriate arguments.

Get dplyr::left_join (and others) to work

This fails, and there is no left_join_ method:

new <- '[1, 2, 3]' %>% gather_array("num") %>%
  left_join(data_frame(num = 1:3, letters = letters[1:3]), by = "num")

expect_is(new, "tbl_json")

Allow gather_array and gather_keys to work multiple times without having to specify a column.name

"[[1, 2], [1, 2]]" %>% gather_array %>% gather_array
#> Error: found duplicated column name: array.index

Remove use of `[[` or other operators in code

Last time I submitted to CRAN they complained about these, import functions from magrittr instead.

Add other badges (CRAN, others?)

Like https://raw.githubusercontent.com/jonocarroll/ggghost/master/README.Rmd

Create vignette to visualize JSON structure

Use the companies dataset and visualize the structure of the JSON documents.

Create a json_complexity function that computes the recursively un-nested length of JSON

Counts the total number of nodes in the JSON.

Change gather_keys to gather_object to be consistent with gather_array

Perhaps also unify their code.

Create a json_structure function to recursively identify the structure of a document

Every row should correspond to a "node" in the JSON document with a unique ID, and should identify the most recent key used to access the node, it's type, length and parent.

Peel JSON one layer at a time?

Is it possible to use fromJSON to turn the JSON into lists one layer at a time, so that the JSON remains a string as you slowly unwind it?

This may be much slower, but would lead to a more natural implementation where the JSON remains a column of the data frame formatted as a character string and more easily printed.

dplyr::slice isn't filtering JSON appropriately

Filter works:

companies[1:5] %>% as.tbl_json %>% filter(document.id == 1) %>% attr("JSON") %>% length
#> [1] 1

but slice does not:

companies[1:5] %>% as.tbl_json %>% slice(1) %>% attr("JSON") %>% length
#> [1] 5

first argument to verbs should not be x

Causes this not to work:

'{"x": 1}' %>% spread_values(x = jstring("x"))
#> Error in UseMethod("as.tbl_json") :
#>  no applicable method for 'as.tbl_json' applied to an object of class "function"

Yet this works:

'{"x": 1}' %>% spread_values(y = jstring("x"))
#>   document.id y
#> 1           1 1

Generate readme from .rmd

E.g., https://github.com/jonocarroll/ggghost/blob/master/README.Rmd

Have json_structure start level with 0

Root level should be 0

Change gather_keys to gather_object

Deprecate gather_keys with .Deprecated function, see backwards compatibility section in http://r-pkgs.had.co.nz/release.html#undefined

Warn on gather functions on name conflict only if non-default name is specified

Should not throw a warning:

json %>% gather_array %>% gather_array

Should throw a warning:

json %>% gather_array("special") %>% gather_array("special")

Create plot_json_graph

Should use json_structure and create an igraph object. Initial version of code is in visualization vignette.

Create json_schema

Should do the following:

Work like json_structure, but aggregate across many documents
Arrays should be collapsed into a union of their structures
Should keep a count of how often each structure appears
Should be able to visualize the result as a graph per the visualizing JSON vignette

Create is_json_* functions to test specific json types

Should be a subset of the logic in json_types, and work in places like gather_factory

json_structure fails with empty object

'{}' %>% json_structure
#> Error: wrong result size (2), expected 0 or 1

Use .null argument in map to handle missing object keys

Should cover package internals as well as purrr vignette.

Allow spread_values functions to work with unquoted paths

The following works:

'{"key": "value"}' %>% spread_values(key = jstring("key"))
#>   document.id   key
#> 1           1 value

but this does not:

'{"key": "value"}' %>% spread_values(key = jstring(key))
#> Error in as_function(.f, ...) : object 'key' not found

Try to use map_df in purrr vignette to operate on json directly

e.g.,

json %>%
  at_depth(2, `%||%`, NA) %>%
  map_df(. %$% tibble(name, email_address, number_of_employees, founded_year))

spread_values should not coerce types

This should throw an error:

'{"key": "1"}' %>% spread_values(int = jnumber("key"))
#>   document.id int
#> 1           1   1

create spread_all to automatically spread all keys

Should work like:

'{"a": 1, "b": "x", "c": true}' %>% spread_all_values

Should not affect the state of the JSON object
Should work with nested objects
Should take a sep argument used to separate key names when objects are nested
Should just ignore arrays automatically
NULLs should be cast to NA

Try to use unnest in gather_keys and gather_array

Test with ndjson

https://cran.r-project.org/web/packages/ndjson/index.html

change tbl_json to tidyjson per tibble?

Maybe it makes more sense for the tbl_json object to be a tidyjson object, like tbl_df has moved to tibble?

spread_all(recursive = FALSE) failing

issues %>% gather_array %>% spread_all(recursive = FALSE)
#> Error in `[.data.frame`(z, , final_columns, drop = FALSE) : 
#>   undefined columns selected

Change spread_values to determine type and have spread_values_<type> for specific types

Ideally, would:

Use unquoted notation, e.g., spread_values(key1) instead of spread_values("key1")
spread_values would determine type automatically (converting NULLs to NAs of appropriate type)
How to handle nested keys, e.g., spread_values(key1, key2) could be two top level keys or key2 nested under key1

Create a programming with tidyjson vignette

Should:

cover use of spread_values versus spread_all
show using is_json_X functions to check inputs

Use \code and \link in documentation

E.g., from purrr https://github.com/hadley/purrr/blob/master/R/along.R

#' These functions take the idea of \code{\link{seq_along}} and generalise
#' it to creating lists (\code{list_along}) and repeating values
#' (\code{rep_along}).

Undo fixed dependency on purrr for CRAN submission

https://github.com/jeremystan/tidyjson/blob/master/DESCRIPTION#L13

Increase the number of lines of JSON converted to strings in print.tbl_json

This will be very confusing to users:

> companies[1:5] %>% gather_keys %>% filter(is_json_object(.)) %>% gather_keys("key2")
#> # A tbl_json: 15 x 3 tibble with a "JSON" attribute
#>     `attr(., "JSON")` document.id   key            key2
#>                 <chr>       <int> <chr>           <chr>
#> 1  "52cdef7e4bab8b...           1   _id            $oid
#> 2  [[[150,22],"ass...           1 image available_sizes
#> 3                null           1 image     attribution
#> 4  "52cdef7f4bab8b...           2   _id            $oid
#> 5  [[[150,38],"ass...           2 image available_sizes
#> 6                null           2 image     attribution
#> 7  "52cdef7d4bab8b...           3   _id            $oid
#> 8  [[[150,36],"ass...           3 image available_sizes
#> 9                null           3 image     attribution
#> 10 "52cdef7d4bab8b...           4   _id            $oid
#> 11                ...           4 image available_sizes
#> 12                ...           4 image     attribution
#> 13                ...           5   _id            $oid
#> 14                ...           5 image available_sizes
#> 15                ...           5 image     attribution

Make tidyjson work with stream_in

Per http://stackoverflow.com/questions/38419423/filter-in-nested-data-frame

How to treat nested arrays?

Nested arrays are difficult to work with. For example,

x <- '[[1, 2], 1]' %>% gather_array %>% json_types
x
#>   document.id array.index   type
#> 1           1           1  array
#> 2           1           2 number

At this point, there is no way to gather the next array unless we filter on type == 'array'.

x %>% gather_array("level2")
#> Error in gather_array(., "level2") : 1 records are not arrays
x %>% filter(type == "array") %>% gather_array("level2")
#>   document.id array.index  type level2
#> 1           1           1 array      1
#> 2           1           1 array      2

append_values_number works, but returns NA for the array, and recursive = TRUE doesn't work through the second level array. Further, it could be that the types are mixed.

add a plot_json_graph example to readme header

Can be something simple.

Print tbl_json objects with truncated JSON string

tbl_json objects should print like tbl_df objects, except they should have an additional column at the end, titled something like attr("JSON"), that shows the first N characters of the concise JSON representation of the JSON attribute.

Something like:

document.id key attr("JSON")
----------- --- ------------
1           "a" [1, 2, 3]
2           "b" true
3           "c" {"k1": "value", "k2": [1, 2], "k3...

colearendt / tidyjson Goto Github PK

tidyjson's People

Contributors

Stargazers

Watchers

Forkers

tidyjson's Issues

Recommend Projects

Recommend Topics

Recommend Org