r-dbi / bigrquery Goto Github PK

View Code? Open in Web Editor NEW

505.0 43.0 179.0 6.62 MB

An interface to Google's BigQuery from R.

Home Page: https://bigrquery.r-dbi.org

License: Other

R 91.08% C 2.59% C++ 6.32%

r database bigquery

bigrquery's People

Contributors

Stargazers

Watchers

Forkers

davidcoallier muraenok craigcitro streambo andydavies bsletten strategist922 levelsbl rud totallybullshit totogo jewer qlycool vincentcr prioridata manmac arturochian dig3rati duckrun arijitpathak dctse xtmgah eemaa26 rrdalmeida reggie19500722 aureliepols buckleybrian parthasen ashumeow lefko2000 realakhmed bhatmonica tnb1234 lirazrub fxcebx kahtos etheleon ivarvong bthuntercn nathangoulding liredd jeremcs demorenoc fieldryand benheubl brianweinstein calculus-ask davidadamsphd ashokkumar2016 aditya990 markedmondson1234 drewlanenga backlin arnejol jarodmeng marianaalves bodias exploratory-io zsedo blakeyc fjospinas helen-r psolaimani lots-of-things dstoeckel ktmud skribled orenov everron mallcloud mstei4176 nanaakwasiabayieboateng barnettjacob michaelquinn32 edgararuiz-zz yenaing valeriafungous ysloh87 robexplore lislejoem batpigandme radovankavicky gapdata j450h1 mhamine thyagorezende rvecsler gguida vamsi5689 rasmusab imnfkhri key-mustang nikolayvoronchikhin 414theodore paulsendavidjay valentinumbach khanhdinh byapparov zainrizvi fabbondanza

bigrquery's Issues

Error after first attempt to execute query when permissions have been revoked

r-lib/httr#315 (comment)

Initial state: Connection from bigrquery works, no service token set
Revoke access via https://myaccount.google.com/security#connectedapps
- I can still connect from bigrquery for a minute or so

Eventually, I'm seeing the following as a result from query_exec():

Auto-refreshing stale OAuth token.
No encoding supplied: defaulting to UTF-8.
Error: Invalid Credentials
In addition: Warning message:
Unable to refresh token 
9: stop(out$err$message, call. = FALSE) at request.r#67
8: process_request(req) at request.r#34
7: bq_post(url, body) at jobs.r#60
6: insert_query_job(query, project, destination_table = destination_table, 
       default_dataset = default_dataset) at query.r#32
5: query_exec(sql, project = project) at oauth.R#5
4: eval(expr, envir, enclos)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/git/R/bigrquery/oauth.R", echo = TRUE)

The second call of query_exec() finally opens the authentication page in the browser

no way to POST JSON payloads

insert_upload_job() asserts that values is a data frame

bigrquery + shiny server does not work, error "interactive() is not TRUE"

Putting bigrquery into a shiny app running on shiny server returns the error "interactive() is not TRUE".

There is already this open issue but I think this one is slightly different for two reasons.

First, the error message returned is different than the error message in given by httr::oauth_listener if the session is not interactive.

Second, I've attempted to override both the base::interactive and httr::oauth_listener following directions I got on StackOverflow and neither has done anything. Although I do expect the directions might be wrong.

Figure out how to do automated testing with travis

dplyr backend for BigQuery: Get resulting schema for a query before executing it

Per @hadley, he was looking for a way to get the result schema for a particular query prior to executing it.

Unfortunately the BigQuery API does not support this today. Per advice from @fhoffa:

Filing a feature request to the BigQuery team to support this cleanly would be most appreciated.
In the meantime, table decorators can be used to effectively run a query on zero bytes of data. By adding @-1- to the end of the table name, it will access everything that has been loaded in the last millisecond. Note that if any data was actually loaded within the prior millisecond the result would operate over more than zero bytes, but this should do the trick the vast majority of the time.

Error 400 on Authorization

http://stackoverflow.com/questions/26265362/bigrquery-error-400-on-authorization

I'm running this code:

devtools::install_github("assertthat")
devtools::install_github("hadley/bigrquery")

library(bigrquery)

project <- "four-iq" # put your projectID here
sql <- 'SELECT year, month, day, weight_pounds FROM [publicdata:samples.natality] LIMIT 5;'
data <- query_exec(sql, project = project)

Then I get to a page with this URL:

https://accounts.google.com/o/oauth2/auth?client_id=465736758727.apps.googleusercontent.com&scope=c%28%22https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery%22%2C%20%22https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform%22%29&redirect_uri=http%3A%2F%2Flocalhost%3A18491%2Fcustom%2FOAuth%2Fcred&response_type=code&state=PEzwlZnwc1

The page says:

400. That’s an error.

Error: invalid_scope

Some requested scopes were invalid. {invalid=[&quot;https://www.googleapis.com/auth/cloud-platform&quot;),

c("https://www.googleapis.com/auth/bigquery",]}

Learn more

Request Details
response_type=code
scope=c("https://www.googleapis.com/auth/bigquery", "https://www.googleapis.com/auth/cloud-platform")
redirect_uri=http://localhost:18491/custom/OAuth/cred
state=PEzwlZnwc1
client_id=465736758727.apps.googleusercontent.com

The URL seems to be the problem so I tried to manually fix it then I get to authorize the App.
Then I get this Error inside RStudio:

Error in config(token = token) : Unknown RCurl options: token

Any idea how to get bigrquery working?

Implement operations on datasets

At least delete and insert, to support one-off datasets for testing.

https://cloud.google.com/bigquery/docs/reference/v2/datasets

change from integer to numeric conversion?

Since bigquery seems to like using unix timestamps in milliseconds (at least that's how we set it up, but also many of the bigquery time functions deal with msecs), and since R integers are limited to 32 bits (and since it's not 1970 anymore) I'm wondering if changing the conversion in extract_data to use as.numeric would be a good idea.

Part of the problem here is due to an inconsistency with how as.integer works. The man page states that:

The answer will be ‘NA’ unless the coercion succeeds. Real
values larger in modulus than the largest integer are coerced to
‘NA’ (unlike S which gives the most extreme integer of the same
sign).

But it turns out that as.integer('1414040723000') returns 2^31 - 1 instead of NA (applying as.numeric first returns NA).

It seems like this is a problematic failure mode - if you see a column of NAs, you realize something failed, whereas a column of maxint is less clearly wrong.

Of course there are lots of workarounds - I was just surprised by this behavior.

str_extract fails to convert to integer

Hello - thanks for developing bigrquery it's been a fantastic help. I'm having an issue extracting data from a column containing comma separated data.

mutate(df , PL_REF = as.integer(str_extract(N2_PL,"([0-9]+),[0-9]+,[0-9]+")))
Source: bigquery [201573431837:bamsurgeon]
From: variants3 [?? x 31]
Filter: call_method != "jointtest" 

Error: Encountered " <STRING_LITERAL> "\'([0-9]+),[0-9]+,[0-9]+\' "" at line 1, column 368.
Was expecting:
    ")" ...


query invalidQuery. Encountered " <STRING_LITERAL> "\'([0-9]+),[0-9]+,[0-9]+\' "" at line 1, column 368.
Was expecting:
    ")" ...

I'm able to exectute a working query in bigquery directly as such:

SELECT INTEGER(PL_REF) AS PL_REF, INTEGER(PL_HET) AS PL_HET, POW(INTEGER(PL_ALT),10) AS PL_ALT FROM 
   (SELECT 
       REGEXP_EXTRACT([N2_PL], '([0-9]+),[0-9]+,[0-9]+') AS [PL_REF],
       REGEXP_EXTRACT([N2_PL], '[0-9]+,([0-9]+),[0-9]+') AS [PL_HET],
       REGEXP_EXTRACT([N2_PL], '[0-9]+,[0-9]+,([0-9]+)') AS [PL_ALT]
    FROM [bamsurgeon.variants3])
LIMIT 10

Also note the power function for the third variable (PL_ALT), I was only able to execute this directly as well. bigrquery doesn't recognize ** or ^ from what I have seen, but does appear to recognize POW, although without being able to convert successfully to an integer it will not function.

Thanks!

Dates are returned as strings

> query_exec("SELECT date('2015-01-01') as a", "bigrquery-1168") %>% dput
structure(list(a = "2015-01-01"), .Names = "a", row.names = 1L, class = "data.frame")

DBItest: "^data_date_.*"

list_tables() returns a maximum of 1000 table names

length(list_tables(project, dataset, max_results = 1001))
[1] 1000

I have over 3000 tables in the dataset and could really use all of their names.

Rethink argument naming

project_id vs. project
dataset_id vs. dataset

To me, the _id suffix looks unnecessary, but stripping it everywhere might create incompatibilities.

I input the wrong credential the first time running query_exec() in R

How do I clear the wrong credential so that the query will re-prompt me to enter the right credential? I tried to start my AWS or computer... it all doesn't seem to clear the memory. Many thanks!

bigrquery needs to update token handling for httr changes

The new oauth2-related improvements in httr have changed how tokens need to be passed around -- in particular, we need to pass in a token object (so it can be refreshed) instead of just a signature.

I'm starting to do this now -- opening this issue for tracking.

UTF-8

Should check that on windows (which doesn't use UTF-8) standardise_csv() converts to UTF-8 on write, and maintains UTF-8 encoding status on read

Accept Destination Tables to query_exec

When result sets are especially large, BigQuery returns an error ("Response too large to return.").

This is easily handled by specifying a couple of options in the query body (allowLargeResults = TRUE and specifying a destination table).

I'm happy to submit a PR for it, but didn't know if anyone else was running into the limitation with larger queries.

Error installing

using Devtools version v1.6 produces the below error on a mac when installing using 'devtools::install_github("bigquery")':

"Error in username %||% getOption("github.user") %||% stop("Unknown username.") :
Unknown username."

On a pc results in the following warning:
'Username parameter is deprecated. Please use /bigrquery'

Linux command line R with bigrquery returns "Error:"

Linux command line R with bigrquery returns "Error:", as in error with no details. Does anyone have any idea what I might be doing wrong? I just ran an update for all installed packages. Could this be an issue with a new package?

bigrquery in RStudio Server

when connecting to the RStudio server version of this package I've found a problem:
the first request is required to allow access to the BigQuery in browser, but after giving permission, browser redirects to localhost:1410 with an error 400. R session is opened on server 192.241. ... , but RStudio listening to 1410 port without any reaction on given permissions
The same problem arises when I'm trying to set_access_cred
Is this problem caused by Google API, or server configs, or R-package configs?

Error: Job configuration must contain exactly one job-specific configuration object

Seems recently I updated some packages in in Rstudio, then I always get this error:

> library(bigrquery)
> billing_project <- "341409650721" # put your project number here
> sql <- "SELECT year, month, day, weight_pounds FROM natality LIMIT 5"
> query_exec("publicdata", "samples", sql, billing = billing_project)
Error: Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0:

Error when upload NA value in cell of last row and last column

After several tests, I found when I upload a data.table with function insert_upload_job, if the cell in last row and last column is empty (NA in R), I got this error:

Running query:   RUNNING 21.5s
 Error: Too many errors encountered. Limit is: 0.
File: 0 / Line:28 / Field:1 invalid. Bad character (ASCII 0) encountered: field starts with: <U+003c><U+003e>
 invalid. Too many errors encountered. Limit is: 0.

Note: Line: 28 is the last line, which is an empty line.

Here are last several rows of the data table that didn't pass (copied from temp .csv file generated)

"2092558",1,20000,1,1410739200,1410739200,,,
"2099381",1,25000,1,1410739200,1410739200,,,
"2100055",1,10000,1,1410739200,1410739200,,,
"2104395",1,70000,1,1410652800,1410652800,,,
"2106301",1,15000,1,1410739200,1410739200,1,15000,425
"2113586",1,20000,1,1410739200,1410739200,,,
"2117141",1,50000,1,1410739200,1410739200,,,
"2118506",1,10000,1,1410739200,1410739200,,,

Here are 2 examples passed:
This one I added a new Date column in last column

"2099381",1,25000,1,1410739200,1410739200,,,,1410912000
"2100055",1,10000,1,1410739200,1410739200,,,,1410912000
"2104395",1,70000,1,1410652800,1410652800,,,,1410912000
"2106301",1,15000,1,1410739200,1410739200,1,15000,425,1410912000
"2113586",1,20000,1,1410739200,1410739200,,,,1410912000
"2117141",1,50000,1,1410739200,1410739200,,,,1410912000
"2118506",1,10000,1,1410739200,1410739200,,,,1410912000

I noted the error was only from the last empty line, so I removed the last rows with NA values. It worked. (the last empty line was still there.)

"2089603",1,22000,1,1410739200,1410739200,,,
"2090306",1,25000,1,1410652800,1410652800,,,
"2092558",1,20000,1,1410739200,1410739200,,,
"2099381",1,25000,1,1410739200,1410739200,,,
"2100055",1,10000,1,1410739200,1410739200,,,
"2104395",1,70000,1,1410652800,1410652800,,,
"2106301",1,15000,1,1410739200,1410739200,1,15000,425

64-bit integers

currently cannot be returned.

> query_exec("SELECT 10000000000", "bigrquery-1168")
  f0_
1  NA
Warning message:
NAs introduced by coercion to integer range 
> query_exec("SELECT 10000000000.1", "bigrquery-1168")
    f0_
1 1e+10

Extract function that composes the URL for a project, dataset, and optionally a table

https://github.com/rstats-db/bigrquery/pull/77/files#r50803431. Search the code for ".*/datasets/

Add query pool

To avoid wasting time/money

Double Quote within a character field

I am trying to load a CSV file in R then upload it to BigQuery.
The original data in this column is:

V.O.I "LOVASOA AROVANA"

After load it to R, the string display as:

[1] "V.O.I \"LOVASOA AROVANA\""

Then when I try to upload this table to BigQuery, I got this issue in this field:

File: 0 / Line:151377 / Field:3 invalid. Data between close double quote (") and field separator: field starts with: <U+003c>V.O.I \<U+003e>

Is it possible to handle the double quotes in bigrquery automatically? or should I do some work around to solve this issue?

Add dataset = NULL argument to parse_table()

re #81 (comment)

Support POSIXlt as type for uploading data

DBItest: "roundtrip_timestamp"

1. Failure (at ~/git/R/DBItest/R/test_sql.R#566): DBItest: roundtrip_timestamp 
tbl_in is not identical to tbl_out[order(tbl_out$id), ]. Differences: 
Objects equal but not identical
Error: Failures detected.
In addition: Warning messages:
1: In check_tzones(target, current) : 'tzone' attributes are inconsistent
2: In check_tzones(target, current) : 'tzone' attributes are inconsistent
Execution halted

Unable to append data through query_exec

I want to perform a query and grab a large amount of data from one table, then add it to another table that already contains data using R.

So if I select a table that already exists I get "Error: Already Exists: Table ic..."

In Bigquery what is happpening is we are

Selecting a destination table
Writing if empty
While using the Bigquery Web console we can
Selecting a destination table
Append to a table

Can we add this functionality to bigrquery's query_exec? Is it available in the BigQuery API? I imagine it gets complicated when the tables are different.

Response too large to return despite destination_table being set in query_exec function

According to the function code allowLargeResults is set to True when the destination_table option is used. However, I've run a query with destination_table set and I still get the following error:

Error: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more details, see https://cloud.google.com/bigquery/querying-data#largequeryresults

responseTooLarge. Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more details, see https://cloud.google.com/bigquery/querying-data#largequeryresults

Any ideas why?

installation instruction need to be fix

It says that we need to do
devtools::install_github("bigrquery")

We need to do
devtools::install_github("bigrquery",username="hadley/bigrquery")

cheers,

Credentials

I am making a Shiny App that calls to a database on the BgQuery and am having trouble with getting credentials for other users. When I run query_exec from R, it asks if I want to authorize and then opens a web tab for me to click the allow button. When I try to run it from the shiny server with someone else's project id, it throws an error, "Access Denied".

How do I automate this using the get_access_cred function?

Authentication broken? "Invalid access credentials have been reset"

Dear hadley,
I am trying reproduce the example in the README.md with a dataset of mine, but authentication first completes succesfully and then fails because of a "Invalid access credentials have been reset" error. This happens with both Google's sample data and mine.

Any hints? Thanks!

Giacecco

> library(bigrquery)
> query_exec("<project name>", "<dataset name>", sql, billing = billing_project)
Loading required package: Rook
Loading required package: tools
Loading required package: brew
Waiting for authentication in browser...
Authentication complete.
Loading required package: rjson
Error: Invalid access credentials have been reset. Please try again.
>

Oops! Google Chrome could not connect to localhost:1410

First I am a beginner with R. I get an error when I run the following from Rstudio (doesn't happen in R directly)

billing_project <- "" # put your project number here
sql <- "SELECT * FROM [table] LIMIT 5"
query_exec("[project]", "[dataset]", sql, billing = billing_project)

I get the following error after accepting the request for permission via the browser:
"localhost 1410 custom OAuth cred"

What code do I need to add to get around this issue?

Have default project?

has_project <- function() {
  env <- Sys.getenv("BIGRQUERY_PROJECT")
  if (env == "") {
    message("No default billing project found. Either set `billing` argument ", 
      "in individual calls, or set BIGRQUERY_PROJECT environment variable")
    FALSE
  } else {
    TRUE
  }
}

if (has_project()) {
  sql <- "SELECT year, month, day, weight_pounds FROM natality LIMIT 5"
  query_exec("publicdata", "samples", sql)
}

# Or even make billing = default_billing()
# And then the examples could like this:
sql <- "SELECT year, month, day, weight_pounds FROM natality LIMIT 5"
query_exec("publicdata", "samples", sql)

query_exec() fails on retrieving data due to API change

The BigQuery API just changed; now, when pulling the results of a list table, if there are no more pages to be read the pageToken is no longer returned. This breaks the list_tabledata() function in tabledata.r.

Craig Citro and I are looking into a fix for this (shouldn't be too bad).

list_tables() fails if there are views present

Hi,

The error:
"Error in rbind(deparse.level, ...) :
les nombres de colonnes des arguments ne correspondent pas" (~column number mismatch).

I runned the inner steps of the function. The line causing trouble is:
"do.call("rbind", lapply(data, as.data.frame, row.names = 1L))"

The data is list of vectors of different lenghts thus the list cant be bound to produce a dataset.
This line has no impact on the expected output of list_tables. Would it be possible to remove it?

Best

Provide DBI compatible interface

This should mostly involve providing methods for DBI generics. This will make dplyr implementation easy as it will be able to use more default methods.

Add upload tests for each BQ datatype.

Error: destination_table$dataset_id is not a string

Hi,

I am brand new into big query and r world. I am trying to access some sample data from big query within R. I followed the examples here and some online, but got this error while trying to execute the query inside R. Could you please share ideas on what could be wrong? Here is the query I am trying to run:
data1<-query_exec(sql,project,dataset)

If there is any tutorial on how to use bigrquery that you are share, that would be great!

Thanks!

auth is only storing access token not refresh token

From what I can tell, bigrquery auth doesn't store the oauth2 refresh token, just the access token. The access token is only valid for a brief amount of time, which means you need to re-authorize periodically. If you store the refresh token, you shouldn't have to re-authorize.

Add some examples with github data

Implement BigQuery server authentication (in addition to OAuth)

For server-side implementations where browser confirmation is not possible, a server side auth can be really handy, as per described below using client ID and a keystore:

https://developers.google.com/bigquery/authorization#service-accounts-server

Using local credentials

Hey there Hadley,

Another way to use BigQuery without going to Google's webapp is by using bq.

With bq, you basically locally save a file ~/.bigquery.v2.token which in the latest version (doesn't seem documented on Google's side) is nothing more than a json configuration file containing your OAuth2 credentials (tokens, user-agent, etc).

The second file of interest that one needs to have when using bq is ~/.bigqueryrc which contains the default project id ( billing_id for query_exec(...)), and the aforementioned token file.

Now, when those files are present, would you be interested in seeing the web-authentication replaced by those files? The flow would be quite simple:

If ~/.bigquery.rc exists:
    use content for authentication
    If authentication() fails:
        go to Web-Browser authentication
 Else default to web-browser authentication()

Error in query_exec() - Error in token$sign : $ operator is invalid for atomic vectors

I've run the following code:

sql<-"SELECT visitorId FROM [table]"
project<-"project-name"
query_exec(sql, project)
and received the following error:
Error in token$sign : $ operator is invalid for atomic vectors

Would anyone be able to shed some light on whats causing this?

Thanks

bigrquery issue with Rcurl and httr

I am using bigrquery version 0.1.0.9000 version. I found two issues -

I have Rcurl version 1.96 - 0, which is required for my other work. it seems like bigrquery doesn't work with 1.96, but 1.95 version is fine. Will this be fixed?
I maybe stored some bad credential in .httr in the first time, and when I used reset_access_cred, the old credential persisted. I get 'invalid credentials' error message. When I manually removed the .httr file, it works fine.

Use JSON instead of CSV for uploading data.

Avoids UTF-8 issues
Resolves corner cases such as one-column data frames with NA values

Fixes #45.

httr update to 1.0.0 breaks tbl()

Hello,

Recent update of httr to 1.0.0 switched httr to curl instead of RCurl.

This update breaks the basic functionality of bigrquery at the very least with respect to tbl function. In order to confirm, please update to httr 1.0.0 and run the following to confirm:

library(dplyr)
library(bigrquery)

BILLING_PROJECT <- "<YOUR_BILLING_PROJECT>"
BILLING_DATASET <- "<DATASET_INSIDE_THAT_PROJECT>"

bq <- src_bigquery(BILLING_PROJECT, BILLING_DATASET)

# Upload the test data frame to your project
df <- data.frame(a = 1:100, b = 101:200)
httr_issue <- copy_to(bq, df, name = "httr_issue")

httr_issue2 <- bq %>% tbl("httr_issue")

which will produce the following error if httr is version 1.0.0:

Error in curl::curl_escape(names(url$query)) : 
  `url` must be a character vector.

and will work correctly if httr is 0.6.1.

The issue is caused by difference in behavior of curlEscape with NULLs for curl vs RCurl. Note that curl::curl_escape(NULL) results in an error while RCurl::curlEscape produces an empty vector.

> curl::curl_escape(NULL)
Error in curl::curl_escape(NULL) :
    `url` must be a character vector.
> RCurl::curlEscape(NULL)
character(0)

As of now, the quick fix is to downgrade httr back to v0.6.1:

devtools::install_github("hadley/[email protected]")

Error in UseMethod("sql_select") : no applicable method for 'sql_select' applied to an object of class "bigquery"

Hi. I am not sure if this has been reported. I have not seen this error on the closed issues or under the open issues. So I will post.

here is sample data that receives an error. I get the same error when using my own personal data, but I can't post this because it is for a private google data base, with billing information etc. so please see this sample data set. Thank you in advance!

library(bigrquery)

trying Google sample datasets
bq_db = src_bigquery('publicdata', 'samples')
bq_db
% src: bigquery [publicdata/samples]
% tbls: github_nested, github_timeline, gsod, natality, shakespeare, trigrams, wikipedia
d = tbl(bq_db, 'github_nested')
% Error in UseMethod("sql_select") :
% no applicable method for 'sql_select' applied to an object of class "bigquery"

max_pages = Inf doesn't seem to work

nrow(query_exec("SELECT * FROM [publicdata:samples.wikipedia] limit 300000", project, max_pages = Inf))

Doesn't get me 300 000.

It seems that max_pages = Inf does'nt work.

I have been getting this issue since today.

Error in stopifnot

Hi,

I try to run Rscript file.r on my Ubuntu, with the following code in the first few lines of the file.

library(bigrquery)
project <- "project"
sql <- "SELECT * FROM [publicdata:samples] LIMIT 250"
query_exec(sql, project = project)

If i do this on the R shell it works all out, but by running the script I get this error back.

Error in stopifnot(is(token, "refClass"), is.oauth_endpoint(endpoint),  : 
  could not find function "is"
Calls: query_exec ... get_access_cred -> oauth2.0_token -> new_token -> stopifnot
Execution halted

How can i make it work in a script?