Comments (14)
@sz-cgt commented on Nov 9, 2017, 11:09 PM UTC:
A little more digging suggests that dbplyr is using select * from schema.table as alias
to retrieve column names from tables
The problem with this approach is the hive parameter hive.resultset.use.unique.column.names
, which forces result set names to be unique by pre-pending the table name or table alias to all field names when the *
operator is used.
As this field is set true by default (see the TEZ documentation and HIVE-6687, dbplyr will need to work around this somehow if it wants to effectively support Hive (which seems to be part of RStudio's plans).
from dbplyr.
@edgararuiz commented on Nov 10, 2017, 3:58 PM UTC:
Hi, what happens if we remove the select_all()
step, not sure what it adds to the example:
db <- dbConnect(odbc::odbc(), "PRD")
db %>%
tbl(dbplyr::in_schema("1001_2", "dxcg_raw"))
from dbplyr.
@sz-cgt commented on Nov 10, 2017, 7:24 PM UTC:
You mean this?
db <- dbConnect(odbc::odbc(), "PRD")
db %>%
tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>%
select_all() %>%
show_query()
That gives select * from 1001_2.dxcg_raw
because dplyr
is making no attempt to discover the field names. Swapping to collect()
gives a tible with alias.column
field names. That's also not the use case.
To summarise:
- no field level activities prior to
collect()
works. It just gives the returned tibble weird column names - select() works because the user is supplying the list of fields.
- select_all(), select_if() and select_at() all have the same problem because they're interrogating the database to get the list of fields.
I suspect all *_all(), *_if(), *_at() functions would fail in a similar way, but I have not tried them.
from dbplyr.
@edgararuiz commented on Nov 10, 2017, 7:32 PM UTC:
Ok, can you help me with clarifying the first point in the summarize list?
The field names on a quick query I ran look fine to me:
> t <- tbl(con, "customers") %>%
+ head(10) %>%
+ collect()
>
> colnames(t)
[1] "id" "name" "email_preferences" "addresses" "orders"
from dbplyr.
@sz-cgt commented on Nov 10, 2017, 8:27 PM UTC:
Two questions/observations:
- you didn't include a schema in your example. That's the same as my second scenario in my original post
- you don't mention what database back-end you're using. As I noted originally, this is a Hive-specific problem. It doesn't occur in PostgreSQL for example
from dbplyr.
@edgararuiz commented on Nov 10, 2017, 8:37 PM UTC:
You're correct, sorry, it is a Hive connection. I just added a new DB, and a new table to that DB, here's the code and its results:
> con
<OdbcConnection> cloudera@Hive
Database: HIVE
Hive Version: 1.1.0-cdh5.7.0
>
> t <- tbl(con, in_schema("newdb", "states")) %>%
+ head(10) %>%
+ collect()
>
> colnames(t)
[1] "abbrev" "match1" "match2" "file"
from dbplyr.
@sz-cgt commented on Nov 10, 2017, 9:00 PM UTC:
what does dbGetQuery(db, "set") %>% filter(grepl(pattern = "unique.column", set))
give if you?
As I mentioned in my second post, the parameter hive.resultset.use.unique.column.names
appears to be the issue. When it's true (the default according to the Hive documentation), you get the following:
hive.resultset.use.unique.column.names
Default Value: true
Added In: Hive 0.13.0 with HIVE-6687
Make column names unique in the result set by qualifying column names with table alias if needed. Table alias will be added to column names for queries of type "select *" or if query explicitly uses table alias "select r1.x..".
from dbplyr.
@hadley commented on May 20, 2018, 2:50 PM UTC:
@edgararuiz can you please close the loop here?
from dbplyr.
@edgararuiz commented on May 22, 2018, 10:50 PM UTC:
Yes, I'll be glad to. My Hive service is down now, so I'll circle back as soon as I have somewhere to test @sz-cgt suggestion
from dbplyr.
@edgararuiz commented on May 24, 2018, 3:01 PM UTC:
@sz-cgt - here is the result:
> dbGetQuery(con, "set") %>%
+ filter(grepl(pattern = "unique.column", set))
set
1 hive.resultset.use.unique.column.names=true
Do you think it is because we're using different versions of Hive?
This is what I'm using:
> dbGetInfo(con)$db.version
[1] "1.1.0-cdh5.7.0"
This is what I see yours is (copy-pasted from you previous post):
> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"
from dbplyr.
@sz-cgt commented on Jun 3, 2018, 3:19 PM UTC:
OK. so you have the default. That means this query select * from some_table
should give back results like the following:
some_table.foo | some_table.bar | ... |
---|---|---|
a | b | ... |
If you put an alias on the end, like this select * from some_table asdf
, you get
asdf.foo | asdf.bar | ... |
---|---|---|
a | b | ... |
Is that what happens or do you get undecorated column names?
from dbplyr.
Do we have a solution for this ticket?
I am connected to a Hive DB and would like to first lazily point to a table without having to repeat table name in subsequent queries.
system('hive --version')
Hive 2.1.1-mapr-1803
...
Tried three ways but all got me to the same error.
1-
dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, dbplyr::in_schema("ivr_calls_data", "ivr_vcc_enrich_aggregated"))
that results in
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT *
FROM ivr_calls_data.ivr_vcc_enrich_aggregated AS "zzz9"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:49 cannot recognize input near 'AS' '"zzz9"' 'WHERE' in table source)
2-
dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=true')
dbSendUpdate(conn, 'USE ivr_calls_data')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")
which again results in
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT *
FROM "ivr_vcc_enrich_aggregated" AS "zzz10"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"ivr_vcc_enrich_aggregated"' 'AS' '"zzz10"' in join source)
3- identify DB at dbConnect level
conn <- dbConnect(drv, "jdbc:hive2://<HOST>:<PORT>/ivr_calls_data;",
"<USER>", "<PASS>")
dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")
which yells out with the same error.
from dbplyr.
This is a very long thread. Would someone mind summarising it along with a brief reprex that shows the problem?
from dbplyr.
I've closed this issue due to lack of requested reprex. If anyone stills care about this problem, please open a new issue with a reprex.
from dbplyr.
Related Issues (20)
- Use of `as.Date()` in `filter()` returns an error in dbplyr 2.4.0 for Oracle databases HOT 1
- `rows_patch.tbl_lazy` fails when patching more than one column
- dbplyr 2.4.0 - connecting to a table via a database link no longer works HOT 3
- Using a named vector with `dplyr::select(dplyr::all_of(...))` changes the table column names HOT 1
- The `unknown` column appears after I join the 2 tables HOT 4
- Progress bar for collect() HOT 1
- dbplyr 2.4.0 - date filter doesn't work anymore, possibly due to absence of single quotes in translation HOT 5
- Helpful warning if user forgets `I()`
- Update docs to prefer `I()` over `in_schema()` etc
- Check `table_path()` approach with backends HOT 4
- Fix no visible global function definition NOTE
- Automate global variables defintition HOT 1
- Consider recording quotes in simulate class definition
- Research stricter interpolation revdep failures HOT 1
- Incorrect translation of dates with Oracle ODBC. HOT 1
- Pre-release revdeps HOT 2
- Helper to get `last_sql()`
- str_detect() and str_replace() not treating period (.) detection same in snowflake translation HOT 3
- `paste0()` incorrectly uses a `sep` argument
- Snowflake translation error: dropped filter with `anti_join` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbplyr.