<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

dbplyr on HIVE produces invalid SQL when using select_*() or select helpers on an in_schema() object about dbplyr HOT 14 CLOSED

tidyverse commented on May 29, 2024

dbplyr on HIVE produces invalid SQL when using select_*() or select helpers on an in_schema() object

from dbplyr.

Comments (14)

commented on May 29, 2024

@sz-cgt commented on Nov 9, 2017, 11:09 PM UTC:

A little more digging suggests that dbplyr is using select * from schema.table as alias to retrieve column names from tables

The problem with this approach is the hive parameter hive.resultset.use.unique.column.names, which forces result set names to be unique by pre-pending the table name or table alias to all field names when the * operator is used.

As this field is set true by default (see the TEZ documentation and HIVE-6687, dbplyr will need to work around this somehow if it wants to effectively support Hive (which seems to be part of RStudio's plans).

from dbplyr.

commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 3:58 PM UTC:

Hi, what happens if we remove the select_all() step, not sure what it adds to the example:

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw"))

from dbplyr.

commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 7:24 PM UTC:

You mean this?

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>% 
  select_all() %>% 
  show_query()

That gives select * from 1001_2.dxcg_raw because dplyr is making no attempt to discover the field names. Swapping to collect() gives a tible with alias.column field names. That's also not the use case.

To summarise:

no field level activities prior to collect() works. It just gives the returned tibble weird column names
select() works because the user is supplying the list of fields.
select_all(), select_if() and select_at() all have the same problem because they're interrogating the database to get the list of fields.

I suspect all *_all(), *_if(), *_at() functions would fail in a similar way, but I have not tried them.

from dbplyr.

commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 7:32 PM UTC:

Ok, can you help me with clarifying the first point in the summarize list?

The field names on a quick query I ran look fine to me:

> t <- tbl(con, "customers") %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "id"                "name"              "email_preferences" "addresses"         "orders"

from dbplyr.

commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 8:27 PM UTC:

Two questions/observations:

you didn't include a schema in your example. That's the same as my second scenario in my original post
you don't mention what database back-end you're using. As I noted originally, this is a Hive-specific problem. It doesn't occur in PostgreSQL for example

from dbplyr.

commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 8:37 PM UTC:

You're correct, sorry, it is a Hive connection. I just added a new DB, and a new table to that DB, here's the code and its results:

> con
<OdbcConnection> cloudera@Hive
  Database: HIVE
  Hive Version: 1.1.0-cdh5.7.0
> 
> t <- tbl(con, in_schema("newdb", "states")) %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "abbrev" "match1" "match2" "file"

from dbplyr.

commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 9:00 PM UTC:

what does dbGetQuery(db, "set") %>% filter(grepl(pattern = "unique.column", set)) give if you?

As I mentioned in my second post, the parameter hive.resultset.use.unique.column.names appears to be the issue. When it's true (the default according to the Hive documentation), you get the following:

hive.resultset.use.unique.column.names
Default Value: true
Added In: Hive 0.13.0 with HIVE-6687
Make column names unique in the result set by qualifying column names with table alias if needed. Table alias will be added to column names for queries of type "select *" or if query explicitly uses table alias "select r1.x..".

from dbplyr.

commented on May 29, 2024

@hadley commented on May 20, 2018, 2:50 PM UTC:

@edgararuiz can you please close the loop here?

from dbplyr.

commented on May 29, 2024

@edgararuiz commented on May 22, 2018, 10:50 PM UTC:

Yes, I'll be glad to. My Hive service is down now, so I'll circle back as soon as I have somewhere to test @sz-cgt suggestion

from dbplyr.

commented on May 29, 2024

@edgararuiz commented on May 24, 2018, 3:01 PM UTC:

@sz-cgt - here is the result:

> dbGetQuery(con, "set") %>% 
+   filter(grepl(pattern = "unique.column", set))
                                          set
1 hive.resultset.use.unique.column.names=true

Do you think it is because we're using different versions of Hive?

This is what I'm using:

> dbGetInfo(con)$db.version
[1] "1.1.0-cdh5.7.0"

This is what I see yours is (copy-pasted from you previous post):

> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"

from dbplyr.

commented on May 29, 2024

@sz-cgt commented on Jun 3, 2018, 3:19 PM UTC:

OK. so you have the default. That means this query select * from some_table should give back results like the following:

some_table.foo	some_table.bar	...
a	b	...

If you put an alias on the end, like this select * from some_table asdf, you get

asdf.foo	asdf.bar	...
a	b	...

Is that what happens or do you get undecorated column names?

from dbplyr.

isaac2lord commented on May 29, 2024

Do we have a solution for this ticket?

I am connected to a Hive DB and would like to first lazily point to a table without having to repeat table name in subsequent queries.

system('hive --version')
Hive 2.1.1-mapr-1803
...

Tried three ways but all got me to the same error.

dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, dbplyr::in_schema("ivr_calls_data", "ivr_vcc_enrich_aggregated"))

that results in

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT *
FROM ivr_calls_data.ivr_vcc_enrich_aggregated AS "zzz9"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:49 cannot recognize input near 'AS' '"zzz9"' 'WHERE' in table source)

dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=true')
dbSendUpdate(conn, 'USE ivr_calls_data')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")

which again results in

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT *
FROM "ivr_vcc_enrich_aggregated" AS "zzz10"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"ivr_vcc_enrich_aggregated"' 'AS' '"zzz10"' in join source)

3- identify DB at dbConnect level

conn <- dbConnect(drv, "jdbc:hive2://<HOST>:<PORT>/ivr_calls_data;",
                  "<USER>", "<PASS>")
dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")

which yells out with the same error.

from dbplyr.

hadley commented on May 29, 2024

This is a very long thread. Would someone mind summarising it along with a brief reprex that shows the problem?

from dbplyr.

hadley commented on May 29, 2024

I've closed this issue due to lack of requested reprex. If anyone stills care about this problem, please open a new issue with a reprex.

from dbplyr.

dbplyr on HIVE produces invalid SQL when using select_*() or select helpers on an in_schema() object about dbplyr HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent