Giter Site home page Giter Site logo

Comments (14)

 avatar commented on May 29, 2024

@sz-cgt commented on Nov 9, 2017, 11:09 PM UTC:

A little more digging suggests that dbplyr is using select * from schema.table as alias to retrieve column names from tables

The problem with this approach is the hive parameter hive.resultset.use.unique.column.names, which forces result set names to be unique by pre-pending the table name or table alias to all field names when the * operator is used.

As this field is set true by default (see the TEZ documentation and HIVE-6687, dbplyr will need to work around this somehow if it wants to effectively support Hive (which seems to be part of RStudio's plans).

from dbplyr.

 avatar commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 3:58 PM UTC:

Hi, what happens if we remove the select_all() step, not sure what it adds to the example:

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw"))

from dbplyr.

 avatar commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 7:24 PM UTC:

You mean this?

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>% 
  select_all() %>% 
  show_query()

That gives select * from 1001_2.dxcg_raw because dplyr is making no attempt to discover the field names. Swapping to collect() gives a tible with alias.column field names. That's also not the use case.

To summarise:

  • no field level activities prior to collect() works. It just gives the returned tibble weird column names
  • select() works because the user is supplying the list of fields.
  • select_all(), select_if() and select_at() all have the same problem because they're interrogating the database to get the list of fields.

I suspect all *_all(), *_if(), *_at() functions would fail in a similar way, but I have not tried them.

from dbplyr.

 avatar commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 7:32 PM UTC:

Ok, can you help me with clarifying the first point in the summarize list?

The field names on a quick query I ran look fine to me:

> t <- tbl(con, "customers") %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "id"                "name"              "email_preferences" "addresses"         "orders"     

from dbplyr.

 avatar commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 8:27 PM UTC:

Two questions/observations:

  1. you didn't include a schema in your example. That's the same as my second scenario in my original post
  2. you don't mention what database back-end you're using. As I noted originally, this is a Hive-specific problem. It doesn't occur in PostgreSQL for example

from dbplyr.

 avatar commented on May 29, 2024

@edgararuiz commented on Nov 10, 2017, 8:37 PM UTC:

You're correct, sorry, it is a Hive connection. I just added a new DB, and a new table to that DB, here's the code and its results:

> con
<OdbcConnection> cloudera@Hive
  Database: HIVE
  Hive Version: 1.1.0-cdh5.7.0
> 
> t <- tbl(con, in_schema("newdb", "states")) %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "abbrev" "match1" "match2" "file"  

from dbplyr.

 avatar commented on May 29, 2024

@sz-cgt commented on Nov 10, 2017, 9:00 PM UTC:

what does dbGetQuery(db, "set") %>% filter(grepl(pattern = "unique.column", set)) give if you?

As I mentioned in my second post, the parameter hive.resultset.use.unique.column.names appears to be the issue. When it's true (the default according to the Hive documentation), you get the following:

hive.resultset.use.unique.column.names
Default Value: true
Added In: Hive 0.13.0 with HIVE-6687
Make column names unique in the result set by qualifying column names with table alias if needed. Table alias will be added to column names for queries of type "select *" or if query explicitly uses table alias "select r1.x..".

from dbplyr.

 avatar commented on May 29, 2024

@hadley commented on May 20, 2018, 2:50 PM UTC:

@edgararuiz can you please close the loop here?

from dbplyr.

 avatar commented on May 29, 2024

@edgararuiz commented on May 22, 2018, 10:50 PM UTC:

Yes, I'll be glad to. My Hive service is down now, so I'll circle back as soon as I have somewhere to test @sz-cgt suggestion

from dbplyr.

 avatar commented on May 29, 2024

@edgararuiz commented on May 24, 2018, 3:01 PM UTC:

@sz-cgt - here is the result:

> dbGetQuery(con, "set") %>% 
+   filter(grepl(pattern = "unique.column", set))
                                          set
1 hive.resultset.use.unique.column.names=true

Do you think it is because we're using different versions of Hive?

This is what I'm using:

> dbGetInfo(con)$db.version
[1] "1.1.0-cdh5.7.0"

This is what I see yours is (copy-pasted from you previous post):

> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"

from dbplyr.

 avatar commented on May 29, 2024

@sz-cgt commented on Jun 3, 2018, 3:19 PM UTC:

OK. so you have the default. That means this query select * from some_table should give back results like the following:

some_table.foo some_table.bar ...
a b ...

If you put an alias on the end, like this select * from some_table asdf, you get

asdf.foo asdf.bar ...
a b ...

Is that what happens or do you get undecorated column names?

from dbplyr.

isaac2lord avatar isaac2lord commented on May 29, 2024

Do we have a solution for this ticket?

I am connected to a Hive DB and would like to first lazily point to a table without having to repeat table name in subsequent queries.

system('hive --version')
Hive 2.1.1-mapr-1803
...

Tried three ways but all got me to the same error.

1-

dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, dbplyr::in_schema("ivr_calls_data", "ivr_vcc_enrich_aggregated"))

that results in

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT *
FROM ivr_calls_data.ivr_vcc_enrich_aggregated AS "zzz9"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:49 cannot recognize input near 'AS' '"zzz9"' 'WHERE' in table source)

2-

dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=true')
dbSendUpdate(conn, 'USE ivr_calls_data')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")

which again results in

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT *
FROM "ivr_vcc_enrich_aggregated" AS "zzz10"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"ivr_vcc_enrich_aggregated"' 'AS' '"zzz10"' in join source)

3- identify DB at dbConnect level

conn <- dbConnect(drv, "jdbc:hive2://<HOST>:<PORT>/ivr_calls_data;",
                  "<USER>", "<PASS>")
dbSendUpdate(conn, 'SET hive.resultset.use.unique.column.names=false')
ivr.vcc.agg <- dplyr::tbl(conn, "ivr_vcc_enrich_aggregated")

which yells out with the same error.

from dbplyr.

hadley avatar hadley commented on May 29, 2024

This is a very long thread. Would someone mind summarising it along with a brief reprex that shows the problem?

from dbplyr.

hadley avatar hadley commented on May 29, 2024

I've closed this issue due to lack of requested reprex. If anyone stills care about this problem, please open a new issue with a reprex.

from dbplyr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.