Comments (12)
@rstrickland welcome! Would you mind painting a picture of your setup? Hive metastore is part of your Hadoop installation? Or DSE? I don't quite understand - table meta stored in Hive, but actual tables stored in C*? Thanks!
from filodb.
We use a centralized Hive metastore that's shared by multiple Spark and EMR clusters that all serve different purposes. The data is stored in multiple places, including Cassandra. However, with the lack of a good open source Cassandra-Hive driver we've had to resort to creating temp tables every time we want to get at Cassandra data. It would be awesome if Filo supported legitimate Hive tables so we could bypass this step.
from filodb.
Got it. Would it be fine to do this through Spark — i.e. you cannot use the filo-cli but use Spark API to create tables etc. (since Spark already can connect to the Hive metastore)
On Nov 10, 2015, at 10:09 AM, Robbie Strickland [email protected] wrote:
We use a centralized Hive metastore that's shared by multiple Spark and EMR clusters that all serve different purposes. The data is stored in multiple places, including C. However, with the lack of a good open source C-Hive driver we've had to resort to creating temp tables every time we want to get at C* data. It would be awesome if Filo supported legitimate Hive tables so we could bypass this step.
—
Reply to this email directly or view it on GitHub #41 (comment).
from filodb.
We could, but we do have BI tools that use Hive proper (i.e. not the Spark SQL thrift server). Ideally it would be great if that would work as well, but I know that's a bigger effort.
from filodb.
Okay, I think I understand now. You are looking for a proper FiloDB driver for HIVE that lets you query FiloDB from Hive itself. Understood now.
On Nov 10, 2015, at 1:09 PM, Robbie Strickland [email protected] wrote:
We could, but we do have BI tools that use Hive proper (i.e. not the Spark SQL thrift server). Ideally it would be great if that would work as well, but I know that's a bigger effort.
—
Reply to this email directly or view it on GitHub #41 (comment).
from filodb.
@rstrickland ok so to break this up into two steps:
- Have a Hive driver (like DSE's) that automatically lets you query tables from Spark without having to do a CREATE EXTERNAL TABLE
- Actually support queries directly from Hive without Spark. Hmmmm.... I think this involves yucky input formats and Hive SerDes, etc.
from filodb.
Ok, scoped out the work for Hive metastore support of FiloDB tables for querying in Spark. Spark has a HiveMetadataCatalog
class which has a createDataSourceTable
method. So one possibility is that when the FiloDB daemon/library spins up, it automatically resolves differences between FiloDB tables and the Hive catalog. Other times this sync could in theory happen is when a user requests tables or schema, but this would then require a custom Hive plugin in Spark. Need to think about how to automate the syncing.
@rstrickland it appears in Hive you either have to register a table as Hive-supported (i.e. using Hadoop INputFormats) or non-Hive supported (for Spark datasources, for example). Thus there might need to be some hack for namespacing the tables. What do you think?
from filodb.
So are you suggesting creating temp tables on startup that would be accessible via Spark only? I think the only way to get bona fide Hive support is to create a Hive SerDe. But your solution (if I'm understanding correctly) would solve the issue of having to recreate temp tables on startup for Spark or Spark SQL jobs, but would not allow for Hive queries via a BI tool.
from filodb.
@rstrickland you are right, the above proposed solution would not enable true Hive-only queries, though you can still connect BI tools to Spark SQL / Thrift server via the JDBC/ODBC drivers. The SerDe/InputFormats required for true Hive-only operation would come as a second step.
Would you guys be willing to test out the Spark-only solution, before the full Hive solution comes? What would the timeframe look like?
from filodb.
We would definitely test it whenever it's ready.
On Sunday, January 10, 2016, Evan Chan [email protected] wrote:
@rstrickland https://github.com/rstrickland you are right, the above
proposed solution would not enable true Hive-only queries, though you can
still connect BI tools to Spark SQL / Thrift server via the JDBC/ODBC
drivers. The SerDe/InputFormats required for true Hive-only operation would
come as a second step.Would you guys be willing to test out the Spark-only solution, before the
full Hive solution comes? What would the timeframe look like?—
Reply to this email directly or view it on GitHub
#41 (comment).
- Robbie **Strickland *|Director, Software Engineering
- w:* 770-226-2093 e: [email protected]
from filodb.
@rstrickland check out #63
and LMK if this is roughly what you guys are looking for as a first step. Would like some feedback first.
Thanks!
from filodb.
So the initial support has been merged. Let's close this and open a new ticket for any issues or changes desired.
from filodb.
Related Issues (20)
- Filo actors unreachable in filodb 0.7 HOT 16
- Filo full scan freeze HOT 3
- IN optimization and controlling task size during multipartition scan HOT 1
- Predicate pushdown is not working when a single table query has multiple conditions on the same column HOT 1
- Ability to merge ranges and create a larger token range to reduce number of tasks
- Errors setting up ingestion: ArrayBuffer HOT 5
- sbt test are failing HOT 4
- Try using Quotient Filters
- Unable to fetch data for a specific partition key when partition key is defined with more than 4 columns. HOT 5
- FiloDB Write format filodb.spark giving errors HOT 2
- FiloDB write format fails for Binary HOT 1
- Dataset creation ERROR DatasetCoordinatorActor: HOT 29
- Configured Filodb Failed HOT 1
- JVM Errors/Java Nullpointer exceptions HOT 5
- Upgrade to Scala 2.12 HOT 19
- Upgrade to SBT 1.x HOT 2
- Using Chaos Mesh to enhance FiloDB's stability
- Google groups links in the README do not work HOT 1
- E2E benchmarking of FiloDB HOT 1
- Unify akka versions used by dependencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from filodb.