Giter Site home page Giter Site logo

demo's Introduction

Download | Docs | Benchmarks | Demo

JAVA&C++ Commit Activities Open Issues Website Slack Twitter

StarRocks, a Linux Foundation project, is the next-generation data platform designed to make data-intensive real-time analytics fast and easy. It delivers query speeds 5 to 10 times faster than other popular solutions. StarRocks can perform real-time analytics well while updating historical records. It can also enhance real-time analytics with historical data from data lakes easily. With StarRocks, you can get rid of the de-normalized tables and get the best performance and flexibility.

Learn more 👉🏻 What Is StarRocks: Features and Use Cases



Features

  • 🚀 Native vectorized SQL engine: StarRocks adopts vectorization technology to make full use of the parallel computing power of CPU, achieving sub-second query returns in multi-dimensional analyses, which is 5 to 10 times faster than previous systems.
  • 📊 Standard SQL: StarRocks supports ANSI SQL syntax (fully supported TPC-H and TPC-DS). It is also compatible with the MySQL protocol. Various clients and BI software can be used to access StarRocks.
  • 💡 Smart query optimization: StarRocks can optimize complex queries through CBO (Cost Based Optimizer). With a better execution plan, the data analysis efficiency will be greatly improved.
  • ⚡ Real-time update: The updated model of StarRocks can perform upsert/delete operations according to the primary key, and achieve efficient query while concurrent updates.
  • 🪟 Intelligent materialized view: The materialized view of StarRocks can be automatically updated during the data import and automatically selected when the query is executed.
  • ✨ Querying data in data lakes directly: StarRocks allows direct access to data from Apache Hive™, Apache Iceberg™, and Apache Hudi™ without importing.
  • 🎛️ Resource management: This feature allows StarRocks to limit resource consumption for queries and implement isolation and efficient use of resources among tenants in the same cluster.
  • 💠 Easy to maintain: Simple architecture makes StarRocks easy to deploy, maintain and scale out. StarRocks tunes its query plan agilely, balances the resources when the cluster is scaled in or out, and recovers the data replica under node failure automatically.

Architecture Overview

StarRocks’s streamlined architecture is mainly composed of two modules: Frontend (FE) and Backend (BE). The entire system eliminates single points of failure through seamless and horizontal scaling of FE and BE, as well as replication of metadata and data.

Starting from version 3.0, StarRocks supports a new shared-data architecture, which can provide better scalability and lower costs.


Resources

📚 Read the docs

Section Description
Deploy Learn how to run and configure StarRocks.
Articles How-tos, Tutorials, Best Practices and Architecture Articles.
Docs Full documentation.
Blogs StarRocks deep dive and user stories.

❓ Get support


Contributing to StarRocks

We welcome all kinds of contributions from the community, individuals and partners. We owe our success to your active involvement.

  1. See Contributing.md to get started.
  2. Set up StarRocks development environment:
  1. Understand our GitHub workflow for opening a pull request; use this PR Template when submitting a pull request.
  2. Pick a good first issue and start contributing.

📝 License: StarRocks is licensed under Apache License 2.0.

👥 Community Membership: Learn more about different contributor roles in StarRocks community.


Used By

This project is used by the following companies. Learn more about their use cases:

demo's People

Contributors

alberttwong avatar danroscigno avatar dependabot[bot] avatar dzmxcyr avatar howrocks avatar imay avatar jaogoy avatar kateshaowanjou avatar kevincai avatar naah69 avatar ss892714028 avatar taoshengyijiua avatar thimoonxy avatar yandongxiao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

demo's Issues

hudi docker-compose. here's an example of loading a parquet file.

example shows only a 3 row insert. I wanted to show importing a large parquet file.

for future reference.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConversions._

val df = spark.read.parquet("s3a://huditest/user_behavior_sample_data.parquet")

val databaseName = "hudi_sample"
val tableName = "hudi_coders_hive"
val basePath = "s3a://huditest/hudi_coders"

df.write.format("hudi").
  option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName).
  option(RECORDKEY_FIELD_OPT_KEY, "UserID").
  option(PRECOMBINE_FIELD_OPT_KEY, "UserID").  
  option("hoodie.datasource.hive_sync.enable", "true").
  option("hoodie.datasource.hive_sync.mode", "hms").
  option("hoodie.datasource.hive_sync.database", databaseName).
  option("hoodie.datasource.hive_sync.table", tableName).
  option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083").
  option("fs.defaultFS", "s3://huditest/").  
  mode(Overwrite).
  save(basePath)

hudi quickstart does not allow you to create tables

mysql> create table user_behavior as
    ->     SELECT * FROM FILES(
    ->         "path" = "s3://huditest/user_behavior_sample_data.parquet",
    ->         "format" = "parquet",
    ->         "aws.s3.access_key" = "admin",
    ->         "aws.s3.secret_key" = "password",
    ->         "aws.s3.region" = "us-west-2",
    ->         "aws.s3.use_instance_profile" = "false",
    ->         "aws.s3.enable_ssl" = "false",
    ->         "aws.s3.enable_path_style_access" = "true",
    ->         "aws.s3.endpoint" = "http://minio:9000"
    -> );
ERROR 1064 (HY000): Table replication num should be less than of equal to the number of available BE nodes. You can change this default by setting the replication_num table properties. Current alive backend is [10004]. table=user_behavior, properties.replication_num=3

the reason why is that there is only 1 FE and 1 BE.

Iceberg docker-compose. Won't start again when you Ctrl+C twice to force stop and start again.

^CGracefully stopping... (press Ctrl+C again to force)
[+] Stopping 0/3
 ⠹ Container spark-iceberg  Stopping                                                                                                                                                                                              4.3s
 ⠹ Container starrocks-be   Stopping                                                                                                                                                                                              4.3s
 ⠹ Container mc             Stopping                                                                                                                                                                                              4.3s
[+] Killing 4/6
[+] Killing 6/6eberg-rest   Killed                                                                                                                                                                                                0.3s
 ✔ Container iceberg-rest   Killed                                                                                                                                                                                                0.3s
[+] Stopping 6/4rk-iceberg  Killed                                                                                                                                                                                                0.4s
 ✔ Container spark-iceberg  Stopped                                                                                                                                                                                               4.7s
 ✔ Container starrocks-be   Stopped                                                                                                                                                                                               4.7s
 ✔ Container mc             Stopped                                                                                                                                                                                               4.5s
 ✔ Container starrocks-fe   Stopped                                                                                                                                                                                               0.0s
 ✔ Container iceberg-rest   Stopped                                                                                                                                                                                               0.0s
 ✔ Container minio          Stopped                                                                                                                                                                                               0.0s
canceled
atwong@Albert-CelerData iceberg % docker-compose up
[+] Running 6/0
 ✔ Container iceberg-rest   Created                                                                                                                                                                                               0.0s
 ✔ Container minio          Created                                                                                                                                                                                               0.0s
 ✔ Container starrocks-fe   Created                                                                                                                                                                                               0.0s
 ✔ Container mc             Created                                                                                                                                                                                               0.0s
 ✔ Container spark-iceberg  Created                                                                                                                                                                                               0.0s
 ✔ Container starrocks-be   Created                                                                                                                                                                                               0.0s
Attaching to iceberg-rest, mc, minio, spark-iceberg, starrocks-be, starrocks-fe
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 56: source: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 67: export_env_from_conf: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 70: source: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 74: [[: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 108: jdk_version: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 110: [[: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 133: detect_jvm_xmx: not found
starrocks-fe   | /opt/starrocks/fe/bin/start_fe.sh: 136: [[: not found
starrocks-fe   | Frontend running as process 1. Stop it first.
minio          | MinIO Object Storage Server
minio          | Copyright: 2015-2023 MinIO, Inc.
minio          | License: GNU AGPLv3 <https://www.gnu.org/licenses/agpl-3.0.html>
minio          | Version: RELEASE.2023-11-20T22-40-07Z (go1.21.4 linux/arm64)
minio          |
minio          | Status:         1 Online, 0 Offline.
minio          | S3-API: http://192.168.0.4:9000  http://127.0.0.1:9000
minio          | Console: http://192.168.0.4:9001 http://127.0.0.1:9001
minio          |
minio          | Documentation: https://min.io/docs/minio/linux/index.html
minio          | Warning: The standard parity is set to 0. This can lead to data loss.
iceberg-rest   | 2024-02-02T19:44:34.911 INFO  [org.apache.iceberg.rest.RESTCatalogServer] - Creating catalog with properties: {jdbc.password=password, s3.endpoint=http://minio:9000, jdbc.user=user, io-impl=org.apache.iceberg.aws.s3.S3FileIO, catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog, warehouse=s3://warehouse/, uri=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory}
iceberg-rest   | 2024-02-02T19:44:34.925 INFO  [org.apache.iceberg.CatalogUtil] - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
minio          |
minio          |  You are running an older version of MinIO released 2 months before the latest release
minio          |  Update: Run `mc admin update`
minio          |
minio          |
starrocks-fe exited with code 1
iceberg-rest   | 2024-02-02T19:44:35.055 INFO  [org.eclipse.jetty.util.log] - Logging initialized @256ms to org.eclipse.jetty.util.log.Slf4jLog
iceberg-rest   | 2024-02-02T19:44:35.085 INFO  [org.eclipse.jetty.server.Server] - jetty-9.4.51.v20230217; built: 2023-02-17T08:19:37.309Z; git: b45c405e4544384de066f814ed42ae3dceacdd49; jvm 17.0.9+8-LTS
iceberg-rest   | 2024-02-02T19:44:35.093 INFO  [org.eclipse.jetty.server.handler.ContextHandler] - Started o.e.j.s.ServletContextHandler@574b560f{/,null,AVAILABLE}
iceberg-rest   | 2024-02-02T19:44:35.100 INFO  [org.eclipse.jetty.server.AbstractConnector] - Started ServerConnector@b2c9a9c{HTTP/1.1, (http/1.1)}{0.0.0.0:8181}
iceberg-rest   | 2024-02-02T19:44:35.100 INFO  [org.eclipse.jetty.server.Server] - Started @301ms
spark-iceberg  | starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-57b8cb2c0cbd.out
mc             | Added `minio` successfully.
mc             | mc: <ERROR> Unable to make bucket `minio/warehouse`. Your previous request to create the named bucket succeeded and you already own it.
mc             | mc: Please use 'mc anonymous'
spark-iceberg  | starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-57b8cb2c0cbd.out
spark-iceberg  | starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-57b8cb2c0cbd.out
spark-iceberg  | starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /opt/spark/logs/spark--org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-57b8cb2c0cbd.out
spark-iceberg  | [I 2024-02-02 19:44:45.609 ServerApp] Package notebook took 0.0000s to import
spark-iceberg  | [I 2024-02-02 19:44:45.616 ServerApp] Package jupysql_plugin took 0.0072s to import
spark-iceberg  | [I 2024-02-02 19:44:45.621 ServerApp] Package jupyter_lsp took 0.0041s to import
spark-iceberg  | [W 2024-02-02 19:44:45.621 ServerApp] A `_jupyter_server_extension_points` function was not found in jupyter_lsp. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
spark-iceberg  | [I 2024-02-02 19:44:45.623 ServerApp] Package jupyter_server_terminals took 0.0018s to import
spark-iceberg  | [I 2024-02-02 19:44:45.623 ServerApp] Package jupyterlab took 0.0000s to import
spark-iceberg  | [I 2024-02-02 19:44:45.636 ServerApp] Package notebook_shim took 0.0000s to import
spark-iceberg  | [W 2024-02-02 19:44:45.636 ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
spark-iceberg  | [I 2024-02-02 19:44:45.636 ServerApp] jupysql_plugin | extension was successfully linked.
spark-iceberg  | [I 2024-02-02 19:44:45.636 ServerApp] jupyter_lsp | extension was successfully linked.
spark-iceberg  | [I 2024-02-02 19:44:45.638 ServerApp] jupyter_server_terminals | extension was successfully linked.
spark-iceberg  | [W 2024-02-02 19:44:45.639 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
spark-iceberg  | [W 2024-02-02 19:44:45.639 LabApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
spark-iceberg  | [W 2024-02-02 19:44:45.640 ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
spark-iceberg  | [I 2024-02-02 19:44:45.640 ServerApp] jupyterlab | extension was successfully linked.
spark-iceberg  | [I 2024-02-02 19:44:45.643 ServerApp] notebook | extension was successfully linked.
spark-iceberg  | [I 2024-02-02 19:44:45.739 ServerApp] notebook_shim | extension was successfully linked.
spark-iceberg  | [W 2024-02-02 19:44:45.747 ServerApp] WARNING: The Jupyter server is listening on all IP addresses and not using encryption. This is not recommended.
spark-iceberg  | [W 2024-02-02 19:44:45.747 ServerApp] WARNING: The Jupyter server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
spark-iceberg  | [I 2024-02-02 19:44:45.748 ServerApp] notebook_shim | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.748 ServerApp] Registered jupysql-plugin server extension
spark-iceberg  | [I 2024-02-02 19:44:45.748 ServerApp] jupysql_plugin | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.749 ServerApp] jupyter_lsp | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.750 ServerApp] jupyter_server_terminals | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.750 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.9/site-packages/jupyterlab
spark-iceberg  | [I 2024-02-02 19:44:45.750 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
spark-iceberg  | [I 2024-02-02 19:44:45.751 LabApp] Extension Manager is 'pypi'.
spark-iceberg  | [I 2024-02-02 19:44:45.752 ServerApp] jupyterlab | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.753 ServerApp] notebook | extension was successfully loaded.
spark-iceberg  | [I 2024-02-02 19:44:45.754 ServerApp] Serving notebooks from local directory: /home/iceberg/notebooks
spark-iceberg  | [I 2024-02-02 19:44:45.754 ServerApp] Jupyter Server 2.10.0 is running at:
spark-iceberg  | [I 2024-02-02 19:44:45.754 ServerApp] http://localhost:8888/tree
spark-iceberg  | [I 2024-02-02 19:44:45.754 ServerApp]     http://127.0.0.1:8888/tree
spark-iceberg  | [I 2024-02-02 19:44:45.754 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
spark-iceberg  | [I 2024-02-02 19:44:45.763 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, jedi-language-server, julia-language-server, pyright, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-server, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server
starrocks-be   | ERROR 2005 (HY000): Unknown MySQL server host 'starrocks-fe' (-2)
starrocks-be   | Backend running as process 1. Stop it first.
starrocks-be exited with code 1

taobao data loaded on Hudi shows different results than taobao data loaded on StarRocks internal format

https://forum.starrocks.io/t/retail-ecommerce-funnel-analysis-demo-with-1-million-members-and-87-million-record-dataset-using-starrocks/269

VS

mysql> select count(*) from user_behavior;
+----------+
| count(*) |
+----------+
| 86953525 |
+----------+
1 row in set (0.14 sec)

mysql> with tmp1 as (
    ->   with tmp as (
    ->     select
    ->       t.level as level,
    ->       count(UserID) as res
    ->     from
    ->       (
    ->         select
    ->           UserID,
    ->           window_funnel(
    ->             18000,
    ->             `Timestamp`,
    ->             0,
    ->             [BehaviorType = 'pv' ,
    ->             BehaviorType = 'cart',
    ->             BehaviorType = 'buy' ]
    ->           ) as level
    ->         from
    ->           user_behavior
    ->         where `Timestamp` >= '2017-12-02 00:00:00'
    ->             and `Timestamp` <= '2017-12-02 23:59:59'
    ->         group by
    ->           UserID
    ->       ) as t
    ->     where
    ->       t.level > 0
    ->     group by
    ->       t.level
    ->   )
    ->   select
    ->     tmp.level,
    ->     sum(tmp.res) over (
      order by
        tmp.level rows between current row
        and unbounded following
    ) as retention
  from
    tmp
)
select
  tmp1.level,
  tmp1.retention,
  last_value(tmp1.retention) over(
    order by
      tmp1.level rows between current row
      and 1 following
  )/ tmp1.retention as retention_ratio
from
  tmp1;    ->       order by
    ->         tmp.level rows between current row
    ->         and unbounded following
    ->     ) as retention
    ->   from
    ->     tmp
    -> )
    -> select
    ->   tmp1.level,
    ->   tmp1.retention,
    ->   last_value(tmp1.retention) over(
    ->     order by
    ->       tmp1.level rows between current row
    ->       and 1 following
    ->   )/ tmp1.retention as retention_ratio
    -> from
    ->   tmp1;
+-------+-----------+---------------------+
| level | retention | retention_ratio     |
+-------+-----------+---------------------+
|     1 |    913314 | 0.34725078122091635 |
|     2 |    317149 | 0.23266981765668504 |
|     3 |     73791 |                   1 |
+-------+-----------+---------------------+
3 rows in set (1.94 sec)

mysql> with tmp1 as (
    ->   with tmp as (
    ->     select
    ->       ItemID,
    ->       t.level as level,
    ->       count(UserID) as res
    ->     from
    ->       (
    ->         select
    ->           ItemID,
    ->           UserID,
    ->           window_funnel(
    ->             1800,
    ->             timestamp,
    ->             0,
    ->             [BehaviorType = 'pv',
    ->             BehaviorType ='buy' ]
    ->           ) as level
    ->         from
    ->           user_behavior
    ->         where timestamp >= '2017-12-02 00:00:00'
    ->             and timestamp <= '2017-12-02 23:59:59'
    ->         group by
    ->           ItemID,
    ->           UserID
    ->       ) as t
    ->     where
    ->       t.level > 0
    ->     group by
    ->       t.ItemID,
    ->       t.level
    ->   )
    ->   select
    ->     tmp.ItemID,
    ->     tmp.level,
    ->     sum(tmp.res) over (
    ->       partition by tmp.ItemID
    ->       order by
    ->         tmp.level rows between current row
    ->         and unbounded following
    ->     ) as retention
    ->   from
    ->     tmp
    -> )
    -> select
    ->   tmp1.ItemID,
    ->   tmp1.level,
    ->   tmp1.retention / last_value(tmp1.retention) over(
    ->     partition by tmp1.ItemID
    ->     order by
    ->       tmp1.level desc rows between current row
    ->       and 1 following
    ->   ) as retention_ratio
    -> from
    ->   tmp1
    -> order by
    ->   level desc,
    ->   retention_ratio
    -> limit
    ->   10;
+---------+-------+-----------------------+
| ItemID  | level | retention_ratio       |
+---------+-------+-----------------------+
|   59883 |     2 | 0.0003616636528028933 |
|  394978 |     2 | 0.0006357279084551812 |
| 1164931 |     2 | 0.0006648936170212766 |
| 4622270 |     2 | 0.0007692307692307692 |
|  812879 |     2 | 0.0009121313469139556 |
| 1783990 |     2 | 0.0009132420091324201 |
| 3847054 |     2 |  0.000925925925925926 |
| 2742138 |     2 | 0.0009881422924901185 |
|  530918 |     2 | 0.0010193679918450561 |
|  600756 |     2 | 0.0010319917440660474 |
+---------+-------+-----------------------+
10 rows in set (3.67 sec)

mysql> select
    ->   log.BehaviorType,
    ->   count(log.BehaviorType)
    -> from
    ->   (
    ->     select
    ->       ItemID,
    ->       UserID,
    ->       window_funnel(
    ->         1800,
    ->         timestamp,
    ->         0,
    ->         [BehaviorType = 'pv' ,
    ->         BehaviorType = 'buy' ]
    ->       ) as level
    ->     from
    ->       user_behavior
    ->     where timestamp >= '2017-12-02 00:00:00'
    ->         and timestamp <= '2017-12-02 23:59:59'
    ->     group by
    ->       ItemID,
    ->       UserID
    ->   ) as list
    ->   left join (
    ->     select
    ->       UserID,
    ->       array_agg(BehaviorType) as BehaviorType
    ->     from
    ->       user_behavior
    ->     where
    ->       ItemID = 3563468
    ->       and timestamp >= '2017-12-02 00:00:00'
    ->       and timestamp <= '2017-12-02 23:59:59'
    ->     group by
    ->       UserID
    ->   ) as log on list.UserID = log.UserID
    -> where
    ->   list.ItemID = 3563468
    ->   and list.level = 1
    -> group by
    ->   log.BehaviorType
    -> order by
    ->   count(BehaviorType) desc;
+--------------------------------------+-------------------------+
| BehaviorType                         | count(log.BehaviorType) |
+--------------------------------------+-------------------------+
| ["pv"]                               |                    1589 |
| ["pv","pv"]                          |                      52 |
| ["pv","pv","pv"]                     |                      10 |
| ["cart","pv"]                        |                       8 |
| ["cart","pv","pv"]                   |                       6 |
| ["fav","pv"]                         |                       6 |
| ["fav","pv","pv"]                    |                       3 |
| ["pv","pv","pv","pv"]                |                       2 |
| ["cart","pv","pv","pv"]              |                       2 |
| ["cart","pv","pv","pv","pv"]         |                       1 |
| ["pv","pv","pv","pv","pv","pv","pv"] |                       1 |
| ["fav","pv","pv","pv","pv"]          |                       1 |
| ["fav","pv","pv","pv","cart"]        |                       1 |
| ["pv","cart","pv"]                   |                       1 |
| ["pv","pv","cart"]                   |                       1 |
+--------------------------------------+-------------------------+
15 rows in set (3.10 sec)

mysql>

Apache Hudi Demo with StarRocks Create DF throws Warning

Hello
i am trying out some labs https://github.com/StarRocks/demo
after launching container locally i exec into it /spark-3.2.1-bin-hadoop3.2/bin/spark-shell

i tried

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConversions._

val schema = StructType( Array(
                 StructField("language", StringType, true),
                 StructField("users", StringType, true),
                 StructField("id", StringType, true)
             ))

val rowData= Seq(Row("Java", "20000", "a"), 
               Row("Python", "100000", "b"), 
               Row("Scala", "3000", "c"))


val df = spark.createDataFrame(rowData,schema)

Create DF throws Warning

warning: one deprecation (since 2.12.0); for details, enable `:setting -deprecation' or `:replay -deprecation'
54481 [main] WARN  org.apache.hadoop.fs.FileSystem  - Failed to initialize fileystem hdfs://hadoop-master:9000/user/hive/warehouse: java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop-master
54485 [main] WARN  org.apache.spark.sql.internal.SharedState  - Cannot qualify the warehouse path, leaving it unqualified.
java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop-master
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:466)
        at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:134)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:374)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:308)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initDFSClient(DistributedFileSystem.java:201)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:186)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
        at org.apache.spark.sql.internal.SharedState$.qualifyWarehousePath(SharedState.scala:282)
        at org.apache.spark.sql.internal.SharedState.liftedTree1$1(SharedState.scala:80)
        at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:79)
        at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:139)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:139)
        at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:138)
        at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:158)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:156)
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:153)
        at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:113)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:113)
        at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:230)
        at org.apache.spark.sql.catalyst.util.CharVarcharUtils$.failIfHasCharVarchar(CharVarcharUtils.scala:63)
        at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$4(SparkSession.scala:387)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:386)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:46)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:50)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:52)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:54)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:56)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:58)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:60)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:62)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:64)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:66)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:68)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:70)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:72)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:74)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:76)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:78)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:80)
        at $line24.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:82)
        at $line24.$read$$iw$$iw$$iw$$iw.<init>(<console>:84)
        at $line24.$read$$iw$$iw$$iw.<init>(<console>:86)
        at $line24.$read$$iw$$iw.<init>(<console>:88)
        at $line24.$read$$iw.<init>(<console>:90)
        at $line24.$read.<init>(<console>:92)
        at $line24.$read$.<init>(<console>:96)
        at $line24.$read$.<clinit>(<console>)
        at $line24.$eval$.$print$lzycompute(<console>:7)
        at $line24.$eval$.$print(<console>:6)
        at $line24.$eval.$print(<console>)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020)
        at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568)
        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564)
        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:865)
        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:733)
        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:435)
        at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:456)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:239)
        at org.apache.spark.repl.Main$.doMain(Main.scala:78)
        at org.apache.spark.repl.Main$.main(Main.scala:58)
        at org.apache.spark.repl.Main.main(Main.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: hadoop-master
        ... 92 more
df: org.apache.spark.sql.DataFrame = [language: string, users: string ... 1 more field]

any idea how i can disable or should i ignore this warning ?

Move the hudi demo to use the official apache hive HMS image

  hive-metastore:
    container_name: hive-metastore
    hostname: hive-metastore
    image: 'apache/hive:4.0.0-alpha-2'
    ports:
      - '9083:9083' # Metastore Thrift
    environment:
      SERVICE_NAME: metastore
      HIVE_METASTORE_WAREHOUSE_DIR: /home/data
    volumes:
      - ./data:/home/data

Syntax error in Python connector demo

Bracket position issue, cause format() on return of function print()
MiscDemo/connect/python/connector.py line 90 to 91:

for (siteid, citycode, pv) in cursor:
    print("{}\t{}\t{}").format(siteid, citycode, pv)

should be

for (siteid, citycode, pv) in cursor:
    print("{}\t{}\t{}".format(siteid, citycode, pv)) #bracket position issue

taobao data set (1GB+ parquet file + JOINS) instruction on Hudi

run in the spark container.

bash
rm -f /spark-3.2.1-bin-hadoop3.2/jars/hudi-spark3-bundle_2.12-0.11.1.jar
export SPARK_VERSION=3.2
spark-shell --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.14.1 --driver-memory 24G

run in the scala prompt

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConversions._

val df = spark.read.parquet("s3a://warehouse/user_behavior_sample_data.parquet")

val databaseName = "hudi_ecommerce"
val tableName = "user_behavior"
val basePath = "s3a://huditest/hudi_ecommerce_user_behavior"

df.write.format("hudi").
  option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "bulk_insert").
  option("hoodie.datasource.hive_sync.enable", "true").
  option("hoodie.datasource.hive_sync.mode", "hms").
  option("hoodie.datasource.hive_sync.database", databaseName).
  option("hoodie.datasource.hive_sync.table", tableName).
  option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083").
  option("fs.defaultFS", "s3://huditest/").  
  mode(Overwrite).
  save(basePath)

run in the scala prompt

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConversions._

val df = spark.read.parquet("s3a://warehouse/item_sample_data.parquet")

val databaseName = "hudi_ecommerce"
val tableName = "item"
val basePath = "s3a://huditest/hudi_ecommerce_item"

df.write.format("hudi").
  option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "bulk_insert").
  option("hoodie.datasource.hive_sync.enable", "true").
  option("hoodie.datasource.hive_sync.mode", "hms").
  option("hoodie.datasource.hive_sync.database", databaseName).
  option("hoodie.datasource.hive_sync.table", tableName).
  option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083").
  option("fs.defaultFS", "s3://huditest/").  
  mode(Overwrite).
  save(basePath)

run in the mysql client

CREATE EXTERNAL CATALOG hudi_catalog_hms
PROPERTIES
(
    "type" = "hudi",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "admin",
    "aws.s3.secret_key" = "password",
    "aws.s3.region" = "us-east-1",
    "aws.s3.enable_ssl" = "false",
    "aws.s3.enable_path_style_access" = "true",
    "aws.s3.endpoint" = "http://minio:9000"
);
set catalog hudi_catalog_hms;
show databases;
use hudi_ecommerce;
show tables;
select count(*) from user_behavior;
select count(*) from item;

optional to run in the mysql prompt

drop catalog hudi_catalog_hms;

3 sql exercises at https://forum.starrocks.io/t/retail-ecommerce-funnel-analysis-demo-with-1-million-members-and-87-million-record-dataset-using-starrocks/269

datalakehouse tutorial with hudi, iceberg, delta lake external catalog with conversion done through onetable

Using https://github.com/StarRocks/demo/tree/master/documentation-samples/datalakehouse and #56

Create huditest bucket.

yum install -y python3
rm -f /spark-3.2.1-bin-hadoop3.2/jars/hudi-spark3-bundle_2.12-0.11.1.jar
export SPARK_VERSION=3.2
pyspark \
  --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.1 \
  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
  --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
from pyspark.sql.types import *

# initialize the bucket
table_name = "people"
local_base_path = "s3://huditest/hudi-dataset"
databaseName = "hudi_onetable"

records = [
   (1, 'John', 25, 'NYC', '2023-09-28 00:00:00'),
   (2, 'Emily', 30, 'SFO', '2023-09-28 00:00:00'),
   (3, 'Michael', 35, 'ORD', '2023-09-28 00:00:00'),
   (4, 'Andrew', 40, 'NYC', '2023-10-28 00:00:00'),
   (5, 'Bob', 28, 'SEA', '2023-09-23 00:00:00'),
   (6, 'Charlie', 31, 'DFW', '2023-08-29 00:00:00')
]

schema = StructType([
   StructField("id", IntegerType(), True),
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True),
   StructField("city", StringType(), True),
   StructField("create_ts", StringType(), True)
])

df = spark.createDataFrame(records, schema)

hudi_options = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.partitionpath.field': 'city',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.mode': 'hms',
   'hoodie.datasource.hive_sync.database': databaseName,
   'hoodie.datasource.hive_sync.table': table_name,
   'hoodie.datasource.hive_sync.metastore.uris': 'thrift://hive-metastore:9083'
}

(
   df.write
   .format("hudi")
   .options(**hudi_options)
   .mode("Overwrite")
   .save(f"{local_base_path}/{table_name}")
)
CREATE EXTERNAL CATALOG hudi_catalog_hms
PROPERTIES
(
    "type" = "hudi",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "admin",
    "aws.s3.secret_key" = "password",
    "aws.s3.region" = "us-east-1",
    "aws.s3.enable_ssl" = "false",
    "aws.s3.enable_path_style_access" = "true",
    "aws.s3.endpoint" = "http://minio:9000"
);
set catalog hudi_catalog_hms;
show databases;
use hudi_onetable;
show tables;

output

StarRocks > CREATE EXTERNAL CATALOG hudi_catalog_hms
    -> PROPERTIES
    -> (
    ->     "type" = "hudi",
    ->     "hive.metastore.type" = "hive",
    ->     "hive.metastore.uris" = "thrift://hive-metastore:9083",
    ->     "aws.s3.use_instance_profile" = "false",
    ->     "aws.s3.access_key" = "admin",
    ->     "aws.s3.secret_key" = "password",
    ->     "aws.s3.region" = "us-east-1",
    ->     "aws.s3.enable_ssl" = "false",
    ->     "aws.s3.enable_path_style_access" = "true",
    ->     "aws.s3.endpoint" = "http://minio:9000"
    -> );
Query OK, 0 rows affected (0.44 sec)

StarRocks > set catalog hudi_catalog_hms;
Query OK, 0 rows affected (0.00 sec)

StarRocks > show databases;
use hudi_onetable;
show tables;
+--------------------+
| Database           |
+--------------------+
| default            |
| hudi_onetable      |
| information_schema |
+--------------------+
3 rows in set (0.23 sec)

StarRocks > use hudi_onetable;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
StarRocks > show tables;
+-------------------------+
| Tables_in_hudi_onetable |
+-------------------------+
| people                  |
+-------------------------+
1 row in set (0.00 sec)

StarRocks > select * from people;
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| _hoodie_commit_time | _hoodie_commit_seqno   | _hoodie_record_key    | _hoodie_partition_path | _hoodie_file_name                                                       | id   | name    | age  | city | create_ts           |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| 20240222213015211   | 20240222213015211_8_12 | 20240222213015211_8_0 | city=DFW               | b503b61b-b5c9-437d-81a6-3732da898e27-0_8-42-0_20240222213015211.parquet |    6 | Charlie |   31 | DFW  | 2023-08-29 00:00:00 |
| 20240222213015211   | 20240222213015211_2_7  | 20240222213015211_2_0 | city=SFO               | 467cf3fa-18fc-4aa1-a20c-8581b4abd039-0_2-36-0_20240222213015211.parquet |    2 | Emily   |   30 | SFO  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_7_9  | 20240222213015211_7_0 | city=SEA               | fd8ec934-da12-4cb0-9b19-9aaa524b3159-0_7-41-0_20240222213015211.parquet |    5 | Bob     |   28 | SEA  | 2023-09-23 00:00:00 |
| 20240222213015211   | 20240222213015211_4_8  | 20240222213015211_4_0 | city=ORD               | e5bb037a-8141-46d8-b3a8-d4211373d354-0_4-38-0_20240222213015211.parquet |    3 | Michael |   35 | ORD  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_5_10 | 20240222213015211_5_0 | city=NYC               | a239d9f5-ceba-4ca5-ba83-51cea9a2731e-0_5-39-0_20240222213015211.parquet |    4 | Andrew  |   40 | NYC  | 2023-10-28 00:00:00 |
| 20240222213015211   | 20240222213015211_1_11 | 20240222213015211_1_0 | city=NYC               | 294108a6-8702-4bce-a704-0bcb6196e8bf-0_1-35-0_20240222213015211.parquet |    1 | John    |   25 | NYC  | 2023-09-28 00:00:00 |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
6 rows in set (5.44 sec)

StarRocks >

Note

Issue with Minio authentication. apache/incubator-xtable#327. Updated: Fixed, waiting for PR to get into main.

Important

You have to compile the onetable code right now to get the 600+ meg utilities-0.1.0-SNAPSHOT-bundled.jar file. They're working on making it smaller but right now, there is no other option.

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
cd /spark-3.2.1-bin-hadoop3.2/auxjars
java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig onetable.yaml
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" \
--conf "spark.sql.catalog.spark_catalog.type=hive" \
--conf "spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hive_prod.type=hive"
CREATE SCHEMA iceberg_db LOCATION 's3://warehouse/';

CALL hive_prod.system.register_table(
   table => 'hive_prod.iceberg_db.people',
   metadata_file => 's3://huditest/hudi-dataset/people/metadata/v2.metadata.json'
);
CREATE EXTERNAL CATALOG iceberg_catalog_hms
PROPERTIES
(
    "type" = "iceberg",
    "iceberg.catalog.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "admin",
    "aws.s3.secret_key" = "password",
    "aws.s3.region" = "us-east-1",
    "aws.s3.enable_ssl" = "false",
    "aws.s3.enable_path_style_access" = "true",
    "aws.s3.endpoint" = "http://minio:9000"
);
set catalog iceberg_catalog_hms;
show databases;
use iceberg_db;
show tables;

output

StarRocks > CREATE EXTERNAL CATALOG iceberg_catalog_hms
    -> PROPERTIES
    -> (
    ->     "type" = "iceberg",
    ->     "iceberg.catalog.type" = "hive",
    ->     "hive.metastore.uris" = "thrift://hive-metastore:9083",
    ->     "aws.s3.use_instance_profile" = "false",
    ->     "aws.s3.access_key" = "admin",
    ->     "aws.s3.secret_key" = "password",
    ->     "aws.s3.region" = "us-east-1",
    ->     "aws.s3.enable_ssl" = "false",
    ->     "aws.s3.enable_path_style_access" = "true",
    ->     "aws.s3.endpoint" = "http://minio:9000"
    -> );
Query OK, 0 rows affected (0.03 sec)

StarRocks > set catalog iceberg_catalog_hms;
Query OK, 0 rows affected (0.00 sec)

StarRocks > show databases;
+--------------------+
| Database           |
+--------------------+
| default            |
| hudi_onetable      |
| iceberg_db         |
| information_schema |
+--------------------+
4 rows in set (0.15 sec)

StarRocks > use iceberg_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
StarRocks > show tables;
+----------------------+
| Tables_in_iceberg_db |
+----------------------+
| people               |
+----------------------+
1 row in set (0.04 sec)

StarRocks > select * from people;
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| _hoodie_commit_time | _hoodie_commit_seqno   | _hoodie_record_key    | _hoodie_partition_path | _hoodie_file_name                                                       | id   | name    | age  | city | create_ts           |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| 20240222213015211   | 20240222213015211_8_12 | 20240222213015211_8_0 | city=DFW               | b503b61b-b5c9-437d-81a6-3732da898e27-0_8-42-0_20240222213015211.parquet |    6 | Charlie |   31 | DFW  | 2023-08-29 00:00:00 |
| 20240222213015211   | 20240222213015211_4_8  | 20240222213015211_4_0 | city=ORD               | e5bb037a-8141-46d8-b3a8-d4211373d354-0_4-38-0_20240222213015211.parquet |    3 | Michael |   35 | ORD  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_5_10 | 20240222213015211_5_0 | city=NYC               | a239d9f5-ceba-4ca5-ba83-51cea9a2731e-0_5-39-0_20240222213015211.parquet |    4 | Andrew  |   40 | NYC  | 2023-10-28 00:00:00 |
| 20240222213015211   | 20240222213015211_1_11 | 20240222213015211_1_0 | city=NYC               | 294108a6-8702-4bce-a704-0bcb6196e8bf-0_1-35-0_20240222213015211.parquet |    1 | John    |   25 | NYC  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_2_7  | 20240222213015211_2_0 | city=SFO               | 467cf3fa-18fc-4aa1-a20c-8581b4abd039-0_2-36-0_20240222213015211.parquet |    2 | Emily   |   30 | SFO  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_7_9  | 20240222213015211_7_0 | city=SEA               | fd8ec934-da12-4cb0-9b19-9aaa524b3159-0_7-41-0_20240222213015211.parquet |    5 | Bob     |   28 | SEA  | 2023-09-23 00:00:00 |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
6 rows in set (0.24 sec)
spark-sql --packages io.delta:delta-core_2.12:2.0.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "spark.sql.catalogImplementation=hive"
CREATE SCHEMA delta_db LOCATION 's3://warehouse/';

CREATE TABLE delta_db.people USING DELTA LOCATION 's3://huditest/hudi-dataset/people';
CREATE EXTERNAL CATALOG deltalake_catalog_hms
PROPERTIES
(
    "type" = "deltalake",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://hive-metastore:9083",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "admin",
    "aws.s3.secret_key" = "password",
    "aws.s3.region" = "us-east-1",
    "aws.s3.enable_ssl" = "false",
    "aws.s3.enable_path_style_access" = "true",
    "aws.s3.endpoint" = "http://minio:9000"
);
set catalog deltalake_catalog_hms;
show databases;
use delta_db;
show tables;

output

StarRocks > CREATE EXTERNAL CATALOG deltalake_catalog_hms
    -> PROPERTIES
    -> (
    ->     "type" = "deltalake",
    ->     "hive.metastore.type" = "hive",
    ->     "hive.metastore.uris" = "thrift://hive-metastore:9083",
    ->     "aws.s3.use_instance_profile" = "false",
    ->     "aws.s3.access_key" = "admin",
    ->     "aws.s3.secret_key" = "password",
    ->     "aws.s3.region" = "us-east-1",
    ->     "aws.s3.enable_ssl" = "false",
    ->     "aws.s3.enable_path_style_access" = "true",
    ->     "aws.s3.endpoint" = "http://minio:9000"
    -> );
Query OK, 0 rows affected (0.06 sec)

StarRocks > set catalog deltalake_catalog_hms;
Query OK, 0 rows affected (0.01 sec)

StarRocks > show databases;
+--------------------+
| Database           |
+--------------------+
| default            |
| delta_db           |
| hudi_onetable      |
| iceberg_db         |
| information_schema |
+--------------------+
5 rows in set (0.04 sec)

StarRocks > use delta_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
StarRocks > show tables;
+--------------------+
| Tables_in_delta_db |
+--------------------+
| people             |
+--------------------+
1 row in set (0.08 sec)

StarRocks > select * from people;
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| _hoodie_commit_time | _hoodie_commit_seqno   | _hoodie_record_key    | _hoodie_partition_path | _hoodie_file_name                                                       | id   | name    | age  | city | create_ts           |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
| 20240222213015211   | 20240222213015211_4_8  | 20240222213015211_4_0 | city=ORD               | e5bb037a-8141-46d8-b3a8-d4211373d354-0_4-38-0_20240222213015211.parquet |    3 | Michael |   35 | ORD  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_2_7  | 20240222213015211_2_0 | city=SFO               | 467cf3fa-18fc-4aa1-a20c-8581b4abd039-0_2-36-0_20240222213015211.parquet |    2 | Emily   |   30 | SFO  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_8_12 | 20240222213015211_8_0 | city=DFW               | b503b61b-b5c9-437d-81a6-3732da898e27-0_8-42-0_20240222213015211.parquet |    6 | Charlie |   31 | DFW  | 2023-08-29 00:00:00 |
| 20240222213015211   | 20240222213015211_7_9  | 20240222213015211_7_0 | city=SEA               | fd8ec934-da12-4cb0-9b19-9aaa524b3159-0_7-41-0_20240222213015211.parquet |    5 | Bob     |   28 | SEA  | 2023-09-23 00:00:00 |
| 20240222213015211   | 20240222213015211_1_11 | 20240222213015211_1_0 | city=NYC               | 294108a6-8702-4bce-a704-0bcb6196e8bf-0_1-35-0_20240222213015211.parquet |    1 | John    |   25 | NYC  | 2023-09-28 00:00:00 |
| 20240222213015211   | 20240222213015211_5_10 | 20240222213015211_5_0 | city=NYC               | a239d9f5-ceba-4ca5-ba83-51cea9a2731e-0_5-39-0_20240222213015211.parquet |    4 | Andrew  |   40 | NYC  | 2023-10-28 00:00:00 |
+---------------------+------------------------+-----------------------+------------------------+-------------------------------------------------------------------------+------+---------+------+------+---------------------+
6 rows in set (0.11 sec)

StarRocks >

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.