cloudera / impala-tpcds-kit Goto Github PK

TPC-DS Kit for Impala

License: Apache License 2.0

Smarty 97.66% Makefile 0.25% Java 2.09%

impala-tpcds-kit's Introduction

TPC-DS tools for Apache Impala

The official and latest TPC-DS tools and specification can be found at tpc.org

The query templates and sample queries provided in this repo are compliant with the standards set out by the TPC-DS benchmark specification and include only minor query modifications (MQMs) as set out by section 4.2.3 of the specification. The modification list can be found in query-templates/README.md.

If you use this repo for any results publication, please see Fair Use of TPC Benchmarks.

Step 0: Environment Setup

Install Java JDK and Maven if need be:

sudo yum -y install java-1.8.0-openjdk-devel maven

Install the necessary development tools:

sudo yum -y install git gcc make flex bison byacc curl unzip patch

Step 1: Generate Data

Data generation is done via a MapReduce wrapper around TPC-DS dsdgen. See tpcds-gen/README.md for more details on the commands to generate the flat files.

Step 2: Load Data

Adjust the source/text and target/Parquet schema names and flat file paths in the sql files found in the scripts/ directory. See the comments at the top of each.

Create external text file tables:

impala-shell -f impala-external.sql

Create Parquet tables:

impala-shell -f impala-parquet.sql

Load Parquet tables and compute stats:

impala-shell -f impala-insert.sql

Step 3: Run Queries

Sample queries from the 10TB scale factor can be found in the queries/ directory. The query-templates/ directory contains the Apache Impala TPC-DS query templates which can be used with dsqgen (found in the official TPC-DS tools) to generate queries for other scale factors or to generate more queries with different substitution variables.

impala-tpcds-kit's People

Contributors

Stargazers

Watchers

Forkers

gchen prateek kayousterhout peeterskris izlley meraboxer dkumarc hurdad yingzhox djuanes sungsoo deinspanjer concretevitamin ttmahdy stampy88 simonzhangsm ishaan arpit625 sohan gdtm86 quintona davidkj69 brockn postmind-net y-lan rtvt123 mattinbits drorke mrks sgrotz leoricklin myueh wowjason nunb franklinray bash-horatio kendoyahu lijianwan hj5 yaoweisun198723 gaqzcb pavanmuniraju jimvin mango77881 szhang1 drpymm jasonkoo khurram-faraaz lathamz wtx626 kobefanzzf astadtler renjithmadhavan leejianwei penger padrian2s luciferyang alexjbush kiddingmu bosoon kenmsj andrew-svds dloftus98 kioco tom-deng hermandr icteam-spa wenruimeng mo-dt vijaykumar243 yunkillere taddwood basavarajid405 qwan01 zyystudio iamupendra garyfub tatsuya6502 rajesh07530 wyingquan trystanleftwich isamllr michanaj myquewoomiddo windpiger mineminemine123 suhitag93 valuko zeweichen11 myloginid cheewaphat meijies kgtdbx dongyoungy zjfplayer2 durandkwok mengchengtech tuniu1985 claytontaira shawnallen85

impala-tpcds-kit's Issues

load-store-sales index out of range

load-store-sales.py tries to read '--mem_limit' from http://{impalad}:25000/vars?raw, but it's not there, causing error "list index out of range".

Can impala tpcds run in Hadoop single node mode

Hi friend
I have setup up hadoop single node mode by one cloud host.
Run command ：hadoop jar target/tpcds-gen-1.0-SNAPSHOT.jar -d /tmp/tpc-ds/sf10000/ -p 10 -s 100
error as follows image:

Do you know why? Can I communicate directly with you by IM,for example ,skype,whatapps?
thank you very much

error in impala-load-store_sales.sh - int vs bigint

When I run impala-load-store_sales.sh I get this error:

ERROR: AnalysisException: Possible loss of precision for target table 'tpcds_parquet.store_sales'.
Expression 'ss_sold_date_sk' (type: BIGINT) would need to be cast to INT for column 'ss_sold_date_sk'

I believe this is because et_stores_sales.ss_sold_date_sk is a bigint but store_sales.ss_sold_date_sk is just an int.

I changed store_sales.ss_sold_date_sk to a bigint in impala-load-store_sales.sh, and the script ran without error.

query5a.sql AnalysisException

when running query 5a: Incompatible return types 'VARCHAR(10)' and 'TIMESTAMP' of exprs 'd_date' and 'CAST('1998-08-04' AS TIMESTAMP)'. In CDH 6.2

Running ./run-gen-facts.sh produces hdfs.DFSClient: DataStreamer Exception

We have completed previous steps as per README and saw everything working as expected. The fat generation step has failed:

./run-gen-facts.sh
17/01/23 11:00:43 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/impala/tpcds/store_sales/store_sales_7_96.dat._COPYING_ (inode 5391190): File does not exist. Holder DFSClient_NONMAPREDUCE_1532056676_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)

I can provide full stack trace if needed.

Closed. We forget to run set-nodenum.sh

lack of table

why just 10 tables maybe queries sql need 25tables

Error when loading sales_store table

Hi ,
I am getting the following error when trying to run the "impala-load-store_sales.sh" script.

Error seeking to 3221225472 in file

inventory table

It seems the "inventory" data is not created as part of gen-dims.sh or gen-facts.sh
but the inventory table is used in a few queries (q21, 22, 37, 39, 72, 82)
I ran the inventory table generation myself (by modifying gen-dims.sh) but the data generation for "inventory" takes much longer than any other dimension data and reaches 7.7 GB. when the biggest other dimension table I have is the customer table and is 255MB.

Is this the correct size for the inventory table given the other dim table sizes and the TPCDS_SCALE_FACTOR I set (100). (and my store_sales table is 38 GB).
Thanks

Missing copyright NOTICE

@gregrahn Apache-2.0 requires a copyright statement.

create reason and ship_mode external table with wrong file location

create external table reason (
  r_reason_sk int,
  r_reason_id varchar(16),
  r_reason_desc varchar(100)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/web_sales'
tblproperties ('serialization.null.format'='')
;

~~location '/tmp/tpc-ds/sf10000/web_sales'~~
location '/tmp/tpc-ds/sf10000/source'

create external table ship_mode (
  sm_ship_mode_sk int,
  sm_ship_mode_id varchar(16),
  sm_type varchar(30),
  sm_code varchar(10),
  sm_carrier varchar(20),
  sm_contract varchar(20)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/reason'
tblproperties ('serialization.null.format'='')
;

~~location '/tmp/tpc-ds/sf10000/reason'~~
location '/tmp/tpc-ds/sf10000/ship_mode'

benchmark script

Not an issue as such,
but I can't see any script that runs and capture the processing time of the queries.
Can you suggest the best method to do this?
For example,
should we use impala-shell, should we leave the result to be displayed on screen or store to file (-o) in case it affects the performance.
To test multiple users, should all the impala clients be connected to the same daemon or a different one?
Thanks for any advice.

Impala Scripts no longer work on latest CDH

The scripts fail on the latest CDH release with the newest Impala version. All files created have this error:

ERROR: option 'FILTER' or its argument unknown.
USAGE: dsdgen [options]

Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '-sc' is equivalent to '-SCALE'

General Options

ABREVIATION = -- build table with abreviation
DIR = -- generate tables in directory
HELP = -- display this message
PARAMS = -- read parameters from file
QUIET = [Y|N] -- disable all output to stdout/stderr
SCALE = -- volume of data to generate in GB
TABLE = -- build only table
UPDATE = -- generate update data set
VERBOSE = [Y|N] -- enable verbose output
PARALLEL = -- build data in separate chunks
CHILD = -- generate th chunk of the parallelized data
RELEASE = [Y|N] -- display the release information
_FILTER = [Y|N] -- output data to stdout
VALIDATE = [Y|N] -- produce rows for data validation

Advanced Options

DELIMITER = -- use as output field separator
DISTRIBUTIONS = -- read distributions from file
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = -- use as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = -- set number of validation rows to be produced
VSUFFIX = -- set file suffix for data validation
RNGSEED = -- set RNG seed

Does this repository provide a realistic TPC-DS benchmark?

Three questions, please:

Why only 11 out of 24 tables populated in this test?
Why many queries are modified to use explicit partition pruning?
Do you think current repository provides a realistic TPC-DS benchmark?

Thanks

store_sales table has fewer records than said in the SPEC with factor 3000.

I got 8,251,124,389 from this tpcds kit, while the specification v1.1.0 says it should be 8,639,936,081 wich factor 3000. I've also tried to load data with partition key smaller than 2450816 and bigger than 2452642. Both are zero. Is it an issue of the upstream tpcds kit?

query39.sql contains 2 queries

As stated, 39 has the query twice.

The answer is not consistent

the answer of sparksql（1.6） and the answer of impala(2.3) is not consistent.
for example query8,
I do not konw why .do you have the answers for these querise.

query30.sql column c_last_review_date_sk does not exist, c_last_review_date intended?

Query 30 will not run as provided, believe the _sk needs to be removed

Not generating child tables

I've been unable to generate child tables using the gen-facts.sh script. The parent tables generate fine but the child tables do not show after that run and if I attempt to run them separately I get the following:

ERROR: Table store_returns is a child; it is populated during the build of its parent (e.g., catalog_sales builds catalog returns)

I cannot find where it's associating the parent/child generation. Assistance appreciated.

Some queries are not executable because of missing tables

It would be nice to have a list of queries that can be executed. Some of them result in error because of missing tables.
E.g.

impala-shell -d tpcds_parquet -f q1.sql
ERROR: AnalysisException: Could not resolve table reference: 'store_returns'

-FILTER error. output not going to stdout

I Just cloned impala-tpcds-kit. When running ./gen-dims.sh and ./gen-facts.sh
I get:
ERROR: option 'FILTER' or its argument unknown.
the usage text that follows shows "_FILTER" should be used instead.
There is no more error, but the data is copied to a local file instead of stdout so no data is piped to hdfs
as the script intends to:

${TPCDS_ROOT}/tools/dsdgen
-TABLE $t
-SCALE ${TPCDS_SCALE_FACTOR}
-DISTRIBUTIONS ${TPCDS_ROOT}/tools/tpcds.idx
-TERMINATE N
-FILTER Y
-QUIET Y | hdfs dfs -put - ${FLATFILE_HDFS_ROOT}/${t}/${t}.dat &

Exception when generating store_sales table.

Below exception is seen while generating data for store_sales table.

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/bmv/tpcds/st ore_sales/store_sales_3_30.dat.COPYING (inode 16720): File does not exist.

DATE is not supported in CDH 5.16 Impala

Running impala-external.sql & impala-parquet.sql scripts on CDH 5.16.2 (impalad version 2.12.0-cdh5.16.2 RELEASE) fails with:

ERROR: AnalysisException: Unsupported data type: DATE