Giter Site home page Giter Site logo

impala-tpcds-kit's Introduction

TPC-DS tools for Apache Impala

The official and latest TPC-DS tools and specification can be found at tpc.org

The query templates and sample queries provided in this repo are compliant with the standards set out by the TPC-DS benchmark specification and include only minor query modifications (MQMs) as set out by section 4.2.3 of the specification. The modification list can be found in query-templates/README.md.

If you use this repo for any results publication, please see Fair Use of TPC Benchmarks.

Step 0: Environment Setup

Install Java JDK and Maven if need be:

sudo yum -y install java-1.8.0-openjdk-devel maven

Install the necessary development tools:

sudo yum -y install git gcc make flex bison byacc curl unzip patch

Step 1: Generate Data

Data generation is done via a MapReduce wrapper around TPC-DS dsdgen. See tpcds-gen/README.md for more details on the commands to generate the flat files.

Step 2: Load Data

Adjust the source/text and target/Parquet schema names and flat file paths in the sql files found in the scripts/ directory. See the comments at the top of each.

Create external text file tables:

impala-shell -f impala-external.sql

Create Parquet tables:

impala-shell -f impala-parquet.sql

Load Parquet tables and compute stats:

impala-shell -f impala-insert.sql

Step 3: Run Queries

Sample queries from the 10TB scale factor can be found in the queries/ directory. The query-templates/ directory contains the Apache Impala TPC-DS query templates which can be used with dsqgen (found in the official TPC-DS tools) to generate queries for other scale factors or to generate more queries with different substitution variables.

impala-tpcds-kit's People

Contributors

deinspanjer avatar dkumarc avatar drorke avatar gregrahn avatar ishaan avatar jimvin avatar lskuff avatar rizaon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

impala-tpcds-kit's Issues

Some queries are not executable because of missing tables

It would be nice to have a list of queries that can be executed. Some of them result in error because of missing tables.
E.g.

impala-shell -d tpcds_parquet -f q1.sql
ERROR: AnalysisException: Could not resolve table reference: 'store_returns'

create reason and ship_mode external table with wrong file location

create external table reason (
  r_reason_sk int,
  r_reason_id varchar(16),
  r_reason_desc varchar(100)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/web_sales'
tblproperties ('serialization.null.format'='')
;

location '/tmp/tpc-ds/sf10000/web_sales'
location '/tmp/tpc-ds/sf10000/source'

create external table ship_mode (
  sm_ship_mode_sk int,
  sm_ship_mode_id varchar(16),
  sm_type varchar(30),
  sm_code varchar(10),
  sm_carrier varchar(20),
  sm_contract varchar(20)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/reason'
tblproperties ('serialization.null.format'='')
;

location '/tmp/tpc-ds/sf10000/reason'
location '/tmp/tpc-ds/sf10000/ship_mode'

The answer is not consistent

the answer of sparksql(1.6) and the answer of impala(2.3) is not consistent.
for example query8,
I do not konw why .do you have the answers for these querise.

-FILTER error. output not going to stdout

I Just cloned impala-tpcds-kit. When running ./gen-dims.sh and ./gen-facts.sh
I get:
ERROR: option 'FILTER' or its argument unknown.
the usage text that follows shows "_FILTER" should be used instead.
There is no more error, but the data is copied to a local file instead of stdout so no data is piped to hdfs
as the script intends to:

${TPCDS_ROOT}/tools/dsdgen
-TABLE $t
-SCALE ${TPCDS_SCALE_FACTOR}
-DISTRIBUTIONS ${TPCDS_ROOT}/tools/tpcds.idx
-TERMINATE N
-FILTER Y
-QUIET Y | hdfs dfs -put - ${FLATFILE_HDFS_ROOT}/${t}/${t}.dat &

query5a.sql AnalysisException

when running query 5a: Incompatible return types 'VARCHAR(10)' and 'TIMESTAMP' of exprs 'd_date' and 'CAST('1998-08-04' AS TIMESTAMP)'. In CDH 6.2

inventory table

It seems the "inventory" data is not created as part of gen-dims.sh or gen-facts.sh
but the inventory table is used in a few queries (q21, 22, 37, 39, 72, 82)
I ran the inventory table generation myself (by modifying gen-dims.sh) but the data generation for "inventory" takes much longer than any other dimension data and reaches 7.7 GB. when the biggest other dimension table I have is the customer table and is 255MB.

Is this the correct size for the inventory table given the other dim table sizes and the TPCDS_SCALE_FACTOR I set (100). (and my store_sales table is 38 GB).
Thanks

Can impala tpcds run in Hadoop single node mode

Hi friend
I have setup up hadoop single node mode by one cloud host.
Run command :hadoop jar target/tpcds-gen-1.0-SNAPSHOT.jar -d /tmp/tpc-ds/sf10000/ -p 10 -s 100
error as follows image:
image
Do you know why? Can I communicate directly with you by IM,for example ,skype,whatapps?
thank you very much

error in impala-load-store_sales.sh - int vs bigint

When I run impala-load-store_sales.sh I get this error:

ERROR: AnalysisException: Possible loss of precision for target table 'tpcds_parquet.store_sales'.
Expression 'ss_sold_date_sk' (type: BIGINT) would need to be cast to INT for column 'ss_sold_date_sk'

I believe this is because et_stores_sales.ss_sold_date_sk is a bigint but store_sales.ss_sold_date_sk is just an int.

I changed store_sales.ss_sold_date_sk to a bigint in impala-load-store_sales.sh, and the script ran without error.

lack of table

why just 10 tables maybe queries sql need 25tables

Not generating child tables

I've been unable to generate child tables using the gen-facts.sh script. The parent tables generate fine but the child tables do not show after that run and if I attempt to run them separately I get the following:

ERROR: Table store_returns is a child; it is populated during the build of its parent (e.g., catalog_sales builds catalog returns)

I cannot find where it's associating the parent/child generation. Assistance appreciated.

Impala Scripts no longer work on latest CDH

The scripts fail on the latest CDH release with the newest Impala version. All files created have this error:

ERROR: option 'FILTER' or its argument unknown.
USAGE: dsdgen [options]

Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '-sc' is equivalent to '-SCALE'

General Options

ABREVIATION = -- build table with abreviation
DIR = -- generate tables in directory
HELP = -- display this message
PARAMS = -- read parameters from file
QUIET = [Y|N] -- disable all output to stdout/stderr
SCALE = -- volume of data to generate in GB
TABLE = -- build only table
UPDATE = -- generate update data set
VERBOSE = [Y|N] -- enable verbose output
PARALLEL = -- build data in separate chunks
CHILD = -- generate th chunk of the parallelized data
RELEASE = [Y|N] -- display the release information
_FILTER = [Y|N] -- output data to stdout
VALIDATE = [Y|N] -- produce rows for data validation

Advanced Options

DELIMITER = -- use as output field separator
DISTRIBUTIONS = -- read distributions from file
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = -- use as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = -- set number of validation rows to be produced
VSUFFIX = -- set file suffix for data validation
RNGSEED = -- set RNG seed

DATE is not supported in CDH 5.16 Impala

Running impala-external.sql & impala-parquet.sql scripts on CDH 5.16.2 (impalad version 2.12.0-cdh5.16.2 RELEASE) fails with:

ERROR: AnalysisException: Unsupported data type: DATE

Exception when generating store_sales table.

Below exception is seen while generating data for store_sales table.

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/bmv/tpcds/st ore_sales/store_sales_3_30.dat.COPYING (inode 16720): File does not exist.

benchmark script

Not an issue as such,
but I can't see any script that runs and capture the processing time of the queries.
Can you suggest the best method to do this?
For example,
should we use impala-shell, should we leave the result to be displayed on screen or store to file (-o) in case it affects the performance.
To test multiple users, should all the impala clients be connected to the same daemon or a different one?
Thanks for any advice.

Running ./run-gen-facts.sh produces hdfs.DFSClient: DataStreamer Exception

We have completed previous steps as per README and saw everything working as expected. The fat generation step has failed:

./run-gen-facts.sh
17/01/23 11:00:43 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/impala/tpcds/store_sales/store_sales_7_96.dat._COPYING_ (inode 5391190): File does not exist. Holder DFSClient_NONMAPREDUCE_1532056676_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)

I can provide full stack trace if needed.

Closed. We forget to run set-nodenum.sh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.