Giter Site home page Giter Site logo

databrickslabs / dbldatagen Goto Github PK

View Code? Open in Web Editor NEW
265.0 14.0 50.0 10.25 MB

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Home Page: https://databrickslabs.github.io/dbldatagen

License: Other

Python 99.11% Makefile 0.88% Shell 0.01%
datagen pyspark python data-generation faker spark spark-streaming delta-live-tables deltalake databricks

dbldatagen's People

Contributors

alexott avatar burakyilmaz321 avatar dependabot[bot] avatar marvinschenkel avatar nathanknox avatar nfx avatar pohlposition avatar ronanstokes-db avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbldatagen's Issues

Uninstalling and reinstalling wheel on cluster running DBR 8.3 may fail

If you have a named cluster specification in your Databricks environment and it had the current or a previous build of the datagenerator installed, when you uninstall the library and reinstall it , it may fail

Expected Behavior

Uninstall followed by reinstall should succeed

Current Behavior

Uninstall followed by re-install may fail.

Workaround

  • make sure the wheel does not have a name like dbldatagen-0.2.0rc1-py3-none-any.whl (1) which may result from multiple downloads on the same machine
  • dont use a saved cluster definition - use a new cluster definition

Our plan is to move to a PIP based install which should make installation easier

Your Environment

  • dbldatagen version used: release candidate 2
  • Databricks Runtime version: Databricks 8.3
  • Cloud environment used: Azure

Build fails due to changes to build runner 'ubuntu-latest'

Expected Behavior

Build succeeds

Current Behavior

Build during PR check fails

Steps to Reproduce (for bugs)

Create Pull Request - build will fail due to changes to default Python version in ubuntu-latest

Context

Fix is to explicitly install Python 3.8 during github action to build

Your Environment

  • dbldatagen version used: 0.3.0
  • Databricks Runtime version: various
  • Cloud environment used: shows in build process, before deployment to cloud

Improve speed and coverage of unit tests

Expected Behavior

Unit tests should run faster - current takes 10 - 12 minutes

Proposed enhancements

  • Make use of all cores when running unit tests
  • Convert critical tests to use Pytest rather than unit test
  • use default parallelism as default number of partitions

Unable to create columns containing lists of strings or integers

Expected Behavior

Given a .withColumn() call with values equal to a list of lists (like [['A'], ['A', 'B'], ['A', 'B', 'C']]) a column should be created which has as possible values the following lists: ['A'], ['A', 'B'], and ['A', 'B', 'C']

Current Behavior

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [A]

Steps to Reproduce (for bugs)

Reproducable example:

def test_array_column():
  import dbldatagen as dg
  from pyspark.sql import SparkSession
  from pyspark.sql.types import ArrayType, StringType

  spark = SparkSession.builder.getOrCreate()
  f_spec = dg.DataGenerator(spark, name='mock-glow-genetic-data', rows=10) \
            .withColumn('altAlleles', ArrayType(StringType()), values = [['A']])
    f_spec.build()

Context

Attempting to generate mock genetic data for the GloWGR and Hail frameworks, where some columns are lists of integers or strings

Your Environment

  • dbldatagen version used: 0.2.0rc1
  • Databricks Runtime version: Executed locally on Mac OS
  • Cloud environment used: Executed locally on Mac OS

Problem with column which is named "ID"

Expected Behavior

Generation of column with name "ID".

Current Behavior

Exception:
AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID.

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql import types as T

SparkSession.builder.getOrCreate()

dg.DataGenerator(spark, rows=100, partitions=2).withColumn("ID", T.StringType()).build()

Context

I see that the problem is here
And it may be solved by renaming of my column "ID" to "ID_" before generation and then renaming it back after but it looks little creepy for production... Why you cannot use something less frequent usable for inner ID column? Like datagen__technical__inner__id for example?

Your Environment

  • dbldatagen version used: 0.2.0rc1
  • Databricks Runtime version: 10.4 LTS
  • Cloud environment used: AWS

tutorial/basics.py incomplete at the end

The current tutorial/basics.py at the end is incomplete when using the when clause:

df = spark.range(10000)

df = df.withColumn("test", df.when())

display(df)

This code, would work, on the other hand:

from pyspark.sql.functions import when

df = spark.range(10000)

df = df.withColumn("test", when(df.id > 5000, "UPPER").otherwise("LOWER"))

display(df)

Add support for Spark 3.1.2

Expected Behavior

Add explicit support for Spark 3.2 (included in Databricks runtime 9.1)

Current Behavior

The current versions of the framework work in Databricks 9.1 (which is based on Spark 3.2). However there are some new features in Spark 3.2 that will tidy up syntax for some date and time constructs.

time string for interval should support both `seconds` and `second`

Expected Behavior

Time intervals can be specified as "12 minutes, 2 seconds". You can also specify "1 minute, 2 seconds". You should be able to specify "1 minute 1 second"

Current Behavior

"1 minute 1 seconds" works but "1 minute 1 second" does not

Steps to Reproduce (for bugs)

Will add test case with bug fix

Context

Your Environment

  • dbldatagen version used: 0.2.0 RC0
  • Databricks Runtime version:
  • Cloud environment used:

Distribution functions (and perhaps others) not compatible with Databricks UC Clusters operating in `shared` mode

Expected Behavior

  1. Set up UC-enabled Databricks interactive cluster.
  2. Use Dbldatagen to create spec using distributions ("dist") functions for data generation.

Expected result: a Dataframe with a million ints in it

Current Behavior

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] UDF/UDAF functions are not supported in Unity Catalog.;
Project [id#835L, cast(((round((gamma_func(1.0, 2.0, 2678957030506407010)#837 * cast(99 as float)), 0) * cast(1 as float)) + cast(1 as float)) as int) AS ip_address#838]
+- Range (0, 1000000, step=1, splits=Some(256))

Steps to Reproduce (for bugs)

  1. Set up UC Cluster
from dbldatagen import DataGenerator
import dbldatagen.distributions as dist

shuffle_partitions_requested = 256
partitions_requested = 256
data_rows = 1 * 1000000 #50 million

# partition parameters etc.
spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

dfDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested)          
                .withColumn("ip_address", "int", minValue=1, maxValue=100, random=True, 
                            distribution=dist.Gamma(1.0,2.0)) 
              )
df = dfDataspec.build()

Context

Your Environment

  • dbldatagen version used: 0.30.0
  • Databricks Runtime version: 12.0 (UC enabled)
  • Cloud environment used: Azure

TODO: improve string generation

Idea:

Improve string generation to support :

  • generation of dummy emails
  • generation of dummy credit card numbers
  • generation of strings of random number of words
  • generation of strings of fixed number of random words
  • generation of strings conforming to mask

DblDatagenerator causes global logger to issue messages twice in some circumstances

Expected Behavior

The logging within the dbldatagenerator package respects the logger that is being used in the package that uses dbldatagenerator.

Current Behavior

dbldatagenerator sometimes calls logging.info and logging.debug without calling getLogger() first, resulting in overwriting whatever logger is being used in the caller's package.

Examples:

Since this happens inside _version.py which is being called in __init__.py, this behaviour occurs whenever we import any class from the dbldatagenerator.

Steps to Reproduce (for bugs)

import logging
date_format = "%Y-%m-%d %H:%M:%S"
log_format = "%(asctime)s %(levelname)-8s  %(message)s"
formatter = logging.Formatter(log_format, date_format)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
logger.addHandler(handler)
logger.info("Info message")
# Prints: 2023-03-08 14:18:59 INFO      Info message
from dbldatagen import DataGenerator
# Prints: INFO: Version : VersionInfo(major='0', minor='3', patch='1', release='', build='')
logger.info("Info message")
# Prints:
# 2023-03-08 14:18:59 INFO      Info message
# INFO: Info message

Your Environment

  • dbldatagen version used: 0.3.1
  • Databricks Runtime version: NA
  • Cloud environment used: NA

Make dependency on pyspark development-only, not for the package

Right now, when you install the data generator, it installs the OSS pyspark that may interfere with Databricks implementation. Usually, for such cases we need to make pyspark dependency as optional/development only, and rely on something like findspark package to discover the working PySpark installation. Something like this.

Processing /dbfs/FileStore/wheels/databrickslabs_testdatagenerator-0.10.0_prerel5-py3-none-any.whl
Collecting pyspark>=2.4.0
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.19.2)
Requirement already satisfied: pandas in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.1.3)
Requirement already satisfied: pyarrow>=0.8.0 in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.0.1)
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Requirement already satisfied: pytz>=2017.2 in /databricks/python3/lib/python3.8/site-packages (from pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (2020.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.8/site-packages (from pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (1.15.0)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880769 sha256=5ee3c3a1242c356be086dd2f52db9fd511cc9d403bbc575dd69bdee91e139cd0
  Stored in directory: /root/.cache/pip/wheels/df/88/9e/58ef1f74892fef590330ca0830b5b6d995ba29b44f977b3926
Successfully built pyspark
Installing collected packages: py4j, pyspark, databrickslabs-testdatagenerator
Successfully installed databrickslabs-testdatagenerator-0.10.0-prerel5 py4j-0.10.9 pyspark-3.1.2

return_delay missing from distribution example

In the Distributions examples it mentions that return_delay is used and omitted but this is missing from the actual code example.

The code example is currently:

from pyspark.sql.types import IntegerType

import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_date", "date",
                    expr="date_add('purchase_date', cast(floor(rand() * 100 + 1) as int))")

                )

dfTestData = testDataSpec.build()

I tried the following 2 examples that did not work but not sure if they are what the examples was trying to show:

Example 1:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True, omit=True)
                .withColumn("return_date", "date", expr="date_add('purchase_date', 'return_delay')")

                )

dfTestData = testDataSpec.build()
display(dfTestData)

Error 1:

AnalysisException: The second argument of 'date_add' function needs to be an integer.

Example 2:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True, omit=True)
                .withColumn("return_date", "date", expr="date_add(purchase_date, return_delay)")

                )

dfTestData = testDataSpec.build()
display(dfTestData)

Error 2:

AnalysisException: cannot resolve '`purchase_date`' given input columns: [id]; line 1 pos 9;

The following works but not sure it is the correct way to do this:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True)

                )

dfTestData = testDataSpec.build()
dfTestData = dfTestData.withColumn("return_date", expr("date_add(purchase_date, return_delay)"))
display(dfTestData)

Cast exception when nested schema passed as input

Expected Behavior

Nested Schema shown below should be processed correctly

Current Behavior

Cast exception is thrown as follows:

PySpark version: 3.3.1
PySpark SparkContext version: 3.3.1
StructType([StructField('id', LongType(), True), StructField('city', StructType([StructField('id', LongType(), True), StructField('population', LongType(), True)]), True)])
Traceback (most recent call last):
  File "nested-schema.py", line 24, in <module>
    res1 = gen1.build(withTempView=True)
  File "/home/pramod/.local/lib/python3.8/site-packages/dbldatagen/data_generator.py", line 925, in build
    df1 = self._buildColumnExpressionsWithSelects(df1)
  File "/home/pramod/.local/lib/python3.8/site-packages/dbldatagen/data_generator.py", line 972, in _buildColumnExpressionsWithSelects
    df1 = df1.select(*build_round)
  File "/home/pramod/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py", line 2023, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/home/pramod/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/pramod/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to struct<id:bigint,population:bigint>;
'Project [id#0L, cast(cast((id#0L + cast(0 as bigint)) as struct<id:bigint,population:bigint>) as struct<id:bigint,population:bigint>) AS city#2]
+- Range (0, 10, step=1, splits=Some(2))

Steps to Reproduce (for bugs)

Code below is run as $python3.8 nested_schema.py
Pyspark version is 3.3.1

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, LongType, DoubleType
import dbldatagen as datagen

spark = SparkSession.builder \
    .master("local[4]") \
    .appName("Nested Schema") \
    .getOrCreate()

print('PySpark version: ' + spark.version)
print('PySpark SparkContext version: ' + spark.sparkContext.version)

struct_type = StructType([
                StructField('id', LongType(), True),
                StructField("city", StructType([
                    StructField('id', LongType(), True),
                    StructField('population', LongType(), True)
                ]), True)])

print(struct_type)

gen1 = datagen.DataGenerator(sparkSession=spark, name="nested_schema", rows=10, partitions=2) \
      .withSchema(struct_type).withColumn("id")
res1 = gen1.build(withTempView=True)
res1.show(res1.count())

Context

Your Environment

Improve template text generator

Expected Behavior

Template generator should generate same text from run to run. Proposal is to use Numpy random number generator (and vectorized implementation) to improve repeatability and performance

Current Behavior

Existing implementation uses Python random number generator which is slower and has issues with repeatability,

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

Collect telemetry on usage

There needs to be a way to collect telemetry on who is using the project and what method is used. This way we'd a data-driven way to measure the success of this project and improve quality on the most frequently used projects.

The simplest option would be sending a cluster status check in every method with the following user agent header:

User-Agent: Databricks-Labs-Data-Generator/VERSION (+method_name)

use of base_columns should be allowed as alias for base_column with multiple base_columns

Code is present already to do this but does not work for following code snippet:

import dbldatagen as dg
from pyspark.sql.types import StructType, StructField, StringType

shuffle_partitions_requested = 8
partitions_requested = 8
data_rows = 10000000

dataspec = (dg.DataGenerator(spark, rows=10000000, partitions=8)
.withColumn("name", percent_nulls=1.0, template=r'\w \w|\w a. \w')
.withColumn("payment_instrument_type", values=['paypal', 'visa', 'mastercard', 'amex'], random=True)
.withColumn("payment_instrument", minValue=1000000, maxValue=10000000, template="dddd dddddd ddddd")
.withColumn("email", template=r'\w.\w@\w.com')
.withColumn("md5_payment_instrument",
expr="md5(concat(payment_instrument_type, ':', payment_instrument))",
base_columns=['payment_instrument_type', 'payment_instrument'])
)
df1 = dataspec.build()

df1.display()

Improvement: Want to have more seamless mechanism for generating CDC updates

Currently you can generate simulated updates to an existing data set by either

  • sampling results from an existing dataset and updating fields
  • restricting the number of unique values for a datasets primary key or composite primary key fields so that you get naturally repeated rows

It would be useful to be able to specify a specific number of updates or range of updates per primary key

Remove conda dependency

make create-dev-env
conda create -n dbl_testdatagenerator python=3.7.5
make: conda: No such file or directory
make: *** [create-dev-env] Error 1

People working on the code won't necessarily have conda installed. therefore, build must not depend on it.

install_requires is pretty small. Installing conda messes up with all python interpreters on machine.

Setuptools need to include required packages for working with the library locally (outside of the databricks environment)

Expected Behavior

when the package is installed, I expect it to install the the necessary dependendices.

Current Behavior

It does not at the moment, it assumes that this package is going to run on databricks which is going to be the case, however if I am developing code locally, it becomes a problem during testing.

Steps to Reproduce (for bugs)

pip install dbldatagen

Then try to run some code locally for example tests.

It does not install the required packages:
numpy = "1.22.0"
pyspark = "3.1.3"
pyarrow = "1.0.1"
pandas = "1.1.3"
pyparsing = ">=2.4.7,<3.0.9"

Context

Your Environment

local mac computer

Docuementation issues

Steps to Reproduce (for bugs)

Documentation content only

Context

Documentation examples in Data Ranges first example does not include correct definition for returnDate

Changed interim build labelling to comply with PEP 440

Expected Behavior

build labelling for pre-release builds should be changed to comply with PEP440

Current Behavior

labelling for prerelease builds uses form 0.3.1-a1 - need to have format be 0.3.1a1

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

issue with use of template with value r"dr_\\v"

import dbldatagen as dg
from pyspark.sql.types import StructType, StructField, StringType

shuffle_partitions_requested = 8
partitions_requested = 8
data_rows = 10000000

schema = StructType([
StructField("name",StringType(),True),
StructField("ser",StringType(),True),
StructField("license_plate",StringType(),True),
StructField("email",StringType(),True),
])

dataspec = (dg.DataGenerator(spark, rows=10000000, partitions=8)
.withSchema(schema))

dataspec = (dataspec
.withColumnSpec("name", percent_nulls=1.0, template=r'\w \w|\w a. \w')
.withColumnSpec("ser", minValue=1000000, maxValue=10000000, template=r"dr_\v")
.withColumnSpec("email", template=r'\w.\w@\w.com')
.withColumnSpec("license_plate", template=r'\n-\n')
)
df1 = dataspec.build()

df1.display()

Improve build ordering dependencies

Expected Behavior

When a column contains a SQL expression, if the SQL contains references to other columns that appear prior to the current column - adjust the build sequencing so that those columns are created first.

Current Behavior

You have to specify via baseColumn attribute in all cases

Improved behavior

If simple identifier parser detects valid SQL identifiers in the SQL expression and a) they are not inside a string, and b) they match an existing column name, use separate phases for generation of the column which references the other columns.

This will reduce the number of cases where it is necessary to explicitly reference the baseColumns

Add enhanced options and documentation for streaming data generation

Issue to track:

1 - changes to streaming documentation to add
a) Delta Live Tables integration information,
b) example of generation of sliding event time windows
c) example of generation of simple IOT data with timestamps
d) example of generation of late arriving data with the above

2 - add options to simplify generation of the above
The above behavior is supported in the current version, but it would be useful to include options to simplify the generation of streaming data sets

2a - add options for
- ageLimit - ignore messages older than n seconds. This helps when benchmarking with trigger once using a rate
stream when significant time has elapsed between runs (otherwise there can be a large message backlog)
3 - support rate-micro-batch source

Expose __version__ attribute

Expected Behavior

version attribute should be exposed so that the following code works

import dbldatagen as dg
print(dg.version)

Current Behavior

version attribute is not exposed

Steps to Reproduce (for bugs)

n/a

Context

Your Environment

  • dbldatagen version used: RC 0
  • Databricks Runtime version: tested with Databricks 9.3
  • Cloud environment used: tested on Azure

Enhancement: Generate standard data sets

Enhancement: Generate standard data sets

It would be useful to be able generate standard data sets without having to define columns etc for quick demos and benchmarking of different activities.

The goal would be to make it very easy to quickly generate a data set for benchmarking and other purposes without having to invest much time in learning the details of the data generation framework.

These could be modelled on standard public data sets such as those published as part of Kaggle challenges. For example standard data sets for customers, purchases, sales etc.

In particular, it would be useful for exploring CDC scenarios to be able to generate standard complementary data sets for both base line data and incremental data.

Proposed Behavior

import dbldatagen as dg

# define a standard data set for customers
testdata_generator = (dg.DataGenerator(spark, name="test_dataset", rows=100000, partitions=20)
                       .usingStandardDataset("customers")
                       )

df = testdata_generator.build()  # build our dataset

Generate data based on estimated Delta table size

As I have been using the data generator I have had to use trial and error to get a table size I require. Not sure if this is feasible but it would be great to generate data based on the final table size required instead of no. of rows.

Alternatively it might be useful to easily get back stats about the generated table size and use that to iteratively generate more data to reach the desired table size.

Currently I am doing the following to get back the table size which works well but needs to be run manually each time.

dfTestData.write.format("delta").mode("overwrite").saveAsTable("tfayyaz_db.test_data")
detail = spark.sql("DESCRIBE DETAIL tfayyaz_db.test_data")
print(detail.first()["sizeInBytes"]/1024/1024, "mb")

Thanks
Tahir

Document %pip based install

Expected Behavior

Current Behavior

%pip cell in notebook can be used to install directly from github - document this

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

Typo in readme

Readme incorrectly states that install is via %pip install dbdatagen

Correct command is %pip install dbldatagen

Automatic generation of dates does not work

(Filed on behalf of colleague)
The following code causes error during build , if schema contains Date field:

testDataSpec = (datagen.DataGenerator(sparkSession=spark, name="test_data_set1",
                             rows=1000000,partitions=4)
                            .withSchema(schema)
                            .withIdOutput()
                            )

df = testDataSpec.build()

TODO: Add wildcard matching to set generation spec for multiple columns at a time

Working with a large schema is very unwieldy.

For example, if a schema has 100s of columns, generating realistic data would require 100s of withColumnSpec statements.

Proposed feature is to add methods for specifying multiple column specs in single call.

There will be multiple variations of this:

datagen.withColumnSpecs( pattern=".*_amt", ...)
datagen.withColumnSpecs( pattern=".*_amt", .match_type=DecimalType(38,10), ..)
datagen.withColumnSpecs( columns=["val1","val2"], .match_type=DecimalType(38,10), ..)

Error in creating ArrayType cols

Expected Behavior

When creating a column from existing schema or new that is of a composite type such as an Array of integers the expected behaviour is to have a column generated in the same manner as it would if it was just a combination of many integer columns and not throw an error.

Current Behavior

Error thrown: AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to array;

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
                            .withIdOutput()
                             .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                                         numColumns=column_count)
                            .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                            .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
                            .withColumn("code3", StringType(), values=['a', 'b', 'c'])
                            .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True)
                            .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
                            .withColumn("a", ArrayType(StringType()))
                            )

df = df_spec.build()
display(df)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

Create a data set without pre-existing schemas example fails

I just tried the example from https://databrickslabs.github.io/dbldatagen/public_docs/APIDOCS.html#create-a-data-set-without-pre-existing-schemas and it fails

because of the line numColumns=cls.column_count

with the error:

INFO: effective range: None, None, 1 args: {}
INFO: adding column - `id` with baseColumn : `None`, implicit : True , omit True
INFO: *** using pandas udf for custom functions ***
INFO: Spark version: 3.1.1
INFO: Using spark 3.x
NameError: name 'cls' is not defined

I modified the example to the following and it works.

import dbldatagen as dg
from pyspark.sql.types import FloatType, IntegerType, StringType

row_count=1000 * 100
column_count=5
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                  partitions=4, randomSeedMethod='hash_fieldname', 
                                  verbose=True)
                   .withIdOutput()
                   .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                                    numColumns=column_count)
                   .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                   .withColumn("code2", IntegerType(), minValue=0, maxValue=10, random=True)
                   .withColumn("code3", StringType(), values=['online', 'offline', 'unknown'])
                   .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True, 
                               percentNulls=0.05)
                   .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, 
                                weights=[9, 1, 1])
                   )

dfTestData = testDataSpec.build()
display(dfTestData)

Generating text with baseColumn not consistent

Expected Behavior

When generating any data, if baseColumn is set to a reference column in withColumn then the data generated for the new column should be the same when the value of the reference column is the same.

For example, for the following code:

rows = 10
partitions = 1

unique_customers = 2

generator = (DataGenerator(spark, name="demo", rows=rows, partitions=partitions,randomSeedMethod='hash_fieldname')
 .withIdOutput()
 .withColumn("customer_id", IntegerType(), uniqueValues=unique_customers, baseColumnType="hash")
 .withColumn("first_name", text=fakerText("first_name"), base_column="customer_id")
 .withColumn("phone", template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd', base_column="customer_id")
)

df = generator.build()
display(df)

I would expect that there would be two first names and they would be consistent for the values in customer_id

Current Behavior

With the example above, I get any number of random values in first_name and phone_number.

Steps to Reproduce (for bugs)

See the code above

Context

Trying to generate data with consistent values within a row.

Your Environment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.