databrickslabs / dbldatagen Goto Github PK

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Home Page: https://databrickslabs.github.io/dbldatagen

License: Other

Python 99.11% Makefile 0.88% Shell 0.01%

datagen pyspark python data-generation faker spark spark-streaming delta-live-tables deltalake databricks

dbldatagen's People

Contributors

Stargazers

Watchers

Forkers

maartenhbe guntermueller burakyilmaz321 nathanknox srchilukoori thbeh aadimanchanda vinodhthiagarajan1309 empiricotx tejaswinie krishnamanoj-kota irtyamine wsilveira-usp ctakamiya fflory vadim erdal-pb boschjoya ashreejain chusengkin akashjajoo1994 sunibhaikc analyticsworld1 juanlamadrid20 weixuan2008 airbots marvinschenkel vaquarkhan 0xbadidea cherifsy yabarji59 aiorganisation arpitjain799 ekharitonov hertera1 williamharmer-db boushphong sathya-reddy-m pankajshrestha gbdotcom psykko665 scoopgroupinc vineetp6 melissari1997 edson-github hercules261188 javindolia ghanse madviv001 nuthanreddy999

dbldatagen's Issues

Uninstalling and reinstalling wheel on cluster running DBR 8.3 may fail

If you have a named cluster specification in your Databricks environment and it had the current or a previous build of the datagenerator installed, when you uninstall the library and reinstall it , it may fail

Expected Behavior

Uninstall followed by reinstall should succeed

Current Behavior

Uninstall followed by re-install may fail.

Workaround

make sure the wheel does not have a name like dbldatagen-0.2.0rc1-py3-none-any.whl (1) which may result from multiple downloads on the same machine
dont use a saved cluster definition - use a new cluster definition

Our plan is to move to a PIP based install which should make installation easier

Your Environment

dbldatagen version used: release candidate 2
Databricks Runtime version: Databricks 8.3
Cloud environment used: Azure

Add comparison with Faker library

Currently, the tool looks very similar to Spark implementation of more feature-rich https://faker.readthedocs.io/en/master/

Please add documentation where each of the libs have benefits of usage and where - not

Build fails due to changes to build runner 'ubuntu-latest'

Expected Behavior

Build succeeds

Current Behavior

Build during PR check fails

Steps to Reproduce (for bugs)

Create Pull Request - build will fail due to changes to default Python version in ubuntu-latest

Context

Fix is to explicitly install Python 3.8 during github action to build

Your Environment

dbldatagen version used: 0.3.0
Databricks Runtime version: various
Cloud environment used: shows in build process, before deployment to cloud

Improve generation of data generation specification from existing schema or dataframe

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

dbldatagen version used:
Databricks Runtime version:
Cloud environment used:

Improve speed and coverage of unit tests

Expected Behavior

Unit tests should run faster - current takes 10 - 12 minutes

Proposed enhancements

Make use of all cores when running unit tests
Convert critical tests to use Pytest rather than unit test
use default parallelism as default number of partitions

Unable to create columns containing lists of strings or integers

Expected Behavior

Given a .withColumn() call with values equal to a list of lists (like [['A'], ['A', 'B'], ['A', 'B', 'C']]) a column should be created which has as possible values the following lists: ['A'], ['A', 'B'], and ['A', 'B', 'C']

Current Behavior

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [A]

Steps to Reproduce (for bugs)

Reproducable example:

def test_array_column():
  import dbldatagen as dg
  from pyspark.sql import SparkSession
  from pyspark.sql.types import ArrayType, StringType

  spark = SparkSession.builder.getOrCreate()
  f_spec = dg.DataGenerator(spark, name='mock-glow-genetic-data', rows=10) \
            .withColumn('altAlleles', ArrayType(StringType()), values = [['A']])
    f_spec.build()

Context

Attempting to generate mock genetic data for the GloWGR and Hail frameworks, where some columns are lists of integers or strings

Your Environment

dbldatagen version used: 0.2.0rc1
Databricks Runtime version: Executed locally on Mac OS
Cloud environment used: Executed locally on Mac OS

Problem with column which is named "ID"

Expected Behavior

Generation of column with name "ID".

Current Behavior

Exception:
AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID.

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql import types as T

SparkSession.builder.getOrCreate()

dg.DataGenerator(spark, rows=100, partitions=2).withColumn("ID", T.StringType()).build()

Context

I see that the problem is here
And it may be solved by renaming of my column "ID" to "ID_" before generation and then renaming it back after but it looks little creepy for production... Why you cannot use something less frequent usable for inner ID column? Like datagen__technical__inner__id for example?

Your Environment

dbldatagen version used: 0.2.0rc1
Databricks Runtime version: 10.4 LTS
Cloud environment used: AWS

tutorial/basics.py incomplete at the end

The current tutorial/basics.py at the end is incomplete when using the when clause:

df = spark.range(10000)

df = df.withColumn("test", df.when())

display(df)

This code, would work, on the other hand:

from pyspark.sql.functions import when

df = spark.range(10000)

df = df.withColumn("test", when(df.id > 5000, "UPPER").otherwise("LOWER"))

display(df)

Add support for Spark 3.1.2

Expected Behavior

Add explicit support for Spark 3.2 (included in Databricks runtime 9.1)

Current Behavior

The current versions of the framework work in Databricks 9.1 (which is based on Spark 3.2). However there are some new features in Spark 3.2 that will tidy up syntax for some date and time constructs.

time string for interval should support both `seconds` and `second`

Expected Behavior

Time intervals can be specified as "12 minutes, 2 seconds". You can also specify "1 minute, 2 seconds". You should be able to specify "1 minute 1 second"

Current Behavior

"1 minute 1 seconds" works but "1 minute 1 second" does not

Steps to Reproduce (for bugs)

Will add test case with bug fix

Context

Your Environment

dbldatagen version used: 0.2.0 RC0
Databricks Runtime version:
Cloud environment used:

Distribution functions (and perhaps others) not compatible with Databricks UC Clusters operating in `shared` mode

Expected Behavior

Set up UC-enabled Databricks interactive cluster.
Use Dbldatagen to create spec using distributions ("dist") functions for data generation.

Expected result: a Dataframe with a million ints in it

Current Behavior

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] UDF/UDAF functions are not supported in Unity Catalog.;
Project [id#835L, cast(((round((gamma_func(1.0, 2.0, 2678957030506407010)#837 * cast(99 as float)), 0) * cast(1 as float)) + cast(1 as float)) as int) AS ip_address#838]
+- Range (0, 1000000, step=1, splits=Some(256))

Steps to Reproduce (for bugs)

Set up UC Cluster

from dbldatagen import DataGenerator
import dbldatagen.distributions as dist

shuffle_partitions_requested = 256
partitions_requested = 256
data_rows = 1 * 1000000 #50 million

# partition parameters etc.
spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

dfDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested)          
                .withColumn("ip_address", "int", minValue=1, maxValue=100, random=True, 
                            distribution=dist.Gamma(1.0,2.0)) 
              )
df = dfDataspec.build()

Context

Your Environment

dbldatagen version used: 0.30.0
Databricks Runtime version: 12.0 (UC enabled)
Cloud environment used: Azure

TODO: improve string generation

Idea:

Improve string generation to support :

generation of dummy emails
generation of dummy credit card numbers
generation of strings of random number of words
generation of strings of fixed number of random words
generation of strings conforming to mask

DblDatagenerator causes global logger to issue messages twice in some circumstances

Expected Behavior

The logging within the dbldatagenerator package respects the logger that is being used in the package that uses dbldatagenerator.

Current Behavior

dbldatagenerator sometimes calls logging.info and logging.debug without calling getLogger() first, resulting in overwriting whatever logger is being used in the caller's package.

Examples:

Since this happens inside _version.py which is being called in __init__.py, this behaviour occurs whenever we import any class from the dbldatagenerator.

Steps to Reproduce (for bugs)

import logging
date_format = "%Y-%m-%d %H:%M:%S"
log_format = "%(asctime)s %(levelname)-8s  %(message)s"
formatter = logging.Formatter(log_format, date_format)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
logger.addHandler(handler)
logger.info("Info message")
# Prints: 2023-03-08 14:18:59 INFO      Info message
from dbldatagen import DataGenerator
# Prints: INFO: Version : VersionInfo(major='0', minor='3', patch='1', release='', build='')
logger.info("Info message")
# Prints:
# 2023-03-08 14:18:59 INFO      Info message
# INFO: Info message

Your Environment

dbldatagen version used: 0.3.1
Databricks Runtime version: NA
Cloud environment used: NA

Make dependency on pyspark development-only, not for the package

Right now, when you install the data generator, it installs the OSS pyspark that may interfere with Databricks implementation. Usually, for such cases we need to make pyspark dependency as optional/development only, and rely on something like findspark package to discover the working PySpark installation. Something like this.

Processing /dbfs/FileStore/wheels/databrickslabs_testdatagenerator-0.10.0_prerel5-py3-none-any.whl
Collecting pyspark>=2.4.0
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.19.2)
Requirement already satisfied: pandas in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.1.3)
Requirement already satisfied: pyarrow>=0.8.0 in /databricks/python3/lib/python3.8/site-packages (from databrickslabs-testdatagenerator==0.10.0-prerel5) (1.0.1)
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Requirement already satisfied: pytz>=2017.2 in /databricks/python3/lib/python3.8/site-packages (from pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (2020.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.8/site-packages (from pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->databrickslabs-testdatagenerator==0.10.0-prerel5) (1.15.0)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880769 sha256=5ee3c3a1242c356be086dd2f52db9fd511cc9d403bbc575dd69bdee91e139cd0
  Stored in directory: /root/.cache/pip/wheels/df/88/9e/58ef1f74892fef590330ca0830b5b6d995ba29b44f977b3926
Successfully built pyspark
Installing collected packages: py4j, pyspark, databrickslabs-testdatagenerator
Successfully installed databrickslabs-testdatagenerator-0.10.0-prerel5 py4j-0.10.9 pyspark-3.1.2

return_delay missing from distribution example

In the Distributions examples it mentions that return_delay is used and omitted but this is missing from the actual code example.

The code example is currently:

from pyspark.sql.types import IntegerType

import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_date", "date",
                    expr="date_add('purchase_date', cast(floor(rand() * 100 + 1) as int))")

                )

dfTestData = testDataSpec.build()

I tried the following 2 examples that did not work but not sure if they are what the examples was trying to show:

Example 1:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True, omit=True)
                .withColumn("return_date", "date", expr="date_add('purchase_date', 'return_delay')")

                )

dfTestData = testDataSpec.build()
display(dfTestData)

Error 1:

AnalysisException: The second argument of 'date_add' function needs to be an integer.

Example 2:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True, omit=True)
                .withColumn("return_date", "date", expr="date_add(purchase_date, return_delay)")

                )

dfTestData = testDataSpec.build()
display(dfTestData)

Error 2:

AnalysisException: cannot resolve '`purchase_date`' given input columns: [id]; line 1 pos 9;

The following works but not sure it is the correct way to do this:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import expr


import dbldatagen as dg

row_count = 1000 * 100
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                 partitions=4, randomSeedMethod='hash_fieldname',
                                 verbose=True)
                .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
                .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
                .withColumn("purchase_date", "date",
                            data_range=dg.DateRange("2017-10-01 00:00:00",
                                                    "2018-10-06 11:55:00",
                                                    "days=3"),
                            random=True)
                .withColumn("return_delay", IntegerType(), values=[-1, -2, -3], weights=[9, 2, 1],
                            random=True)

                )

dfTestData = testDataSpec.build()
dfTestData = dfTestData.withColumn("return_date", expr("date_add(purchase_date, return_delay)"))
display(dfTestData)

Cast exception when nested schema passed as input

Expected Behavior

Nested Schema shown below should be processed correctly

Current Behavior

Cast exception is thrown as follows:

PySpark version: 3.3.1
PySpark SparkContext version: 3.3.1
StructType([StructField('id', LongType(), True), StructField('city', StructType([StructField('id', LongType(), True), StructField('population', LongType(), True)]), True)])
Traceback (most recent call last):
  File "nested-schema.py", line 24, in <module>
    res1 = gen1.build(withTempView=True)
  File "/home/pramod/.local/lib/python3.8/site-packages/dbldatagen/data_generator.py", line 925, in build
    df1 = self._buildColumnExpressionsWithSelects(df1)
  File "/home/pramod/.local/lib/python3.8/site-packages/dbldatagen/data_generator.py", line 972, in _buildColumnExpressionsWithSelects
    df1 = df1.select(*build_round)
  File "/home/pramod/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py", line 2023, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/home/pramod/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/pramod/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to struct<id:bigint,population:bigint>;
'Project [id#0L, cast(cast((id#0L + cast(0 as bigint)) as struct<id:bigint,population:bigint>) as struct<id:bigint,population:bigint>) AS city#2]
+- Range (0, 10, step=1, splits=Some(2))

Steps to Reproduce (for bugs)

Code below is run as $python3.8 nested_schema.py
Pyspark version is 3.3.1

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, LongType, DoubleType
import dbldatagen as datagen

spark = SparkSession.builder \
    .master("local[4]") \
    .appName("Nested Schema") \
    .getOrCreate()

print('PySpark version: ' + spark.version)
print('PySpark SparkContext version: ' + spark.sparkContext.version)

struct_type = StructType([
                StructField('id', LongType(), True),
                StructField("city", StructType([
                    StructField('id', LongType(), True),
                    StructField('population', LongType(), True)
                ]), True)])

print(struct_type)

gen1 = datagen.DataGenerator(sparkSession=spark, name="nested_schema", rows=10, partitions=2) \
      .withSchema(struct_type).withColumn("id")
res1 = gen1.build(withTempView=True)
res1.show(res1.count())

Context

Your Environment

dbldatagen version used: python3.8 -m pip install https://github.com/databrickslabs/dbldatagen/releases/download/release%2Fv0.2.1/dbldatagen-0.2.1-py3-none-any.whl
Databricks Runtime version: N/A
Cloud environment used: Local laptop

Update docs to include new install options

Update docs to refer to new installation options:

%pip install semantics

remove print statements as appropriate

Remove any print statements from code

Improve template text generator

Expected Behavior

Template generator should generate same text from run to run. Proposal is to use Numpy random number generator (and vectorized implementation) to improve repeatability and performance

Current Behavior

Existing implementation uses Python random number generator which is slower and has issues with repeatability,

Steps to Reproduce (for bugs)

Context

Your Environment

dbldatagen version used:
Databricks Runtime version:
Cloud environment used:

Enhancement: As a SQL persona how do I use the data-generator

As a SQL persona how do I use this tool?

TODO: add integration for specific distribution integration

Collect telemetry on usage

There needs to be a way to collect telemetry on who is using the project and what method is used. This way we'd a data-driven way to measure the success of this project and improve quality on the most frequently used projects.

The simplest option would be sending a cluster status check in every method with the following user agent header:

User-Agent: Databricks-Labs-Data-Generator/VERSION (+method_name)

Feature request: Allow generation of data set for specific size

Some users have requested the ability to generate 100 GB or 1 TB of data without specifying the number of rows

use of base_columns should be allowed as alias for base_column with multiple base_columns

Code is present already to do this but does not work for following code snippet:

import dbldatagen as dg
from pyspark.sql.types import StructType, StructField, StringType

shuffle_partitions_requested = 8
partitions_requested = 8
data_rows = 10000000

dataspec = (dg.DataGenerator(spark, rows=10000000, partitions=8)
.withColumn("name", percent_nulls=1.0, template=r'\w \w|\w a. \w')
.withColumn("payment_instrument_type", values=['paypal', 'visa', 'mastercard', 'amex'], random=True)
.withColumn("payment_instrument", minValue=1000000, maxValue=10000000, template="dddd dddddd ddddd")
.withColumn("email", template=r'\w.\w@\w.com')
.withColumn("md5_payment_instrument",
expr="md5(concat(payment_instrument_type, ':', payment_instrument))",
base_columns=['payment_instrument_type', 'payment_instrument'])
)
df1 = dataspec.build()

df1.display()

feature request to specify the separator char used for prefix

prefix option adds "_", I'd like to set that char

Improvement: Want to have more seamless mechanism for generating CDC updates

Currently you can generate simulated updates to an existing data set by either

sampling results from an existing dataset and updating fields
restricting the number of unique values for a datasets primary key or composite primary key fields so that you get naturally repeated rows

It would be useful to be able to specify a specific number of updates or range of updates per primary key

Remove conda dependency

make create-dev-env
conda create -n dbl_testdatagenerator python=3.7.5
make: conda: No such file or directory
make: *** [create-dev-env] Error 1

People working on the code won't necessarily have conda installed. therefore, build must not depend on it.

install_requires is pretty small. Installing conda messes up with all python interpreters on machine.

Setuptools need to include required packages for working with the library locally (outside of the databricks environment)

Expected Behavior

when the package is installed, I expect it to install the the necessary dependendices.

Current Behavior

It does not at the moment, it assumes that this package is going to run on databricks which is going to be the case, however if I am developing code locally, it becomes a problem during testing.

Steps to Reproduce (for bugs)

pip install dbldatagen

Then try to run some code locally for example tests.

It does not install the required packages:
numpy = "1.22.0"
pyspark = "3.1.3"
pyarrow = "1.0.1"
pandas = "1.1.3"
pyparsing = ">=2.4.7,<3.0.9"

Context

Your Environment

local mac computer

Docuementation issues

Steps to Reproduce (for bugs)

Documentation content only

Context

Documentation examples in Data Ranges first example does not include correct definition for returnDate

Changed interim build labelling to comply with PEP 440

Expected Behavior

build labelling for pre-release builds should be changed to comply with PEP440

Current Behavior

labelling for prerelease builds uses form 0.3.1-a1 - need to have format be 0.3.1a1

Steps to Reproduce (for bugs)

Context

Your Environment

dbldatagen version used:
Databricks Runtime version:
Cloud environment used:

TODO: add integration for generating test data frame for streaming use

test issue

Generate multi-table dataset with referential consistency

Current implementation just generates single-table fake dataset, not adding more functionality, compared to Faker (#44). This library needs more functionality to differentiate from existing solutions.

use only last 4 digit payment instruments in examples

To avoid any coincidental generation of real payment instrument numbers

issue with use of template with value r"dr_\\v"

import dbldatagen as dg
from pyspark.sql.types import StructType, StructField, StringType

shuffle_partitions_requested = 8
partitions_requested = 8
data_rows = 10000000

schema = StructType([
StructField("name",StringType(),True),
StructField("ser",StringType(),True),
StructField("license_plate",StringType(),True),
StructField("email",StringType(),True),
])

dataspec = (dg.DataGenerator(spark, rows=10000000, partitions=8)
.withSchema(schema))

dataspec = (dataspec
.withColumnSpec("name", percent_nulls=1.0, template=r'\w \w|\w a. \w')
.withColumnSpec("ser", minValue=1000000, maxValue=10000000, template=r"dr_\v")
.withColumnSpec("email", template=r'\w.\w@\w.com')
.withColumnSpec("license_plate", template=r'\n-\n')
)
df1 = dataspec.build()

df1.display()

Improve build ordering dependencies

Expected Behavior

When a column contains a SQL expression, if the SQL contains references to other columns that appear prior to the current column - adjust the build sequencing so that those columns are created first.

Current Behavior

You have to specify via baseColumn attribute in all cases

Improved behavior

If simple identifier parser detects valid SQL identifiers in the SQL expression and a) they are not inside a string, and b) they match an existing column name, use separate phases for generation of the column which references the other columns.

This will reduce the number of cases where it is necessary to explicitly reference the baseColumns

Add enhanced options and documentation for streaming data generation

Issue to track:

1 - changes to streaming documentation to add
a) Delta Live Tables integration information,
b) example of generation of sliding event time windows
c) example of generation of simple IOT data with timestamps
d) example of generation of late arriving data with the above

2 - add options to simplify generation of the above
The above behavior is supported in the current version, but it would be useful to include options to simplify the generation of streaming data sets

2a - add options for
- ageLimit - ignore messages older than n seconds. This helps when benchmarking with trigger once using a rate
stream when significant time has elapsed between runs (otherwise there can be a large message backlog)
3 - support rate-micro-batch source

stringtype spec with prefix generates null if string length is too long?

.withColumn("order_num", min=1, max=100000000, random=True)
.withColumnSpec("OrderId", prefix="order", base_column="order_num")

Most values for OrderId are null, except if the digits in order_num or sufficiently small

Expose version attribute

Expected Behavior

version attribute should be exposed so that the following code works

import dbldatagen as dg
print(dg.version)

Current Behavior

version attribute is not exposed

Steps to Reproduce (for bugs)

n/a

Context

Your Environment

dbldatagen version used: RC 0
Databricks Runtime version: tested with Databricks 9.3
Cloud environment used: tested on Azure

Enhancement: Generate standard data sets

It would be useful to be able generate standard data sets without having to define columns etc for quick demos and benchmarking of different activities.

The goal would be to make it very easy to quickly generate a data set for benchmarking and other purposes without having to invest much time in learning the details of the data generation framework.

These could be modelled on standard public data sets such as those published as part of Kaggle challenges. For example standard data sets for customers, purchases, sales etc.

In particular, it would be useful for exploring CDC scenarios to be able to generate standard complementary data sets for both base line data and incremental data.

Proposed Behavior

import dbldatagen as dg

# define a standard data set for customers
testdata_generator = (dg.DataGenerator(spark, name="test_dataset", rows=100000, partitions=20)
                       .usingStandardDataset("customers")
                       )

df = testdata_generator.build()  # build our dataset

Generate data based on estimated Delta table size

As I have been using the data generator I have had to use trial and error to get a table size I require. Not sure if this is feasible but it would be great to generate data based on the final table size required instead of no. of rows.

Alternatively it might be useful to easily get back stats about the generated table size and use that to iteratively generate more data to reach the desired table size.

Currently I am doing the following to get back the table size which works well but needs to be run manually each time.

dfTestData.write.format("delta").mode("overwrite").saveAsTable("tfayyaz_db.test_data")
detail = spark.sql("DESCRIBE DETAIL tfayyaz_db.test_data")
print(detail.first()["sizeInBytes"]/1024/1024, "mb")

Thanks
Tahir

Document %pip based install

Expected Behavior

Current Behavior

%pip cell in notebook can be used to install directly from github - document this

Steps to Reproduce (for bugs)

Context

Your Environment

dbldatagen version used:
Databricks Runtime version:
Cloud environment used:

Typo in readme

Readme incorrectly states that install is via %pip install dbdatagen

Correct command is %pip install dbldatagen

TODO: incorporate bump version mechanics

Bump version mechanics in progress

Automatic generation of dates does not work

(Filed on behalf of colleague)
The following code causes error during build , if schema contains Date field:

testDataSpec = (datagen.DataGenerator(sparkSession=spark, name="test_data_set1",
                             rows=1000000,partitions=4)
                            .withSchema(schema)
                            .withIdOutput()
                            )

df = testDataSpec.build()

TODO: Add wildcard matching to set generation spec for multiple columns at a time

Working with a large schema is very unwieldy.

For example, if a schema has 100s of columns, generating realistic data would require 100s of withColumnSpec statements.

Proposed feature is to add methods for specifying multiple column specs in single call.

There will be multiple variations of this:

datagen.withColumnSpecs( pattern=".*_amt", ...)
datagen.withColumnSpecs( pattern=".*_amt", .match_type=DecimalType(38,10), ..)
datagen.withColumnSpecs( columns=["val1","val2"], .match_type=DecimalType(38,10), ..)

Error in creating ArrayType cols

Expected Behavior

When creating a column from existing schema or new that is of a composite type such as an Array of integers the expected behaviour is to have a column generated in the same manner as it would if it was just a combination of many integer columns and not throw an error.

Current Behavior

Error thrown: AnalysisException: cannot resolve '(id + CAST(0 AS BIGINT))' due to data type mismatch: cannot cast bigint to array;

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
                            .withIdOutput()
                             .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                                         numColumns=column_count)
                            .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                            .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
                            .withColumn("code3", StringType(), values=['a', 'b', 'c'])
                            .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True)
                            .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
                            .withColumn("a", ArrayType(StringType()))
                            )

df = df_spec.build()
display(df)

Context

Your Environment

dbldatagen version used:
Databricks Runtime version:
Cloud environment used:

Create a data set without pre-existing schemas example fails

I just tried the example from https://databrickslabs.github.io/dbldatagen/public_docs/APIDOCS.html#create-a-data-set-without-pre-existing-schemas and it fails

because of the line numColumns=cls.column_count

with the error:

INFO: effective range: None, None, 1 args: {}
INFO: adding column - `id` with baseColumn : `None`, implicit : True , omit True
INFO: *** using pandas udf for custom functions ***
INFO: Spark version: 3.1.1
INFO: Using spark 3.x
NameError: name 'cls' is not defined

I modified the example to the following and it works.

import dbldatagen as dg
from pyspark.sql.types import FloatType, IntegerType, StringType

row_count=1000 * 100
column_count=5
testDataSpec = (dg.DataGenerator(spark, name="test_data_set1", rows=row_count,
                                  partitions=4, randomSeedMethod='hash_fieldname', 
                                  verbose=True)
                   .withIdOutput()
                   .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
                                    numColumns=column_count)
                   .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                   .withColumn("code2", IntegerType(), minValue=0, maxValue=10, random=True)
                   .withColumn("code3", StringType(), values=['online', 'offline', 'unknown'])
                   .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True, 
                               percentNulls=0.05)
                   .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, 
                                weights=[9, 1, 1])
                   )

dfTestData = testDataSpec.build()
display(dfTestData)

Generating text with baseColumn not consistent

Expected Behavior

When generating any data, if baseColumn is set to a reference column in withColumn then the data generated for the new column should be the same when the value of the reference column is the same.

For example, for the following code:

rows = 10
partitions = 1

unique_customers = 2

generator = (DataGenerator(spark, name="demo", rows=rows, partitions=partitions,randomSeedMethod='hash_fieldname')
 .withIdOutput()
 .withColumn("customer_id", IntegerType(), uniqueValues=unique_customers, baseColumnType="hash")
 .withColumn("first_name", text=fakerText("first_name"), base_column="customer_id")
 .withColumn("phone", template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd', base_column="customer_id")
)

df = generator.build()
display(df)

I would expect that there would be two first names and they would be consistent for the values in customer_id

Current Behavior

With the example above, I get any number of random values in first_name and phone_number.

Steps to Reproduce (for bugs)

See the code above

Context

Trying to generate data with consistent values within a row.

Your Environment

dbldatagen version used: v 0.2.0-rc0 public preview 2
Databricks Runtime version: 10.4
Cloud environment used: Azure

Replace simplistic dbldatagen.text_generators with community maintained Faker generators

Why don't we use use Faker as random generation backend? it's more powerful already, than self-written dbldatagen.text_generators. there's already plenty of data providers - https://faker.readthedocs.io/en/stable/providers.html & https://faker.readthedocs.io/en/stable/communityproviders.html - it's very good idea to build on top of existing efforts.