materializeinc / datagen Goto Github PK

Generate authentic looking mock data based on a SQL, JSON or Avro schema and produce to Kafka in JSON or Avro format.

License: Apache License 2.0

JavaScript 6.88% Dockerfile 0.98% TypeScript 92.14%

kafka avro sql

datagen's Introduction

Datagen CLI

This command line interface application allows you to take schemas defined in JSON (.json), Avro (.avsc), or SQL (.sql) and produce believable fake data to Kafka in JSON or Avro format or to Postgres.

The benefits of using this datagen tool are:

You can specify what values are generated using the expansive FakerJS API to craft data that more faithfully imitates your use case. This allows you to more easily apply business logic downstream.
This is a relatively simple CLI tool compared to other Kafka data generators that require Kafka Connect.
When using the avro output format, datagen connects to Schema Registry. This allows you to take advantage of the benefits of using Schema Registry.
Often when you generate random data, your downstream join results won't make sense because it's unlikely a randomly generated field in one dataset will match a randomly generated field in another. With this datagen tool, you can specify relationships between your datasets so that related columns will match up, resulting in meaningful joins downstream. Jump to the end-to-end ecommerce tutorial for a full example.

🚧 Specifying relationships between datasets currently requires using JSON for the input schema.

🚧 The postgres output format currently does not support specifying relationships between datasets.

Installation

npm

npm install -g @materializeinc/datagen

Docker

docker pull materialize/datagen

From Source

git clone https://github.com/MaterializeInc/datagen.git
cd datagen
npm install
npm run build
npm link

Setup

Create a file called .env with the following environment variables

# Kafka Brokers
export KAFKA_BROKERS=

# For Kafka SASL Authentication:
export SASL_USERNAME=
export SASL_PASSWORD=
export SASL_MECHANISM=

# For Kafka SSL Authentication:
export SSL_CA_LOCATION=
export SSL_CERT_LOCATION=
export SSL_KEY_LOCATION=

# Connect to Schema Registry if using '--format avro'
export SCHEMA_REGISTRY_URL=
export SCHEMA_REGISTRY_USERNAME=
export SCHEMA_REGISTRY_PASSWORD=

# Postgres
export POSTGRES_HOST=
export POSTGRES_PORT=
export POSTGRES_DB=
export POSTGRES_USER=
export POSTGRES_PASSWORD=

# MySQL
export MYSQL_HOST=
export MYSQL_PORT=
export MYSQL_DB=
export MYSQL_USER=
export MYSQL_PASSWORD=

The datagen program will read the environment variables from .env in the current working directory. If you are running datagen from a different directory, you can first source /path/to/your/.env before running the command.

Usage

datagen -h

Usage: datagen [options]

Fake Data Generator

Options:
  -V, --version             output the version number
  -s, --schema <char>       Schema file to use
  -f, --format <char>       The format of the produced data (choices: "json", "avro", "postgres", "webhook", "mysql", default: "json")
  -n, --number <char>       Number of records to generate. For infinite records, use -1 (default: "10")
  -c, --clean               Clean (delete) Kafka topics and schema subjects previously created
  -dr, --dry-run            Dry run (no data will be produced to Kafka)
  -d, --debug               Output extra debugging information
  -w, --wait <int>          Wait time in ms between record production
  -rs, --record-size <int>  Record size in bytes, eg. 1048576 for 1MB
  -p, --prefix <char>       Kafka topic and schema registry prefix
  -h, --help                display help for command

Quick Examples

See example input schema files in examples and tests folders.

Quickstart

Iterate through a schema defined in SQL 10 times, but don't actually interact with Kafka or Schema Registry ("dry run"). Also, see extra output with debug mode.
```
datagen \
  --schema tests/products.sql \
  --format avro \
  --dry-run \
  --debug
```
Same as above, but actually create the schema subjects and Kafka topics, and actually produce the data. There is less output because debug mode is off.
```
datagen \
    --schema tests/products.sql \
    --format avro
```
Same as above, but produce to Kafka continuously. Press Ctrl+C to quit.
```
datagen \
    -s tests/products.sql \
    -f avro \
    -n -1
```
If you want to generate a larger payload, you can use the --record-size option to specify number of bytes of junk data to add to each record. Here, we generate a 1MB record. So if you have to generate 1GB of data, you run the command with the following options:
```
datagen \
    -s tests/products.sql \
    -f avro \
    -n 1000 \
    --record-size 1048576
```
This will add a recordSizePayload field to the record with the specified size and will send the record to Kafka.

📓 The 'Max Message Size' of your Kafka cluster needs to be set to a higher value than 1MB for this to work.

Clean (delete) the topics and schema subjects created above

datagen \
    --schema tests/products.sql \
    --format avro \
    --clean

Generate records with sequence numbers

To simulate auto incrementing primary keys, you can use the iteration.index variable in the schema.

This is particularly useful when you want to generate a small set of records with sequence of IDs, for example 1000 records with IDs from 1 to 1000:

[
    {
        "_meta": {
            "topic": "mz_datagen_users"
        },
        "id": "iteration.index",
        "name": "faker.internet.userName()",
    }
]

Example:

datagen \
    -s tests/iterationIndex.json \
    -f json \
    -n 1000 \
    --dry-run

Docker

Call the docker container like you would call the CLI locally, except:

include --rm to remove the container when it exits
include -it (interactive teletype) to see the output as you would locally (e.g. colors)
mount .env and schema files into the container
note that the working directory in the container is /app

docker run \
  --rm -it \
  -v ${PWD}/.env:/app/.env \
  -v ${PWD}/tests/schema.json:/app/blah.json \
      materialize/datagen -s blah.json -n 1 --dry-run

Input Schemas

You can define input schemas using JSON (.json), Avro (.avsc), or SQL (.sql). Within those schemas, you use the FakerJS API to define the data that is generated for each field.

You can pass arguments to faker methods by escaping quotes. For example, here is faker.datatype.number with min and max arguments:

"faker.datatype.number({min: 100, max: 1000})"

🚧 Right now, JSON is the only kind of input schema that supports generating relational data.

⚠️ Please inspect your input schema file since faker methods can contain arbitrary Javascript functions that datagen will execute.

JSON Schema

Here is the general syntax for a JSON input schema:

[
  {
    "_meta": {
      "topic": "<my kafka topic>",
      "key": "<field to be used for kafka record key>" ,
      "relationships": [
        {
          "topic": "<topic for dependent dataset>",
          "parent_field": "<field in this dataset>",
          "child_field": "<matching field in dependent dataset>",
          "records_per": <number of records in dependent dataset per record in this dataset>
        },
        ...
      ]
    },
    "<my first field>": "<method from the faker API>",
    "<my second field>": "<another method from the faker API>",
    ...
  },
  {
    ...
  },
  ...
]

Go to the end-to-end ecommerce tutorial to walk through an example that uses a JSON input schema with relational data.

SQL Schema

The SQL schema option allows you to use a CREATE TABLE statement to define what data is generated. You specify the FakerJS API method using a COMMENT on the column. Here is an example:

CREATE TABLE "ecommerce"."products" (
  "id" int PRIMARY KEY,
  "name" varchar COMMENT 'faker.internet.userName()',
  "merchant_id" int NOT NULL COMMENT 'faker.datatype.number()',
  "price" int COMMENT 'faker.datatype.number()',
  "status" int COMMENT 'faker.datatype.boolean()',
  "created_at" timestamp DEFAULT (now())
);

This will produce the desired mock data to the topic ecommerce.products.

Producing to Postgres

You can also produce the data to a Postgres database. To do this, you need to specify the -f postgres option and provide Postgres connection information in the .env file. Here is an example .env file:

# Postgres
export POSTGRES_HOST=
export POSTGRES_PORT=
export POSTGRES_DB=
export POSTGRES_USER=
export POSTGRES_PASSWORD=

Then, you can run the following command to produce the data to Postgres:

datagen \
    -s tests/products.sql \
    -f postgres \
    -n 1000

⚠️ You can only produce to Postgres with a SQL schema.

Producing to MySQL

You can also produce the data to a MySQL database. To do this, you need to specify the -f mysql option and provide MySQL connection information in the .env file. Here is an example .env file:

# MySQL
export MYSQL_HOST=
export MYSQL_PORT=
export MYSQL_DB=
export MYSQL_USER=
export MYSQL_PASSWORD=

Then, you can run the following command to produce the data to MySQL:

datagen \
    -s tests/products.sql \
    -f mysql \
    -n 1000

⚠️ You can only produce to MySQL with a SQL schema.

Producing to Webhook

You can also produce the data to a Webhook. To do this, you need to specify the -f webhook option and provide Webhook connection information in the .env file. Here is an example .env file:

# Webhook
export WEBHOOK_URL=
export WEBHOOK_SECRET=

Then, you can run the following command to produce the data to Webhook:

datagen \
    -s tests/products.sql \
    -f webhook \
    -n 1000

⚠️ You can only produce to Webhook with basic authentication.

Avro Schema

🚧 Avro input schema currently does not support arbitrary FakerJS methods. Instead, data is randomly generated based on the type.

Here is an example Avro input schema from tests/schema.avsc that will produce data to a topic called products:

{
  "type": "record",
  "name": "products",
  "namespace": "exp.products.v1",
  "fields": [
    { "name": "id", "type": "string" },
    { "name": "productId", "type": ["null", "string"] },
    { "name": "title", "type": "string" },
    { "name": "price", "type": "int" },
    { "name": "isLimited", "type": "boolean" },
    { "name": "sizes", "type": ["null", "string"], "default": null },
    { "name": "ownerIds", "type": { "type": "array", "items": "string" } }
  ]
}

datagen's People

Contributors

Stargazers

Watchers

Forkers

bobbyiliev chuck-alt-delete morsapaes veryfatboy tspannhw fwchitu danthelion yajneshpadiyar sashamos recursethis prakashpie alexrogalskiy

datagen's Issues

Feature: Add Materialize integration tests

Is your feature request related to a problem? Please describe.

At the moment we only test the datagen ability to produce data to Kafka. We should also add some Materialize integration tests to make sure that Materialize can actually ingest the produced data.

Additional context

We can use the setup from the Terraform Provider here.

Feature: Allow fields to reuse previously generated values

It would be great if a field could reuse the generated value from another field. This currently works with parent_field and child_field to specify a foreign key relationship in _meta.relationships, but we may want to extend that to other fields as well.

For example, suppose we want a field that takes the value of another field and increases it by a random number between 5 to 10. The most general solution might be to somehow allow referencing other fields by json path? This has the benefit that it would work for nested fields as well.

Dependent fields would have to come later in the order so the upstream field has a chance to get generated first.

Related issues

For Debezium change data capture, a record has before and after fields that wrap the values before and after the record changed in the upstream database. Often we'll want to include a field that didn't change, like a primary key id. But with the current implementation, there isn't a way for one field to reference the data generated in another field.

Feature: Add docker compose example

Is your feature request related to a problem? Please describe.

It will be handy to add a docker-compose example as we will be able to showcase how the users could run this in a simulated multithreaded setup, eg. running multiple instances of the datagen script to increase the throughput

Feature: Kafka Benchmark data generator

Is your feature request related to a problem? Please describe.

In some cases, users might want to create a Kafka topic with a specific volume of data, eg 10GB.

Describe the solution you'd like

Add an option to allow users to specify the size of the data they want to generate, eg 10GB and produce that exact volume of data.

Additional context

We could use the crypto package for example:

const crypto = require('crypto');

// Define the size of an individual message, Confluent for example has a limit of 8-9MB
const size = (5 * 1024 * 1024) / 2;

// Generate a string with that size
function generateRandomString(size) {
    return crypto.randomBytes(size).toString('hex');
}

Loop n number of times to generate the required total data volume:

  for (let i = 1; i <= 2000; i++) {
    // Payload with UPSERT format
    const payload = {
      record_id: i,
      body: generateRandomString(size)
    }
    console.log(payload)
    await producer.send({
      topic: "topic_10gb",
      messages: [{ value: JSON.stringify(payload), key: payload.record_id.toString() }],
      key: payload.record_id.toString()
    });
  }

Bug: error parsing sql when multiple columns used as primary key

Describe the bug

error when input sql uses multiple columns as primary key

To Reproduce

datagen -s tests/schema2.sql

Feature: Debug mode -- Show Kafka topic

I am getting an error that tells me to try creating the kafka topic manually, but it's not immediately clear to me how the topic should be named to match what datagen expects. Could debug mode show what the kafka topic name is supposed to be?

Feature: More control over keyspace size

Is your feature request related to a problem? Please describe.

I was looking at faker.datatype.number and other faker methods to see what I might do to limit the keyspace of a collection. With a max, we can ensure that we hit keys multiple times for updates.

Describe the solution you'd like

I’m not sure…I would say user could specify cardinality in _meta? But only when it makes sense for faker object accepts some kind of max argument like faker.datatype.number

Feature: Add MySQL destination

Is your feature request related to a problem? Please describe.

With the new Materialize MySQL source on the way, it will be nice to have a MySQL destination available for some quick tests.

Describe the solution you'd like

This should be more or less the same as the Postgres destination that we currently have.

Feature: add Postgres target

Is your feature request related to a problem? Please describe.

Although Kafka is a great place to start building out the tool, it'd be useful to extend the generator and support producing data to a Postgres target. This would ensure we don't force users to adopt and/or learn Kafka if their use case just requires direct Postgres CDC to Materialize.

I can also see this being useful for internal testing of our Postgres source.

Describe the solution you'd like

Since we already support .sql files as input, we could run these against the specified connection to bootstrap the target database (+ which should be configured for logical replication). Generating relational data sounds a bit tricky, but there are a few SQL Faker implementations out there we could draw inspiration from (e.g. sqlfaker, sql-faker). I'd be motivated to work on a prototype! 🖖

Feature: Allow user to use faker.js API in sql schema definitions

It would be great if you could also use the faker.js collections for sql defined schema. I think right now the record is generated just based on the type specified, but it would be cool if the user could specify which preset collection from the faker js library they want to use.

Bug: Remove Materialize related environment variables

We should remove the Materialize related environment variables from example.env. This CLI doesn't interface with materialize (yet), just Kafka

Feature: Allow user to use faker.js API in avro definitions

It would be great if you could also use the faker.js collections for avro defined schemas.

Feature: Add support for Avro output format

The feature should include:

Tasks

Beta Give feedback

Create a avroDataGenerator method
Add a schema registry details
Update the documentation to include the extra details
Add tests to verify that the avro encoding works
Options

Feature: Add integration tests

Is your feature request related to a problem? Please describe.

At the moment we are only testing the schema parsing and the output.

It will be nice to actually test the Kafka producer including both JSON and Avro formats

Describe the solution you'd like

We can create a docker-compose file with Kafka + CSR and add additional tests to actually run the datagen script out of dry run mode

Additional context

The tests should run on every PR

Bug: avro namespace causes error when name has periods

Describe the bug

When producing tests/products.sql with avro output format, there is an error in the avro schema because of a period in the name. If the avro record name has periods, then it must match with the namespace

To Reproduce

datagen -s tests/products.sql -f avro

Feature: Clean up topics and schema subjects

Is your feature request related to a problem? Please describe.

My schema file might have a dozen Kafka topics and it’s a little cumbersome to delete all of those if I want to clean up.

Describe the solution you'd like

Add a --clean option that takes all the topics in the input schema file and deletes them from kafka. If using schema registry, also delete the associated schema subjects.

Feature: Add CONTRIBUTING.md

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Add a CONTRIBUTING.md file:

# Contributor instructions

## Testing

???

## Cutting a new release

Perform a manual test of the latest code on `main`. See prior section. Then run:

    git tag -a vX.Y.Z -m vX.Y.Z
    git push origin vX.Y.Z

Feature: support envelope debezium

Is your feature request related to a problem? Please describe.

It could be powerful to support outputting data in a debezium format. While semantically, it is not any different, users may want to see data in the format they’ll have in production

publish ARM docker image

Our dockerhub.yml GitHub actions workflow is missing this line here to publish multiarch:

platforms: linux/amd64,linux/arm64

Bug: Kafka Topic creation fails

Describe the bug

The Kafka topic creation for Confluent Cloud specifically is failing.

This seems to be working fine with Upstash for example.

To Reproduce

Use a Confluent Cloud Kafka and a non-existing topic

Expected behavior

The script should be able to create a topic if it is missing, as long as we have write access of course

Feature: Define await time during data generation

Is your feature request related to a problem? Please describe.

At the moment the setTimeout is hardcoded and is set to 500ms. Meaning that we wait 500ms after each message has been produced:

https://github.com/MaterializeInc/datagen/blob/main/src/jsonDataGenerator.js#L210

Describe the solution you'd like

It will be great if the user could specify the setTimeout value as a parameter.

Additional context

We need to make sure that we have a default value, eg. 500ms so that the user does not have to explicitly specify it unless they want to override it.

Feature: Add support for generating relational data

An experimental JSON relational option has been added. We need to add the same for SQL and AVRO schemas.

This adds some very basic relational functionality for JSON schemas. You can specify the PK and the FK in the schema eg:

[
    {
        "_meta": {
            "topic": "mz_datagen_users",
            "key": "id"
        },
        "id": "datatype.uuid",
        "name": "internet.userName",
        "email": "internet.exampleEmail",
        "phone": "phone.imei",
        "website": "internet.domainName",
        "city": "address.city",
        "company": "company.name"
    },
    {
        "_meta": {
            "topic": "mz_datagen_posts",
            "foreignKey": "user_id",
            "key": "id"
        },
        "id": "datatype.uuid",
        "user_id": "datatype.uuid",
        "title": "lorem.sentence",
        "body": "lorem.paragraph"
    },
]

A unique ID will be generated and used for the users.id and posts.user_id

Feature: Allow datagen to run indefinitely

Is your feature request related to a problem? Please describe.

Right now, datagen only runs through a specific number of iterations. It would be nice if you could have it run indefinitely until it's killed.

Describe the solution you'd like

If user specifies -n -1, we could have it run indefinitely

Add a license to the repository

The repository should have an appropriate license that waives liability while indicating the code is available for free use.

Bug: "Error creating kafka topic -- alert is not defined"

Describe the bug

✖ Error creating Kafka topic, try creating it manually...
alert is not defined

To Reproduce

Set env variables and:

datagen \
    -s examples/ecommerce.json \
    -f avro \
    -n -1

When I check out to tag v0.1.4 it works, so it must have been introduced in #78.

Additional context

Looks like we need to import the alerts module in producer.ts. Then, investigate why there is an error creating kafka topic in createTopic.ts.

Bug: sql schema parser adds table name to record payload

Describe the bug

the sql schema parser adds the table name to the payload. This was initially intended to generate all of the record in one topic which is no longer the case.

To Reproduce

Try generating some records using the sql test files.

Expected behavior

Remove the table name from the sql parser output.

Screenshots

Additional context

Bug: `[email protected]` deprecated

Describe the bug

Noticed the following warning:

WARN deprecated [email protected]:
            This package is no longer supported. It's now a built-in Node module. 
            If you've depended on crypto, you should switch to the one that's built-in.

We should look into switching to the built-in one mentioned in the warning output.

To Reproduce

Just run:

npm install

Feature: Add `--prefix` option to prefix topics and schema subjects

Is your feature request related to a problem? Please describe.

I realize it might be nice to include a --prefix option to add a prefix to all the created topics / schema subjects. I came across this because we share a schema registry across a bunch of different confluent kafka clusters, and I was getting rejected on registering a schema subject for a topic in a different cluster. With --prefix , different people could be running datagen from the same input schema files in different clusters without stepping on each other in schema registry

Describe the solution you'd like

add a --prefix option

Additional context

Feature: Support list type for a field

Is your feature request related to a problem? Please describe.

Datagen doesn’t yet support fields that are lists. I want do to a thing where each order has a list of item ids, then have the item list be the parent field and items.id be the child field.

As a stopgap, I suppose I could make orders an append-only stream with individual item ids instead of a list. I would limit the key space of the order ids to get several items per order. Then I can group by order id and aggregate the item ids into a list

Describe the solution you'd like

I think we would add a case to the record generation function to handle Array type (currently only handles Object and primitive types). There should be one faker method inside the list, and then we generate a random number of elements where the values are generated from the faker method.

For relationships, if the parent field is a list, we would iterate over the list to generate the child records.

Bug: Avro Schema not processed when Record type is used with in fields

Describe the bug

When trying to generate data using avro schmea that has record type in the fields. Data is generated as string and the record type is ignored.

To Reproduce

Consider below schema. saved at tests/schema.avsc

{
"type": "record",
"name": "products",
"namespace": "exp.products.v1",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "productId", "type": ["null", "string"] },
{ "name": "title", "type": "string" },
{ "name": "price", "type": "int" },
{ "name": "isLimited", "type": "boolean" },
{ "name": "sizes", "type": ["null", "string"], "default": null },
{ "name": "ownerIds", "type": { "type": "array", "items": "string" } },
{
"name": "address",
"type": {
"type": "record",
"name": "Address",
"fields": [
{
"name": "ID",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "STREET",
"type": [
"null",
"string"
],
"default": null
}
]}}
]
}

Run the command datagen -s tests/schema.avsc -n 1 -dr -f avro
The result is as below

{
"id": "'N7"pALPHJ",
"productId": "fi*=2Zp?B*",
"title": "!yqQ>6{3+I",
"price": 87678,
"isLimited": false,
"sizes": "-653[vwS6+",
"ownerIds": [
"%75#fB+%m|",
"N)fj1rBGOp",
23720,
"8>vTA@#)S=",
"H&QmrQ]TdV",
10082,
"YaqG3o"`lx",
"aB/vXGNfcr",
57678,
"s$zC0Rze%X"
],
"address": "8TzdK'H<ak"
}

Expected behavior

The data should be generated as below

{
"id": ")VA;tD#=%",
"productId": "zP=8.!:zHh",
"title": ";qlT+^i{?h",
"price": 92089,
"isLimited": false,
"sizes": ";kKW.\[!,h",
"ownerIds": [
92325,
77425,
10334,
22855,
62197,
62238,
6471,
"<=z&|<zEL2",
"FN%T:':(g",
"b|7?WQVzTF"
],
"address": {
"ID": "G{EZ7q>UAZ",
"STREET": ".qwz7vv=Uk"
}
}

Screenshots

Additional context

I have identified the problem to be in the convertAvroSchemaToJson function in parseAvroSchema.ts file. Here the if condition is looking for if ((column.type === 'record')) { but this will always be false as the column.type is a object.

Feature: Add a workflow to build a Docker image and push it to Docker Hub

Is your feature request related to a problem? Please describe.

We can add a deploy step to build and upload the docker image to Docker hub as well.

Feature: Consider migrating to typescript

Is your feature request related to a problem? Please describe.

I find javascript very difficult to write and maintain. I need types and access modifiers to formulate my thoughts in code.

Describe the solution you'd like

I want to open the discussion of migrating to typescript, and I'd like to hear everyone's thoughts for and against it.

Additional context

Feature: More flexible relationships

The implementation of datagen right now starts with the first object in the list and follows the relationships and maps the parent's primary key to the child's foreign key. But sometimes we want to map a field other than the primary key.

For example, suppose we have users, purchases, and items. We want each user to have 3 purchases, and each purchase to have 2 items. But we don't want to pass purchases.id to items.id. We want to pass purchases.item_id to items.id.

I think this should be a pretty simple fix. Each relationship has a field, but really it should have a parent_field and a child_field . So for purchases -> items relationship, you'd have "parent_field": "item_id" and a "child_field": "id".

Feature: Enable client authentication with TLS for Kafka connections

Is your feature request related to a problem? Please describe.

At the moment we only support SASL and plaintext authentication. We should also support TLS authentication so that users with managed Redpanda Cloud clusters could also use this.

Additional context

KafkaJS supports this already so it should be trivial to get it implemented:

https://kafka.js.org/docs/configuration#ssl

Improve integration tests

Use redpanda docker image like we do in the docker example to easily remove zookeeper
Use path filter to only run tests when relevant files change. I.e. don't run tests for an update to readme.md.
Is there a way to pre-cache docker images on the runner so it isn't pulling every time? I found this example but I haven't looked at it deeply

Feature: Customize statistical distribution for relationship `records_per`

Right now, we can set records_per equal to, say, 5, which means there are 5 child records for every parent record. But what if we want another, non-uniform distribution? Like, sometimes there are 100 child records for a parent record, and sometimes there are 2, following some statistical distribution.

This could potentially help the optimizer team as they investigate performance issues related to different distributions of join keys. cc @aalexandrov

Feature: Improve Producer Performance

Is your feature request related to a problem? Please describe.

Right now, we create and destroy a producer on each iteration. It would be a little more efficient to create a single producer and then disconnect when the program exits.

Feature: Bursty workloads

Right now we have --record-size and --wait, but perhaps we could customize them so that the workload feels bursty -- a lot for 10 minutes, then very slow, then a lot again.

Perhaps running one datagen process might just be inherently limited in its ability to simulate a bursty workload, but I wanted to record the idea.

Feature: Add support for `UPSERT` envelope

Use boolean options

We have several options like --debug which are set to the strings true or false.

I think we should use booleans instead, so when the user specifies --debug, that means debug is set to True. It also makes us less likely to introduce bugs when checking if different options are true or false.

Feature: end to end tests

Is your feature request related to a problem? Please describe.

As the code base matures and increases in functionality, we need testing infrastructure to ensure it works properly and does not regress.

Describe the solution you'd like

I've had great success with test containers for orchestrating external components and end-to-end testing. We can spin up Kafka and schema registries in a test pipeline via docker and then write out records.

Additional context

Feature: Support Date types for Avro

Is your feature request related to a problem? Please describe.

Right now, if you use faker methods that generate something of type Date, it shows up as an empty record.

Describe the solution you'd like

We can implement a hook like this:

mtth/avsc#324 (comment)

Feature: Make .sql schemas relational

We need to add the _meta.relationships list to .sql schemas to support foreign key relationships

Bug: setting the key to a number results in an error

Describe the bug

Setting the key in a JSON schema to an int results to the following error:

TypeError [ERR_INVALID_ARG_TYPE]: The "string" argument must be of type string or an instance of Buffer or ArrayBuffer. Received type number (22)

To Reproduce

Example schema to reproduce the problem:

{
    "_meta": {
        "topic": "air_quality",
        "key": "id"
    },
    "id": "datatype.number({ \"min\": 1, \"max\": 100 })",
    "timestamp": "date.recent",
    "location": {
        "latitude": "datatype.number({ \"max\": 90, \"min\": -90})",
        "longitude": "datatype.number({ \"max\": 180, \"min\": -180})"
    },
    "pm25": "datatype.float({ \"min\": 10, \"max\": 90 })",
    "pm10": "datatype.float({ \"min\": 10, \"max\": 90 })",
    "temperature": "datatype.float({ \"min\": -10, \"max\": 120 })",
    "humidity": "datatype.float({ \"min\": 0, \"max\": 100 })"
}

Dockerize it!

Let's dockerize it and get it on docker hub!