Are there limits to the number of columns in the dataframe that can be passed to train

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Size Limits about gretel-synthetics HOT 9 CLOSED

gretelai commented on May 30, 2024

Size Limits

from gretel-synthetics.

Comments (9)

johntmyers commented on May 30, 2024

Do you have a Traceback from the training step? There's no imposed limit from us, but depending on the hyperparameters you have set, it could be causing some exception in the downstream libs.

from gretel-synthetics.

zakraicik commented on May 30, 2024

Yeah- just to quickly test the tool, I had borrowed the hyperparamters from the heart disease kaggle example. I'm not sure how to tune them yet.

The code is below.

config = LocalConfig(
    max_lines=0, # read all lines (zero)
    epochs=15, # 15-30 epochs for production
    vocab_size=200, # tokenizer model vocabulary size
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=10000, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    save_all_checkpoints=False,
    field_delimiter=",",
    input_data_path=annotated_file # filepath or S3
)

train_rnn(config)

The error I receive is

RuntimeError: Internal: /Users/travis/build/google/sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]

My DF is ~45K rows x 1500 columns. If I shrink the number of rows, I get the same error. However, if I shrink the number of columns it runs so I think the google sentence piece probably has a maximum text length.

from gretel-synthetics.

zredlined commented on May 30, 2024

@zakraicik Weird error. Can you provide any sample data we can use to recreate the error? Is it possible that one of your rows is all empty values?

Re: config. Try expanding the vocab_size in the config to 20000. It's generally good to use the biggest vocab size you can support in GPU memory and this might help with the input text length. Other parameters look good for most datasets. Most of our development to date has been w/Datasets that are 20-50 columns. At 1,500 columns, it may take a lot of samples for the neural network to learn the data structure. Let us know if increasing the vocab size helps!

from gretel-synthetics.

zakraicik commented on May 30, 2024

@zredlined Unfortunately I can't provide any of the data.

I'll try expanding the vocab size and get back to you.

from gretel-synthetics.

zakraicik commented on May 30, 2024

Changing the vocab size doesn't help

from gretel-synthetics.

johntmyers commented on May 30, 2024

@zakraicik After you output the DF to the training CSV, can you tell us what the strlen of a full row is on average? Maybe we can re-create by just generating a dataset that has similar dimensions.

from gretel-synthetics.

johntmyers commented on May 30, 2024

I got it to error on a dummy set I made. We'll take a look.

from gretel-synthetics.

johntmyers commented on May 30, 2024

Got it fixed, will cut a new release. You'll have to override the max line len. SP defaults to 2049. I generated lines that were 49500 in length. I had to set the size to > 50k still, so you'll want the max size you input to be a few K higher than your max line is.

from gretel-synthetics.

johntmyers commented on May 30, 2024

@zakraicik Cutting v0.9.3 that should let you set a custom line limit that overrides SP's.

from gretel-synthetics.

Size Limits about gretel-synthetics HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent