Giter Site home page Giter Site logo

Size Limits about gretel-synthetics HOT 9 CLOSED

gretelai avatar gretelai commented on May 30, 2024
Size Limits

from gretel-synthetics.

Comments (9)

johntmyers avatar johntmyers commented on May 30, 2024

Do you have a Traceback from the training step? There's no imposed limit from us, but depending on the hyperparameters you have set, it could be causing some exception in the downstream libs.

from gretel-synthetics.

zakraicik avatar zakraicik commented on May 30, 2024

Yeah- just to quickly test the tool, I had borrowed the hyperparamters from the heart disease kaggle example. I'm not sure how to tune them yet.

The code is below.

config = LocalConfig(
    max_lines=0, # read all lines (zero)
    epochs=15, # 15-30 epochs for production
    vocab_size=200, # tokenizer model vocabulary size
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=10000, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    save_all_checkpoints=False,
    field_delimiter=",",
    input_data_path=annotated_file # filepath or S3
)

train_rnn(config)

The error I receive is

RuntimeError: Internal: /Users/travis/build/google/sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]

My DF is ~45K rows x 1500 columns. If I shrink the number of rows, I get the same error. However, if I shrink the number of columns it runs so I think the google sentence piece probably has a maximum text length.

from gretel-synthetics.

zredlined avatar zredlined commented on May 30, 2024

@zakraicik Weird error. Can you provide any sample data we can use to recreate the error? Is it possible that one of your rows is all empty values?

Re: config. Try expanding the vocab_size in the config to 20000. It's generally good to use the biggest vocab size you can support in GPU memory and this might help with the input text length. Other parameters look good for most datasets. Most of our development to date has been w/Datasets that are 20-50 columns. At 1,500 columns, it may take a lot of samples for the neural network to learn the data structure. Let us know if increasing the vocab size helps!

from gretel-synthetics.

zakraicik avatar zakraicik commented on May 30, 2024

@zredlined Unfortunately I can't provide any of the data.

I'll try expanding the vocab size and get back to you.

from gretel-synthetics.

zakraicik avatar zakraicik commented on May 30, 2024

Changing the vocab size doesn't help

from gretel-synthetics.

johntmyers avatar johntmyers commented on May 30, 2024

@zakraicik After you output the DF to the training CSV, can you tell us what the strlen of a full row is on average? Maybe we can re-create by just generating a dataset that has similar dimensions.

from gretel-synthetics.

johntmyers avatar johntmyers commented on May 30, 2024

I got it to error on a dummy set I made. We'll take a look.

from gretel-synthetics.

johntmyers avatar johntmyers commented on May 30, 2024

Got it fixed, will cut a new release. You'll have to override the max line len. SP defaults to 2049. I generated lines that were 49500 in length. I had to set the size to > 50k still, so you'll want the max size you input to be a few K higher than your max line is.

from gretel-synthetics.

johntmyers avatar johntmyers commented on May 30, 2024

@zakraicik Cutting v0.9.3 that should let you set a custom line limit that overrides SP's.

from gretel-synthetics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.