Comments (9)
Do you have a Traceback from the training step? There's no imposed limit from us, but depending on the hyperparameters you have set, it could be causing some exception in the downstream libs.
from gretel-synthetics.
Yeah- just to quickly test the tool, I had borrowed the hyperparamters from the heart disease kaggle example. I'm not sure how to tune them yet.
The code is below.
config = LocalConfig(
max_lines=0, # read all lines (zero)
epochs=15, # 15-30 epochs for production
vocab_size=200, # tokenizer model vocabulary size
character_coverage=1.0, # tokenizer model character coverage percent
gen_chars=0, # the maximum number of characters possible per-generated line of text
gen_lines=10000, # the number of generated text lines
rnn_units=256, # dimensionality of LSTM output space
batch_size=64, # batch size
buffer_size=1000, # buffer size to shuffle the dataset
dropout_rate=0.2, # fraction of the inputs to drop
dp=True, # let's use differential privacy
dp_learning_rate=0.015, # learning rate
dp_noise_multiplier=1.1, # control how much noise is added to gradients
dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
dp_microbatches=256, # split batches into minibatches for parallelism
checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
save_all_checkpoints=False,
field_delimiter=",",
input_data_path=annotated_file # filepath or S3
)
train_rnn(config)
The error I receive is
RuntimeError: Internal: /Users/travis/build/google/sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]
My DF is ~45K rows x 1500 columns. If I shrink the number of rows, I get the same error. However, if I shrink the number of columns it runs so I think the google sentence piece probably has a maximum text length.
from gretel-synthetics.
@zakraicik Weird error. Can you provide any sample data we can use to recreate the error? Is it possible that one of your rows is all empty values?
Re: config. Try expanding the vocab_size in the config to 20000. It's generally good to use the biggest vocab size you can support in GPU memory and this might help with the input text length. Other parameters look good for most datasets. Most of our development to date has been w/Datasets that are 20-50 columns. At 1,500 columns, it may take a lot of samples for the neural network to learn the data structure. Let us know if increasing the vocab size helps!
from gretel-synthetics.
@zredlined Unfortunately I can't provide any of the data.
I'll try expanding the vocab size and get back to you.
from gretel-synthetics.
Changing the vocab size doesn't help
from gretel-synthetics.
@zakraicik After you output the DF to the training CSV, can you tell us what the strlen of a full row is on average? Maybe we can re-create by just generating a dataset that has similar dimensions.
from gretel-synthetics.
I got it to error on a dummy set I made. We'll take a look.
from gretel-synthetics.
Got it fixed, will cut a new release. You'll have to override the max line len. SP defaults to 2049. I generated lines that were 49500 in length. I had to set the size to > 50k still, so you'll want the max size you input to be a few K higher than your max line is.
from gretel-synthetics.
@zakraicik Cutting v0.9.3 that should let you set a custom line limit that overrides SP's.
from gretel-synthetics.
Related Issues (20)
- Performance issue in /src/gretel_synthetics/tensorflow (by P3) HOT 1
- [BUG] Incompatability with package dependence HOT 2
- timeseries_dgan.ipynb example - error from train_numpy HOT 2
- TypeError: __init__() got an unexpected keyword argument 'prefetch_factor' HOT 1
- Poor training results HOT 6
- TooManyInvalidError: Maximum number of invalid lines reached! HOT 3
- [BUG] train_numpy() got multiple values for argument 'feature_types' - dgan HOT 4
- [FR] Generation based on given attributes HOT 2
- [FR / BUG] HOT 2
- Bug HOT 5
- Sample_len Value HOT 2
- Results about DGAN
- [BUG] : Loading a trained model and generating synthetic data throws an error HOT 8
- About DoppelGANger training results HOT 1
- [BUG]: Outdated category_encoders HOT 3
- List index out of range HOT 4
- ValueError: multiprocessing_context option should specify a valid start method in ['spawn'], but got multiprocessing_context='fork'[FR / BUG] HOT 1
- [BUG] example notebook error HOT 3
- Marketoptiontend-analysis
- DGAN for ECG dataset HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gretel-synthetics.