Giter Site home page Giter Site logo

Comments (4)

amysteier avatar amysteier commented on May 29, 2024 1

How will we pick the default max_line_len? If I use a financial dataset as ref for avg field length, 2048 equates to 300 fields, If I use healthcare data, 2028 equates to 146 fields.

I like the default epochs, and the recommended range seems dead on. Do we alter the epochs if either model peaks before their chosen epoch or looks like it's still improving when they end?

The difference between seq_length and max_line_len is confusing. Maybe more explanation.

Should field_delimiter default be none? If it's structured data and they don't specify do we try and auto-deduce?

In vocab_size, this is really the max vocab size. The tokenizer may choose a much smaller number.

On DP, should we say the model eps and delta will be displayed once training completes, or let them figure that out?

Maybe start gen_temp with "This parameter is used to control the randomness of predictions by scaling the logits before applying softmax." then continue with the rest of your explanation.

from gretel-synthetics.

zredlined avatar zredlined commented on May 29, 2024

Initial checkin for docstrings are here:

Args:
max_lines (optional): Number of rows of file to read. Useful for training on a subset of large files.
If unspecified, max_lines will default to ``0`` (process all lines).
max_line_len (optional): Maximum line length for input training data. Any lines longer than
this length will be ignored. Default is ``2048``.
epochs (optional): Number of epochs to train the model. An epoch is an iteration over the entire
training set provided. For production use cases, 15-50 epochs are recommended.
Default is ``30``.
batch_size (optional): Number of samples per gradient update. Using larger batch sizes can help
make more efficient use of CPU/GPU parallelization, at the cost of memory.
If unspecified, batch_size will default to ``64``.
buffer_size (optional): Buffer size which is used to shuffle elements during training.
Default size is ``10000``.
seq_length (optional): The maximum length sentence we want for a single training input in
characters. Default size is ``100``.
embedding_dim (optional): Vector size for the lookup table used in the neural network
Embedding layer that maps the numbers of each character. Default size is ``256``.
rnn_units (optional): Positive integer, dimensionality of the output space for LSTM layers.
Default size is ``256``.
dropout_rate (optional): Float between 0 and 1. Fraction of the units to drop for the
linear transformation of the inputs. Using a dropout can help to prevent overfitting
by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good
compromise between retaining model accuracy and preventing overfitting. Default is 0.2.
rnn_initializer (optional): Initializer for the kernal weights matrix, used for the linear
transformation of the inputs. Default is ``glorot_transform``.
field_delimiter (optional): Delimiter to use for training on structured data. When specified,
the delimiter is passed as a user-specified token to the tokenizer, which can improve
synthetic data quality. For unstructured text, leave as ``None``. For structured text
such as comma or tab separated values, specify "," or "\t" respectively. Default is ``None``.
field_delimiter_token (optional): User specified token to replace ``field_delimiter`` with
while annotating data for training the model. Default is ``<d>``.
vocab_size (optional): Pre-determined vocabulary size prior to neural model training, based on
subword units including byte-pair-encoding (BPE) and unigram language model, with the extension
of direct training from raw sentences. We generally recommend using a large vocabulary
size of 20,000 to 50,000. Default is ``20000``.
character_coverage (optional): The amount of characters covered by the model. Unknown characters
will be replaced with the <unk> tag. Good defaults are ``0.995`` for languages with rich
character sets like Japanese or Chinese, and 1.0 for other languages or machine data.
Default is ``1.0``.
dp (optional): If ``True``, train model with differential privacy enabled. This setting provides
assurances that the models will encode general patterns in data rather than facts
about specific training examples. These additional guarantees can usefully strengthen
the protections offered for sensitive data and content, at a small loss in model
accuracy and synthetic data quality. Default is ``False``.
dp_learning_rate (optional): The higher the learning rate, the more that each update during
training matters. If the updates are noisy (such as when the additive noise is large
compared to the clipping threshold), a low learning rate may help with training.
Default is ``0.015``.
dp_noise_multiplier (optional): The amount of noise sampled and added to gradients during
training. Generally, more noise results in better privacy, at the expense of
model accuracy. Default is ``1.1``.
dp_l2_norm_clip (optional): The maximum Euclidean (L2) norm of each gradient is applied to
update model parameters. This hyperparameter bounds the optimizer's sensitivity to
individual training points. Default is ``1.0``.
dp_microbatches (optional): Each batch of data is split into smaller units called micro-batches.
Computational overhead can be reduced by increasing the size of micro-batches to include
more than one training example. The number of micro-batches should divide evenly into
the overall ``batch_size``. Default is ``64``.
gen_temp (optional): Low temperatures result in more predictable text. Higher temperatures
result in more surprising text. Experiment to find the best setting. Default is ``1.0``.
gen_chars (optional): Maximum number of characters to generate per line. Default is ``0`` (no limit).
gen_lines (optional): Maximum number of text lines to generate. This function is used by
``generate_text`` and the optional ``line_validator`` to make sure that all lines created
by the model pass validation. Default is ``1000``.
save_all_checkpoints (optional). Set to ``True`` to save all model checkpoints as they are created,
which can be useful for optimal model selection. Set to ``False`` to save only the latest
checkpoint. Default is ``True``.
overwrite (optional). Set to ``True`` to automatically overwrite previously saved model checkpoints.
If ``False``, the trainer will generate an error if checkpoints exist in the model directory.
Default is ``False``.

from gretel-synthetics.

zredlined avatar zredlined commented on May 29, 2024

Added Google style doc strings to all config params - linked below.

@amysteier can you review?

Args:
max_lines (optional): Number of rows of file to read. Useful for training on a subset of large files.
If unspecified, max_lines will default to ``0`` (process all lines).
max_line_len (optional): Maximum line length for input training data. Any lines longer than
this length will be ignored. Default is ``2048``.
epochs (optional): Number of epochs to train the model. An epoch is an iteration over the entire
training set provided. For production use cases, 15-50 epochs are recommended.
Default is ``30``.
batch_size (optional): Number of samples per gradient update. Using larger batch sizes can help
make more efficient use of CPU/GPU parallelization, at the cost of memory.
If unspecified, batch_size will default to ``64``.
buffer_size (optional): Buffer size which is used to shuffle elements during training.
Default size is ``10000``.
seq_length (optional): The maximum length sentence we want for a single training input in
characters. Default size is ``100``.
embedding_dim (optional): Vector size for the lookup table used in the neural network
Embedding layer that maps the numbers of each character. Default size is ``256``.
rnn_units (optional): Positive integer, dimensionality of the output space for LSTM layers.
Default size is ``256``.
dropout_rate (optional): Float between 0 and 1. Fraction of the units to drop for the
linear transformation of the inputs. Using a dropout can help to prevent overfitting
by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good
compromise between retaining model accuracy and preventing overfitting. Default is 0.2.
rnn_initializer (optional): Initializer for the kernal weights matrix, used for the linear
transformation of the inputs. Default is ``glorot_transform``.
field_delimiter (optional): Delimiter to use for training on structured data. When specified,
the delimiter is passed as a user-specified token to the tokenizer, which can improve
synthetic data quality. For unstructured text, leave as ``None``. For structured text
such as comma or tab separated values, specify "," or "\t" respectively. Default is ``None``.
field_delimiter_token (optional): User specified token to replace ``field_delimiter`` with
while annotating data for training the model. Default is ``<d>``.
vocab_size (optional): Pre-determined vocabulary size prior to neural model training, based on
subword units including byte-pair-encoding (BPE) and unigram language model, with the extension
of direct training from raw sentences. We generally recommend using a large vocabulary
size of 20,000 to 50,000. Default is ``20000``.
character_coverage (optional): The amount of characters covered by the model. Unknown characters
will be replaced with the <unk> tag. Good defaults are ``0.995`` for languages with rich
character sets like Japanese or Chinese, and 1.0 for other languages or machine data.
Default is ``1.0``.
dp (optional): If ``True``, train model with differential privacy enabled. This setting provides
assurances that the models will encode general patterns in data rather than facts
about specific training examples. These additional guarantees can usefully strengthen
the protections offered for sensitive data and content, at a small loss in model
accuracy and synthetic data quality. Default is ``False``.
dp_learning_rate (optional): The higher the learning rate, the more that each update during
training matters. If the updates are noisy (such as when the additive noise is large
compared to the clipping threshold), a low learning rate may help with training.
Default is ``0.015``.
dp_noise_multiplier (optional): The amount of noise sampled and added to gradients during
training. Generally, more noise results in better privacy, at the expense of
model accuracy. Default is ``1.1``.
dp_l2_norm_clip (optional): The maximum Euclidean (L2) norm of each gradient is applied to
update model parameters. This hyperparameter bounds the optimizer's sensitivity to
individual training points. Default is ``1.0``.
dp_microbatches (optional): Each batch of data is split into smaller units called micro-batches.
Computational overhead can be reduced by increasing the size of micro-batches to include
more than one training example. The number of micro-batches should divide evenly into
the overall ``batch_size``. Default is ``64``.
gen_temp (optional): Low temperatures result in more predictable text. Higher temperatures
result in more surprising text. Experiment to find the best setting. Default is ``1.0``.
gen_chars (optional): Maximum number of characters to generate per line. Default is ``0`` (no limit).
gen_lines (optional): Maximum number of text lines to generate. This function is used by
``generate_text`` and the optional ``line_validator`` to make sure that all lines created
by the model pass validation. Default is ``1000``.
save_all_checkpoints (optional). Set to ``True`` to save all model checkpoints as they are created,
which can be useful for optimal model selection. Set to ``False`` to save only the latest
checkpoint. Default is ``True``.
overwrite (optional). Set to ``True`` to automatically overwrite previously saved model checkpoints.
If ``False``, the trainer will generate an error if checkpoints exist in the model directory.
Default is ``False``.

from gretel-synthetics.

zredlined avatar zredlined commented on May 29, 2024

@amysteier re: default epochs, we can add a Keras callback function that checks for changes in model training loss or accuracy, and stops training after a certain point. Let's add this to roadmap.

Re: field delimiter- I'd recommend keeping it as an optional parameter. it isn't strictly necessary to specify, it just improves synthetic data performance when you set it.

Re: vocab_size, gen_temp, and seq_length - good call-outs, will clarify in doc strings.

from gretel-synthetics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.