Create detailed docstrings for every possible param to the configs, right now we don't

Initial checkin for docstrings are here: <div class="Box Box--co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Config param docstrings about gretel-synthetics HOT 4 CLOSED

gretelai commented on May 29, 2024

Config param docstrings

from gretel-synthetics.

Comments (4)

amysteier commented on May 29, 2024 1

How will we pick the default max_line_len? If I use a financial dataset as ref for avg field length, 2048 equates to 300 fields, If I use healthcare data, 2028 equates to 146 fields.

I like the default epochs, and the recommended range seems dead on. Do we alter the epochs if either model peaks before their chosen epoch or looks like it's still improving when they end?

The difference between seq_length and max_line_len is confusing. Maybe more explanation.

Should field_delimiter default be none? If it's structured data and they don't specify do we try and auto-deduce?

In vocab_size, this is really the max vocab size. The tokenizer may choose a much smaller number.

On DP, should we say the model eps and delta will be displayed once training completes, or let them figure that out?

Maybe start gen_temp with "This parameter is used to control the randomness of predictions by scaling the logits before applying softmax." then continue with the rest of your explanation.

from gretel-synthetics.

zredlined commented on May 29, 2024

Initial checkin for docstrings are here:

gretel-synthetics/src/gretel_synthetics/config.py

Lines 31 to 100 in 5992209

    
               Args: 
        
                   max_lines (optional): Number of rows of file to read. Useful for training on a subset of large files. 
        
                       If unspecified, max_lines will default to ``0`` (process all lines). 
        
                   max_line_len (optional): Maximum line length for input training data. Any lines longer than 
        
                       this length will be ignored. Default is ``2048``. 
        
                   epochs (optional): Number of epochs to train the model. An epoch is an iteration over the entire 
        
                       training set provided. For production use cases, 15-50 epochs are recommended. 
        
                       Default is ``30``. 
        
                   batch_size (optional): Number of samples per gradient update. Using larger batch sizes can help 
        
                       make more efficient use of CPU/GPU parallelization, at the cost of memory. 
        
                       If unspecified, batch_size will default to ``64``. 
        
                   buffer_size (optional): Buffer size which is used to shuffle elements during training. 
        
                       Default size is ``10000``. 
        
                   seq_length (optional): The maximum length sentence we want for a single training input in 
        
                       characters. Default size is ``100``. 
        
                   embedding_dim (optional): Vector size for the lookup table used in the neural network 
        
                       Embedding layer that maps the numbers of each character. Default size is ``256``. 
        
                   rnn_units (optional): Positive integer, dimensionality of the output space for LSTM layers. 
        
                       Default size is ``256``. 
        
                   dropout_rate (optional): Float between 0 and 1. Fraction of the units to drop for the 
        
                       linear transformation of the inputs. Using a dropout can help to prevent overfitting 
        
                       by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good 
        
                       compromise between retaining model accuracy and preventing overfitting. Default is 0.2. 
        
                   rnn_initializer (optional): Initializer for the kernal weights matrix, used for the linear 
        
                       transformation of the inputs. Default is ``glorot_transform``. 
        
                   field_delimiter (optional): Delimiter to use for training on structured data. When specified, 
        
                       the delimiter is passed as a user-specified token to the tokenizer, which can improve 
        
                       synthetic data quality. For unstructured text, leave as ``None``. For structured text 
        
                       such as comma or tab separated values, specify "," or "\t" respectively. Default is ``None``. 
        
                   field_delimiter_token (optional): User specified token to replace ``field_delimiter`` with 
        
                       while annotating data for training the model. Default is ``<d>``. 
        
                   vocab_size (optional): Pre-determined vocabulary size prior to neural model training, based on 
        
                       subword units including byte-pair-encoding (BPE) and unigram language model, with the extension 
        
                       of direct training from raw sentences. We generally recommend using a large vocabulary 
        
                       size of 20,000 to 50,000. Default is ``20000``. 
        
                   character_coverage (optional): The amount of characters covered by the model. Unknown characters 
        
                       will be replaced with the <unk> tag. Good defaults are ``0.995`` for languages with rich 
        
                       character sets like Japanese or Chinese, and 1.0 for other languages or machine data. 
        
                       Default is ``1.0``. 
        
                   dp (optional): If ``True``, train model with differential privacy enabled. This setting provides 
        
                       assurances that the models will encode general patterns in data rather than facts 
        
                       about specific training examples. These additional guarantees can usefully strengthen 
        
                       the protections offered for sensitive data and content, at a small loss in model 
        
                       accuracy and synthetic data quality. Default is ``False``. 
        
                   dp_learning_rate (optional): The higher the learning rate, the more that each update during 
        
                       training matters. If the updates are noisy (such as when the additive noise is large 
        
                       compared to the clipping threshold), a low learning rate may help with training. 
        
                       Default is ``0.015``. 
        
                   dp_noise_multiplier (optional): The amount of noise sampled and added to gradients during 
        
                       training. Generally, more noise results in better privacy, at the expense of 
        
                       model accuracy. Default is ``1.1``. 
        
                   dp_l2_norm_clip (optional): The maximum Euclidean (L2) norm of each gradient is applied to 
        
                       update model parameters. This hyperparameter bounds the optimizer's sensitivity to 
        
                       individual training points. Default is ``1.0``. 
        
                   dp_microbatches (optional): Each batch of data is split into smaller units called micro-batches. 
        
                       Computational overhead can be reduced by increasing the size of micro-batches to include 
        
                       more than one training example. The number of micro-batches should divide evenly into 
        
                       the overall ``batch_size``. Default is ``64``. 
        
                   gen_temp (optional): Low temperatures result in more predictable text. Higher temperatures 
        
                       result in more surprising text. Experiment to find the best setting. Default is ``1.0``. 
        
                   gen_chars (optional): Maximum number of characters to generate per line. Default is ``0`` (no limit). 
        
                   gen_lines (optional): Maximum number of text lines to generate. This function is used by 
        
                       ``generate_text`` and the optional ``line_validator`` to make sure that all lines created 
        
                       by the model pass validation. Default is ``1000``. 
        
                   save_all_checkpoints (optional). Set to ``True`` to save all model checkpoints as they are created, 
        
                       which can be useful for optimal model selection. Set to ``False`` to save only the latest 
        
                       checkpoint. Default is ``True``. 
        
                   overwrite (optional). Set to ``True`` to automatically overwrite previously saved model checkpoints. 
        
                       If ``False``, the trainer will generate an error if checkpoints exist in the model directory. 
        
                       Default is ``False``.

from gretel-synthetics.

zredlined commented on May 29, 2024

Added Google style doc strings to all config params - linked below.

@amysteier can you review?

gretel-synthetics/src/gretel_synthetics/config.py

Lines 31 to 100 in 838b3d8

    
               Args: 
        
                   max_lines (optional): Number of rows of file to read. Useful for training on a subset of large files. 
        
                       If unspecified, max_lines will default to ``0`` (process all lines). 
        
                   max_line_len (optional): Maximum line length for input training data. Any lines longer than 
        
                       this length will be ignored. Default is ``2048``. 
        
                   epochs (optional): Number of epochs to train the model. An epoch is an iteration over the entire 
        
                       training set provided. For production use cases, 15-50 epochs are recommended. 
        
                       Default is ``30``. 
        
                   batch_size (optional): Number of samples per gradient update. Using larger batch sizes can help 
        
                       make more efficient use of CPU/GPU parallelization, at the cost of memory. 
        
                       If unspecified, batch_size will default to ``64``. 
        
                   buffer_size (optional): Buffer size which is used to shuffle elements during training. 
        
                       Default size is ``10000``. 
        
                   seq_length (optional): The maximum length sentence we want for a single training input in 
        
                       characters. Default size is ``100``. 
        
                   embedding_dim (optional): Vector size for the lookup table used in the neural network 
        
                       Embedding layer that maps the numbers of each character. Default size is ``256``. 
        
                   rnn_units (optional): Positive integer, dimensionality of the output space for LSTM layers. 
        
                       Default size is ``256``. 
        
                   dropout_rate (optional): Float between 0 and 1. Fraction of the units to drop for the 
        
                       linear transformation of the inputs. Using a dropout can help to prevent overfitting 
        
                       by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good 
        
                       compromise between retaining model accuracy and preventing overfitting. Default is 0.2. 
        
                   rnn_initializer (optional): Initializer for the kernal weights matrix, used for the linear 
        
                       transformation of the inputs. Default is ``glorot_transform``. 
        
                   field_delimiter (optional): Delimiter to use for training on structured data. When specified, 
        
                       the delimiter is passed as a user-specified token to the tokenizer, which can improve 
        
                       synthetic data quality. For unstructured text, leave as ``None``. For structured text 
        
                       such as comma or tab separated values, specify "," or "\t" respectively. Default is ``None``. 
        
                   field_delimiter_token (optional): User specified token to replace ``field_delimiter`` with 
        
                       while annotating data for training the model. Default is ``<d>``. 
        
                   vocab_size (optional): Pre-determined vocabulary size prior to neural model training, based on 
        
                       subword units including byte-pair-encoding (BPE) and unigram language model, with the extension 
        
                       of direct training from raw sentences. We generally recommend using a large vocabulary 
        
                       size of 20,000 to 50,000. Default is ``20000``. 
        
                   character_coverage (optional): The amount of characters covered by the model. Unknown characters 
        
                       will be replaced with the <unk> tag. Good defaults are ``0.995`` for languages with rich 
        
                       character sets like Japanese or Chinese, and 1.0 for other languages or machine data. 
        
                       Default is ``1.0``. 
        
                   dp (optional): If ``True``, train model with differential privacy enabled. This setting provides 
        
                       assurances that the models will encode general patterns in data rather than facts 
        
                       about specific training examples. These additional guarantees can usefully strengthen 
        
                       the protections offered for sensitive data and content, at a small loss in model 
        
                       accuracy and synthetic data quality. Default is ``False``. 
        
                   dp_learning_rate (optional): The higher the learning rate, the more that each update during 
        
                       training matters. If the updates are noisy (such as when the additive noise is large 
        
                       compared to the clipping threshold), a low learning rate may help with training. 
        
                       Default is ``0.015``. 
        
                   dp_noise_multiplier (optional): The amount of noise sampled and added to gradients during 
        
                       training. Generally, more noise results in better privacy, at the expense of 
        
                       model accuracy. Default is ``1.1``. 
        
                   dp_l2_norm_clip (optional): The maximum Euclidean (L2) norm of each gradient is applied to 
        
                       update model parameters. This hyperparameter bounds the optimizer's sensitivity to 
        
                       individual training points. Default is ``1.0``. 
        
                   dp_microbatches (optional): Each batch of data is split into smaller units called micro-batches. 
        
                       Computational overhead can be reduced by increasing the size of micro-batches to include 
        
                       more than one training example. The number of micro-batches should divide evenly into 
        
                       the overall ``batch_size``. Default is ``64``. 
        
                   gen_temp (optional): Low temperatures result in more predictable text. Higher temperatures 
        
                       result in more surprising text. Experiment to find the best setting. Default is ``1.0``. 
        
                   gen_chars (optional): Maximum number of characters to generate per line. Default is ``0`` (no limit). 
        
                   gen_lines (optional): Maximum number of text lines to generate. This function is used by 
        
                       ``generate_text`` and the optional ``line_validator`` to make sure that all lines created 
        
                       by the model pass validation. Default is ``1000``. 
        
                   save_all_checkpoints (optional). Set to ``True`` to save all model checkpoints as they are created, 
        
                       which can be useful for optimal model selection. Set to ``False`` to save only the latest 
        
                       checkpoint. Default is ``True``. 
        
                   overwrite (optional). Set to ``True`` to automatically overwrite previously saved model checkpoints. 
        
                       If ``False``, the trainer will generate an error if checkpoints exist in the model directory. 
        
                       Default is ``False``.

from gretel-synthetics.

zredlined commented on May 29, 2024

@amysteier re: default epochs, we can add a Keras callback function that checks for changes in model training loss or accuracy, and stops training after a certain point. Let's add this to roadmap.

Re: field delimiter- I'd recommend keeping it as an optional parameter. it isn't strictly necessary to specify, it just improves synthetic data performance when you set it.

Re: vocab_size, gen_temp, and seq_length - good call-outs, will clarify in doc strings.

from gretel-synthetics.

Config param docstrings about gretel-synthetics HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	Args:
	max_lines (optional): Number of rows of file to read. Useful for training on a subset of large files.
	If unspecified, max_lines will default to ``0`` (process all lines).
	max_line_len (optional): Maximum line length for input training data. Any lines longer than
	this length will be ignored. Default is ``2048``.
	epochs (optional): Number of epochs to train the model. An epoch is an iteration over the entire
	training set provided. For production use cases, 15-50 epochs are recommended.
	Default is ``30``.
	batch_size (optional): Number of samples per gradient update. Using larger batch sizes can help
	make more efficient use of CPU/GPU parallelization, at the cost of memory.
	If unspecified, batch_size will default to ``64``.
	buffer_size (optional): Buffer size which is used to shuffle elements during training.
	Default size is ``10000``.
	seq_length (optional): The maximum length sentence we want for a single training input in
	characters. Default size is ``100``.
	embedding_dim (optional): Vector size for the lookup table used in the neural network
	Embedding layer that maps the numbers of each character. Default size is ``256``.
	rnn_units (optional): Positive integer, dimensionality of the output space for LSTM layers.
	Default size is ``256``.
	dropout_rate (optional): Float between 0 and 1. Fraction of the units to drop for the
	linear transformation of the inputs. Using a dropout can help to prevent overfitting
	by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good
	compromise between retaining model accuracy and preventing overfitting. Default is 0.2.
	rnn_initializer (optional): Initializer for the kernal weights matrix, used for the linear
	transformation of the inputs. Default is ``glorot_transform``.
	field_delimiter (optional): Delimiter to use for training on structured data. When specified,
	the delimiter is passed as a user-specified token to the tokenizer, which can improve
	synthetic data quality. For unstructured text, leave as ``None``. For structured text
	such as comma or tab separated values, specify "," or "\t" respectively. Default is ``None``.
	field_delimiter_token (optional): User specified token to replace ``field_delimiter`` with
	while annotating data for training the model. Default is ``<d>``.
	vocab_size (optional): Pre-determined vocabulary size prior to neural model training, based on
	subword units including byte-pair-encoding (BPE) and unigram language model, with the extension
	of direct training from raw sentences. We generally recommend using a large vocabulary
	size of 20,000 to 50,000. Default is ``20000``.
	character_coverage (optional): The amount of characters covered by the model. Unknown characters
	will be replaced with the <unk> tag. Good defaults are ``0.995`` for languages with rich
	character sets like Japanese or Chinese, and 1.0 for other languages or machine data.
	Default is ``1.0``.
	dp (optional): If ``True``, train model with differential privacy enabled. This setting provides
	assurances that the models will encode general patterns in data rather than facts
	about specific training examples. These additional guarantees can usefully strengthen
	the protections offered for sensitive data and content, at a small loss in model
	accuracy and synthetic data quality. Default is ``False``.
	dp_learning_rate (optional): The higher the learning rate, the more that each update during
	training matters. If the updates are noisy (such as when the additive noise is large
	compared to the clipping threshold), a low learning rate may help with training.
	Default is ``0.015``.
	dp_noise_multiplier (optional): The amount of noise sampled and added to gradients during
	training. Generally, more noise results in better privacy, at the expense of
	model accuracy. Default is ``1.1``.
	dp_l2_norm_clip (optional): The maximum Euclidean (L2) norm of each gradient is applied to
	update model parameters. This hyperparameter bounds the optimizer's sensitivity to
	individual training points. Default is ``1.0``.
	dp_microbatches (optional): Each batch of data is split into smaller units called micro-batches.
	Computational overhead can be reduced by increasing the size of micro-batches to include
	more than one training example. The number of micro-batches should divide evenly into
	the overall ``batch_size``. Default is ``64``.
	gen_temp (optional): Low temperatures result in more predictable text. Higher temperatures
	result in more surprising text. Experiment to find the best setting. Default is ``1.0``.
	gen_chars (optional): Maximum number of characters to generate per line. Default is ``0`` (no limit).
	gen_lines (optional): Maximum number of text lines to generate. This function is used by
	``generate_text`` and the optional ``line_validator`` to make sure that all lines created
	by the model pass validation. Default is ``1000``.
	save_all_checkpoints (optional). Set to ``True`` to save all model checkpoints as they are created,
	which can be useful for optimal model selection. Set to ``False`` to save only the latest
	checkpoint. Default is ``True``.
	overwrite (optional). Set to ``True`` to automatically overwrite previously saved model checkpoints.
	If ``False``, the trainer will generate an error if checkpoints exist in the model directory.
	Default is ``False``.