Giter Site home page Giter Site logo

texygen's Introduction

Texygen is a benchmarking platform to support research on open-domain text generation models. Texygen has not only implemented a majority of text generation models, but also covered a set of metrics that evaluate the diversity, the quality and the consistency of the generated texts. The Texygen platform could help standardize the research on text generation and facilitate the sharing of fine-tuned open-source implementations among researchers for their work. As a consequence, this would help in improving the reproductivity and reliability of future research work in text generation.

For more details, please refer to our SIGIR 2018 paper: Texygen: A Benchmarking Platform for Text Generation Models by Yaoming Zhu et al.

Should you have any questions and enquiries, please feel free to contact Yaoming Zhu (ym-zhu [AT] outlook.com) and Weinan Zhang (wnzhang [AT] sjtu.edu.cn).

Requirement

We suggest you run the platform under Python 3.6+ with following libs:

  • TensorFlow >= 1.5.0
  • Numpy 1.12.1
  • Scipy 0.19.0
  • NLTK 3.2.3
  • CUDA 7.5+ (Suggested for GPU speed up, not compulsory)

Or just type pip install -r requirements.txt in your terminal.

Implemented Models and Original Papers

Get Started

git clone https://github.com/geek-ai/Texygen.git
cd Texygen
# run SeqGAN with default setting
python3 main.py

More detailed documentation for the platform and code setup is provided here.

Evaluation Results

BLEU on image COCO caption test dataset:

SeqGAN MaliGAN RankGAN LeakGAN TextGAN MLE
BLEU2 0.745 0.673 0.743 0.746 0.593 0.731
BLEU3 0.498 0.432 0.467 0.528 0.463 0.497
BLEU4 0.294 0.257 0.264 0.355 0.277 0.305
BLEU5 0.180 0.159 0.156 0.230 0.207 0.189

Mode Collapse (Self-BLEU):

          SeqGAN MaliGAN RankGAN LeakGAN TextGAN       MLE
S-BLEU2 0.950 0.918 0.959 0.966 0.942 0.916
S-BLEU3 0.840 0.781 0.882 0.913 0.931 0.769
S-BLEU4 0.670 0.606 0.762 0.848 0.804 0.583
S-BLEU5 0.489 0.437 0.618 0.780 0.746 0.408

More detailed benchmark settings and evaluation results are provided here.

Reference

@article{zhu2018texygen,
  title={Texygen: A Benchmarking Platform for Text Generation Models},
  author={Zhu, Yaoming and Lu, Sidi and Zheng, Lei and Guo, Jiaxian and Zhang, Weinan and Wang, Jun and Yu, Yong},
  journal={SIGIR},
  year={2018}
}

texygen's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

texygen's Issues

How does the operator '@' work?

To whom it may concern,

Hi, I find a problem in the RankGAN code. The link is here

self.scores = tf.reshape(self.feature @ tf.transpose(self.reference, perm=[1, 0]), [-1])

I wonder how does this operator @ work.

Thanks.

How to analysis the results?

I run the main.py like the command:python main.py. And I get a file :experiment-log-mle.csv. I don't know what those values mean. How to analysis it? And where the generated sentences are?

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

In MLE model, sample from multinomial distribution instead of using argmax to generate next_token

def _g_recurrence(i, x_t, h_tm1, gen_o, gen_x): h_t = self.g_recurrent_unit(x_t, h_tm1) # hidden_memory_tuple o_t = self.g_output_unit(h_t) # batch x vocab , logits not prob log_prob = tf.log(tf.nn.softmax(o_t)) next_token = tf.cast(tf.reshape(tf.multinomial(log_prob, 1), [self.batch_size]), tf.int32) x_tp1 = tf.nn.embedding_lookup(self.g_embeddings, next_token) # batch x emb_dim gen_o = gen_o.write(i, tf.reduce_sum(tf.multiply(tf.one_hot(next_token, self.num_vocabulary, 1.0, 0.0), tf.nn.softmax(o_t)), 1)) # [batch_size] , prob gen_x = gen_x.write(i, next_token) # indices, batch_size return i + 1, x_tp1, h_t, gen_o, gen_x
In this function, you sample from multinomial distribution to generate next_token. However, in most cases, next_token are generated by argmax.

Question about Synthetic Data Experiment

Hi guys. Thanks for the great repo!

I looking at Section 4.2 Synthetic Data Experiment of the Texygen paper. I have 2 questions.

  1. Are all the models trained on the exact same Oracle, or is is it re-initialized at every run. ?

  2. Did you do a hyperparameter search for all the models (and presented the best run for each), or did you run one run for each model (with the default parameters in the repo)?

Thanks a lot!

How to use Bleu?

In the Bleu function,What does test_text and real_text mean? Does test_text refer to 'save/test_file.txt'? And what does real_text refer to?' save/oracle.txt' or 'save/generator.txt'? Thanks!

Unique ngram, division by lenght of document

I think there is an error inside the class unique ngram, the right computation of ngram should be #unique_grams/#grams

def get_ng(self):
        document = self.get_reference()
        length = len(document) #is this a bug? to get ngramm is needed to divide uniquengram by all ngram, not len of sentence!
        grams = list()
        for sentence in document:
            
        grams += self.get_gram(sentence)
        print(grams,len(set(grams)),len(grams))
        
        #to get ngrams is divide by number of grams not by number of sentence
        return len(set(grams))/length` #The right computation should use len(grams) instead of length

TextGAN and GSGAN generators do not converge

Hi,
I have been feeding a very simple dataset composed of consecutive ints to all GANs available in this library. The data is generated as follows, and stored in the oracle file.

X = [list(range(20))] * 80

I then attempt to perform a training on real data, using this file in input. Seqgan, Leakgan, Rankgan, Mle and Maligan manage to learn and reproduce this simple pattern.

The generator of GSGAN is able to restrict its vocabulary to the one given in input (i.e. use only integers between 0 and 20 instead of 0 and 5000), but is not able to capture the ordering.

The generator of TextGAN is neither able to restrict its vocabulary to the one given in input, nor able to learn an ordering.

As a result, the nll-test for these models is far from the one reached by the other models. Any hint on what may be causing this issue?

Question about self-BLEU implementation

As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.

However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.

To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.

Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?

BLEU Score Computation

I have two questions:

  1. How do you compute the reported BLEU scores? I see the imports of BLEU in several models but BLEU is never added to the self.metrics collection of any of them.

  2. When computing BLEU, which sentences are considered to be references?

How to get text from generated generator.txt file?

Hi,
I used real data and as a result of runnig python main.py -g seqgan - t real -d 'data/narratives.txt'
I got a generator.txt file under save directory. An instance row of this generator.txt file is like:

4172 2137 1971 4089 842 2056 3806 2100 3845 1955 1111 721 462 3040 3877 2913 4855 1363 374 4203
Now, my question is how can I get the actual text from this generator.txt file?

When python main.py -g leakgan, there have a BUG and the data/shi.txt have a encoding BUG

Traceback (most recent call last):
File "main.py", line 85, in
parse_cmd(sys.argv[1:])
File "main.py", line 67, in parse_cmd
gan = set_gan(opt_arg['-g'])
File "main.py", line 26, in set_gan
gan = Gan()
File "E:\all_code\aspectLevel\Texygen\models\leakgan\Leakgan.py", line 64, in init
self.sequence_length = FLAGS.length
File "D:\Anacond\lib\site-packages\tensorflow\python\platform\flags.py", line 84, in getattr
wrapped(_sys.argv)
File "D:\Anacond\lib\site-packages\absl\flags_flagvalues.py", line 630, in call
name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'g'

dataset partition

Hi,

Thanks again for the great benchmarking platform.

I am wondering what is the exact dataset partition that is normally used for the EMNLP2017 News and Image COCO experiments. I'm asking because you provide a train/test split, but it is unclear what the validation set is.

Also, could you confirm that the reported performance in "
Neural Text Generation: Past, Present and Beyond" and "CoT: Cooperative Training for Generative Modeling of Discrete Data" was computed with the same split?

Thank you very much

Small vocab_size raises division by zero in DocEmbSim

Feeding the following real training dataset to a SeqGAN works perfectly:

X = np.random.randint(0, 20, (80, 20))

However, the following dataset with the same dimensionality but 6 symbols instead of 20 raises an error.

X = np.random.randint(0, 6, (80, 20))

In both cases, we used vocab_size = #unique symbols + 1, as suggested in text_process.text_precess(). Here is the corresponding traceback:

Traceback (most recent call last):
  File "texygen/texygen.py", line 85, in train
    gan_func(X)
  File "texygen/models/seqgan/Seqgan.py", line 331, in train_real
    self.evaluate()
  File "texygen/models/seqgan/Seqgan.py", line 80, in evaluate
    scores = super().evaluate()
  File "texygen/models/Gan.py", line 55, in evaluate
    score = metric.get_score()
  File "texygen/utils/metrics/DocEmbSim.py", line 33, in get_score
    return self.get_dis_corr()
  File "texygen/utils/metrics/DocEmbSim.py", line 164, in get_dis_corr
    return np.log10(corr / len(self.oracle_sim))
ZeroDivisionError: division by zero

On LeakGAN (non-interleaved) Pretraining

In your code, the LeakGAN generator is pretrained, and only after that, the discriminator is pretrained. If I understand correctly, this leads to useless features (i.e. noise) from the discriminator being leaked to the generator during pretraining - and those features becoming "drastically" useful after the discriminator is pretrained (i.e. at the start of adversarial training).

In the original LeakGAN code, as well as the original paper (appendix: algorithm), the authors propose to pretrain generator and discriminator interleavingly.

Getting error while training the SeqGAN model

Hi,

I got the following error when training the SeqGAN model.

epoch:13
[1.1808003]
[1.1286035]
[1.1499095]
epoch:14
Traceback (most recent call last):
  File "main.py", line 85, in <module>
    parse_cmd(sys.argv[1:])
  File "main.py", line 69, in parse_cmd
    gan.train_oracle()
  File "path/to/Texygen/models/seqgan/Seqgan.py", line 131, in train_oracle
    self.train_discriminator()
  File "path/to/Texygen/models/seqgan/Seqgan.py", line 52, in train_discriminator
    self.dis_data_loader.load_train_data(self.oracle_file, self.generator_file)
  File "path/to/Texygen/models/seqgan/SeqganDataLoader.py", line 70, in load_train_data
    self.labels = np.concatenate([positive_labels, negative_labels], 0)
ValueError: all the input arrays must have same number of dimensions

The command I ran is

python main.py -g seqgan

Do you have any idea how it happens or how to fix it?
Thanks in advance.

potential bug in self-bleu calculations

According the paper for self-bleu calculations each generation is compared against all the other references.

The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.

    def get_bleu(self):
        ngram = self.gram
        bleu = list()
        reference = self.get_reference()
        weight = tuple((1. / ngram for _ in range(ngram)))
        with open(self.test_data) as test_data:
            for hypothesis in test_data:
                hypothesis = nltk.word_tokenize(hypothesis)
                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
                                                                    smoothing_function=SmoothingFunction().method1))
        return sum(bleu) / len(bleu)

should we remove the target hypothesis from the set of references or am I missing something here?

Thanks for the help in advance

Some questions about GSGAN

Hello, I have a few questions about GSGAN.

  1. In your code, the inverse temperature parameter τ (self.tau in file: GsganGenerator.py) is kept to 10. However, in the original paper, the authors suggests that starting with some relatively large τ and then anealing it to zero during training.

  2. What's more, I also don't understand why you add gumbel distribution before calculating output logists.

        def _pretrain_recurrence(i, x_t, h_tm1, g_predictions):
            h_t = self.g_recurrent_unit(x_t, h_tm1)
            h_t = self.add_gumbel(h_t)  # add g_i?????
            o_t = self.g_output_unit(h_t)
            g_predictions = g_predictions.write(i, tf.nn.softmax(o_t))  # batch x vocab_size
            x_tp1 = tf.nn.softmax(o_t / self.tau)
            return i + 1, x_tp1, h_t, g_predictions

Can you give me more illustration about this function?

  1. Finally, why you don't show the performance of GSGAN?

Bugs in running RankGAN

When running python main.py -g rankgan, got IndexError:

Traceback (most recent call last):
  File "/home/xxx/Texygen/main.py", line 78, in parse_cmd
    gan.train_oracle()
  File "/home/xxx/Texygen/models/rankgan/Rankgan.py", line 121, in train_oracle
    self.evaluate()
  File "/home/chaiyekun/GAN.tf/Texygen/models/rankgan/Rankgan.py", line 93, in evaluate
    scores = super().evaluate()
  File "/home/xxx/Texygen/models/Gan.py", line 57, in evaluate
    score = metric.get_score()
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 32, in get_score
    self.get_gen_sim()
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 153, in get_gen_sim
    self.gen_sim = self.get_wordvec(self.generator_file)
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 142, in get_wordvec
    batch_size, num_skips, skip_window, data[index])
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 72, in generate_batch
    buffer.append(data[self.data_index])
IndexError: list index out of range

get_bleu_fast and get_bleu producing different result

reference = reference[0:self.sample_size]

When you calculate BLEU, the size of the reference list should match exactly, since a reference list with greater number of sentences produces higher BLEU, which was mentioned in the paper that introduced BLEU score. The aforementioned line curtails the original reference list consisting of 10k sentences to 500 sentences, which results in lower BLEU score. The same thing can be said for self-BLEU. Did you calculate COCO (self)-BLEU with get_bleu_fast?

How to save only discriminator model?

In Seqgan.py, I can only save the total model like this:

**
print('adversarial training:')
self.reward = Reward(self.generator, .8)
for epoch in range(self.adversarial_epoch_num):
# print('epoch:' + str(epoch))
start = time()
for index in range(1):
samples = self.generator.generate(self.sess)
rewards = self.reward.get_reward(self.sess, samples, 16, self.discriminator)
feed = {
self.generator.x: samples,
self.generator.rewards: rewards
}
loss, _ = self.sess.run([self.generator.g_loss, self.generator.g_updates], feed_dict=feed)
print(loss)
end = time()
self.add_epoch()
print('epoch:' + str(self.epoch) + '\t time:' + str(end - start))
if epoch % 5 == 0 or epoch == self.adversarial_epoch_num - 1:
generate_samples(self.sess, self.generator, self.batch_size, self.generate_num, self.generator_file)
get_real_test_file()
self.evaluate()
self.reward.update_params()
for _ in range(15):
self.train_discriminator()
self.save('./save/seqgan/model.pth')
**

Error in NLL_oracle / NLL_test reported for LeakGAN

Hi,

It seems that in the current setup, a different temperature is used to calculate NLL_oracle and NLL_test for LeakGAN. Is this intended ? This (in my opinion) gives a misleading view of the model's quality / diversity tradeoff.

Thanks,
Lucas

[Retracted]

[Retracted]

looks like the original Leakgan code did this too.

LeakGAN trains on test data!

Regarding the synthetic data experiment, the nll-test metric is computed with the gen_data_loader (training data) so nll-test is actually nll-train (useless metric). Highly misleading.

Also, even if you create a separate test set (test_file.txt), the function generator.get_nll calls some training updates. This function is used to compute nll-test. So even in this case, you would still be training on test data!

I suspect this error occurs in other models also ...

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

Question on vocabulary size of the Chinese poem dataset

I'm trying to reproduce the Poem BLEU-2 result in the SeqGan paper, but I couldn't find out the vocabulary size used in the paper. In the RankGan paper, it uses a different dataset with size of 13,123 poems and filters out the words that occurs less than 5 times. Do you know the vocabulary size used in the SeqGan paper? Thanks a lot!

How to generate data from leakgan : leakgan error

Hello Require some help generating data using leakgan. While running leakgan versus my "test" below error shows up. would highly appreciate your assistance:

/Texygen$ python main.py -g leakgan -t real -d /data/test.txt
Traceback (most recent call last):
File "main.py", line 10, in
from models.rankgan.Rankgan import Rankgan
File "/home/Texygen/models/rankgan/Rankgan.py", line 5, in
from models.rankgan.RankganDiscriminator import Discriminator
File "/home/Texygen/models/rankgan/RankganDiscriminator.py", line 191
self.scores = tf.reshape(self.feature @ tf.transpose(self.reference, perm=[1, 0]), [-1])
^
SyntaxError: invalid syntax

please find test file attached
test.txt

On Accurate Evaluation of GANs for Language Generation

Dear authors:

First, I would like to thank you for the work. It is really helpful for standardizing the development of GAN-based text generation methods.

However, recently Google has just published a paper "On Accurate Evaluation of GANs for Language Generation", arguing that BLEU is not a good, or even a misleading metric for GAN-based text generation methods. (See summary I wrote about this paper). Therefore, In my humble opinion, I think it would be better to report better metric (Reverse LM score and FD, specifically) for those methods that are already implemented on TexyGen.

Best,
Howard

Why GAN instead of RNN?

Hi there,

Thank you for the repo. I am trying some different models recently for my project. So far ı have observed that rnn models such as textgenrnn much more suffecient than GANs. Do you have any benchmark of this repo against rnn models?

Cheers

LeakGAN architecture

Hi,

It seems that the default LeakGAN discriminator architecture in this repository is quite different from the one in the official repo. Specifically, Texygen's implementation has only two layers, while the offical's has over 10 layers. Is there any reason for this discrepency ?

Thanks in advance,
Lucas

BLEU of LeakGAN is much lower than in the original paper

BLEU-2 of LeakGAN in WMT is 0.956 in the original paper, whereas it's 0.835 here. Notably, BLEU score of LeakGAN and that of other models for COCO diminished in your benchmark. Do you know what caused the difference? In order to compare my model with the ones in your benchmark, am I supposed to make every words in lower case by using .lower() as suggested in text_process.py? Lower-case-only generation wasn't done in LeakGAN, which regarded 'A' and 'a' to be different, so that may be one factor for the aforementioned difference.

How to prepare the data for text generation task. Thank you very much.

First, I'm not sure whether the model contains the encoder during training.

EOS means end-of-sentence. Encoder and decoder are part of transformer network.

If without-encoder, training time:

target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]

If without-encoder, testing time:

decoder input: [0]

If with encoder, training time:

encoder input: [A, B, C, D]
target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]

If with-encoder, testing time:

encoder input: [A, B, C, D]
decoder input: [0]

Am I exact right?

I know it is beyond the topic of this project, but hope you could help.
Thank you and thank you.

Changing the number of synthetic reports/sentences being generated?

Hello!

This is a great effort and has been very useful for my work.
I am trying to change the number of records being generated (and saved to test_file.txt) by changing the generate_num parameter in the seqgan.py file.

However, modifying this doesn't seem to have an effect. What's the right way to change the number of synthetic reports being generated?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.