geek-ai / texygen Goto Github PK
View Code? Open in Web Editor NEWA text generation benchmarking platform
License: MIT License
A text generation benchmarking platform
License: MIT License
C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence
Hi, I have a quick question, Does the self.g_loss
need to be divided by self.batch_size
? (as the self.pretrain_loss
do)
Thank you for your reply.
In Seqgan.py, I can only save the total model like this:
**
print('adversarial training:')
self.reward = Reward(self.generator, .8)
for epoch in range(self.adversarial_epoch_num):
# print('epoch:' + str(epoch))
start = time()
for index in range(1):
samples = self.generator.generate(self.sess)
rewards = self.reward.get_reward(self.sess, samples, 16, self.discriminator)
feed = {
self.generator.x: samples,
self.generator.rewards: rewards
}
loss, _ = self.sess.run([self.generator.g_loss, self.generator.g_updates], feed_dict=feed)
print(loss)
end = time()
self.add_epoch()
print('epoch:' + str(self.epoch) + '\t time:' + str(end - start))
if epoch % 5 == 0 or epoch == self.adversarial_epoch_num - 1:
generate_samples(self.sess, self.generator, self.batch_size, self.generate_num, self.generator_file)
get_real_test_file()
self.evaluate()
self.reward.update_params()
for _ in range(15):
self.train_discriminator()
self.save('./save/seqgan/model.pth')
**
I have two questions:
How do you compute the reported BLEU scores? I see the imports of BLEU in several models but BLEU is never added to the self.metrics
collection of any of them.
When computing BLEU, which sentences are considered to be references?
Is there a way to condition the generations of the GANs on some input text similar to how you can include a prompt/seed for transformers (GPT-2/XLNet) when generating language samples?
Hi,
I got the following error when training the SeqGAN model.
epoch:13
[1.1808003]
[1.1286035]
[1.1499095]
epoch:14
Traceback (most recent call last):
File "main.py", line 85, in <module>
parse_cmd(sys.argv[1:])
File "main.py", line 69, in parse_cmd
gan.train_oracle()
File "path/to/Texygen/models/seqgan/Seqgan.py", line 131, in train_oracle
self.train_discriminator()
File "path/to/Texygen/models/seqgan/Seqgan.py", line 52, in train_discriminator
self.dis_data_loader.load_train_data(self.oracle_file, self.generator_file)
File "path/to/Texygen/models/seqgan/SeqganDataLoader.py", line 70, in load_train_data
self.labels = np.concatenate([positive_labels, negative_labels], 0)
ValueError: all the input arrays must have same number of dimensions
The command I ran is
python main.py -g seqgan
Do you have any idea how it happens or how to fix it?
Thanks in advance.
Line 65 in 08c67a1
When you calculate BLEU, the size of the reference list should match exactly, since a reference list with greater number of sentences produces higher BLEU, which was mentioned in the paper that introduced BLEU score. The aforementioned line curtails the original reference list consisting of 10k sentences to 500 sentences, which results in lower BLEU score. The same thing can be said for self-BLEU. Did you calculate COCO (self)-BLEU with get_bleu_fast?
Hello, I have a few questions about GSGAN.
In your code, the inverse temperature parameter τ
(self.tau
in file: GsganGenerator.py) is kept to 10. However, in the original paper, the authors suggests that starting with some relatively large τ
and then anealing it to zero during training.
What's more, I also don't understand why you add gumbel distribution before calculating output logists.
def _pretrain_recurrence(i, x_t, h_tm1, g_predictions):
h_t = self.g_recurrent_unit(x_t, h_tm1)
h_t = self.add_gumbel(h_t) # add g_i?????
o_t = self.g_output_unit(h_t)
g_predictions = g_predictions.write(i, tf.nn.softmax(o_t)) # batch x vocab_size
x_tp1 = tf.nn.softmax(o_t / self.tau)
return i + 1, x_tp1, h_t, g_predictions
Can you give me more illustration about this function?
Hi,
It seems that in the current setup, a different temperature is used to calculate NLL_oracle and NLL_test for LeakGAN. Is this intended ? This (in my opinion) gives a misleading view of the model's quality / diversity tradeoff.
Thanks,
Lucas
As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.
However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.
To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.
Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?
Hello,
I want to have the BLEU score in addition to NLL and EmbSim, how can I add it to my metrics?
Feeding the following real training dataset to a SeqGAN works perfectly:
X = np.random.randint(0, 20, (80, 20))
However, the following dataset with the same dimensionality but 6 symbols instead of 20 raises an error.
X = np.random.randint(0, 6, (80, 20))
In both cases, we used vocab_size = #unique symbols + 1, as suggested in text_process.text_precess(). Here is the corresponding traceback:
Traceback (most recent call last):
File "texygen/texygen.py", line 85, in train
gan_func(X)
File "texygen/models/seqgan/Seqgan.py", line 331, in train_real
self.evaluate()
File "texygen/models/seqgan/Seqgan.py", line 80, in evaluate
scores = super().evaluate()
File "texygen/models/Gan.py", line 55, in evaluate
score = metric.get_score()
File "texygen/utils/metrics/DocEmbSim.py", line 33, in get_score
return self.get_dis_corr()
File "texygen/utils/metrics/DocEmbSim.py", line 164, in get_dis_corr
return np.log10(corr / len(self.oracle_sim))
ZeroDivisionError: division by zero
Hi,
Thanks again for the great benchmarking platform.
I am wondering what is the exact dataset partition that is normally used for the EMNLP2017 News and Image COCO experiments. I'm asking because you provide a train/test split, but it is unclear what the validation set is.
Also, could you confirm that the reported performance in "
Neural Text Generation: Past, Present and Beyond" and "CoT: Cooperative Training for Generative Modeling of Discrete Data" was computed with the same split?
Thank you very much
def _g_recurrence(i, x_t, h_tm1, gen_o, gen_x): h_t = self.g_recurrent_unit(x_t, h_tm1) # hidden_memory_tuple o_t = self.g_output_unit(h_t) # batch x vocab , logits not prob log_prob = tf.log(tf.nn.softmax(o_t)) next_token = tf.cast(tf.reshape(tf.multinomial(log_prob, 1), [self.batch_size]), tf.int32) x_tp1 = tf.nn.embedding_lookup(self.g_embeddings, next_token) # batch x emb_dim gen_o = gen_o.write(i, tf.reduce_sum(tf.multiply(tf.one_hot(next_token, self.num_vocabulary, 1.0, 0.0), tf.nn.softmax(o_t)), 1)) # [batch_size] , prob gen_x = gen_x.write(i, next_token) # indices, batch_size return i + 1, x_tp1, h_t, gen_o, gen_x
In this function, you sample from multinomial distribution to generate next_token. However, in most cases, next_token are generated by argmax.
Thanks,
Lucas
I run the main.py like the command:python main.py. And I get a file :experiment-log-mle.csv. I don't know what those values mean. How to analysis it? And where the generated sentences are?
According the paper for self-bleu calculations each generation is compared against all the other references.
The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.
def get_bleu(self):
ngram = self.gram
bleu = list()
reference = self.get_reference()
weight = tuple((1. / ngram for _ in range(ngram)))
with open(self.test_data) as test_data:
for hypothesis in test_data:
hypothesis = nltk.word_tokenize(hypothesis)
bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
smoothing_function=SmoothingFunction().method1))
return sum(bleu) / len(bleu)
should we remove the target hypothesis from the set of references or am I missing something here?
Thanks for the help in advance
I'm new to LeakGAN or SeqGAN or TextGAN. I know GAN is to generate text and let discriminator un-judge-able to real text and gen-text.
LM(language model) is the task of predicting the next word and can also be used to generate text.
Thank you!
Hi,
I used real data and as a result of runnig python main.py -g seqgan - t real -d 'data/narratives.txt'
I got a generator.txt file under save directory. An instance row of this generator.txt file is like:
4172 2137 1971 4089 842 2056 3806 2100 3845 1955 1111 721 462 3040 3877 2913 4855 1363 374 4203
Now, my question is how can I get the actual text from this generator.txt file?
To whom it may concern,
Hi, I find a problem in the RankGAN code. The link is here
self.scores = tf.reshape(self.feature @ tf.transpose(self.reference, perm=[1, 0]), [-1])
I wonder how does this operator @ work.
Thanks.
Hi guys. Thanks for the great repo!
I looking at Section 4.2 Synthetic Data Experiment of the Texygen paper. I have 2 questions.
Are all the models trained on the exact same Oracle, or is is it re-initialized at every run. ?
Did you do a hyperparameter search for all the models (and presented the best run for each), or did you run one run for each model (with the default parameters in the repo)?
Thanks a lot!
Regarding the synthetic data experiment, the nll-test metric is computed with the gen_data_loader (training data) so nll-test is actually nll-train (useless metric). Highly misleading.
Also, even if you create a separate test set (test_file.txt), the function generator.get_nll calls some training updates. This function is used to compute nll-test. So even in this case, you would still be training on test data!
I suspect this error occurs in other models also ...
I want find where is the paper abou MLE text generation. Can you give me the any paper about your implementation?
hi,
The output under save is id, not natural language? How to generate the readable sentences?
C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence
Hello!
This is a great effort and has been very useful for my work.
I am trying to change the number of records being generated (and saved to test_file.txt) by changing the generate_num parameter in the seqgan.py file.
However, modifying this doesn't seem to have an effect. What's the right way to change the number of synthetic reports being generated?
Hi,
It seems that the default LeakGAN discriminator architecture in this repository is quite different from the one in the official repo. Specifically, Texygen's implementation has only two layers, while the offical's has over 10 layers. Is there any reason for this discrepency ?
Thanks in advance,
Lucas
In your code, the LeakGAN generator is pretrained, and only after that, the discriminator is pretrained. If I understand correctly, this leads to useless features (i.e. noise) from the discriminator being leaked to the generator during pretraining - and those features becoming "drastically" useful after the discriminator is pretrained (i.e. at the start of adversarial training).
In the original LeakGAN code, as well as the original paper (appendix: algorithm), the authors propose to pretrain generator and discriminator interleavingly.
please find test file attached
test.txt
Hi there,
Thank you for the repo. I am trying some different models recently for my project. So far ı have observed that rnn models such as textgenrnn much more suffecient than GANs. Do you have any benchmark of this repo against rnn models?
Cheers
The BLEU score of LeakGAN is far lower than that reported in the paper.
hello!
Why is the number of data I entered? The result of each generation is 9,984 pieces of data. This number cannot be determined. Can it be changed? Or what does this number relate to?
How can I use this platform to save the models which I train on my own datasets and reuse them in the production?
Traceback (most recent call last):
File "main.py", line 85, in
parse_cmd(sys.argv[1:])
File "main.py", line 67, in parse_cmd
gan = set_gan(opt_arg['-g'])
File "main.py", line 26, in set_gan
gan = Gan()
File "E:\all_code\aspectLevel\Texygen\models\leakgan\Leakgan.py", line 64, in init
self.sequence_length = FLAGS.length
File "D:\Anacond\lib\site-packages\tensorflow\python\platform\flags.py", line 84, in getattr
wrapped(_sys.argv)
File "D:\Anacond\lib\site-packages\absl\flags_flagvalues.py", line 630, in call
name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'g'
Dear authors:
First, I would like to thank you for the work. It is really helpful for standardizing the development of GAN-based text generation methods.
However, recently Google has just published a paper "On Accurate Evaluation of GANs for Language Generation", arguing that BLEU is not a good, or even a misleading metric for GAN-based text generation methods. (See summary I wrote about this paper). Therefore, In my humble opinion, I think it would be better to report better metric (Reverse LM score and FD, specifically) for those methods that are already implemented on TexyGen.
Best,
Howard
When running python main.py -g rankgan
, got IndexError:
Traceback (most recent call last):
File "/home/xxx/Texygen/main.py", line 78, in parse_cmd
gan.train_oracle()
File "/home/xxx/Texygen/models/rankgan/Rankgan.py", line 121, in train_oracle
self.evaluate()
File "/home/chaiyekun/GAN.tf/Texygen/models/rankgan/Rankgan.py", line 93, in evaluate
scores = super().evaluate()
File "/home/xxx/Texygen/models/Gan.py", line 57, in evaluate
score = metric.get_score()
File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 32, in get_score
self.get_gen_sim()
File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 153, in get_gen_sim
self.gen_sim = self.get_wordvec(self.generator_file)
File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 142, in get_wordvec
batch_size, num_skips, skip_window, data[index])
File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 72, in generate_batch
buffer.append(data[self.data_index])
IndexError: list index out of range
[Retracted]
looks like the original Leakgan code did this too.
Why did you build your benchmark from scratch, instead of using OpenAI gym?
https://github.com/openai/gym
A comparison of your environment vs OpenAI gym would be appreciated
First, I'm not sure whether the model contains the encoder during training.
EOS means end-of-sentence. Encoder and decoder are part of transformer network.
If without-encoder, training time:
target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]
If without-encoder, testing time:
decoder input: [0]
If with encoder, training time:
encoder input: [A, B, C, D]
target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]
If with-encoder, testing time:
encoder input: [A, B, C, D]
decoder input: [0]
Am I exact right?
I know it is beyond the topic of this project, but hope you could help.
Thank you and thank you.
How to use it for Chinese words ?
Is the generator loss of MaliGAN correct? It should be:
self.g_loss = -tf.reduce_sum(
tf.reduce_sum(
tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_vocabulary, 1.0, 0.0) * tf.log(
tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_vocabulary]), 1e-20, 1.0)
), 1) * tf.reshape(self.rewards, [-1])
)
Hi,
I have been feeding a very simple dataset composed of consecutive ints to all GANs available in this library. The data is generated as follows, and stored in the oracle file.
X = [list(range(20))] * 80
I then attempt to perform a training on real data, using this file in input. Seqgan, Leakgan, Rankgan, Mle and Maligan manage to learn and reproduce this simple pattern.
The generator of GSGAN is able to restrict its vocabulary to the one given in input (i.e. use only integers between 0 and 20 instead of 0 and 5000), but is not able to capture the ordering.
The generator of TextGAN is neither able to restrict its vocabulary to the one given in input, nor able to learn an ordering.
As a result, the nll-test for these models is far from the one reached by the other models. Any hint on what may be causing this issue?
I'm trying to reproduce the Poem BLEU-2 result in the SeqGan paper, but I couldn't find out the vocabulary size used in the paper. In the RankGan paper, it uses a different dataset with size of 13,123 poems and filters out the words that occurs less than 5 times. Do you know the vocabulary size used in the SeqGan paper? Thanks a lot!
In the Bleu function,What does test_text and real_text mean? Does test_text refer to 'save/test_file.txt'? And what does real_text refer to?' save/oracle.txt' or 'save/generator.txt'? Thanks!
BLEU-2 of LeakGAN in WMT is 0.956 in the original paper, whereas it's 0.835 here. Notably, BLEU score of LeakGAN and that of other models for COCO diminished in your benchmark. Do you know what caused the difference? In order to compare my model with the ones in your benchmark, am I supposed to make every words in lower case by using .lower() as suggested in text_process.py? Lower-case-only generation wasn't done in LeakGAN, which regarded 'A' and 'a' to be different, so that may be one factor for the aforementioned difference.
I think there is an error inside the class unique ngram, the right computation of ngram should be #unique_grams/#grams
def get_ng(self):
document = self.get_reference()
length = len(document) #is this a bug? to get ngramm is needed to divide uniquengram by all ngram, not len of sentence!
grams = list()
for sentence in document:
grams += self.get_gram(sentence)
print(grams,len(set(grams)),len(grams))
#to get ngrams is divide by number of grams not by number of sentence
return len(set(grams))/length` #The right computation should use len(grams) instead of length
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.