The texygen from geek-ai

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

A quick question about g_loss

Hi, I have a quick question, Does the self.g_loss need to be divided by self.batch_size ? (as the self.pretrain_loss do)
Thank you for your reply.

How to save only discriminator model?

In Seqgan.py, I can only save the total model like this:

**
print('adversarial training:')
self.reward = Reward(self.generator, .8)
for epoch in range(self.adversarial_epoch_num):
# print('epoch:' + str(epoch))
start = time()
for index in range(1):
samples = self.generator.generate(self.sess)
rewards = self.reward.get_reward(self.sess, samples, 16, self.discriminator)
feed = {
self.generator.x: samples,
self.generator.rewards: rewards
}
loss, _ = self.sess.run([self.generator.g_loss, self.generator.g_updates], feed_dict=feed)
print(loss)
end = time()
self.add_epoch()
print('epoch:' + str(self.epoch) + '\t time:' + str(end - start))
if epoch % 5 == 0 or epoch == self.adversarial_epoch_num - 1:
generate_samples(self.sess, self.generator, self.batch_size, self.generate_num, self.generator_file)
get_real_test_file()
self.evaluate()
self.reward.update_params()
for _ in range(15):
self.train_discriminator()
self.save('./save/seqgan/model.pth')
**

BLEU Score Computation

I have two questions:

How do you compute the reported BLEU scores? I see the imports of BLEU in several models but BLEU is never added to the self.metrics collection of any of them.
When computing BLEU, which sentences are considered to be references?

Is there a way to make generation conditional on some input text?

Is there a way to condition the generations of the GANs on some input text similar to how you can include a prompt/seed for transformers (GPT-2/XLNet) when generating language samples?

Getting error while training the SeqGAN model

Hi,

I got the following error when training the SeqGAN model.

epoch:13
[1.1808003]
[1.1286035]
[1.1499095]
epoch:14
Traceback (most recent call last):
  File "main.py", line 85, in <module>
    parse_cmd(sys.argv[1:])
  File "main.py", line 69, in parse_cmd
    gan.train_oracle()
  File "path/to/Texygen/models/seqgan/Seqgan.py", line 131, in train_oracle
    self.train_discriminator()
  File "path/to/Texygen/models/seqgan/Seqgan.py", line 52, in train_discriminator
    self.dis_data_loader.load_train_data(self.oracle_file, self.generator_file)
  File "path/to/Texygen/models/seqgan/SeqganDataLoader.py", line 70, in load_train_data
    self.labels = np.concatenate([positive_labels, negative_labels], 0)
ValueError: all the input arrays must have same number of dimensions

The command I ran is

python main.py -g seqgan

Do you have any idea how it happens or how to fix it?
Thanks in advance.

get_bleu_fast and get_bleu producing different result

Texygen/utils/metrics/Bleu.py

Line 65 in 08c67a1

reference = reference[0:self.sample_size]

When you calculate BLEU, the size of the reference list should match exactly, since a reference list with greater number of sentences produces higher BLEU, which was mentioned in the paper that introduced BLEU score. The aforementioned line curtails the original reference list consisting of 10k sentences to 500 sentences, which results in lower BLEU score. The same thing can be said for self-BLEU. Did you calculate COCO (self)-BLEU with get_bleu_fast?

Some questions about GSGAN

Hello, I have a few questions about GSGAN.

In your code, the inverse temperature parameter τ (self.tau in file: GsganGenerator.py) is kept to 10. However, in the original paper, the authors suggests that starting with some relatively large τ and then anealing it to zero during training.
What's more, I also don't understand why you add gumbel distribution before calculating output logists.

        def _pretrain_recurrence(i, x_t, h_tm1, g_predictions):
            h_t = self.g_recurrent_unit(x_t, h_tm1)
            h_t = self.add_gumbel(h_t)  # add g_i?????
            o_t = self.g_output_unit(h_t)
            g_predictions = g_predictions.write(i, tf.nn.softmax(o_t))  # batch x vocab_size
            x_tp1 = tf.nn.softmax(o_t / self.tau)
            return i + 1, x_tp1, h_t, g_predictions

Can you give me more illustration about this function?

Finally, why you don't show the performance of GSGAN?

Error in NLL_oracle / NLL_test reported for LeakGAN

Hi,

It seems that in the current setup, a different temperature is used to calculate NLL_oracle and NLL_test for LeakGAN. Is this intended ? This (in my opinion) gives a misleading view of the model's quality / diversity tradeoff.

Thanks,
Lucas

Question about self-BLEU implementation

As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.

However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.

To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.

Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?

How can I get the BLEU on real data?

Hello,
I want to have the BLEU score in addition to NLL and EmbSim, how can I add it to my metrics?

Small vocab_size raises division by zero in DocEmbSim

Feeding the following real training dataset to a SeqGAN works perfectly:

X = np.random.randint(0, 20, (80, 20))

However, the following dataset with the same dimensionality but 6 symbols instead of 20 raises an error.

X = np.random.randint(0, 6, (80, 20))

In both cases, we used vocab_size = #unique symbols + 1, as suggested in text_process.text_precess(). Here is the corresponding traceback:

Traceback (most recent call last):
  File "texygen/texygen.py", line 85, in train
    gan_func(X)
  File "texygen/models/seqgan/Seqgan.py", line 331, in train_real
    self.evaluate()
  File "texygen/models/seqgan/Seqgan.py", line 80, in evaluate
    scores = super().evaluate()
  File "texygen/models/Gan.py", line 55, in evaluate
    score = metric.get_score()
  File "texygen/utils/metrics/DocEmbSim.py", line 33, in get_score
    return self.get_dis_corr()
  File "texygen/utils/metrics/DocEmbSim.py", line 164, in get_dis_corr
    return np.log10(corr / len(self.oracle_sim))
ZeroDivisionError: division by zero

dataset partition

Hi,

Thanks again for the great benchmarking platform.

I am wondering what is the exact dataset partition that is normally used for the EMNLP2017 News and Image COCO experiments. I'm asking because you provide a train/test split, but it is unclear what the validation set is.

Also, could you confirm that the reported performance in "
Neural Text Generation: Past, Present and Beyond" and "CoT: Cooperative Training for Generative Modeling of Discrete Data" was computed with the same split?

Thank you very much

In MLE model, sample from multinomial distribution instead of using argmax to generate next_token

def _g_recurrence(i, x_t, h_tm1, gen_o, gen_x): h_t = self.g_recurrent_unit(x_t, h_tm1) # hidden_memory_tuple o_t = self.g_output_unit(h_t) # batch x vocab , logits not prob log_prob = tf.log(tf.nn.softmax(o_t)) next_token = tf.cast(tf.reshape(tf.multinomial(log_prob, 1), [self.batch_size]), tf.int32) x_tp1 = tf.nn.embedding_lookup(self.g_embeddings, next_token) # batch x emb_dim gen_o = gen_o.write(i, tf.reduce_sum(tf.multiply(tf.one_hot(next_token, self.num_vocabulary, 1.0, 0.0), tf.nn.softmax(o_t)), 1)) # [batch_size] , prob gen_x = gen_x.write(i, next_token) # indices, batch_size return i + 1, x_tp1, h_t, gen_o, gen_x
In this function, you sample from multinomial distribution to generate next_token. However, in most cases, next_token are generated by argmax.

Are PAD tokens included in the calculation of NLL ?

Thanks,
Lucas

How to analysis the results?

I run the main.py like the command:python main.py. And I get a file :experiment-log-mle.csv. I don't know what those values mean. How to analysis it? And where the generated sentences are?

potential bug in self-bleu calculations

According the paper for self-bleu calculations each generation is compared against all the other references.

The current Self-BLEU implementation includes the selected hypothesis in the list of references. This risks inflation in the self-bleu scores as there will be always a direct match between the hypothesis and one of the references.

    def get_bleu(self):
        ngram = self.gram
        bleu = list()
        reference = self.get_reference()
        weight = tuple((1. / ngram for _ in range(ngram)))
        with open(self.test_data) as test_data:
            for hypothesis in test_data:
                hypothesis = nltk.word_tokenize(hypothesis)
                bleu.append(nltk.translate.bleu_score.sentence_bleu(reference, hypothesis, weight,
                                                                    smoothing_function=SmoothingFunction().method1))
        return sum(bleu) / len(bleu)

should we remove the target hypothesis from the set of references or am I missing something here?

Thanks for the help in advance

What is the difference between TextGAN and LM for text generation?

I'm new to LeakGAN or SeqGAN or TextGAN. I know GAN is to generate text and let discriminator un-judge-able to real text and gen-text.

LM(language model) is the task of predicting the next word and can also be used to generate text.

Thank you!

How to get text from generated generator.txt file?

Hi,
I used real data and as a result of runnig python main.py -g seqgan - t real -d 'data/narratives.txt'
I got a generator.txt file under save directory. An instance row of this generator.txt file is like:

4172 2137 1971 4089 842 2056 3806 2100 3845 1955 1111 721 462 3040 3877 2913 4855 1363 374 4203
Now, my question is how can I get the actual text from this generator.txt file?

How does the operator '@' work?

To whom it may concern,

Hi, I find a problem in the RankGAN code. The link is here

self.scores = tf.reshape(self.feature @ tf.transpose(self.reference, perm=[1, 0]), [-1])

I wonder how does this operator @ work.

Thanks.

Question about Synthetic Data Experiment

Hi guys. Thanks for the great repo!

I looking at Section 4.2 Synthetic Data Experiment of the Texygen paper. I have 2 questions.

Are all the models trained on the exact same Oracle, or is is it re-initialized at every run. ?
Did you do a hyperparameter search for all the models (and presented the best run for each), or did you run one run for each model (with the default parameters in the repo)?

Thanks a lot!

LeakGAN trains on test data!

Regarding the synthetic data experiment, the nll-test metric is computed with the gen_data_loader (training data) so nll-test is actually nll-train (useless metric). Highly misleading.

Also, even if you create a separate test set (test_file.txt), the function generator.get_nll calls some training updates. This function is used to compute nll-test. So even in this case, you would still be training on test data!

I suspect this error occurs in other models also ...

Where is the reference about MLE

I want find where is the paper abou MLE text generation. Can you give me the any paper about your implementation?

the output under save is id, not natural language?

hi，

The output under save is id, not natural language? How to generate the readable sentences?

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt
2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:02:00.0
totalMemory: 4.00GiB freeMemory: 3.34GiB
2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
Traceback (most recent call last):
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in
parse_cmd(sys.argv[1:])
File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd
gan_func(opt_arg['-d'])
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real
wi_dict, iw_dict = self.init_real_trainng(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng
self.sequence_length, self.vocab_size = text_precess(data_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess
train_tokens = get_tokenlized(train_text_loc)
File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized
for text in raw:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

Changing the number of synthetic reports/sentences being generated?

Hello!

This is a great effort and has been very useful for my work.
I am trying to change the number of records being generated (and saved to test_file.txt) by changing the generate_num parameter in the seqgan.py file.

However, modifying this doesn't seem to have an effect. What's the right way to change the number of synthetic reports being generated?

LeakGAN architecture

Hi,

It seems that the default LeakGAN discriminator architecture in this repository is quite different from the one in the official repo. Specifically, Texygen's implementation has only two layers, while the offical's has over 10 layers. Is there any reason for this discrepency ?

Thanks in advance,
Lucas

On LeakGAN (non-interleaved) Pretraining

In your code, the LeakGAN generator is pretrained, and only after that, the discriminator is pretrained. If I understand correctly, this leads to useless features (i.e. noise) from the discriminator being leaked to the generator during pretraining - and those features becoming "drastically" useful after the discriminator is pretrained (i.e. at the start of adversarial training).

In the original LeakGAN code, as well as the original paper (appendix: algorithm), the authors propose to pretrain generator and discriminator interleavingly.

How to generate data from leakgan : leakgan error

Hello Require some help generating data using leakgan. While running leakgan versus my "test" below error shows up. would highly appreciate your assistance:

/Texygen$ python main.py -g leakgan -t real -d /data/test.txt
Traceback (most recent call last):
File "main.py", line 10, in
from models.rankgan.Rankgan import Rankgan
File "/home/Texygen/models/rankgan/Rankgan.py", line 5, in
from models.rankgan.RankganDiscriminator import Discriminator
File "/home/Texygen/models/rankgan/RankganDiscriminator.py", line 191
self.scores = tf.reshape(self.feature @ tf.transpose(self.reference, perm=[1, 0]), [-1])
^
SyntaxError: invalid syntax

please find test file attached
test.txt

怎样跑自己的真实数据

Why GAN instead of RNN?

Hi there,

Thank you for the repo. I am trying some different models recently for my project. So far ı have observed that rnn models such as textgenrnn much more suffecient than GANs. Do you have any benchmark of this repo against rnn models?

Cheers

the BLEU score of LeakGAN in COCO is about 0.84

The BLEU score of LeakGAN is far lower than that reported in the paper.

The result of each generation is 9,984 pieces of data？

hello!
Why is the number of data I entered? The result of each generation is 9,984 pieces of data. This number cannot be determined. Can it be changed? Or what does this number relate to?

How to Save Trained Models for reuse?

How can I use this platform to save the models which I train on my own datasets and reuse them in the production?

When python main.py -g leakgan, there have a BUG and the data/shi.txt have a encoding BUG

Traceback (most recent call last):
File "main.py", line 85, in
parse_cmd(sys.argv[1:])
File "main.py", line 67, in parse_cmd
gan = set_gan(opt_arg['-g'])
File "main.py", line 26, in set_gan
gan = Gan()
File "E:\all_code\aspectLevel\Texygen\models\leakgan\Leakgan.py", line 64, in init
self.sequence_length = FLAGS.length
File "D:\Anacond\lib\site-packages\tensorflow\python\platform\flags.py", line 84, in getattr
wrapped(_sys.argv)
File "D:\Anacond\lib\site-packages\absl\flags_flagvalues.py", line 630, in call
name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'g'

Why the leakgan implementation is different from the paper?

On Accurate Evaluation of GANs for Language Generation

Dear authors:

First, I would like to thank you for the work. It is really helpful for standardizing the development of GAN-based text generation methods.

However, recently Google has just published a paper "On Accurate Evaluation of GANs for Language Generation", arguing that BLEU is not a good, or even a misleading metric for GAN-based text generation methods. (See summary I wrote about this paper). Therefore, In my humble opinion, I think it would be better to report better metric (Reverse LM score and FD, specifically) for those methods that are already implemented on TexyGen.

Best,
Howard

Bugs in running RankGAN

When running python main.py -g rankgan, got IndexError:

Traceback (most recent call last):
  File "/home/xxx/Texygen/main.py", line 78, in parse_cmd
    gan.train_oracle()
  File "/home/xxx/Texygen/models/rankgan/Rankgan.py", line 121, in train_oracle
    self.evaluate()
  File "/home/chaiyekun/GAN.tf/Texygen/models/rankgan/Rankgan.py", line 93, in evaluate
    scores = super().evaluate()
  File "/home/xxx/Texygen/models/Gan.py", line 57, in evaluate
    score = metric.get_score()
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 32, in get_score
    self.get_gen_sim()
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 153, in get_gen_sim
    self.gen_sim = self.get_wordvec(self.generator_file)
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 142, in get_wordvec
    batch_size, num_skips, skip_window, data[index])
  File "/home/xxx/Texygen/utils/metrics/DocEmbSim.py", line 72, in generate_batch
    buffer.append(data[self.data_index])
IndexError: list index out of range

[Retracted]

looks like the original Leakgan code did this too.

Texygen and OpenAI gym

Why did you build your benchmark from scratch, instead of using OpenAI gym?
https://github.com/openai/gym

A comparison of your environment vs OpenAI gym would be appreciated

How to prepare the data for text generation task. Thank you very much.

First, I'm not sure whether the model contains the encoder during training.

EOS means end-of-sentence. Encoder and decoder are part of transformer network.

If without-encoder, training time:

target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]

If without-encoder, testing time:

decoder input: [0]

If with encoder, training time:

encoder input: [A, B, C, D]
target: [E, F, G, H, EOS]
decoder input: [0, E, F, G, H]

If with-encoder, testing time:

encoder input: [A, B, C, D]
decoder input: [0]

Am I exact right?

I know it is beyond the topic of this project, but hope you could help.
Thank you and thank you.

How to use it for Chinese words ？

Question of Generator Loss of MaliGAN

Is the generator loss of MaliGAN correct? It should be:

Texygen/models/maligan_basic/MaliganGenerator.py

Line 112 in 3104e22

self.g_loss = -tf.reduce_sum(

        self.g_loss = -tf.reduce_sum(
            tf.reduce_sum(
                tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_vocabulary, 1.0, 0.0) * tf.log(
                    tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_vocabulary]), 1e-20, 1.0)
                ), 1) * tf.reshape(self.rewards, [-1])
        )

请问你们成功运行该项目的环境具体是多少？我总是出错

TextGAN and GSGAN generators do not converge

Hi,
I have been feeding a very simple dataset composed of consecutive ints to all GANs available in this library. The data is generated as follows, and stored in the oracle file.

X = [list(range(20))] * 80

I then attempt to perform a training on real data, using this file in input. Seqgan, Leakgan, Rankgan, Mle and Maligan manage to learn and reproduce this simple pattern.

The generator of GSGAN is able to restrict its vocabulary to the one given in input (i.e. use only integers between 0 and 20 instead of 0 and 5000), but is not able to capture the ordering.

The generator of TextGAN is neither able to restrict its vocabulary to the one given in input, nor able to learn an ordering.

As a result, the nll-test for these models is far from the one reached by the other models. Any hint on what may be causing this issue?

Question on vocabulary size of the Chinese poem dataset

I'm trying to reproduce the Poem BLEU-2 result in the SeqGan paper, but I couldn't find out the vocabulary size used in the paper. In the RankGan paper, it uses a different dataset with size of 13,123 poems and filters out the words that occurs less than 5 times. Do you know the vocabulary size used in the SeqGan paper? Thanks a lot!

How to use Bleu?

In the Bleu function，What does test_text and real_text mean? Does test_text refer to 'save/test_file.txt'? And what does real_text refer to?' save/oracle.txt' or 'save/generator.txt'? Thanks!

BLEU of LeakGAN is much lower than in the original paper

BLEU-2 of LeakGAN in WMT is 0.956 in the original paper, whereas it's 0.835 here. Notably, BLEU score of LeakGAN and that of other models for COCO diminished in your benchmark. Do you know what caused the difference? In order to compare my model with the ones in your benchmark, am I supposed to make every words in lower case by using .lower() as suggested in text_process.py? Lower-case-only generation wasn't done in LeakGAN, which regarded 'A' and 'a' to be different, so that may be one factor for the aforementioned difference.

Unique ngram, division by lenght of document

I think there is an error inside the class unique ngram, the right computation of ngram should be #unique_grams/#grams

def get_ng(self):
        document = self.get_reference()
        length = len(document) #is this a bug? to get ngramm is needed to divide uniquengram by all ngram, not len of sentence!
        grams = list()
        for sentence in document:
            
        grams += self.get_gram(sentence)
        print(grams,len(set(grams)),len(grams))
        
        #to get ngrams is divide by number of grams not by number of sentence
        return len(set(grams))/length` #The right computation should use len(grams) instead of length

geek-ai / texygen Goto Github PK

texygen's People

Contributors

Stargazers

Watchers

Forkers

texygen's Issues

Hello Require some help generating data using leakgan. While running leakgan versus my "test" below error shows up. would highly appreciate your assistance:

Recommend Projects

Recommend Topics

Recommend Org