Giter Site home page Giter Site logo

Comments (9)

CR-Gjx avatar CR-Gjx commented on September 4, 2024 1

For the last question, it may my bugs during uploading the codes. I will verify it and fix. Thanks for your reminds.

from leakgan.

CR-Gjx avatar CR-Gjx commented on September 4, 2024

If you want to use real-world dataset, it better to use codes of "Image Coco" folder, You only modify the "realtrain_cotra.txt".

from leakgan.

Crista23 avatar Crista23 commented on September 4, 2024

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?

from leakgan.

CR-Gjx avatar CR-Gjx commented on September 4, 2024

For real data in "Image Coco" folder, I create a vocabulary dictionary for every word in vocab_cotra.pkl firstly, and then every word in the dataset will be transformed to number according to the dictionary. Specifically, every sentence in the dataset is aligned to 20-length, if one sentence's length less than 20, some paddings (blank) will be added up to 20 and the padding is a special token in the dictionary.

from leakgan.

AranKomat avatar AranKomat commented on September 4, 2024

According to realtrain_cotra.txt, there are 32 tokens per line, and some lines are occupied by more than 20 non-1814 tokens, assuming 1814 here means zero padding. So, I assume you meant "32-length" rather than "20-length."

In vocab_contra.pkl, p4801 aS'OTHERPAD' is the last entry with ' ', so there are only 4801 vocabs for COCO. But main.py says the vocab size is 4839, which doesn't agree. realtrain_contra.txt says 0 is also used as a token (in a middle of a sentence), but it didn't appear in vocab_contra. Since 0 was designated to be a start token, I believe it cannot be used in a middle of a sentence. According to real_traincontra.txt, it seems 65 stands for 'A', but according to vocab_contra, 'A' is at 67. Likewise, '.' (period) is 193 according to real_traincontra, but it's 194 in vocab_contra. By the way, does 'OTHERPAD' mean zero padding (instead of 1814)? In vocab_contra, there's this line:

p194
aS'.'
aS'much'

which means 194 corresponds to both '.' and 'much'. So, I believe your vocab_contra is inaccurate. Or is it not?

from leakgan.

CR-Gjx avatar CR-Gjx commented on September 4, 2024

In fact, when I write the main.py ,I write a bigger vocab number to prevent vocabs-overflow. Maybe it is not rigorous. But with training, some tokens' probabilities become 0 because they never happened in training dataset.
As you say, aS'OTHERPAD' is a common word and a blank. In my code, I assume Generator network can only generate fixed length sentences, so I add this token to guarantee all sentences are fixed length in the dataset. But some sentences are so short that aS'OTHERPAD' appears many times.

from leakgan.

AranKomat avatar AranKomat commented on September 4, 2024

I did print(word) print(vocab) in convert.py, then I found that '.' and 'much' are attributed differently and appropriately. So, I guess this is due to a bug that occurs when one opens .pkl file like txt file. I found that '0' corresponds to 'raining,' so it has nothing to do with the start token. A few sentences from realtrain_contra.py were translated nicely with convert.py, so I guess there's no problem at all. Sorry for confusion.

from leakgan.

CR-Gjx avatar CR-Gjx commented on September 4, 2024

OK, it may be .pkl's bug. Thanks for your discovery.

from leakgan.

bharathreddy1997 avatar bharathreddy1997 commented on September 4, 2024

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?
Hi did you understand how the pickle file was generated?

from leakgan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.