Giter Site home page Giter Site logo

Questions about input data about code2seq HOT 12 CLOSED

tech-srl avatar tech-srl commented on May 27, 2024
Questions about input data

from code2seq.

Comments (12)

lishi0927 avatar lishi0927 commented on May 27, 2024 1

Okay,I know why this happens.
Because I use batch size is 2, hence they random choose two target labels into a batch;
And you project focuses on one function, rather one file; (For example, the AboutPage.java has two functions)
Thank you.

from code2seq.

urialon avatar urialon commented on May 27, 2024

Hi @lishi0927 ,
Please provide the exact command line that you ran.
Please also mention if you have changed anything in the code compared to the default (git diff).

Uri

from code2seq.

lishi0927 avatar lishi0927 commented on May 27, 2024

Thank you for your reply.
My command line is "python3 reader.py";
I only change the "self.TRAIN_PATH = java-small/java-small" in Line 206 in the reader.py;

from code2seq.

urialon avatar urialon commented on May 27, 2024

Oh I see, running reader.py as the main file is mostly meant for debugging.

To use it with a real dataset you'll need to load the dataset's settings from the .dict.c2s file, and create a vocabulary out of them.
You'll need to copy lines 33-55 from model.py and paste them to line 220 in reader.py (possibly with some minor adaptation of variable names).

The issue is that in the test in reader.py - settings like config.DATA_MAX_CONTEXTS and dictionaries like nodes_to_index are hard-coded in the test code.
In a real training (as it is in model.py) - these settings are loaded from the dataset's dictionary file, and the dictionaries like nodes_to_index are created based on them.

Let me know if you tried that and experienced problems.

from code2seq.

lishi0927 avatar lishi0927 commented on May 27, 2024

I have tried your advice, but it still has the same bug;
Here is my codes of main function in the reader.py, I use the java-small dataset, hence the java-small/java-small.dict.c2s is in the same file folder with the train and test file.

target_word_to_index = {Common.PAD: 0, Common.UNK: 1, Common.SOS: 2,
                            'a': 3, 'b': 4, 'c': 5, 'd': 6, 't': 7}
    subtoken_to_index = {Common.PAD: 0, Common.UNK: 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5}
    node_to_index = {Common.PAD: 0, Common.UNK: 1, '1': 2, '2': 3, '3': 4, '4': 5}
    import numpy as np

    class Config:
        def __init__(self):
            self.SAVE_EVERY_EPOCHS = 1
            self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
            self.BATCH_SIZE = 2
            self.TEST_BATCH_SIZE = self.BATCH_SIZE
            self.READER_NUM_PARALLEL_BATCHES = 1
            self.READING_BATCH_SIZE = 2
            self.SHUFFLE_BUFFER_SIZE = 100
            self.MAX_CONTEXTS = 4
            self.DATA_NUM_CONTEXTS = 4
            self.MAX_PATH_LENGTH = 3
            self.MAX_NAME_PARTS = 2
            self.MAX_TARGET_PARTS = 4
            self.RANDOM_CONTEXTS = True
            self.CSV_BUFFER_SIZE = None
            self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
            self.TARGET_VOCAB_MAX_SIZE = 27000        
    
    config = Config()
    with open('{}.dict.c2s'.format(config.TRAIN_PATH), 'rb') as file:
        subtoken_to_count = pickle.load(file)
        node_to_count = pickle.load(file)
        target_to_count = pickle.load(file)
        max_contexts = pickle.load(file)
        num_training_examples = pickle.load(file)
        print('Dictionaries loaded.')

    if config.DATA_NUM_CONTEXTS <= 0:
        config.DATA_NUM_CONTEXTS = max_contexts
    subtoken_to_index, index_to_subtoken, subtoken_vocab_size = \
        Common.load_vocab_from_dict(subtoken_to_count, add_values=[Common.PAD, Common.UNK],
                                    max_size=config.SUBTOKENS_VOCAB_MAX_SIZE)
    print('Loaded subtoken vocab. size: %d' % subtoken_vocab_size)

    target_to_index, index_to_target, target_vocab_size = \
        Common.load_vocab_from_dict(target_to_count, add_values=[Common.PAD, Common.UNK, Common.SOS],
                                    max_size=config.TARGET_VOCAB_MAX_SIZE)
    print('Loaded target word vocab. size: %d' % target_vocab_size)

    node_to_index, index_to_node, nodes_vocab_size = \
        Common.load_vocab_from_dict(node_to_count, add_values=[Common.PAD, Common.UNK], max_size=None)
    print('Loaded nodes vocab. size: %d' % nodes_vocab_size)
    epochs_trained = 0

    reader = Reader(subtoken_to_index, target_word_to_index, node_to_index, config, False)

    output = reader.get_output()
    target_index_op = output[TARGET_INDEX_KEY]
    target_string_op = output[TARGET_STRING_KEY]
    target_length_op = output[TARGET_LENGTH_KEY]
    path_source_indices_op = output[PATH_SOURCE_INDICES_KEY]
    node_indices_op = output[NODE_INDICES_KEY]
    path_target_indices_op = output[PATH_TARGET_INDICES_KEY]
    valid_context_mask_op = output[VALID_CONTEXT_MASK_KEY]
    path_source_lengths_op = output[PATH_SOURCE_LENGTHS_KEY]
    path_lengths_op = output[PATH_LENGTHS_KEY]
    path_target_lengths_op = output[PATH_TARGET_LENGTHS_KEY]
    path_source_strings_op = output[PATH_SOURCE_STRINGS_KEY]
    path_strings_op = output[PATH_STRINGS_KEY]
    path_target_strings_op = output[PATH_TARGET_STRINGS_KEY]

    sess = tf.InteractiveSession()
    tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.tables_initializer()).run()
    reader.reset(sess)

    try:
        while True:
            
            target_indices, target_strings, target_lengths, path_source_indices, \
            node_indices, path_target_indices, valid_context_mask, path_source_lengths, \
            path_lengths, path_target_lengths, path_source_strings, path_strings, \
            path_target_strings = sess.run(
                [target_index_op, target_string_op, target_length_op, path_source_indices_op,
                 node_indices_op, path_target_indices_op, valid_context_mask_op, path_source_lengths_op,
                 path_lengths_op, path_target_lengths_op, path_source_strings_op, path_strings_op,
                 path_target_strings_op])

            print('Target strings: ', Common.binary_to_string_list(target_strings))
            print('Context strings: ', Common.binary_to_string_3d(
                np.concatenate([path_source_strings, path_strings, path_target_strings], -1)))
            print('Target indices: ', target_indices)
            print('Target lengths: ', target_lengths)
            print('Path source strings: ', Common.binary_to_string_3d(path_source_strings))
            print('Path source indices: ', path_source_indices)
            print('Path source lengths: ', path_source_lengths)
            print('Path strings: ', Common.binary_to_string_3d(path_strings))
            print('Node indices: ', node_indices)
            print('Path lengths: ', path_lengths)
            print('Path target strings: ', Common.binary_to_string_3d(path_target_strings))
            print('Path target indices: ', path_target_indices)
            print('Path target lengths: ', path_target_lengths)
            print('Valid context mask: ', valid_context_mask)

            #target_indices = sess.run(target_index_op)
            #print('Target indices: ', target_indices)
    except tf.errors.OutOfRangeError:
        print('Done training, epoch reached')

from code2seq.

urialon avatar urialon commented on May 27, 2024

Right, I see.
In the line: self.DATA_NUM_CONTEXTS = 4 set the initialization value to 0 instead of 4.
This will signal to load this value from the dictionary, rather than using the hard-coded value.

Sorry for not noticing this difference before.

from code2seq.

lishi0927 avatar lishi0927 commented on May 27, 2024

Thank you for your patient reply.
It makes sense, but I have tried to fix some configuration parameters:

def __init__(self):
            self.SAVE_EVERY_EPOCHS = 1
            self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
            self.BATCH_SIZE = 2
            self.TEST_BATCH_SIZE = self.BATCH_SIZE
            self.READER_NUM_PARALLEL_BATCHES = 1
            self.READING_BATCH_SIZE = 2
            self.SHUFFLE_BUFFER_SIZE = 100
            self.MAX_CONTEXTS = 0
            self.DATA_NUM_CONTEXTS = 0
            self.MAX_PATH_LENGTH = 0
            self.MAX_NAME_PARTS = 0
            self.MAX_TARGET_PARTS = 0
            self.RANDOM_CONTEXTS = True
            self.CSV_BUFFER_SIZE = None
            self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
            self.TARGET_VOCAB_MAX_SIZE = 27000

from code2seq.

urialon avatar urialon commented on May 27, 2024

Let me know if there is still any problem.

from code2seq.

lishi0927 avatar lishi0927 commented on May 27, 2024

I use a simple example, but the result is

Target strings:  ['content', 'render']
Context strings:  [[], []]
Target indices:  [[0]
 [0]]
Target lengths:  [0 0]
Path source strings:  [[], []]
Path source indices:  []
Path source lengths:  []
Path strings:  [[], []]
Node indices:  []
Path lengths:  []
Path target strings:  [[], []]
Path target indices:  []
Path target lengths:  []
Valid context mask:  []
Target strings:  ['pre|head']
Context strings:  [[]]
Target indices:  [[0]]
Target lengths:  [0]
Path source strings:  [[]]
Path source indices:  []
Path source lengths:  []
Path strings:  [[]]
Node indices:  []
Path lengths:  []
Path target strings:  [[]]
Path target indices:  []
Path target lengths:  []
Valid context mask:  []
Done training, epoch reached

why there are so many empty outputs?
I use two java files in the java-small datasets, the test_dataset.train.raw.txt is as following:
test_dataset.train.raw.txt

from code2seq.

urialon avatar urialon commented on May 27, 2024

I think this is because you zeroed too many config parameters, and should have zeroed only DATA_NUM_CONTEXTS. All the following should not be zeroed, please set them to these values (as in config.py:

self.MAX_CONTEXTS = 200
self.DATA_NUM_CONTEXTS = 0
self.MAX_PATH_LENGTH = 9
self.MAX_NAME_PARTS = 5
self.MAX_TARGET_PARTS = 6

Of course, you can reduce these numbers if, for example, 200 paths is too many to observe at once.
Let me know if you experience any additional problems.

from code2seq.

lishi0927 avatar lishi0927 commented on May 27, 2024

Thank you.
It makes sense, but I still have a question.
I know that all features in the different java files are extracted into the raw files, each line stores (target name, path, padding). But when I run the reader.py, the target labels can be assigned into different tensors.
For example, the function in AboutBlock.java is [render], the functions in AboutPage.java are [prehead, content];
Hence the train.c2s is shown as follows:
content ...
render...
pre|head...
But the outputs are:
Target strings: ['render', 'content']
Target strings: ['pre|head']
How to combine these target labels? Why not ['render'] or [content, pre|head]?

from code2seq.

urialon avatar urialon commented on May 27, 2024

Great!
If you will pass is_evaluating=True to the Reader object initialization (here) - the target labels will not be shuffled (here) and will appear in the same order as in the textual file.

from code2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.