Hello, I want to know more about the structure of input data, hence I run the read

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Right, I see. In the line: self.DATA_NUM_CONTEXTS = 4</c

I use a simple example, but the result is <div class="snippet-clipboard-content no

Questions about input data about code2seq HOT 12 CLOSED

tech-srl commented on May 27, 2024

Questions about input data

from code2seq.

Comments (12)

lishi0927 commented on May 27, 2024 1

Okay，I know why this happens.
Because I use batch size is 2, hence they random choose two target labels into a batch;
And you project focuses on one function, rather one file; (For example, the AboutPage.java has two functions)
Thank you.

from code2seq.

urialon commented on May 27, 2024

Hi @lishi0927 ,
Please provide the exact command line that you ran.
Please also mention if you have changed anything in the code compared to the default (git diff).

Uri

from code2seq.

lishi0927 commented on May 27, 2024

Thank you for your reply.
My command line is "python3 reader.py";
I only change the "self.TRAIN_PATH = java-small/java-small" in Line 206 in the reader.py;

from code2seq.

urialon commented on May 27, 2024

Oh I see, running reader.py as the main file is mostly meant for debugging.

To use it with a real dataset you'll need to load the dataset's settings from the .dict.c2s file, and create a vocabulary out of them.
You'll need to copy lines 33-55 from model.py and paste them to line 220 in reader.py (possibly with some minor adaptation of variable names).

The issue is that in the test in reader.py - settings like config.DATA_MAX_CONTEXTS and dictionaries like nodes_to_index are hard-coded in the test code.
In a real training (as it is in model.py) - these settings are loaded from the dataset's dictionary file, and the dictionaries like nodes_to_index are created based on them.

Let me know if you tried that and experienced problems.

from code2seq.

lishi0927 commented on May 27, 2024

I have tried your advice, but it still has the same bug;
Here is my codes of main function in the reader.py, I use the java-small dataset, hence the java-small/java-small.dict.c2s is in the same file folder with the train and test file.

target_word_to_index = {Common.PAD: 0, Common.UNK: 1, Common.SOS: 2,
                            'a': 3, 'b': 4, 'c': 5, 'd': 6, 't': 7}
    subtoken_to_index = {Common.PAD: 0, Common.UNK: 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5}
    node_to_index = {Common.PAD: 0, Common.UNK: 1, '1': 2, '2': 3, '3': 4, '4': 5}
    import numpy as np

    class Config:
        def __init__(self):
            self.SAVE_EVERY_EPOCHS = 1
            self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
            self.BATCH_SIZE = 2
            self.TEST_BATCH_SIZE = self.BATCH_SIZE
            self.READER_NUM_PARALLEL_BATCHES = 1
            self.READING_BATCH_SIZE = 2
            self.SHUFFLE_BUFFER_SIZE = 100
            self.MAX_CONTEXTS = 4
            self.DATA_NUM_CONTEXTS = 4
            self.MAX_PATH_LENGTH = 3
            self.MAX_NAME_PARTS = 2
            self.MAX_TARGET_PARTS = 4
            self.RANDOM_CONTEXTS = True
            self.CSV_BUFFER_SIZE = None
            self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
            self.TARGET_VOCAB_MAX_SIZE = 27000        
    
    config = Config()
    with open('{}.dict.c2s'.format(config.TRAIN_PATH), 'rb') as file:
        subtoken_to_count = pickle.load(file)
        node_to_count = pickle.load(file)
        target_to_count = pickle.load(file)
        max_contexts = pickle.load(file)
        num_training_examples = pickle.load(file)
        print('Dictionaries loaded.')

    if config.DATA_NUM_CONTEXTS <= 0:
        config.DATA_NUM_CONTEXTS = max_contexts
    subtoken_to_index, index_to_subtoken, subtoken_vocab_size = \
        Common.load_vocab_from_dict(subtoken_to_count, add_values=[Common.PAD, Common.UNK],
                                    max_size=config.SUBTOKENS_VOCAB_MAX_SIZE)
    print('Loaded subtoken vocab. size: %d' % subtoken_vocab_size)

    target_to_index, index_to_target, target_vocab_size = \
        Common.load_vocab_from_dict(target_to_count, add_values=[Common.PAD, Common.UNK, Common.SOS],
                                    max_size=config.TARGET_VOCAB_MAX_SIZE)
    print('Loaded target word vocab. size: %d' % target_vocab_size)

    node_to_index, index_to_node, nodes_vocab_size = \
        Common.load_vocab_from_dict(node_to_count, add_values=[Common.PAD, Common.UNK], max_size=None)
    print('Loaded nodes vocab. size: %d' % nodes_vocab_size)
    epochs_trained = 0

    reader = Reader(subtoken_to_index, target_word_to_index, node_to_index, config, False)

    output = reader.get_output()
    target_index_op = output[TARGET_INDEX_KEY]
    target_string_op = output[TARGET_STRING_KEY]
    target_length_op = output[TARGET_LENGTH_KEY]
    path_source_indices_op = output[PATH_SOURCE_INDICES_KEY]
    node_indices_op = output[NODE_INDICES_KEY]
    path_target_indices_op = output[PATH_TARGET_INDICES_KEY]
    valid_context_mask_op = output[VALID_CONTEXT_MASK_KEY]
    path_source_lengths_op = output[PATH_SOURCE_LENGTHS_KEY]
    path_lengths_op = output[PATH_LENGTHS_KEY]
    path_target_lengths_op = output[PATH_TARGET_LENGTHS_KEY]
    path_source_strings_op = output[PATH_SOURCE_STRINGS_KEY]
    path_strings_op = output[PATH_STRINGS_KEY]
    path_target_strings_op = output[PATH_TARGET_STRINGS_KEY]

    sess = tf.InteractiveSession()
    tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.tables_initializer()).run()
    reader.reset(sess)

    try:
        while True:
            
            target_indices, target_strings, target_lengths, path_source_indices, \
            node_indices, path_target_indices, valid_context_mask, path_source_lengths, \
            path_lengths, path_target_lengths, path_source_strings, path_strings, \
            path_target_strings = sess.run(
                [target_index_op, target_string_op, target_length_op, path_source_indices_op,
                 node_indices_op, path_target_indices_op, valid_context_mask_op, path_source_lengths_op,
                 path_lengths_op, path_target_lengths_op, path_source_strings_op, path_strings_op,
                 path_target_strings_op])

            print('Target strings: ', Common.binary_to_string_list(target_strings))
            print('Context strings: ', Common.binary_to_string_3d(
                np.concatenate([path_source_strings, path_strings, path_target_strings], -1)))
            print('Target indices: ', target_indices)
            print('Target lengths: ', target_lengths)
            print('Path source strings: ', Common.binary_to_string_3d(path_source_strings))
            print('Path source indices: ', path_source_indices)
            print('Path source lengths: ', path_source_lengths)
            print('Path strings: ', Common.binary_to_string_3d(path_strings))
            print('Node indices: ', node_indices)
            print('Path lengths: ', path_lengths)
            print('Path target strings: ', Common.binary_to_string_3d(path_target_strings))
            print('Path target indices: ', path_target_indices)
            print('Path target lengths: ', path_target_lengths)
            print('Valid context mask: ', valid_context_mask)

            #target_indices = sess.run(target_index_op)
            #print('Target indices: ', target_indices)
    except tf.errors.OutOfRangeError:
        print('Done training, epoch reached')

from code2seq.

urialon commented on May 27, 2024

Right, I see.
In the line: self.DATA_NUM_CONTEXTS = 4 set the initialization value to 0 instead of 4.
This will signal to load this value from the dictionary, rather than using the hard-coded value.

Sorry for not noticing this difference before.

from code2seq.

lishi0927 commented on May 27, 2024

Thank you for your patient reply.
It makes sense, but I have tried to fix some configuration parameters:

def __init__(self):
            self.SAVE_EVERY_EPOCHS = 1
            self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
            self.BATCH_SIZE = 2
            self.TEST_BATCH_SIZE = self.BATCH_SIZE
            self.READER_NUM_PARALLEL_BATCHES = 1
            self.READING_BATCH_SIZE = 2
            self.SHUFFLE_BUFFER_SIZE = 100
            self.MAX_CONTEXTS = 0
            self.DATA_NUM_CONTEXTS = 0
            self.MAX_PATH_LENGTH = 0
            self.MAX_NAME_PARTS = 0
            self.MAX_TARGET_PARTS = 0
            self.RANDOM_CONTEXTS = True
            self.CSV_BUFFER_SIZE = None
            self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
            self.TARGET_VOCAB_MAX_SIZE = 27000

from code2seq.

urialon commented on May 27, 2024

Let me know if there is still any problem.

from code2seq.

lishi0927 commented on May 27, 2024

I use a simple example, but the result is

Target strings:  ['content', 'render']
Context strings:  [[], []]
Target indices:  [[0]
 [0]]
Target lengths:  [0 0]
Path source strings:  [[], []]
Path source indices:  []
Path source lengths:  []
Path strings:  [[], []]
Node indices:  []
Path lengths:  []
Path target strings:  [[], []]
Path target indices:  []
Path target lengths:  []
Valid context mask:  []
Target strings:  ['pre|head']
Context strings:  [[]]
Target indices:  [[0]]
Target lengths:  [0]
Path source strings:  [[]]
Path source indices:  []
Path source lengths:  []
Path strings:  [[]]
Node indices:  []
Path lengths:  []
Path target strings:  [[]]
Path target indices:  []
Path target lengths:  []
Valid context mask:  []
Done training, epoch reached

why there are so many empty outputs?
I use two java files in the java-small datasets, the test_dataset.train.raw.txt is as following:
test_dataset.train.raw.txt

from code2seq.

urialon commented on May 27, 2024

I think this is because you zeroed too many config parameters, and should have zeroed only DATA_NUM_CONTEXTS. All the following should not be zeroed, please set them to these values (as in config.py:

self.MAX_CONTEXTS = 200
self.DATA_NUM_CONTEXTS = 0
self.MAX_PATH_LENGTH = 9
self.MAX_NAME_PARTS = 5
self.MAX_TARGET_PARTS = 6

Of course, you can reduce these numbers if, for example, 200 paths is too many to observe at once.
Let me know if you experience any additional problems.

from code2seq.

lishi0927 commented on May 27, 2024

Thank you.
It makes sense, but I still have a question.
I know that all features in the different java files are extracted into the raw files, each line stores (target name, path, padding). But when I run the reader.py, the target labels can be assigned into different tensors.
For example, the function in AboutBlock.java is [render], the functions in AboutPage.java are [prehead, content];
Hence the train.c2s is shown as follows:
content ...
render...
pre|head...
But the outputs are:
Target strings: ['render', 'content']
Target strings: ['pre|head']
How to combine these target labels? Why not ['render'] or [content, pre|head]?

from code2seq.

urialon commented on May 27, 2024

Great!
If you will pass is_evaluating=True to the Reader object initialization (here) - the target labels will not be shuffled (here) and will appear in the same order as in the textual file.

from code2seq.

Questions about input data about code2seq HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent