Hello, I have tried to use your dataset V2 and found an intresting t

Okay. It is a deion of the problem and its solution. Let's tak

Parallel corpus V2 possibly incorrect about code-docstring-corpus HOT 4 CLOSED

edinburghnlp commented on June 20, 2024

Parallel corpus V2 possibly incorrect

from code-docstring-corpus.

Comments (4)

Avmb commented on June 20, 2024

I've checked with wc -l under linux and I don't see any error:

$ wc -l parallel_methods_*
397224 parallel_methods_bodies
397224 parallel_methods_decl
397224 parallel_methods_desc
397224 parallel_methods_meta
1588896 total
$ wc -l parallel_bodies parallel_decl parallel_desc
148602 parallel_bodies
148602 parallel_decl
148602 parallel_desc
445806 total
$ wc -l data_ps.*.train
109108 data_ps.bodies.train
109108 data_ps.declarations.train
109108 data_ps.decldesc.train
109108 data_ps.descriptions.train
109108 data_ps.metadata.train
545540 total

How are you counting the lines? Maybe there is a difference in how newlines or some punctuation character are handled.

from code-docstring-corpus.

osanwe commented on June 20, 2024

I prepared the simple Reader class for next data preprocessing which looks like this:

class Reader:

    def __init__(self, path_to_func, path_to_desc):
        self.path_to_func = path_to_func
        self.path_to_desc = path_to_desc
        self.__read_functions()
        self.__read_descriptions()
        assert len(self.functions) == len(self.descriptions),\
            'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)

    def __read_functions(self):
        print('Reading functions data...')
        self.functions = self.__read_file(self.path_to_func)

    def __read_descriptions(self):
        print('Reading descriptions data...')
        self.descriptions = self.__read_file(self.path_to_desc)

    def __read_file(self, filename):
        with open(filename, 'r', encoding='utf-8') as f:
            return f.readlines()

Next I call this code in the main function:

path_to_func = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_bodies'
path_to_desc = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_desc'
r = reader.Reader(path_to_func, path_to_desc)

After that I got the following message:

Reading functions data...
Reading descriptions data...
Traceback (most recent call last):
  File "D:/Sources/DocStringsPredictor/src/main.py", line 37, in <module>
    sys.exit(main())
  File "D:/Sources/DocStringsPredictor/src/main.py", line 23, in main
    r = reader.Reader(path_to_func, path_to_desc)
  File "D:\Sources\DocStringsPredictor\src\reader.py", line 27, in __init__
    'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)
AssertionError: Functions count: 148602; Desctiprions count: 148619
Functions file: D:\code-docstring-corpus\V2\parallel\parallel_bodies
Descriptions count: D:\code-docstring-corpus\V2\parallel\parallel_desc

Notepad++ shows the same values.

Edited (June 14th, 2018):
Deleted incorrect Linux example.

from code-docstring-corpus.

osanwe commented on June 20, 2024

Okay. It is a description of the problem and its solution.

Let's take a current version of the repository and count the lines number:

$ wc -l parallel_bodies parallel_decl parallel_desc
   148602 parallel_bodies
   148602 parallel_decl
   148602 parallel_desc
   445806 total

It shows us correct values.

Next, let's write a simple function to get the number of lines in files:

def get_lines(filename):
    # `latin1` is used instead of `utf-8` because `UnicodeDecodeError` is appeared
    with open(filename, 'r', encoding='latin1') as f:
        return len(f.readlines())

After this, let's count lines number:

>>> get_lines('./V2/parallel/parallel_desc')
148619
>>> get_lines('./V2/parallel/parallel_bodies')
148602
>>> get_lines('./V2/parallel/parallel_decl')
148602

It seems that something wrong with the parallel_desc file. Let's check:

>>> f2 = open('./V2/parallel/parallel_desc', 'rb')
>>> c = f2.read()
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
17
>>> c.count(b'\r\n')
0

It is clear that \r is not needed.

Removing \r and updating the file:

>>> c = c.replace(b'\r', b'')
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
0
>>> c.count(b'\r\n')
0
>>> f3 = open('./V2/parallel/parallel_desc', 'wb')
>>> f3.write(c)
39562697
>>> f3.close()
>>> f2.close()

Now our function gets us a correct value:

>>> get_lines('./V2/parallel/parallel_desc')
148602

and Reader class works correctly.

Please, check and update other files if it is possible.

Thank you.

from code-docstring-corpus.

Avmb commented on June 20, 2024

fixed

from code-docstring-corpus.

Parallel corpus V2 possibly incorrect about code-docstring-corpus HOT 4 CLOSED

Comments (4)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent