Giter Site home page Giter Site logo

Comments (4)

Avmb avatar Avmb commented on June 20, 2024

I've checked with wc -l under linux and I don't see any error:

$ wc -l parallel_methods_*
397224 parallel_methods_bodies
397224 parallel_methods_decl
397224 parallel_methods_desc
397224 parallel_methods_meta
1588896 total
$ wc -l parallel_bodies parallel_decl parallel_desc
148602 parallel_bodies
148602 parallel_decl
148602 parallel_desc
445806 total
$ wc -l data_ps.*.train
109108 data_ps.bodies.train
109108 data_ps.declarations.train
109108 data_ps.decldesc.train
109108 data_ps.descriptions.train
109108 data_ps.metadata.train
545540 total

How are you counting the lines? Maybe there is a difference in how newlines or some punctuation character are handled.

from code-docstring-corpus.

osanwe avatar osanwe commented on June 20, 2024

I prepared the simple Reader class for next data preprocessing which looks like this:

class Reader:

    def __init__(self, path_to_func, path_to_desc):
        self.path_to_func = path_to_func
        self.path_to_desc = path_to_desc
        self.__read_functions()
        self.__read_descriptions()
        assert len(self.functions) == len(self.descriptions),\
            'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)

    def __read_functions(self):
        print('Reading functions data...')
        self.functions = self.__read_file(self.path_to_func)

    def __read_descriptions(self):
        print('Reading descriptions data...')
        self.descriptions = self.__read_file(self.path_to_desc)

    def __read_file(self, filename):
        with open(filename, 'r', encoding='utf-8') as f:
            return f.readlines()

Next I call this code in the main function:

path_to_func = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_bodies'
path_to_desc = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_desc'
r = reader.Reader(path_to_func, path_to_desc)

After that I got the following message:

Reading functions data...
Reading descriptions data...
Traceback (most recent call last):
  File "D:/Sources/DocStringsPredictor/src/main.py", line 37, in <module>
    sys.exit(main())
  File "D:/Sources/DocStringsPredictor/src/main.py", line 23, in main
    r = reader.Reader(path_to_func, path_to_desc)
  File "D:\Sources\DocStringsPredictor\src\reader.py", line 27, in __init__
    'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)
AssertionError: Functions count: 148602; Desctiprions count: 148619
Functions file: D:\code-docstring-corpus\V2\parallel\parallel_bodies
Descriptions count: D:\code-docstring-corpus\V2\parallel\parallel_desc

Notepad++ shows the same values.

Edited (June 14th, 2018):
Deleted incorrect Linux example.

from code-docstring-corpus.

osanwe avatar osanwe commented on June 20, 2024

Okay. It is a description of the problem and its solution.

Let's take a current version of the repository and count the lines number:

$ wc -l parallel_bodies parallel_decl parallel_desc
   148602 parallel_bodies
   148602 parallel_decl
   148602 parallel_desc
   445806 total

It shows us correct values.

Next, let's write a simple function to get the number of lines in files:

def get_lines(filename):
    # `latin1` is used instead of `utf-8` because `UnicodeDecodeError` is appeared
    with open(filename, 'r', encoding='latin1') as f:
        return len(f.readlines())

After this, let's count lines number:

>>> get_lines('./V2/parallel/parallel_desc')
148619
>>> get_lines('./V2/parallel/parallel_bodies')
148602
>>> get_lines('./V2/parallel/parallel_decl')
148602

It seems that something wrong with the parallel_desc file. Let's check:

>>> f2 = open('./V2/parallel/parallel_desc', 'rb')
>>> c = f2.read()
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
17
>>> c.count(b'\r\n')
0

It is clear that \r is not needed.

Removing \r and updating the file:

>>> c = c.replace(b'\r', b'')
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
0
>>> c.count(b'\r\n')
0
>>> f3 = open('./V2/parallel/parallel_desc', 'wb')
>>> f3.write(c)
39562697
>>> f3.close()
>>> f2.close()

Now our function gets us a correct value:

>>> get_lines('./V2/parallel/parallel_desc')
148602

and Reader class works correctly.

Please, check and update other files if it is possible.

Thank you.

from code-docstring-corpus.

Avmb avatar Avmb commented on June 20, 2024

fixed

from code-docstring-corpus.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.