Comments (4)
I've checked with wc -l under linux and I don't see any error:
$ wc -l parallel_methods_*
397224 parallel_methods_bodies
397224 parallel_methods_decl
397224 parallel_methods_desc
397224 parallel_methods_meta
1588896 total
$ wc -l parallel_bodies parallel_decl parallel_desc
148602 parallel_bodies
148602 parallel_decl
148602 parallel_desc
445806 total
$ wc -l data_ps.*.train
109108 data_ps.bodies.train
109108 data_ps.declarations.train
109108 data_ps.decldesc.train
109108 data_ps.descriptions.train
109108 data_ps.metadata.train
545540 total
How are you counting the lines? Maybe there is a difference in how newlines or some punctuation character are handled.
from code-docstring-corpus.
I prepared the simple Reader
class for next data preprocessing which looks like this:
class Reader:
def __init__(self, path_to_func, path_to_desc):
self.path_to_func = path_to_func
self.path_to_desc = path_to_desc
self.__read_functions()
self.__read_descriptions()
assert len(self.functions) == len(self.descriptions),\
'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)
def __read_functions(self):
print('Reading functions data...')
self.functions = self.__read_file(self.path_to_func)
def __read_descriptions(self):
print('Reading descriptions data...')
self.descriptions = self.__read_file(self.path_to_desc)
def __read_file(self, filename):
with open(filename, 'r', encoding='utf-8') as f:
return f.readlines()
Next I call this code in the main function:
path_to_func = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_bodies'
path_to_desc = 'D:\\code-docstring-corpus\\V2\\parallel\\parallel_desc'
r = reader.Reader(path_to_func, path_to_desc)
After that I got the following message:
Reading functions data...
Reading descriptions data...
Traceback (most recent call last):
File "D:/Sources/DocStringsPredictor/src/main.py", line 37, in <module>
sys.exit(main())
File "D:/Sources/DocStringsPredictor/src/main.py", line 23, in main
r = reader.Reader(path_to_func, path_to_desc)
File "D:\Sources\DocStringsPredictor\src\reader.py", line 27, in __init__
'Functions count: {}; Desctiprions count: {}\nFunctions file: {}\nDescriptions count: {}'.format(len(self.functions), len(self.descriptions), self.path_to_func, self.path_to_desc)
AssertionError: Functions count: 148602; Desctiprions count: 148619
Functions file: D:\code-docstring-corpus\V2\parallel\parallel_bodies
Descriptions count: D:\code-docstring-corpus\V2\parallel\parallel_desc
Notepad++ shows the same values.
Edited (June 14th, 2018):
Deleted incorrect Linux example.
from code-docstring-corpus.
Okay. It is a description of the problem and its solution.
Let's take a current version of the repository and count the lines number:
$ wc -l parallel_bodies parallel_decl parallel_desc
148602 parallel_bodies
148602 parallel_decl
148602 parallel_desc
445806 total
It shows us correct values.
Next, let's write a simple function to get the number of lines in files:
def get_lines(filename):
# `latin1` is used instead of `utf-8` because `UnicodeDecodeError` is appeared
with open(filename, 'r', encoding='latin1') as f:
return len(f.readlines())
After this, let's count lines number:
>>> get_lines('./V2/parallel/parallel_desc')
148619
>>> get_lines('./V2/parallel/parallel_bodies')
148602
>>> get_lines('./V2/parallel/parallel_decl')
148602
It seems that something wrong with the parallel_desc
file. Let's check:
>>> f2 = open('./V2/parallel/parallel_desc', 'rb')
>>> c = f2.read()
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
17
>>> c.count(b'\r\n')
0
It is clear that \r
is not needed.
Removing \r
and updating the file:
>>> c = c.replace(b'\r', b'')
>>> c.count(b'\n')
148602
>>> c.count(b'\r')
0
>>> c.count(b'\r\n')
0
>>> f3 = open('./V2/parallel/parallel_desc', 'wb')
>>> f3.write(c)
39562697
>>> f3.close()
>>> f2.close()
Now our function gets us a correct value:
>>> get_lines('./V2/parallel/parallel_desc')
148602
and Reader
class works correctly.
Please, check and update other files if it is possible.
Thank you.
from code-docstring-corpus.
fixed
from code-docstring-corpus.
Related Issues (15)
- Nice!! HOT 8
- tokenization of the data HOT 1
- Performance of SOTA model on this dataset HOT 1
- Are all the descriptions used in the baseline methods? HOT 2
- blue score HOT 2
- How much memory does this require while training? HOT 3
- Question about creating a dataset format for NeuralCodeSum HOT 1
- Idea for generating test cases HOT 2
- Examples? HOT 2
- Can I get the source code as body directly without syntax parsing? HOT 1
- dataset HOT 8
- help required HOT 8
- Recover original code snippets from `data_ps.all.*` HOT 2
- parallel-corpus HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code-docstring-corpus.