autociter's People
autociter's Issues
Clean news.rules
Move content extraction to separate module
The code for extracting the content of a webpage is long and inelegant. Furthermore, we should be able to implement and interchangeably try several algorithms. Therefore, content extraction should be moved to a separate module.
Implement automated validation testing
The accuracy of content extraction methods should be automatically tested against a random sample of data.
Implement effective method of finding specific fields in article content
I.e how to locate an arbitrarily formatted title in article content?
Improve accuracy of content extraction
Currently the content extraction system is extremely inaccurate. Alternate methods should be investigated and the most effective one should be selected.
Implement unit tests for timeout decorator
Rigorously test the newly implemented timeout decorator
Pipeline unit tests should be silent
Running unit tests should not produce any output. Perhaps pipeline could have a global variable that defines whether debugging information will be outputted?
Implement timeout decorator that works on non UNIX machines
Clean RND modules
Currently, the creation, evaluation, and execution modules lack documentation. Furthermore, the code could become more concise and readable.
Figure out why data_preservation_accuracy takes so long
Something is getting stuck:
-
Webpages.source? Webpages.content?
-
Timeout decorator not working?? (Low-probability case)
-
FUCK
Clean citations.csv
- Remove non-English articles (maybe we can add them back in the long long run)
- Remove images / gif files
- Clean redirects
- ??
Add function for constructing rules
Right now rules can be saved to a file, but there is no mechanism for reading those rules. A function should be implemented in the solution
module.
Improve _test and test_pipeline styles
The _test
and test_pipeline
modules currently contain minor style errors. The errors can easily be fixed with some minor edits.
Test extractor methods
Currently all extractor objects have no tests. To prevent further issues later on, unit tests should be promptly written.
Move similarity function to separate module
To prevent code duplication, the often-used similarity
function should be moved to a utility module.
Implement effective standardization of text that is logical
- Applying .lower() over the entire content does not make sense because it does not preserve capitalization (Also applying .lower() over the text means that other fields eg titles won't be able to be found, since titles are standardized by applying .title() )
Test accuracy of title recognition
The current content extraction algorithm appears to recognize article starts with fair accuracy, but further tests are necessary to confirm this theory.
Make get_text_from_url more modular
Currently pipeline functionality is implemented in relatively-large functions. As a result, code reuse is limited.
Here's an example: for testing purpose, it may be necessary to retrieve text from a PDF. However, the test may require the entire text and not select portions. As a result, the get_text_from_url
is unusable.
Here's my proposal: create new functions get_content_from_url
and get_text_from_pdf
. Then implement get_content_from_url
as a dispatch function that calls both get_text_from_url
and get_text_from_pdf
and manipulates the resulting text to extract content.
Improve standardization.py style
Currently, the standardization module fails to receive a 10/10 score via pylint. This can easily be remedied by a few style changes.
Change Record internal data implementation
If a user changes a record's fields or values, the underlying data dictionary will remain unchanged because the dictionary is constructed in init. Therefore, the internal structure of Record should change.
Clean redirects from data
Many of the data records contain redirects to irrelevant web pages. To preserve the accuracy of the database, these records should be removed.
Write time / efficiency accuracy test for a model / approach
Add debugging information to content accuracy testing
Currently the validate_webpage_content
displays no debugging information. The module should take an optional flag to print the desired information.
Add more data to citation_sample.csv
The current citation_sample.csv
file contains ten records. More records are needed to conduct meaningful tests. If someone could help verify the information in the file that would be extremely helpful.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.