The autociter from michael153

Clean news.rules

Move content extraction to separate module

The code for extracting the content of a webpage is long and inelegant. Furthermore, we should be able to implement and interchangeably try several algorithms. Therefore, content extraction should be moved to a separate module.

Implement automated validation testing

The accuracy of content extraction methods should be automatically tested against a random sample of data.

Implement effective method of finding specific fields in article content

I.e how to locate an arbitrarily formatted title in article content?

Improve accuracy of content extraction

Currently the content extraction system is extremely inaccurate. Alternate methods should be investigated and the most effective one should be selected.

Implement unit tests for timeout decorator

Rigorously test the newly implemented timeout decorator

Pipeline unit tests should be silent

Running unit tests should not produce any output. Perhaps pipeline could have a global variable that defines whether debugging information will be outputted?

Implement timeout decorator that works on non UNIX machines

Clean RND modules

Currently, the creation, evaluation, and execution modules lack documentation. Furthermore, the code could become more concise and readable.

Figure out why data_preservation_accuracy takes so long

Something is getting stuck:

Webpages.source? Webpages.content?
Timeout decorator not working?? (Low-probability case)
FUCK

Clean citations.csv

Remove non-English articles (maybe we can add them back in the long long run)
Remove images / gif files
Clean redirects
??

Add function for constructing rules

Right now rules can be saved to a file, but there is no mechanism for reading those rules. A function should be implemented in the solution module.

Improve _test and test_pipeline styles

The _test and test_pipeline modules currently contain minor style errors. The errors can easily be fixed with some minor edits.

Test extractor methods

Currently all extractor objects have no tests. To prevent further issues later on, unit tests should be promptly written.

Move similarity function to separate module

To prevent code duplication, the often-used similarity function should be moved to a utility module.

Implement effective standardization of text that is logical

Applying .lower() over the entire content does not make sense because it does not preserve capitalization (Also applying .lower() over the text means that other fields eg titles won't be able to be found, since titles are standardized by applying .title() )

Test accuracy of title recognition

The current content extraction algorithm appears to recognize article starts with fair accuracy, but further tests are necessary to confirm this theory.

Make get_text_from_url more modular

Currently pipeline functionality is implemented in relatively-large functions. As a result, code reuse is limited.

Here's an example: for testing purpose, it may be necessary to retrieve text from a PDF. However, the test may require the entire text and not select portions. As a result, the get_text_from_url is unusable.

Here's my proposal: create new functions get_content_from_url and get_text_from_pdf. Then implement get_content_from_url as a dispatch function that calls both get_text_from_url and get_text_from_pdf and manipulates the resulting text to extract content.

michael153 / autociter Goto Github PK

autociter's People

Contributors

Watchers

autociter's Issues

Recommend Projects

Recommend Topics

Recommend Org