Giter Site home page Giter Site logo

autociter's People

Contributors

bveeramani avatar michael153 avatar

Watchers

 avatar

autociter's Issues

Add more data to citation_sample.csv

The current citation_sample.csv file contains ten records. More records are needed to conduct meaningful tests. If someone could help verify the information in the file that would be extremely helpful.

Clean redirects from data

Many of the data records contain redirects to irrelevant web pages. To preserve the accuracy of the database, these records should be removed.

Make get_text_from_url more modular

Currently pipeline functionality is implemented in relatively-large functions. As a result, code reuse is limited.

Here's an example: for testing purpose, it may be necessary to retrieve text from a PDF. However, the test may require the entire text and not select portions. As a result, the get_text_from_url is unusable.

Here's my proposal: create new functions get_content_from_url and get_text_from_pdf. Then implement get_content_from_url as a dispatch function that calls both get_text_from_url and get_text_from_pdf and manipulates the resulting text to extract content.

Clean RND modules

Currently, the creation, evaluation, and execution modules lack documentation. Furthermore, the code could become more concise and readable.

Test extractor methods

Currently all extractor objects have no tests. To prevent further issues later on, unit tests should be promptly written.

Change Record internal data implementation

If a user changes a record's fields or values, the underlying data dictionary will remain unchanged because the dictionary is constructed in init. Therefore, the internal structure of Record should change.

Move content extraction to separate module

The code for extracting the content of a webpage is long and inelegant. Furthermore, we should be able to implement and interchangeably try several algorithms. Therefore, content extraction should be moved to a separate module.

Clean citations.csv

  1. Remove non-English articles (maybe we can add them back in the long long run)
  2. Remove images / gif files
  3. Clean redirects
  4. ??

Add function for constructing rules

Right now rules can be saved to a file, but there is no mechanism for reading those rules. A function should be implemented in the solution module.

Test accuracy of title recognition

The current content extraction algorithm appears to recognize article starts with fair accuracy, but further tests are necessary to confirm this theory.

Implement effective standardization of text that is logical

  • Applying .lower() over the entire content does not make sense because it does not preserve capitalization (Also applying .lower() over the text means that other fields eg titles won't be able to be found, since titles are standardized by applying .title() )

Improve standardization.py style

Currently, the standardization module fails to receive a 10/10 score via pylint. This can easily be remedied by a few style changes.

Pipeline unit tests should be silent

Running unit tests should not produce any output. Perhaps pipeline could have a global variable that defines whether debugging information will be outputted?

Improve accuracy of content extraction

Currently the content extraction system is extremely inaccurate. Alternate methods should be investigated and the most effective one should be selected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.