Giter Site home page Giter Site logo

akshayakp97 / tl-dr Goto Github PK

View Code? Open in Web Editor NEW
21.0 1.0 2.0 69.29 MB

An end-to-end event extraction and summarization system.

Jupyter Notebook 67.60% Makefile 0.03% Shell 0.08% Batchfile 0.02% Java 32.24% HTML 0.04%
event-extraction text-analysis nlp nlp-library nlp-keywords-extraction summarization text-mining python machine-learning spacy spacy-nlp neuralcoref topic-modeling text-classification stanford-nlp named-entity-recognition entity-extraction acled gdelt

tl-dr's Introduction

TL;DR

An end-to-end event extraction and summarization system.

The project entitles "TL;DR" describes a large-scale automated system for extracting violent incidents relating to protests/riots and violence against civilians. A comprehensive architecture is outlined that can identify, categorize, summarize and perform entity slot filling against the target event types. Furthermore, an attempt is made to relate the recorded events to politics and elections taking place in the countries India, Indonesia, and Thailand. This allows a better understanding of the driving factors for these events.

Getting Started

Step 1: Crawling news articles

News Please -

  1. Install News Please.

  2. Use config.cfg_mass, sitelist.hjson_mass to collect data for building classifier Model.

  3. Use config_lib.cfg_live, sitelist_COUNTRYNAME.hjson (replace COUNTRYNAME with correcponding country) to crawl and clean Live news articles

Step 2: Classification of the articles:

The following scripts are run in the same order as below:

  1. Classifier_Builder.ipynb - A classifier that classifies events described in the news articles.

  2. Doc2Vec_Classifier.ipynb - Using Doc2Vec we built the classifier.

  3. TextRank.ipynb - Performed TextRank and Coref Resolution on the text to extract keywords.

Step 3: NER tagging and Summarization System-

sample_input.csv = Input file for the system. ./output = Location where the output files are stored.

  1. topic_modelling.ipynb - Topic Modelling for extracting topics from the news article. Input - sample_input.csv Output - ./output/topic_modelling.csv

  2. extractor.ipynb - An extractor to extract named entities from the news article. Input - sample_input.csv Output - ./output/final_data_with_lat_and_long.csv

  3. evaluation.ipynb - The evaluation of our system. Input - sample_input.csv - ./output/extracted_data.csv *Output for this script - ./output/ACLED_rouge_scores.txt - (Printing the scores for each category in the console).

Step 4: User Interface Setup -

  1. Install Elastic search

  2. Add Index CSE_635 as described in ElasticSearch_Index.txt to Elastic search.

  3. Import Kibana dashboard and Visualizations from Kibana Admin console from KIbana_VIsualizations.json

  4. Run Elasticsearch_Indexer.ipynb - Jupyter notebook for Indexing articles into Elastic search Input for this script - ./output/final_data_with_lat_and_long.csv

Example:

Summary of a news article:

Patna Bihar India, Apr 4 ANI Congress workers created ruckus at the party office here on Thursday in protest against the denial of ticket to former party MP Nikhil Kumar from Aurangabad parliamentary constituency. The workers also shouted the slogan 'Nikhil Kumar Zinadabad.' Kumar was also present in the office when the ruckus took place. Kumar had successfully contested from the seat in 2004 when the Congress fought in alliance with the RJD and the LJP. Kumar, a former Delhi Police Commissioner, unsuccessfully contested from Aurangabad in 2014 against BJP s Sushil Kumar Singh.

Date: 4/4/2019 ranked_list_PERSON: [{'Kumar': 1.0}, {'Patna Bihar': 0.625}, {'Nikhil': 1.0}] ranked_list_ORG: [{'Congress': 1.0},{'Kumar': 1.0}, {'ANI': 1.0}] Location: ['India', 'Aurangabad', 'Patna', 'Bihar', 'India', 'Lok', 'Sabha'] From the above example, we can see that for the ‘PERSON’ entity, terms like ‘Nikhil’ and ‘Kumar’ have been given higher weight than ‘Patna Bihar’. The date tagged by our system is accurate because ‘April 4th’ in the text and the ‘Thursday’ correspond to the same day. Similarly, for the ‘ORG’, the term ‘Congress’ has a higher weight

tl-dr's People

Contributors

akshayakp97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tl-dr's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.