Giter Site home page Giter Site logo

nlp-eventidentification-infoextraction's Introduction

Main Goal

Given the data set contains the information of news articles, try to identify the breaking news event and extract information for each event.

Module Require

This project is built by the PyTorch. The required modules are list as below.

  • torch
  • pytorch-transformers
    • For the implementation of BERT.
  • spacy
  • nltk
  • gensim

How to run

The code are organized in the jupyter notebook.

Before run,

  • make sure the data/ and output/ two file folders are build in the current path.

  • Put the original data set into data/ file folder and rename it as 'data_news.xlsx'.

Install some modules and download pre-trained model.

pip install -r requirements.txt

# pip install spacy
# python -m spacy download en_core_web_sm
# python -m spacy link en_core_web_sm en_core_web_sm

(You may also encouter the spacy installation code in the notebook.)

I did the project on my own laptop and tested it on the server. To save time, the shorten data data_news_shorten.xlsx set could be used. It contains the first 1000 items from original data set.

Files

  • data/: the file folder containing the data set
    • data_news_shorten.xlsx: the truncated data set (1000 from the original data)
    • data_news.xlsx: the original data (has been renamed)
    • news_corpus_for_doc2vec.csv: the csv file for training doc2vec.
  • challenge_dataset : custom module for the data set. (though didn't use in the final version of my project) In the beginning I wanted to build some scripts for this challenge.
  • 0. EDA of the News Dataset.ipynb: EDA
  • 1. Identify the Breaking News Event.ipynb : the notebook for solving the first task
  • 2 Information Extraction.ipynb: the notebook for solving the second task
  • 3. Further Idea.ipynb: for third task
  • Using BERT NextSentence Prediction for Task1.ipynb: experience of exploring BERT. The supplement of the 3. Further Idea.ipynb.

Finally, Thanks to echobox for supplying me such a good opportunity to do the challenge!

โ€” Shuo

nlp-eventidentification-infoextraction's People

Contributors

shuogh avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.