Given the data set contains the information of news articles, try to identify the breaking news event and extract information for each event.
This project is built by the PyTorch. The required modules are list as below.
- torch
- pytorch-transformers
- For the implementation of BERT.
- spacy
- nltk
- gensim
The code are organized in the jupyter notebook.
Before run,
-
make sure the
data/
andoutput/
two file folders are build in the current path. -
Put the original data set into
data/
file folder and rename it as 'data_news.xlsx'.
Install some modules and download pre-trained model.
pip install -r requirements.txt
# pip install spacy
# python -m spacy download en_core_web_sm
# python -m spacy link en_core_web_sm en_core_web_sm
(You may also encouter the spacy
installation code in the notebook.)
I did the project on my own laptop and tested it on the server. To save time, the shorten data data_news_shorten.xlsx
set could be used. It contains the first 1000 items from original data set.
- data/: the file folder containing the data set
data_news_shorten.xlsx
: the truncated data set (1000 from the original data)data_news.xlsx
: the original data (has been renamed)news_corpus_for_doc2vec.csv
: the csv file for training doc2vec.
challenge_dataset
: custom module for the data set. (though didn't use in the final version of my project) In the beginning I wanted to build some scripts for this challenge.0. EDA of the News Dataset.ipynb
: EDA1. Identify the Breaking News Event.ipynb
: the notebook for solving the first task2 Information Extraction.ipynb
: the notebook for solving the second task3. Further Idea.ipynb
: for third taskUsing BERT NextSentence Prediction for Task1.ipynb
: experience of exploring BERT. The supplement of the3. Further Idea.ipynb
.
Finally, Thanks to echobox for supplying me such a good opportunity to do the challenge!
โ Shuo