The nlp_final_project from ah4597

Used 2 different models: 1) LSTM Model, trained using a dataset of roughly 10k article titles 2) GPT-2 finetuned model, using the same dataset of roughly 10k article titles All articles used were under the "news" category.

The test corpus was size 1611 for the LSTM model, but only size 20 for the GPT-2 model, due to hardware and time-constraints.

The keywords of the testing corpus were extracted using the YAKE! library, because its light-weight, simple to use, and its benchmark tests showed it can outperform other state-of-the art methods.

If time permits, maybe another method of keyword extraction will be implemented, or a better set of stop words (beyond a generic list from nltk).

The top 5 keywords/phrases were extracted.

First limiting the size of the word/phrase to 1, so only a single word. Then, limiting the size of the phrase to 3, so up to three words could be considered 'key'. Then finally, limiting the size of the phrase to 5, so up to five words could be considered 'key'.

Afterwards, these keyword/phrases were inputted into both models. Again, the LSTM model received all 5 keywords, of all 1611 titles (i.e. generated a total of 8055 titles) The GPT-2 model only received 3 keywords, from 20 titles (due to time and hardware constraints, generating a total of 60 titles)

These titles were then compared to the original articles title, using cosine similarity (from sklearn), and the results are below:

LSTM Model:

yake1 : 0.06385564597491072

yake1 best cosine: 0.49630705945783515, index: 739, result: 0

Generated: Skin Whitening What Is What We Need To Know About The

Actual: Skin whitening: What is it, what are the risks and who profits?
yake3 : 0.06427950108759217

yake3 best cosine: 0.5448954693228305, index: 1942, result: 2

Generated: Sir John Major Russia Fast Facts In The Us And A Man Of

Actual: John Major Fast Facts
yake5 : 0.06427181437205604

yake5 best cosine: 0.49958147679900644, index: 813, result: 4

Generated: Leymah Gbowee The World And The Most Of 500 People Bike The

Actual: Leymah Gbowee: The people have awoken, we need to make the most of this moment

Best average result (6.5% similar) Best result (54% similar)

GPT-2 Model:

yake1 : 0.045118583378910664

yake1 best cosine: 0.18817272215521147, index: 165, result: 1

Generated: 1) Cnn'It Is A Warzone,' Says Migrant 2) Russian Agreement Puts Ukraine In Eu 'Black Hole'

Actual: 'Help us, we're stranded': International students say they're trapped in northeast Ukraine
yake3 : 0.03172485258032359

yake3 best cosine: 0.31745314091816873, index: 73, result: 1

Generated: Russia Invades Ukraine To 'Hijack' Key Institutions

Actual: Kenya's UN ambassador slams Russia and compares Ukraine crisis to Africa's colonial past
yake5 : 0.06005158784842297

yake5 best cosine: 0.2630507988611436, index: 73, result: 2

Generated: Amazon Prime Video Fast Facts A Fast Facts For The Week

Actual: Amazon's going to Nollywood -- and its deals with studios could shake up one of the world's most prolific filmmaking hubs

Best average result (6.01% similar ) Best result (31% similar)

GPT-2 excelled with a larger limit on the keyword/phrase, nearly 50% increase of single keyword, and nearly double that of the 3 keyword. LSTM had nearly the same results across the board, however surprisingly yake-3-keyword performed the best.

I would imagine that yake-5-keyword would stay the best for GPT-2, and would outperform yake-3-keyword for the LSTM Model, with some improvements to the keyword extraction, as well as our stop word list.

Possible imporvements:

Improve stopwords list
Different category titles (news is generally pretty broad)
Larger testing corpus for GPT-2 -- The inputs that generated the best results for LSTM weren't tested on GPT-2 due to time constraints

ah4597 / nlp_final_project Goto Github PK

nlp_final_project's Introduction

nlp_final_project's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent