Light

longtv02 / vietnamese-corpus-search-and-analysis-web-app Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tuan-lee-23/vietnamese-corpus-search-and-analysis-web-app

0.0 0.0 0.0 37.71 MB

Vietnamese corpus search tools and statistical analysis

Python 100.00%

vietnamese-corpus-search-and-analysis-web-app's Introduction

This project used 100% python (v 3.7)

Features:

Corpus search tool:

Our tool can search in a corpus by:

Ambiguous: you can search everything such as character, number, morpheme,...
Noun (POS tagging)
Verb (POS tagging)
Adjective (POS tagging)
Name of Person (NER model)
Name of Location (NER model)
Name of Organization (NER model)
Show the top 10 similar words of your input (gensim word2Vec)

Corpus dataset:

I did web scrapping and got 12k description lines on vnexpress.net

Libraries used:

Dash + Dash bootstrap components
Plotly
Gensim
Underthesea (now Underthesea requires pytorch 1.4.0)
nltk
numpy
pandas
statsmodels

How to run:

Open terminal in the following directory: "Vietnamese-corpus-search-and-analysis-Web-app/"

Using Corpus search app

Run terminal "python src/app.py"

python src/app.py

Wait about 1 minute for the server, if you see the local host link in terminal, then ctrl click open it or copy and paste it into browser

Using corpus statistical analysis app

Run terminal

python src_statistics/app.py

Wait about 1 minute for the server, if you see the local host link in terminal, then ctrl click open it or copy and paste it into browser

Using another corpus

Rename your corpus file to "vn_express.txt" and replace it in resources/
You have to run "python src/create_NER_pickle.py", then type in your corpus' directory: "resources/vn_express.txt" to build the NER model and Word2vec model, output as 2 files ner.pik and w2v.pik
You only need to run once when using a new corpus

Folders structure:

docs/: documentation folder
- NLP.pptx: slides
src/: source code of corpus search app
src_statistics/: source code of corpus statistical analysis app
resources/:
- ner.pik: pickle file of NER model
- w2v.pik: pickle file of Word2vec model
- vn_express.txt: main corpus data
- corpus_mini.txt: small 2k corpus for fast debugging
- stop_words.txt: File contains Vietnamese stopwords

Demo

Corpus search tool

Statistical analysis tool

vietnamese-corpus-search-and-analysis-web-app's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.