soaxelbrooke / phrase Goto Github PK

View Code? Open in Web Editor NEW

31.0 4.0 3.0 1.81 MB

A tool for learning significant phrase/term models, and efficiently labeling with them.

License: Apache License 2.0

Rust 92.96% Shell 7.04%

phrase-extraction nlp phrases text-mining text-analysis topic-modeling

phrase's Introduction

A CLI tool and server for learning significant phrase/term models, and efficiently labeling with them.

Installation

Download and extract the release archive for your OS, and put the phrase binary somewhere on the PATH (like /usr/local/bin). If you're using linux, the GNU binary currently appears to be 5-10x faster than the musl version, so try that first.

For example, installing the linux binary:

$ wget https://github.com/soaxelbrooke/phrase/releases/download/0.3.6/phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz
$ tar -xzvf phrase-0.3.6-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv phrase /usr/local/bin/

Use

In general, using phrase falls into 3 steps:

Counting n-grams
Exporting scored models
Significant term/phrase extraction/transform or model serving

N-gram counting is done continuously, providing batches of documents as they come in. Model export reads all n-gram counts so far and calculates mutual information-based collocations - you can then deploy the models by shipping the binary and data/scores_* files to a server. Labeling (identifying all significant terms and phrases in text) or transforming (eager replace of longest found phrases in text) can be done either via the CLI or the web server. Providing labels for documents is not necessary for learning phrases, but does help, and allows for significant term labeling also.

Training a phrase model

This example uses the assets/reviews.json data in the repo, 10k app reviews:

$ head -1 assets/reviews.json
{"body": "Woww! Moon Invoice is just so amazing. I don\u2019t think any such app exists that works so wonderfully. I am awestruck by the experience.", "category": "Business", "sentiment": "positive"}

First, you need to count n-grams from your data:

$ phrase count --mode json assets/reviews.json --textfield body --labelfield category --labelfield sentiment

(This creates n-gram count files at data/counts_*)

Then, you need to export scored phrase models:

$ phrase export

(This will create scored phrase models at data/scores_*)

Validating Learned Phrases

You can validate the phrases being learned per-label with the show command:

$ phrase show -n 3

Label=News
hash,ngram,score
3178089391134982486,New Yorker,0.5142287028163096
18070968419002659619,long form,0.5096737783425647
16180697492236521925,sleep timer,0.5047391214969927

Label=Business
hash,ngram,score
4727477585106156155,iTimePunch Plus,0.55749203444841
12483914742025992948,Crew Lounge,0.5479129370086021
11796198430323558093,black and white,0.5385891753319711

...

hash is the hash of the stemmed n-gram, ngram is the canonical version of the n-gram used for display purposes. For phrases, score is a combination of NPMI(phrase, tokens) and NPMI(n-gram, label), and is only NPMI(n-gram, label) for single tokens.

Transforming Text

$ echo "The weather channel is great for when I want to check the weather!" | phrase transform --label Weather -
The Weather_Channel is great for when I_want_to check_the_weather!

Modes allow CSV, JSON, and plaintext (the default). CSV and JSON will maintain the rest of the document/row, but replace text in the specified --textlabel fields (or in the text field if not specified).

Serving Scored Phrase Models

$ phrase serve

It also accepts --port and --host parameters.

API Routes

GET /labels - list all available labels for extraction/labeling.

$ curl localhost:6220/labels
{"labels":["Social Networking","Travel","negative","Weather","positive","Business","News","neutral",null]}

POST /analyze - identifies all significant phrases and terms found in the provided documents.

$ curl -XPOST localhost:6220/analyze -d '{"documents": [{"labels": ["Weather", "positive", null], "text": "The weather channel is great for when I want to check the weather!"}]}'
[{"labels":["Weather","positive"],"ngrams":["I want","I want to","I want to check","Weather Channel","channel","check","check the weather","want to","want to check","want to check the weather","weather","when I want","when I want to"],"text":"The weather channel is great for when I want to check the weather!"}]

POST /transform - eagerly replaces the longest phrases found in the provided documents.

$ curl -XPOST localhost:6220/transform -d '{"documents": [{"labels": ["Weather"], "text": "The weather channel is great for when I want to check the weather!"}]}'
[{"label":"Weather","text":"The Weather_Channel is great for when I_want_to check_the_weather!"}]

Labels

Labels are used to learn significant single tokens and to aid in scoring significant phrases. While phrase can be used without providing labels, providing them allows it to learn more nuanced phrases, like used by a specific community or when describing a specific product. Labels are generally provided in the label field of the input file, specified using --labelfield argument, or with the --label argument.

Providing labels for your data causes phrase to count them into separate bags per label, and during export allows it to calculate an extra significance score based on label (instead of just co-occurance). This means that a phrase that is unique to that label is much more likely to be picked up than if it was being overshadowed in unlabeled data.

An example of a good label would be app category, as apps in each category are related, and customer reviews talk about similar subjects. An example of a bad label would be user ID, since it would be very high cardinality, cause very bad performance, and likely wouldn't learn useful phrases or terms due to data sparsity per user.

Performance

It's fast.

It takes 0.66 second to count 1 to 5-grams for 10,000 reviews, and ~1.2 seconds to export. Performance is primarily based on n-gram size, the number of labels, and vocab size. For example, labeling on iOS app category (23 labels) using default parameters on an Intel Core i7-7820HQ (Ubuntu):

Task	Tokens per Second per Thread
Counting n-grams	779,025
Exporting scored models	206,704
Labeling significant terms	354,395
Phrase transformation	345,957

Note: Exports do not gain much from parallelization

Environment Variables

A variety of environment variables can be used:

ROCKET_ADDRESS - The address so serve on, defaults to localhost. (Other Rocket configs can be assigned, also)

ROCKET_PORT - The port to serve on, defaults to 6220.

LANG - Determines the stemmer language to use, ISO 639-1. Should be set automatically on Unix systems, but can be overridden.

TOKEN_REGEX - The regular expression used to find tokens when learning and labeling phrases.

CHUNK_SPLIT_REGEX - The regular expression used to detect chunk boundaries, across which phrases aren't learned.

HEAD_IGNORES / TAIL_IGNORES - Used to ignore phrases that start or end with a token, comma separated. For instance, TAIL_IGNORES=the would ignore 'I love the'.

PRUNE_AT - The size at which to prune the n=gram count mapping. Useful for limiting memory usage, default is 5000000.

PRUNE_TO - Controls what size n-gram mappings are pruned to during pruning. Also sets the number of n-grams that are saved after counting (sorted by count). Default is 2000000.

BATCH_SIZE - Controls the document batch size. Causes input streams to be batched, allowing larger than memory datasets. Default is 1000000.

MAX_NGRAM - The highest n-gram size to count to, higher values cause slower counting, but allow for more specific and longer phrases. Default is 5.

MIN_NGRAM - The lowest n-gram size to export, default is 1 (unigrams).

MIN_COUNT - The minimum n-gram count for a phrase or token to be considered significant. Default is 5.

MIN_SCORE - The minimum NPMI score for a term or phrase to be considered significant. Default is 0.1.

MAX_EXPORT - The maximum size of exported models, per label.

NGRAM_DELIM - The delimiter used to join phrases when counting and scoring. Default is .

Citations

Normalized (Pointwise) Mutual Information in Collocation Extraction - Gerlof Bouma

phrase's People

Contributors

Stargazers

Watchers

Forkers

atul9 jjhbw pombredanne

phrase's Issues

Doesn't create data dir if it doesn't exist (fails instead)

Calculate stems during export

It's a waste to calculate them during counting, and appears to be causing memory problems.

Add "transform" subcommand

Counting halts for ~30 minutes during counting

See the last few log lines:

[2019-05-01T16:17:32Z DEBUG phrase] Pruning ngrams of length 5052241.
[2019-05-01T16:17:37Z DEBUG phrase] Done pruning.
[2019-05-01T16:17:46Z DEBUG phrase] Pruning ngrams of length 5072157.
[2019-05-01T16:17:50Z DEBUG phrase] Done pruning.
[2019-05-01T16:17:59Z DEBUG phrase] Pruning ngrams of length 5115327.
[2019-05-01T16:18:03Z DEBUG phrase] Done pruning.
[2019-05-01T16:18:11Z DEBUG phrase] Pruning ngrams of length 5105722.
[2019-05-01T16:18:16Z DEBUG phrase] Done pruning.
[2019-05-01T16:18:24Z DEBUG phrase] Pruning ngrams of length 5057711.
[2019-05-01T16:18:29Z DEBUG phrase] Done pruning.
[2019-05-01T16:18:39Z DEBUG phrase] Pruning ngrams of length 5060026.
[2019-05-01T16:18:43Z DEBUG phrase] Done pruning.
[2019-05-01T16:18:52Z DEBUG phrase] Pruning ngrams of length 5050255.
[2019-05-01T16:18:57Z DEBUG phrase] Done pruning.
[2019-05-01T16:19:06Z DEBUG phrase] Pruning ngrams of length 5003765.
[2019-05-01T16:19:11Z DEBUG phrase] Done pruning.
[2019-05-01T16:19:21Z DEBUG phrase] Pruning ngrams of length 5083177.
[2019-05-01T16:19:26Z DEBUG phrase] Done pruning.
[2019-05-01T16:19:35Z DEBUG phrase] Pruning ngrams of length 5065380.
[2019-05-01T16:19:40Z DEBUG phrase] Done pruning.
[2019-05-01T16:20:45Z DEBUG phrase] Pruning ngrams of length 5082774.
[2019-05-01T16:20:53Z DEBUG phrase] Done pruning.
[2019-05-01T16:50:00Z DEBUG phrase] Pruning ngrams of length 5042668.
[2019-05-01T16:51:39Z DEBUG phrase] Done pruning.

Only seems to occur for very large input.

Make label names more permissive

Many real category data out there includes non-[a-z] data, like names with spaces, ampersands, etc.

Add fit-transform style CLI option

There are times in preprocessing for exploratory NLP tasks that you don't really care about maintaining a model. It should be easy to run a single command to fit your phrase model and transform your input at the same time.

Make thread per label being counted

Performance Regression

Something has caused phrase counting to slow down by ~10x. It seems likely due to all the copying happening in the newer abstract document readers. Old 10k review counting took ~1s, now its taking 5-10s.

Implement vocab pruning

Ignore numbers and emails

They aren't useful and can result in people's PII being encoded into models.

Try Different PMI Estimate

Current NPMI estimate does some wacky stuff, we should instead calculate the PMI for left subgram + right unigram and left unigram + right subgram, mean them (arithmetic vs geometric?), then normalize.

https://github.com/soaxelbrooke/phrase/blob/master/src/main.rs#L1283

Add environment variables for parameters

Add /count API endpoint

While counting is best done in bulk, its pretty clear there are instances where you'd prefer services to be interfacing with phrase over HTTP instead of relying on running a cron job to keep counts up to date.

Accept multiple labels per document for counting and labeling

Add start and end indexes for detected phrases

Many times its difficult to highlight in the text where a phrase has occurred, for multiple reasons:

A phrase can occur multiple times
A returned phrase can be the normalized version

The /analyze API should return a list of spans that match each phrase.

Duplicated tokens when using MAX_NGRAM > 10

In trying mutual information out for learning a subword tokenization vocab, MAX_NGRAM=11 is resulting in duplicated characters:

stuart@mkv:~/data/tmp$ rm data/*; MAX_NGRAM=10 TOKEN_REGEX=. phrase count en_reviews.csv --mode csv --textfield body --labelfield rating; phrase export; phrase show

Label=1
hash,ngram,score
18415922922211966513,t h e,0.6363380611966793
17914724369135298022,f o r,0.5871598626336608
3095383692093144700,t h e y,0.575924682014556
4712868740118897308,h a v e,0.5703441057211067
14024107387028683102,w i t h,0.5526794390886509

Label=4
hash,ngram,score
15130871412783076140, ,0.5045632902635742
4572565019114190643,j e g,0.4935512138461849
7304601966317243410,e r,0.4899087702638937
4861016657709414307,👍 🏼,0.4648909308968822
17914724369135298022,f o r,0.46172711934348454

Label=3
hash,ngram,score
17914724369135298022,f o r,0.5493486900984565
18415922922211966513,t h e,0.5400118642648779
823578326311812929,w a s,0.5075134909529068
4572565019114190643,j e g,0.5043512947889531
15130871412783076140, ,0.49597432486554377

Label=2
hash,ngram,score
18415922922211966513,t h e,0.6590091494840076
17914724369135298022,f o r,0.5813162238122089
4712868740118897308,h a v e,0.5733841012831812
3197989230922105892,t o,0.5579905097552169
14024107387028683102,w i t h,0.5374034025991882

Label=5
hash,ngram,score
16756749404574535759,v e r y,0.5988956216366501
18415922922211966513,t h e,0.5757690182994385
14024107387028683102,w i t h,0.5413567960482037
13219107927906441752,q u i c k,0.5331504213044447
15130871412783076140, ,0.5031534046170836

Default
hash,ngram,score
18415922922211966513,t h e,1.0989026833407292
14024107387028683102,w i t h,1.0210042251209719
16756749404574535759,v e r y,1.0038153994719832
17914724369135298022,f o r,0.96870549510887
823578326311812929,w a s,0.967953100921941
stuart@mkv:~/data/tmp$ rm data/*; MAX_NGRAM=11 TOKEN_REGEX=. phrase count en_reviews.csv --mode csv --textfield body --labelfield rating; phrase export; phrase show

Label=1
hash,ngram,score
7885594734979136064,e e d,0.8937949560137818
12177849757917733397,e e y,0.8796409927072046
18153361248725046440,e e r,0.8325780064404189
5849221387069017603,e e r e,0.8266668449847678
14831183891864497587,h h e y,0.7824760802388278

Label=4
hash,ngram,score
7885594734979136064,e e d,0.8886180389152183
14919098748374444916,n n d,0.8242804941093913
12316562843929522493,t t e r,0.6761508403051653
3139600802105402158,å å,0.6754007875749644
17460211092147653343,", ,",0.665983184393488

Label=3
hash,ngram,score
7885594734979136064,e e d,0.882480809621127
6978619936077427553,e e s,0.7774728570525561
14919098748374444916,n n d,0.748207087317821
3737277342115879903,u u t,0.7240416552060133
12316562843929522493,t t e r,0.6951651248885364

Label=2
hash,ngram,score
7885594734979136064,e e d,0.9698827378917166
5849221387069017603,e e r e,0.8089767275911705
18153361248725046440,e e r,0.7870684856450568
3737277342115879903,u u t,0.7589893235021974
14919098748374444916,n n d,0.7530743220470321

Label=5
hash,ngram,score
14919098748374444916,n n d,1.032676725440377
4879210129080536622,r r y,0.9521606573853961
7885594734979136064,e e d,0.871696100885882
18153361248725046440,e e r,0.8448989448505464
11847061388218302127,l l y,0.8346358477052218

Default
hash,ngram,score
7885594734979136064,e e d,1.749894320305351
14919098748374444916,n n d,1.7360713637821104
4879210129080536622,r r y,1.6715589374799673
11322620428672548095,r r s,1.4814410538933094
11847061388218302127,l l y,1.4717042853422688

Add stemming and canonical form mapping

Saturates memory on large corpus

Hi! This looks like a great tool. I love that it's focused on command line use.

When I took it for a spin on a larger-than-memory corpus the process got killed because of being OOM.

Is this expected behaviour? (I respect that working with larger-than-memory datasets is complicated in general)

Keep ngram counts in sorted order

To help improve performance

`phrase` doesn't print usage

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.