ayota / ddl_nlp Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 213 KB

Repo for DDL research lab project.

Python 100.00%

ddl_nlp's People

Contributors

Stargazers

Watchers

Forkers

lauralorenz

ddl_nlp's Issues

Update documentation

I forgot to check the documentation after all the refactor stuff. Assume evaluation is dying. This issue is to remind me of that next week :P

We need to develop a file that parses the downloaded text and returns a training and a test set k-1 times. The next step in the process (gensim word2vec training) should be able to pull just the training data to conduct training. The function should also return test data that is accessible to our evaluation tasks.

Separate corpus build-out from word2vec workflow

The idea here is that a corpus can be built in a number of ways:

Wikipedia
Medical Abstracts
Medical Textbooks
Manually adding files
Any combination of the above options

This doesn't lend itself well to inclusion in the word2vec Drake workflow because of all the moving parts in corpus creation.

Perhaps a better approach would to have a master corpus creation script (under the fun_3000/ingestion/ directory) which takes in as argument a selection of source types (wiki, abstracts, textbooks, all) and takes care of calling the appropriate ingestion modules. Once corpus creation is done (and any necessary manual edits are done to it), the Drake workflow can be kicked off to execute the rest of the steps.

Stub the evaluation script

The evaluation script is still in a state that expects there to be multiple model files in the folds-based directory structure but we took that out in PR #64. Also we hate it.

This issue is to refactor the evaluation script and its Drakefile component so that it just appends random data to the scores files (unless we can at this point anticipate any of the data, such as boost numbers, in which case use the real data). This is just to remain a stub so that we can evaluate the visualization PR as well while we figure out the theoretical evaluation tactic we want to use.

Drop the corpus splitting possibly

We have a step where we split up the corpus into k number of different folds and then run word2vec on those folds separately, resulting in k number of model files. Later on during evaluation we take each of those model files, obtain cross-validated accuracy scores for the evaluation task for each model, take the top value for each model and then average those together to represent the efficacy of the entire corpus.

We think that we don't like the corpus splitting in the first stage and this issue is to get rid of it. In the first place, its introducing a variable we don't really want to test, namely, if the corpus itself is biased. In the second place, the way we pool the model's evaluation scores to end up representing the whole corpus anyways makes the evaluation measure unnecessarily complex and indirect.

Set up corpus ingestion on The Cloud

We don't want to sit around watching our computer's scrape wikipedia. This issue is to set up ingestion on an AWS instance to pull the corpus against a larger word set and with a broader radius. We may want to modify the code as well to save against an S3 bucket, or we can do the transfer ourselves after the process has completed. 🍤

Performance refactor: streaming sentence from files

There are cases where we read the entire corpus into memory which is just going to give us more and more pain as the corpus gets larger.

During corpus cleaning, all files are read into a single string. May be fixed during the spacey refactor (issue #57)
~~During generating corpus folds, each sentence in the corpus is read into a single list. This may be fixed during dropping generating folds (issue #61)~~

Grabbing Relevant Terms

Right now we have a process which starts with grabbing terms from a text file to run the get_corpus.py process. However, there are only certain terms that we are really interested in. Those 'terms of extreme interest' can be generated from fun_3000.evaluation.similarity_evaluation.UMLS.get_words_list function. We basically want this function to run and have some capability that saves this to a file. Ultimately, we probably just want this to run within drake as the activity that kicks off the whole process. Another option is to just run that function one time and save the result to the location where we want the file for terms to exist and add it to configuration control. Your call.

Drake: Get drake to run the full training process

We need to get the Drakefile we have to run the full ingestion, fold_generation, and training on each fold in one single run.

Storage: Fix to make sure models files are stored properly

Due to the fact that we are storing our corpus files in folds in a folder structure like the structure described in our README, we need to be able to store the k model files that are generated when gensim runs k times.

Currently. The fun_3000.word2vec.py file is trying to save files to a folder that does not exist.

'models/jazz/2/train/jazz/2/train.model'

IDEAL: We'd like the folder structure to be built as a part of the end of the script.
and to follow the pattern

models///train.model

Current Full traceback below:

Traceback (most recent call last):
File "fun_3000/word2vec.py", line 79, in
run_model(opts.input_data_dir, opts.parallel_workers, opts.context_window, opts.hidden_layer, opts.model_name)
File "fun_3000/word2vec.py", line 66, in run_model
model.save(model_path)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1452, in save
super(Word2Vec, self).save(_args, *_kwargs)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 486, in save
pickle_protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 359, in _smart_save
pickle(self, fname, protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 912, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 99, in smart_open
return file_smart_open(parsed_uri.uri_path, mode)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 379, in file_smart_open
return open(fname, mode)
IOError: [Errno 2] No such file or directory: '/Users/donaldvetal/Projects/ddl_nlp/models/jazz/2/train/jazz/2/train.model'

Average models

Take all models from multiple folds and average together to get single model to feed into evaluation task.

Add docs to README related to ontologies

Spin off from #30 to do the part updating README to reflect how pulling ontologies works.

Update README to represent proper directory structure during different steps

README is out of date about what shape the file system should look like during the process. We're using the README as specs for our glue together scripts so this issue is to clean it up.

Change the example directories listed for the generate folds script in the README to be based off runs, not single terms (e.g. "jazz") since we got away from single terms and do it by multiple terms at a time.

Ontology pruning system

We need to be able to prune the ontology to give us the ability to prune out any explicit relationship that we expect in our evaluation metric corpus.

Spacy refactor for corpus cleaning

Several things we should do here, that spaCy can help us with. 🎊

Stop reading all the things in at once; make input/outputs to generators. I think the pipeline method will help here.
we need a function to clean random characters out of the corpus we encounter, mostly remnants of html/latex/other code and copyright symbols maybe some of these flags? and also string features?
we need a function to split the corpus into sentences figure out where it does this
we need a function to throw out invalid sentences (ones that don't have SOV, or are just numbers, or are just headers, etc.)
we need a wrapper function to call all the things, that can be run as part of the ingestion task and output a clean bunch of sentences for fold making
input should be a directory of .txt files; output should be a new text file, Drake currently expects it to be output.txt based on parameters sent from the Drakefile

Parameterize boost level

We want to be able to parameterize the boost level of the ontologies during drake file run.

This issue is done when we can send a int to the drake step that generates the folds that affects how many times an ontology is appended to that fold.

Ingest medical abstracts

We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting medical abstracts.

PubMed
a subset of PubMed for medical sciences called MEDLINE
arXiv

Like #8 we are interested in abstracts referencing our evaluation metrics, so this ingestion module should be able to

query one or more abstract sources for each query in the evaluation metric data set
retrieve abstracts within some parameterized limit of search results

Input: evaluation metric data set in csv
Output: all the abstracts as a flat text file to disk

Classification evaluation metric based on ontology dropout analogies

Classifying things
Based on some information extracted from the ontology but not used to train the neural net (Don's example is, A >> B >> C, so we remove/try to classify on B / Another example is, in the ontology we may have plague is type of disease, a disease is a type of ailment, and we would remove reference to disease and leave the link between plague and ailment).

Input: csv of words with classes assigned
Output: df of word vectors and associated classes, indexed by word

Update word2vec to now take the single cleaned corpus file

TBA

Make readme for evaluation script

Add section to main readme that gives high level overview of what the evaluation script does and how to use it.

We think the current docs are missing

what the score actually is (math)
example output
manual example does not include evaluation example, add that and move all the wikipedia manual example part into the github wiki

Ingest medical textbooks

archive.org - lots of textbooks, mostly in image form though
Smithsonian's transcription center ? who knows if this is useful
somewhere online

Honestly we feel like we only need one so some hard core google fu for one or two in text form that we can programmatically retrieve (or simply one time click and download) would be sufficient.

Input: none
Output: medical textbook(s) as a flat text file on disk

Make wikipedia ingestion ignore Notes/References and External Links sections

There are some sections common to wikipedia pages that are not valuable content to us because they contain primarily links or otherwise information that is not contextual human language. In particular the Notes, References, and External Links sections that are optionally included in wikipedia page objects are not valuable to us. This issue is closed when our wikipedia ingestion script returns all sections EXCEPT the Notes, References, and External Links sections when pulling data. You can see information on a Wikipedia Page object's sections property and section() method here for an idea of how to filter this information with the python wikipedia API.

Make a few prod models

Now that we've got boost more or less together we should run a few boost models on a big (or maybe our existing full) corpus so we have something to investigate with evaluation. Fun! 🦐

Clean up the text

The text pulled in as abstracts and wikipedia has a bunch of weird shit in it. Altering generate_folds.py to include rules to clean this up a bit so the sentences generated are as complete and sentence-y as possible.

We are going to:

remove all html tags {<-->)
remove all latex ({--} and ${--})
remove headers from wikipedia articles
remove new lines and carriage returns (this messes up the tokenize script)
remove all non-ascii characters (like copyright symbols)
remove extraneous spaces (this also messes up the tokenize script)
remove sentences less than 7 words long, that don't end with a period, don't start with a capital letter, or start with a number.

Bugfix for medical abstract ingestion with illegal variable reference

2016-08-18 00:38:52,978: INFO : Fetched Myopathy term wiki artifacts.
2016-08-18 00:38:53,952: INFO : Document does not have abstract.
Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 84, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 41, in fetch_corpus
    med_search.get_medical_abstracts(term, data_dir, results)
  File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 172, in get_medical_abstracts
    abstracts_pubmed = fetch_pubmed(search_term, results)
  File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 64, in fetch_pubmed
    for item in summary:
UnboundLocalError: local variable 'summary' referenced before assignment

Looks like this here belongs in the try/except.

Give everyone access to corpus 1

HTTP

IAM users

🐳

Update evaluation script to take one of the valid evaluation files

We've been using a "fake" evaluation file for our evaluation task since we merged it in for the purposes of testing evaluation workflow, not accuracy. Now that we have a more normal-sized corpus, we should switch out to a real evaluation file and see how we do.

It seems there are 3 options for valid evaluation files:

https://github.com/ayota/ddl_nlp/blob/71077ba5836ce025c95d8a5457c86c09625f6850/data/evaluation/UMNSRS_similarity_evalfile.csv (@ayota recently posted this so the current thinking is this is the right one)
https://github.com/ayota/ddl_nlp/blob/71077ba5836ce025c95d8a5457c86c09625f6850/data/evaluation/UMLS_synonyms/MiniMayoSRS.csv
https://github.com/ayota/ddl_nlp/blob/71077ba5836ce025c95d8a5457c86c09625f6850/data/evaluation/UMLS_synonyms/UMNSRS_similarity.csv (which seems to be similar to the first one in naming and data, but is in a different order so it's hard to tell)

Right now the file used during evaluation testing is hardcoded into the evaluation script but should probably be configurable. We should choose whichever one we want to use, set it as our default configuration in the Drakefile, and refactor evaluation to take the configured similarity file as the test set.

Make CLI better

Instead of using optparse over and over, can we better structure the CLI to be more consistent throughout package?

Can we handle the path drama here instead of in super fragile utils file?
Implementing required vs optional args; and generally nicer structure for CLI
Consistent commands and usage throughout package
And more! 🐰

Bugfixes to wikipedia ingest to make it more robust

While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.

Disambiguation errors

To reproduce, use the data/eval_words files as your search term input. Though you may be able to reproduce just with the "bad" search term which was Activase. You'll see this error:

/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
    raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "active" may refer to: 
Active (album)
Active Records
Active (ship)
Active (1764 ship)
Active (1850)
Active (1877)
Active (sternwheeler)
HMS Active
USCS Active (1852)
USCGC Active
USRC Active
USS Active
Active (whaler)
Active Enterprises
Sky Active
Active (pharmacology)
Active, Alabama
ACTIVE
Locomotion No 1
fraternities and sororities
Active lifestyle
Activation
Activity (disambiguation)
Passive (disambiguation)
All pages beginning with "Active"

As far as we can tell this is because from the search term Activase we receive a results list including the page title Active, but when we later try and retrieve the wikipedia page with the wikipedia.page method against the title Active, it returns a DisambiguationError from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. against Activase) or the second round of search here (i.e. against Active), so we will need to catch both places.

Page ID not found

To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was Malignant tumor of lung.

Renal failure
Kidney failure
Abortion
Miscarriage
Heart
Myocardium
Stroke
Delusion
Schizophrenia
Calcification
Stenosis
Tumor metastasis
Adenocarcinoma
Congestive heart failure
Pulmonary edema
Pulmonary fibrosis
Malignant tumor of lung
Diarrhea
Stomach cramps

This will trigger this error:

Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 345, in __load
    raise PageError(self.title)
wikipedia.exceptions.PageError: Page id "malignant tumors of luna" does not match any pages. Try another id!

This one is a little more confusing at this point since the original term was actually malignant tumors of lung. At some point it tries to instantiate a WikipediaPage against the title malignant tumors of luna but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted as page_id unless page_id is explicitly cast for None.

For this one we might also want to figure out more related to the auto_suggest flag as that may be how we're getting the weird respelling of our original search term.

Ingest and flatten information we want from OGMS Ontology

Placeholder right now for discussion about the OGMS Ontology.

input: ontology URLs that are in rdf/xml
output: sentences

add in more instance level OWLs
parameterize data directory location

Evaluation

Identifying synonyms
Using the UMLS data set, we use the 566 synonyms to do two related tasks:

Logistic scale: Take top quartile (~140 words) as "very similar" (=1) and mark all others as "not similar" (0). Get word2vec vectors for train/test pairs, find cosine distance between vectors and build a classifier off that.
Multivariate scale: Use coder's continuous scale (maybe just take mean and ignore standard deviation?) to train multivariate regression model to predict score of test set of pairs.

Retrieve term list directory from a function

Right now out get_corpus.py file expects the user to provide a file that includes the terms we want to search for. However, we have built a function similarity_evalution.FEATURE_BUILDER.get_words_list that returns the unique term list from the UMNSRS_similarity.csv file. We should add an option to simply grab the terms from this location.

Update word2vec.py to fit into revised workflow

Word2vec.py script needs a bit of attention to confirm it's still doing what we want. This is the start of a list of things we need to look at.

Right now the readme and what the function actually does are out of sync. Fix readme/confirm word2vec.py actually generates a model specific to each fold
Edit automagically generated filename to include data dir name (in our case, {SOME RUN}), as well as the fold number (e.g., {SOME RUN}_{SOME FOLD}.model) to reduce confusion.
Other things TK

include entries for all terms from the evaluation metric dataset
use a radius parameter to include a radius of similar entries from the base entry

Probably should simply build off of our existing wikipedia_ingestion.py but throw the results parameter from the wikipedia package in there.

Input: evaluation metric dataset in csv
Output: all the related wikipedia articles as a flat text file to disk