Giter Site home page Giter Site logo

ddl_nlp's People

Contributors

alamine42 avatar ayota avatar dvetal avatar lauralorenz avatar lcombs avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lauralorenz

ddl_nlp's Issues

Update documentation

I forgot to check the documentation after all the refactor stuff. Assume evaluation is dying. This issue is to remind me of that next week :P

Generate Cross-val folds

We need to develop a file that parses the downloaded text and returns a training and a test set k-1 times. The next step in the process (gensim word2vec training) should be able to pull just the training data to conduct training. The function should also return test data that is accessible to our evaluation tasks.

Separate corpus build-out from word2vec workflow

The idea here is that a corpus can be built in a number of ways:

  • Wikipedia
  • Medical Abstracts
  • Medical Textbooks
  • Manually adding files
  • Any combination of the above options

This doesn't lend itself well to inclusion in the word2vec Drake workflow because of all the moving parts in corpus creation.

Perhaps a better approach would to have a master corpus creation script (under the fun_3000/ingestion/ directory) which takes in as argument a selection of source types (wiki, abstracts, textbooks, all) and takes care of calling the appropriate ingestion modules. Once corpus creation is done (and any necessary manual edits are done to it), the Drake workflow can be kicked off to execute the rest of the steps.

Stub the evaluation script

The evaluation script is still in a state that expects there to be multiple model files in the folds-based directory structure but we took that out in PR #64. Also we hate it.

This issue is to refactor the evaluation script and its Drakefile component so that it just appends random data to the scores files (unless we can at this point anticipate any of the data, such as boost numbers, in which case use the real data). This is just to remain a stub so that we can evaluate the visualization PR as well while we figure out the theoretical evaluation tactic we want to use.

Drop the corpus splitting possibly

We have a step where we split up the corpus into k number of different folds and then run word2vec on those folds separately, resulting in k number of model files. Later on during evaluation we take each of those model files, obtain cross-validated accuracy scores for the evaluation task for each model, take the top value for each model and then average those together to represent the efficacy of the entire corpus.

We think that we don't like the corpus splitting in the first stage and this issue is to get rid of it. In the first place, its introducing a variable we don't really want to test, namely, if the corpus itself is biased. In the second place, the way we pool the model's evaluation scores to end up representing the whole corpus anyways makes the evaluation measure unnecessarily complex and indirect.

Set up corpus ingestion on The Cloud

We don't want to sit around watching our computer's scrape wikipedia. This issue is to set up ingestion on an AWS instance to pull the corpus against a larger word set and with a broader radius. We may want to modify the code as well to save against an S3 bucket, or we can do the transfer ourselves after the process has completed. ๐Ÿค

Grabbing Relevant Terms

Right now we have a process which starts with grabbing terms from a text file to run the get_corpus.py process. However, there are only certain terms that we are really interested in. Those 'terms of extreme interest' can be generated from fun_3000.evaluation.similarity_evaluation.UMLS.get_words_list function. We basically want this function to run and have some capability that saves this to a file. Ultimately, we probably just want this to run within drake as the activity that kicks off the whole process. Another option is to just run that function one time and save the result to the location where we want the file for terms to exist and add it to configuration control. Your call.

Storage: Fix to make sure models files are stored properly

Due to the fact that we are storing our corpus files in folds in a folder structure like the structure described in our README, we need to be able to store the k model files that are generated when gensim runs k times.

Currently. The fun_3000.word2vec.py file is trying to save files to a folder that does not exist.

'models/jazz/2/train/jazz/2/train.model'

IDEAL: We'd like the folder structure to be built as a part of the end of the script.
and to follow the pattern

models///train.model

Current Full traceback below:

Traceback (most recent call last):
File "fun_3000/word2vec.py", line 79, in
run_model(opts.input_data_dir, opts.parallel_workers, opts.context_window, opts.hidden_layer, opts.model_name)
File "fun_3000/word2vec.py", line 66, in run_model
model.save(model_path)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1452, in save
super(Word2Vec, self).save(_args, *_kwargs)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 486, in save
pickle_protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 359, in _smart_save
pickle(self, fname, protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 912, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 99, in smart_open
return file_smart_open(parsed_uri.uri_path, mode)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 379, in file_smart_open
return open(fname, mode)
IOError: [Errno 2] No such file or directory: '/Users/donaldvetal/Projects/ddl_nlp/models/jazz/2/train/jazz/2/train.model'

Average models

Take all models from multiple folds and average together to get single model to feed into evaluation task.

Update README to represent proper directory structure during different steps

README is out of date about what shape the file system should look like during the process. We're using the README as specs for our glue together scripts so this issue is to clean it up.

Change the example directories listed for the generate folds script in the README to be based off runs, not single terms (e.g. "jazz") since we got away from single terms and do it by multiple terms at a time.

Ontology pruning system

We need to be able to prune the ontology to give us the ability to prune out any explicit relationship that we expect in our evaluation metric corpus.

Spacy refactor for corpus cleaning

Several things we should do here, that spaCy can help us with. ๐ŸŽŠ

  • Stop reading all the things in at once; make input/outputs to generators. I think the pipeline method will help here.
  • we need a function to clean random characters out of the corpus we encounter, mostly remnants of html/latex/other code and copyright symbols maybe some of these flags? and also string features?
  • we need a function to split the corpus into sentences figure out where it does this
  • we need a function to throw out invalid sentences (ones that don't have SOV, or are just numbers, or are just headers, etc.)
  • we need a wrapper function to call all the things, that can be run as part of the ingestion task and output a clean bunch of sentences for fold making
  • input should be a directory of .txt files; output should be a new text file, Drake currently expects it to be output.txt based on parameters sent from the Drakefile

Parameterize boost level

We want to be able to parameterize the boost level of the ontologies during drake file run.

This issue is done when we can send a int to the drake step that generates the folds that affects how many times an ontology is appended to that fold.

Ingest medical abstracts

We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting medical abstracts.

Like #8 we are interested in abstracts referencing our evaluation metrics, so this ingestion module should be able to

  • query one or more abstract sources for each query in the evaluation metric data set
  • retrieve abstracts within some parameterized limit of search results

Input: evaluation metric data set in csv
Output: all the abstracts as a flat text file to disk

Classification evaluation metric based on ontology dropout analogies

Classifying things
Based on some information extracted from the ontology but not used to train the neural net (Don's example is, A >> B >> C, so we remove/try to classify on B / Another example is, in the ontology we may have plague is type of disease, a disease is a type of ailment, and we would remove reference to disease and leave the link between plague and ailment).

Input: csv of words with classes assigned
Output: df of word vectors and associated classes, indexed by word

Make readme for evaluation script

Add section to main readme that gives high level overview of what the evaluation script does and how to use it.

We think the current docs are missing

  • what the score actually is (math)
  • example output
  • manual example does not include evaluation example, add that and move all the wikipedia manual example part into the github wiki

Ingest medical textbooks

We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting medical textbooks. Possible places to find medical textbooks might be:

Honestly we feel like we only need one so some hard core google fu for one or two in text form that we can programmatically retrieve (or simply one time click and download) would be sufficient.

Input: none
Output: medical textbook(s) as a flat text file on disk

Make wikipedia ingestion ignore Notes/References and External Links sections

There are some sections common to wikipedia pages that are not valuable content to us because they contain primarily links or otherwise information that is not contextual human language. In particular the Notes, References, and External Links sections that are optionally included in wikipedia page objects are not valuable to us. This issue is closed when our wikipedia ingestion script returns all sections EXCEPT the Notes, References, and External Links sections when pulling data. You can see information on a Wikipedia Page object's sections property and section() method here for an idea of how to filter this information with the python wikipedia API.

Make a few prod models

Now that we've got boost more or less together we should run a few boost models on a big (or maybe our existing full) corpus so we have something to investigate with evaluation. Fun! ๐Ÿฆ

Clean up the text

The text pulled in as abstracts and wikipedia has a bunch of weird shit in it. Altering generate_folds.py to include rules to clean this up a bit so the sentences generated are as complete and sentence-y as possible.

We are going to:

  • remove all html tags {<-->)
  • remove all latex ({--} and ${--})
  • remove headers from wikipedia articles
  • remove new lines and carriage returns (this messes up the tokenize script)
  • remove all non-ascii characters (like copyright symbols)
  • remove extraneous spaces (this also messes up the tokenize script)
  • remove sentences less than 7 words long, that don't end with a period, don't start with a capital letter, or start with a number.

Bugfix for medical abstract ingestion with illegal variable reference

2016-08-18 00:38:52,978: INFO : Fetched Myopathy term wiki artifacts.
2016-08-18 00:38:53,952: INFO : Document does not have abstract.
Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 84, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 41, in fetch_corpus
    med_search.get_medical_abstracts(term, data_dir, results)
  File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 172, in get_medical_abstracts
    abstracts_pubmed = fetch_pubmed(search_term, results)
  File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 64, in fetch_pubmed
    for item in summary:
UnboundLocalError: local variable 'summary' referenced before assignment

Looks like this here belongs in the try/except.

Update evaluation script to take one of the valid evaluation files

We've been using a "fake" evaluation file for our evaluation task since we merged it in for the purposes of testing evaluation workflow, not accuracy. Now that we have a more normal-sized corpus, we should switch out to a real evaluation file and see how we do.

It seems there are 3 options for valid evaluation files:

Right now the file used during evaluation testing is hardcoded into the evaluation script but should probably be configurable. We should choose whichever one we want to use, set it as our default configuration in the Drakefile, and refactor evaluation to take the configured similarity file as the test set.

Make CLI better

Instead of using optparse over and over, can we better structure the CLI to be more consistent throughout package?

  1. Can we handle the path drama here instead of in super fragile utils file?
  2. Implementing required vs optional args; and generally nicer structure for CLI
  3. Consistent commands and usage throughout package
  4. And more! ๐Ÿฐ

Bugfixes to wikipedia ingest to make it more robust

While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.

  • Disambiguation errors

To reproduce, use the data/eval_words files as your search term input. Though you may be able to reproduce just with the "bad" search term which was Activase. You'll see this error:

/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
    raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "active" may refer to: 
Active (album)
Active Records
Active (ship)
Active (1764 ship)
Active (1850)
Active (1877)
Active (sternwheeler)
HMS Active
USCS Active (1852)
USCGC Active
USRC Active
USS Active
Active (whaler)
Active Enterprises
Sky Active
Active (pharmacology)
Active, Alabama
ACTIVE
Locomotion No 1
fraternities and sororities
Active lifestyle
Activation
Activity (disambiguation)
Passive (disambiguation)
All pages beginning with "Active"

As far as we can tell this is because from the search term Activase we receive a results list including the page title Active, but when we later try and retrieve the wikipedia page with the wikipedia.page method against the title Active, it returns a DisambiguationError from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. against Activase) or the second round of search here (i.e. against Active), so we will need to catch both places.

  • Page ID not found

To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was Malignant tumor of lung.

Renal failure
Kidney failure
Abortion
Miscarriage
Heart
Myocardium
Stroke
Delusion
Schizophrenia
Calcification
Stenosis
Tumor metastasis
Adenocarcinoma
Congestive heart failure
Pulmonary edema
Pulmonary fibrosis
Malignant tumor of lung
Diarrhea
Stomach cramps

This will trigger this error:

Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 345, in __load
    raise PageError(self.title)
wikipedia.exceptions.PageError: Page id "malignant tumors of luna" does not match any pages. Try another id!

This one is a little more confusing at this point since the original term was actually malignant tumors of lung. At some point it tries to instantiate a WikipediaPage against the title malignant tumors of luna but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted as page_id unless page_id is explicitly cast for None.

For this one we might also want to figure out more related to the auto_suggest flag as that may be how we're getting the weird respelling of our original search term.

Evaluation

Identifying synonyms
Using the UMLS data set, we use the 566 synonyms to do two related tasks:

  • Logistic scale: Take top quartile (~140 words) as "very similar" (=1) and mark all others as "not similar" (0). Get word2vec vectors for train/test pairs, find cosine distance between vectors and build a classifier off that.
  • Multivariate scale: Use coder's continuous scale (maybe just take mean and ignore standard deviation?) to train multivariate regression model to predict score of test set of pairs.

Retrieve term list directory from a function

Right now out get_corpus.py file expects the user to provide a file that includes the terms we want to search for. However, we have built a function similarity_evalution.FEATURE_BUILDER.get_words_list that returns the unique term list from the UMNSRS_similarity.csv file. We should add an option to simply grab the terms from this location.

Update word2vec.py to fit into revised workflow

Word2vec.py script needs a bit of attention to confirm it's still doing what we want. This is the start of a list of things we need to look at.

  • Right now the readme and what the function actually does are out of sync. Fix readme/confirm word2vec.py actually generates a model specific to each fold
  • Edit automagically generated filename to include data dir name (in our case, {SOME RUN}), as well as the fold number (e.g., {SOME RUN}_{SOME FOLD}.model) to reduce confusion.
  • Other things TK

Cache corpus cleaning step

Separate corpus cleaning from the generate folds step so that it is actually a preprocess step for our normal workflow. This way we can have a clean corpus that we port around without having to redo it every time we generate a new folds situation.

Ingest specific wikipedia articles

We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting specific wikipedia articles that

  • include entries for all terms from the evaluation metric dataset
  • use a radius parameter to include a radius of similar entries from the base entry

Probably should simply build off of our existing wikipedia_ingestion.py but throw the results parameter from the wikipedia package in there.

Input: evaluation metric dataset in csv
Output: all the related wikipedia articles as a flat text file to disk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.