ayota / ddl_nlp Goto Github PK
View Code? Open in Web Editor NEWRepo for DDL research lab project.
Repo for DDL research lab project.
I forgot to check the documentation after all the refactor stuff. Assume evaluation is dying. This issue is to remind me of that next week :P
We need to develop a file that parses the downloaded text and returns a training and a test set k-1 times. The next step in the process (gensim word2vec training) should be able to pull just the training data to conduct training. The function should also return test data that is accessible to our evaluation tasks.
The idea here is that a corpus can be built in a number of ways:
This doesn't lend itself well to inclusion in the word2vec Drake workflow because of all the moving parts in corpus creation.
Perhaps a better approach would to have a master corpus creation script (under the fun_3000/ingestion/ directory) which takes in as argument a selection of source types (wiki, abstracts, textbooks, all) and takes care of calling the appropriate ingestion modules. Once corpus creation is done (and any necessary manual edits are done to it), the Drake workflow can be kicked off to execute the rest of the steps.
The evaluation script is still in a state that expects there to be multiple model files in the folds-based directory structure but we took that out in PR #64. Also we hate it.
This issue is to refactor the evaluation script and its Drakefile component so that it just appends random data to the scores files (unless we can at this point anticipate any of the data, such as boost numbers, in which case use the real data). This is just to remain a stub so that we can evaluate the visualization PR as well while we figure out the theoretical evaluation tactic we want to use.
We have a step where we split up the corpus into k number of different folds and then run word2vec on those folds separately, resulting in k number of model files. Later on during evaluation we take each of those model files, obtain cross-validated accuracy scores for the evaluation task for each model, take the top value for each model and then average those together to represent the efficacy of the entire corpus.
We think that we don't like the corpus splitting in the first stage and this issue is to get rid of it. In the first place, its introducing a variable we don't really want to test, namely, if the corpus itself is biased. In the second place, the way we pool the model's evaluation scores to end up representing the whole corpus anyways makes the evaluation measure unnecessarily complex and indirect.
We don't want to sit around watching our computer's scrape wikipedia. This issue is to set up ingestion on an AWS instance to pull the corpus against a larger word set and with a broader radius. We may want to modify the code as well to save against an S3 bucket, or we can do the transfer ourselves after the process has completed. ๐ค
There are cases where we read the entire corpus into memory which is just going to give us more and more pain as the corpus gets larger.
Right now we have a process which starts with grabbing terms from a text file to run the get_corpus.py process. However, there are only certain terms that we are really interested in. Those 'terms of extreme interest' can be generated from fun_3000.evaluation.similarity_evaluation.UMLS.get_words_list function. We basically want this function to run and have some capability that saves this to a file. Ultimately, we probably just want this to run within drake as the activity that kicks off the whole process. Another option is to just run that function one time and save the result to the location where we want the file for terms to exist and add it to configuration control. Your call.
We need to get the Drakefile we have to run the full ingestion, fold_generation, and training on each fold in one single run.
Due to the fact that we are storing our corpus files in folds in a folder structure like the structure described in our README, we need to be able to store the k model files that are generated when gensim runs k times.
Currently. The fun_3000.word2vec.py file is trying to save files to a folder that does not exist.
'models/jazz/2/train/jazz/2/train.model'
IDEAL: We'd like the folder structure to be built as a part of the end of the script.
and to follow the pattern
models///train.model
Current Full traceback below:
Traceback (most recent call last):
File "fun_3000/word2vec.py", line 79, in
run_model(opts.input_data_dir, opts.parallel_workers, opts.context_window, opts.hidden_layer, opts.model_name)
File "fun_3000/word2vec.py", line 66, in run_model
model.save(model_path)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1452, in save
super(Word2Vec, self).save(_args, *_kwargs)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 486, in save
pickle_protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 359, in _smart_save
pickle(self, fname, protocol=pickle_protocol)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/gensim/utils.py", line 912, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 99, in smart_open
return file_smart_open(parsed_uri.uri_path, mode)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 379, in file_smart_open
return open(fname, mode)
IOError: [Errno 2] No such file or directory: '/Users/donaldvetal/Projects/ddl_nlp/models/jazz/2/train/jazz/2/train.model'
Take all models from multiple folds and average together to get single model to feed into evaluation task.
Spin off from #30 to do the part updating README to reflect how pulling ontologies works.
README is out of date about what shape the file system should look like during the process. We're using the README as specs for our glue together scripts so this issue is to clean it up.
Change the example directories listed for the generate folds script in the README to be based off runs, not single terms (e.g. "jazz") since we got away from single terms and do it by multiple terms at a time.
We need to be able to prune the ontology to give us the ability to prune out any explicit relationship that we expect in our evaluation metric corpus.
Several things we should do here, that spaCy can help us with. ๐
We want to be able to parameterize the boost level of the ontologies during drake file run.
This issue is done when we can send a int to the drake step that generates the folds that affects how many times an ontology is appended to that fold.
We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting medical abstracts.
Like #8 we are interested in abstracts referencing our evaluation metrics, so this ingestion module should be able to
Input: evaluation metric data set in csv
Output: all the abstracts as a flat text file to disk
Classifying things
Based on some information extracted from the ontology but not used to train the neural net (Don's example is, A >> B >> C, so we remove/try to classify on B / Another example is, in the ontology we may have plague is type of disease, a disease is a type of ailment, and we would remove reference to disease and leave the link between plague and ailment).
Input: csv of words with classes assigned
Output: df of word vectors and associated classes, indexed by word
TBA
Add section to main readme that gives high level overview of what the evaluation script does and how to use it.
We think the current docs are missing
We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting medical textbooks. Possible places to find medical textbooks might be:
Honestly we feel like we only need one so some hard core google fu for one or two in text form that we can programmatically retrieve (or simply one time click and download) would be sufficient.
Input: none
Output: medical textbook(s) as a flat text file on disk
There are some sections common to wikipedia pages that are not valuable content to us because they contain primarily links or otherwise information that is not contextual human language. In particular the Notes, References, and External Links sections that are optionally included in wikipedia page objects are not valuable to us. This issue is closed when our wikipedia ingestion script returns all sections EXCEPT the Notes, References, and External Links sections when pulling data. You can see information on a Wikipedia Page object's sections
property and section()
method here for an idea of how to filter this information with the python wikipedia
API.
Now that we've got boost more or less together we should run a few boost models on a big (or maybe our existing full) corpus so we have something to investigate with evaluation. Fun! ๐ฆ
The text pulled in as abstracts and wikipedia has a bunch of weird shit in it. Altering generate_folds.py to include rules to clean this up a bit so the sentences generated are as complete and sentence-y as possible.
We are going to:
2016-08-18 00:38:52,978: INFO : Fetched Myopathy term wiki artifacts.
2016-08-18 00:38:53,952: INFO : Document does not have abstract.
Traceback (most recent call last):
File "fun_3000/get_corpus.py", line 84, in <module>
fetch_corpus(search_terms, directory, results)
File "fun_3000/get_corpus.py", line 41, in fetch_corpus
med_search.get_medical_abstracts(term, data_dir, results)
File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 172, in get_medical_abstracts
abstracts_pubmed = fetch_pubmed(search_term, results)
File "/home/laura/ddl_nlp/fun_3000/ingestion/med_abstract_ingest.py", line 64, in fetch_pubmed
for item in summary:
UnboundLocalError: local variable 'summary' referenced before assignment
Looks like this here belongs in the try/except.
HTTP
IAM users
๐ณ
We've been using a "fake" evaluation file for our evaluation task since we merged it in for the purposes of testing evaluation workflow, not accuracy. Now that we have a more normal-sized corpus, we should switch out to a real evaluation file and see how we do.
It seems there are 3 options for valid evaluation files:
Right now the file used during evaluation testing is hardcoded into the evaluation script but should probably be configurable. We should choose whichever one we want to use, set it as our default configuration in the Drakefile, and refactor evaluation to take the configured similarity file as the test set.
Instead of using optparse over and over, can we better structure the CLI to be more consistent throughout package?
While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.
To reproduce, use the data/eval_words
files as your search term input. Though you may be able to reproduce just with the "bad" search term which was Activase
. You'll see this error:
/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
Traceback (most recent call last):
File "fun_3000/get_corpus.py", line 85, in <module>
fetch_corpus(search_terms, directory, results)
File "fun_3000/get_corpus.py", line 40, in fetch_corpus
wiki_search.get_wikipedia_pages(term, data_dir, results)
File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
save_wiki_text(search_term, local_file_path)
File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
page = wpg(wiki_search_term)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "active" may refer to:
Active (album)
Active Records
Active (ship)
Active (1764 ship)
Active (1850)
Active (1877)
Active (sternwheeler)
HMS Active
USCS Active (1852)
USCGC Active
USRC Active
USS Active
Active (whaler)
Active Enterprises
Sky Active
Active (pharmacology)
Active, Alabama
ACTIVE
Locomotion No 1
fraternities and sororities
Active lifestyle
Activation
Activity (disambiguation)
Passive (disambiguation)
All pages beginning with "Active"
As far as we can tell this is because from the search term Activase
we receive a results list including the page title Active
, but when we later try and retrieve the wikipedia page with the wikipedia.page
method against the title Active
, it returns a DisambiguationError
from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. against Activase
) or the second round of search here (i.e. against Active
), so we will need to catch both places.
To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was Malignant tumor of lung
.
Renal failure
Kidney failure
Abortion
Miscarriage
Heart
Myocardium
Stroke
Delusion
Schizophrenia
Calcification
Stenosis
Tumor metastasis
Adenocarcinoma
Congestive heart failure
Pulmonary edema
Pulmonary fibrosis
Malignant tumor of lung
Diarrhea
Stomach cramps
This will trigger this error:
Traceback (most recent call last):
File "fun_3000/get_corpus.py", line 85, in <module>
fetch_corpus(search_terms, directory, results)
File "fun_3000/get_corpus.py", line 40, in fetch_corpus
wiki_search.get_wikipedia_pages(term, data_dir, results)
File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
save_wiki_text(search_term, local_file_path)
File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
page = wpg(wiki_search_term)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 345, in __load
raise PageError(self.title)
wikipedia.exceptions.PageError: Page id "malignant tumors of luna" does not match any pages. Try another id!
This one is a little more confusing at this point since the original term was actually malignant tumors of lung
. At some point it tries to instantiate a WikipediaPage
against the title malignant tumors of luna
but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted as page_id
unless page_id
is explicitly cast for None
.
For this one we might also want to figure out more related to the auto_suggest
flag as that may be how we're getting the weird respelling of our original search term.
Placeholder right now for discussion about the OGMS Ontology.
input: ontology URLs that are in rdf/xml
output: sentences
Identifying synonyms
Using the UMLS data set, we use the 566 synonyms to do two related tasks:
Right now out get_corpus.py file expects the user to provide a file that includes the terms we want to search for. However, we have built a function similarity_evalution.FEATURE_BUILDER.get_words_list that returns the unique term list from the UMNSRS_similarity.csv file. We should add an option to simply grab the terms from this location.
Word2vec.py script needs a bit of attention to confirm it's still doing what we want. This is the start of a list of things we need to look at.
Need to add instructions for usage for how to use get_corpus
and how to use the individual ingestion scripts individual to update the first section.
Separate corpus cleaning from the generate folds step so that it is actually a preprocess step for our normal workflow. This way we can have a clean corpus that we port around without having to redo it every time we generate a new folds situation.
We do not have direct access to a large enough corpus of data so we may have to build it ourselves. We want to mash up wikipedia, medical textbooks, and medical abstracts that are sure to cover our evaluation metric dataset. This issue is specifically intended to address ingesting specific wikipedia articles that
Probably should simply build off of our existing wikipedia_ingestion.py
but throw the results
parameter from the wikipedia
package in there.
Input: evaluation metric dataset in csv
Output: all the related wikipedia articles as a flat text file to disk
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.