lisc-tools / lisc Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 11.0 6.36 MB

Literature Scanner: Automated collection & analyses of the scientific literature.

Home Page: https://lisc-tools.github.io/

License: Apache License 2.0

Python 97.96% TeX 0.92% Makefile 1.12%

literature-mining literature-review meta-analysis scientific-publications text-mining web-scraping

lisc's People

Contributors

Stargazers

Watchers

Forkers

bridgettripp aashish24 jasongfleischer ryanhammonds snbvn complexbrains koudyk wenlongxu francisandorful hahaschool johnththomas

lisc's Issues

Error on tutorial n°2

I used
python3 plot_02-CountsAnalysis.py
I get the following error

For  'frontal lobe'    the highest association is  'audition'        with         377
For  'temporal lobe'   the highest association is  'audition'        with        1299
For  'parietal lobe'   the highest association is  'audition'        with         231
For  'occipital lobe'  the highest association is  'vision'          with         236
For  'vision'          the highest association is  'occipital lobe'  with         236
For  'audition'        the highest association is  'temporal lobe'   with        1299
For  'somatosensory'   the highest association is  'parietal lobe'   with         187
For  'olfaction'       the highest association is  'temporal lobe'   with          71
For  'gustation'       the highest association is  'temporal lobe'   with          44
For  'proprioception'  the highest association is  'parietal lobe'   with           9
For  'nociception'     the highest association is  'temporal lobe'   with         198
[[1.05994414e-02 2.69999284e-02 9.59679152e-03 3.58089236e-03
  1.14588555e-03 2.86471389e-04 1.14588555e-02]
 [6.88251619e-03 4.80666050e-02 6.80851064e-03 2.62719704e-03
  1.62812211e-03 7.40055504e-05 7.32654949e-03]
 [1.75801448e-02 4.77766287e-02 3.86763185e-02 1.65460186e-03
  1.24095140e-03 1.86142709e-03 2.33712513e-02]
 [6.44984969e-02 2.32303908e-02 9.56545504e-03 1.36649358e-03
  0.00000000e+00 0.00000000e+00 1.72178191e-02]]
[[1.16386971e-03 3.37408488e-03 3.15234779e-03 1.86908901e-03
  3.77287304e-04 2.14454214e-04 2.63519801e-04]
 [1.32680867e-03 1.04864621e-02 3.31412104e-03 1.78427825e-03
  7.93622164e-04 6.30596544e-05 3.19257517e-04]
 [7.19747326e-04 2.24813142e-03 5.61106610e-03 4.52872913e-04
  1.80234305e-04 9.45477466e-04 1.88936671e-04]
 [2.02106705e-03 8.35610782e-04 1.08349070e-03 3.03177298e-04
  0.00000000e+00 0.00000000e+00 1.05535063e-04]]
Traceback (most recent call last):
  File "plot_02-CountsAnalysis.py", line 129, in <module>
    plot_matrix(counts.score, counts.terms['B'].labels, counts.terms['A'].labels)
  File "/usr/local/lib/python3.5/dist-packages/lisc/plts/utils.py", line 93, in decorated
    func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/lisc/plts/counts.py", line 45, in plot_matrix
    cmap = get_cmap(cmap)
  File "/usr/local/lib/python3.5/dist-packages/lisc/plts/utils.py", line 49, in get_cmap
    cmap = sns.cubehelix_palette(as_cmap=True)
AttributeError: 'bool' object has no attribute 'cubehelix_palette'

Load words objects directly

Right now, to re-load Words data, you have to load the overall object, and then separately load results data. Q: would it be more convenient to load results directly when using load_object?

# Current loading procedure:
words = load_object('tutorial_words', SCDB('tutorials/lisc_db')) # This doesn't load results
words.results[0].load(directory=SCDB('tutorials/lisc_db')) # This load the first Words result

The reason for loading separately is that for large scrapes, the data can be large, and we don't always want to load it all together. So we do want an option to no re-load all Articles data. However, it would be convenient to be able to do so. I'm thinking maybe a keyword option (reload_results or something) that would optionally reload all the data in load_object if set to True.

Note: idea from @ryanhammonds

question of collect_counts extracted titles

Hi,

Sorry to bother. When I used 'collect_counts' to fetch the articles, it successfully showed the number of papers that include the term. However, I want to know the titles of the articles that corresponds to the number. I tried used print(meta_dat), but it had not include the titles. Followed is my code and result:
##################
coocs, term_counts, meta_dat = collect_counts(
terms_a=terms_a, terms_b=terms_b, db='pubmed', save_and_clear=True,usehistory=True,verbose=True)
##################

Is there a way to extract the articles' titles that corresponds to the number?
Thanks for your help!
Best

Add support for seaching by date

It would be nice to add support for specifying date ranges to search for with LISC, for EUtils.

There are some notes on optional parameters for dates here:
https://www.ncbi.nlm.nih.gov/books/NBK25499/

There looks to be two ways to do things:

1. search for things in the last n day
1. search for things between two specified dates

There is also a datetype option to specify things like "modification date" or "publication date". By default, we want publication date. I guess maybe might as well make this user definable though as well. I think, if it's easy, adding support for both both types of date makes sense. Note that these settings are only relevant to EQuery, and ESearch (I think).

Mainly, I think doing this relates to the URLs handling in the eutils file:
https://github.com/lisc-tools/lisc/blob/master/lisc/urls/eutils.py

Otherwise, this will also just need some checking through the code for managing & accepting these extra settings.

I'm hoping this is an easy extension! This is the kind of update of is somewhat depending on knowing a bit about the codebase, and overall organization and approach - which in this case is to connect to a RESTful API (from EUtils).

Tutorial 01 - Counts Searching: Comments

"Counts" Section:

Two "Counts" section headings. I think the first one is probably not necessary
Typo: "Specifically, it search" --> "Specifically, it searches"
Typo: "papers where found" --> "papers were found"
Explain the numbers for percent of paper overlap
Typo: "Print out many" --> "Print out how many"
Print out some of the variables such as dat_numbers and dat_percent for co-occurrence between two sets

"Count Object" section:

May be useful to include a short note explaining when/why it might be useful to use object method over function method
May look better if spacing between words in function outputs was reduced
Short explanation of what "highest association" means may be useful. Could also be more intuitive if a list with more than two terms was used

"Co-occurrence data - different word lists" section:

May look neater if section header was larger
Explicitly state (before code) that two separate lists are being used, and co-occurrence of terms between the two lists is being checked
Output template may be better if changed from "For the (term)" to "For (term)"

"Synonyms and Exclusion Words" section:

Typo: "For example, a using" --> "For example, using"
Weird quotations and spacing for "gene" OR "genetic" NOT "protein"
Point out that exclusion words correspond to the respective search terms, i.e. that "protein" is the exclusion term for gene/genetic and "subcortical" is the exclusion term for cortex/cortical
Include output for counts.terms['A'].labels
In the final note, point out that using the scrape_counts function is an alternative to using the Counts object

Automated tests

Are there automated tests or manual steps described so that the function of the software can be verified?

typo in tutorial 00

Hello! Thanks for making this great package!

I noticed an inconsistency in these lines of the Overview tutorial:

147 # For example, the following set of search term components:
148 #
149 # - search terms ['brain', 'cortex']
150 # - inclusion words ['biology', 'biochemistry']
151 # - exclusion words ['body', 'corporeal']
152 #
153 # All combine to give the search term of:
154 #
155 # - `'("gene"OR"genetic)AND("biology"OR"biochemistry")NOT("body"OR"corporeal)'`

In order for the components to actually combine to the search term, I assume that either

line 149 should be search terms ['gene', 'genetic'], or
line 155 should be '("brain"OR"cortex)AND("biology"OR"biochemistry")NOT("body"OR"corporeal)'

I thought probably the latter was correct, since 'gene' and 'genetic' hadn't been mentioned yet in the examples.

Cheers

Code Check & Add Docstring Examples

For @ryanhammonds : while you are doing a check through the codebases and familiarizing yourself with the code, I think we should try and add more Examples in docstrings.

Here's roughly what I'm thinking:

Do a sweep through of the codebase itself, getting familiar with the code, and doing a quick 'audit' check through the code
If you notice any broader issues or problems, then you can open issues, and if you see any nitpicks, typos, etc, you can edit them directly to merge
As you go, see if and where it might make sense to add an Examples section to the docstring. As a starting point, I'd say functions / classes listed in the API page are all candidates to have doctest examples.
Add any Examples sections that you think would be useful to docstrings, and follow doctest format for them all
PR the updates of the docstrings updates, and any small tweaks

Notes on doctest:

Doctest examples can be set up to be executed (with or without a tested output) or use the # doctest:+SKIP directive, after the example line, to indicate it shouldn't be run
It's somewhat open ended (up to you) when doctests should be set to run or skip, but I would say don't add too much set-up code just for the sake of having it execute.
To a certain extent, doctest examples can show quick versions of stuff that's covered in the tutorials. Try to use them to show clear and specific cases people might want to try.
Please add a thing to run doctests (specifically) into the summary.sh file (like in fooof), and/or some other way check & run these examples.
Note that you can run doctests in specific files with python -m doctest -v file.py, or run a whole module, with pytest, with pytest --doctest-modules --ignore=module/tests module (that also includes the option to ignore explicit test files)

NLP dependency

LISC has requirements for some basic text processing, like tokenization and removing stopwords.

Right now, this is done with a dependency on nltk, and in particular requiring downloaded data files from nltk. Some things can fail if they are run without these downloaded files. It would perhaps be nice to update how nltk stuff is managed or maybe even drop / change the nltk dependency, as it seems a bit bulky for what we need / do.

Possibilities:

Can we just copy theses files into the module, and use them locally? I think they are pretty small, and this would avoid the download issue.
And/or - maybe there is a way to make nltk an optional dependency?
There could be other / nicer ways to deal with the data files. Perhaps better error handling and documentation of what files are are needed? Or enforce they get downloaded at install?
We might want to consider switching dependency for text processing. Maybe spacy? It seems newer and cleaner.

However / counter-point: if this turns out to be really messy, I don't think it's a huge deal (everything works), and it might be reasonable to just more explicitly say "this module requires these files downloaded from nltk", and perhaps try and enforce this to happen during install, and then this would avoid future problems from there.

Release on PYPI

When it's ready - release LISC 0.1.0 on PYPI.

Note: when pushed to PYPI, add the PYPI badge and the Python badge to the README.

Tutorial 03 - Words Scraping: Comments

"Words" Section:

Typo: First sentence incomplete

"Function Approach: scrape_words" section:

Add explanation of parameters (maybe as a comment) for scrape_words function
Include outputs for dat, d1
"Each data object...papers" : could be rephrased/elaborated to be clearer, maybe give examples for what exactly the "data" is
"Print out some data": Might be useful to include a short note about what different information can be printed

"Object Approach: Words" Section:

Include words.results output
Add code for printing out information, such as in function approach part
"demonstrated above" --> "demonstrated in Tutorial 01"

"Metadata" Section:

Include words.meta_dat output
Typo: "scrape It" --> "scrape. It"
Specify that the Requester object is saved as 'req'
Can the 'en_time' attribute name be changed to 'end_time'? That might be more intuitive
Can seconds be included in the time? Minutes only might not be informative for smaller searches.

Updates for Sphinx

Updates for sphinx documentation:

Convert the README to rst format
Once README is converted, drop the m2r doc requirement (drop readme_link.rst file)
In API list links & rendering
- links: check for root links (currentmodule directive), and consolidate on linking to 'shallow' link (where __init__ is / where we expect people to import from), and make sure it works in a way that you get the in code links in rendered sphinx-gallery pages
- check rendering of objects on API page, and edit conf.py to tweak anything needed (such as inherited members). This might involve checking and set to inherited-members in autodoc_default_options in conf.py
In sphinx-gallery examples: check through for function / class links, in the text sections. Update to use relative links (leading period) and make sure it works to get the links in the text parts of rendered sphinx-gallery pages
Add autosummary templates (the ones copied from FOOOF will add the examples links)
- make sure the examples links from the API pages are working.
- make sure backrefs are linked to generated (sphinx_gallery_conf)
Update data objects for rendering nicely on API page
- Edit data objects to be wrapped as classes
- Add and use a template for data objects
Add the sphinx-copybutton extension (add to conf.py)
sphinx Makefile options
- add check command in makefile
- make sure the install command calls make clean
HTML theming and options
- make body_max_width equal to None in html_theme_options

Notes:

for all of these updates (and in particular updating settings in sphinx conf & Makefile, etc) the edits needed here can basically be copied from how things were changed in FOOOF. To see this, check the file diffs in fooof-tools/fooof#170
the autosummary, API & sphinx-gallery lists all need to be somewhat coordinated to get all the links to work between each other, and this also relates somewhat to where functions are linked in the actual code. It may make sense to change some init links.

URL length limit reached when fetching many articles

Hi there, lisc is awesome and we have used it for a while, but we recently encountered some issues when searching papers.

To reproduce, we could build a method called find_papers(num: int) first:

from lisc.objects.base import Base
from lisc import Words
import re
import urllib.request

def find_papers(num):
    terms = [['Pembrolizumab']]
    inclusions = [['Predict']]
    exclusions = [['TIDE']]
    base = Base()
    base.add_terms(terms)
    base.add_terms(inclusions, 'inclusions')
    base.add_terms(exclusions, 'exclusions')
    print(base.make_search_term(0))
    words = Words()
    words.add_terms(terms)
    words.add_terms(inclusions,'inclusions')
    words.add_terms(exclusions,'exclusions')
    words.run_collection(retmax=num)
    # return a list of PubMed ID.
    pmid = words.results[0].ids
    return pmid

We call find_papers(100) and it's working fine, but we want to retrieve all papers so we increased num to 400+ and we found after passing a number we will suddenly not be able to retrieve anything. After reading the code we found this should be caused by the URL length limit. We have prepared a fix and we hope this can resolve this issue: #85 .

Thank you.

Adam

Add full text scraping with PMC

We currently only offer functionality for collecting 'words' at the level of abstract text.

It would be nice to add functionality to be able to use the PMC database, and collect full text materials.

Make version of tests without collections

The tests are currently fairly slow, mostly because they launch a bunch of test scrapes.

It would be nice to have a switch for the tests, to run a test suite with test data, but without launching any collections. This would work as a quick version, and to test components without internet.

Add utility to check URL

It would be nice to have a quick way to see / check the URL that is built for a particular search / request.

It might also be useful to have logging, to keep track of every URL requested (easier for debugging).

Sort out DB

LISC inherited a DB object, but it's over-tuned to a specific project / computer. Need to figure out if and how to get this into a generalized form.

Qs:

Where to put / how to organize terms files.
Is a generic version of a DB object, that would be useful?

Broken link

https://lisc-tools.github.io/lisc/auto_tutorial/index.html

logging

It would be nice to have better logging / printing as searches are run. In particular, for debugging, collecting and printing full URLs and/or term lists would be really useful.

Sort out plots

The plotting functions aren't well organized or set up for this to be a general utility.

Allign word and count searches

Hello,
thank you for the nice package and documentation!

I managed to search for co-occurences using the pattern
terms =[[A],[B]
inclusions=[C]
coming to a search string of "(A) AND (B) AND (C)".

How can I design a word search with the same search string?
So far I only managed "(A) OR (B) AND (C)" or "(A) AND (B or C)" via inclusions. I can then get the dois from the result and find overlaps, but that's rather annoying and not within the package framework.

Any help is appreciated.

Multiple terms collect question

Hi,

Thanks for your developed tool "lisc". However, I am curious about the collection words with multiple terms. I have two lists of terms (terms_A and terms_B) and I want to use 'collect_words' to query the title and abstract of pubmed. However, I don't know how to use the co-occurance of these terms?

BTW, I noticed the parameter of 'collect_words' had inclusion and exclusion:(collect_words(terms, inclusions=None, exclusions=None)), but I don't know how to use the inclusion, could you give me a example?

Thanks!

Info call doesn't seem to be working

The eutils.info call that should populate the meta_dat dictionary doesn't seem to be working.

Fix / Figure out Authentication & API keys

E-utils has changed it's requirements regarding API keys:
https://www.ncbi.nlm.nih.gov/books/NBK25497/

ToDo: update support for using API keys. Also note rate limit changes
No Key: 3 requests / second
Key: 10 requests / second

^ToDo: Update rate limiting based on whether user is authenticated or not.

Update tests.

ToDo Item: Update the test suite (tests aren't properly updated from conversion from the ERP_SCANR code).

co-occurence search within the pre-filtered data in Articles

Hi @TomDonoghue !

As always suped thankful for your very useful tools :)

I was wondering if you could do a co-occurence search, within the pre-filtered data in Articles. Or you would have to manage this throush the inclusion terms?

In this case for example:

Terms = ['word1','word2','word3'] --> Articles

TermsA = ['X']
InclusionA = ['x1','x2', Terms]

TermsB = ['Y']
InclusionA = ['y1','y2', Terms]

Terms: will look for
'word1' OR 'word2' OR 'word3'
Whereas TermsA with InclusionA: will look for
'X' AND ('x1' OR 'x2' OR ('word1' OR 'word3' OR 'word3'))
Or instead will it look for:
'X' AND (('x1' OR 'x2') AND('word1' OR 'word3' OR 'word3'))
How it should be done to obain the latter case?

Thanks a lot!

Collection content are literatures' full text, abstracts or the titles only?

Hello! Thank you so much for providing this useful tool and I would like to know the collection content are literatures' full text, abstracts or the titles only.

run_collection error

Hi,

I met a problem about the 'words.run_collection'. When I did the tutorial as followed:
#######################
db = SCDB('lisc_db')
words.run_collection(usehistory=True, retmax=15, save_and_clear=True, directory=db)
#######################
It reorted an error that:
FileNotFoundError: {Errno 2] No such file or directory: 'lisc_db\data\words\raw\ovary.jason'

Could you give me a hand?
Thanks!