lisc-tools / lisc Goto Github PK
View Code? Open in Web Editor NEWLiterature Scanner: Automated collection & analyses of the scientific literature.
Home Page: https://lisc-tools.github.io/
License: Apache License 2.0
Literature Scanner: Automated collection & analyses of the scientific literature.
Home Page: https://lisc-tools.github.io/
License: Apache License 2.0
I used
python3 plot_02-CountsAnalysis.py
I get the following error
For 'frontal lobe' the highest association is 'audition' with 377
For 'temporal lobe' the highest association is 'audition' with 1299
For 'parietal lobe' the highest association is 'audition' with 231
For 'occipital lobe' the highest association is 'vision' with 236
For 'vision' the highest association is 'occipital lobe' with 236
For 'audition' the highest association is 'temporal lobe' with 1299
For 'somatosensory' the highest association is 'parietal lobe' with 187
For 'olfaction' the highest association is 'temporal lobe' with 71
For 'gustation' the highest association is 'temporal lobe' with 44
For 'proprioception' the highest association is 'parietal lobe' with 9
For 'nociception' the highest association is 'temporal lobe' with 198
[[1.05994414e-02 2.69999284e-02 9.59679152e-03 3.58089236e-03
1.14588555e-03 2.86471389e-04 1.14588555e-02]
[6.88251619e-03 4.80666050e-02 6.80851064e-03 2.62719704e-03
1.62812211e-03 7.40055504e-05 7.32654949e-03]
[1.75801448e-02 4.77766287e-02 3.86763185e-02 1.65460186e-03
1.24095140e-03 1.86142709e-03 2.33712513e-02]
[6.44984969e-02 2.32303908e-02 9.56545504e-03 1.36649358e-03
0.00000000e+00 0.00000000e+00 1.72178191e-02]]
[[1.16386971e-03 3.37408488e-03 3.15234779e-03 1.86908901e-03
3.77287304e-04 2.14454214e-04 2.63519801e-04]
[1.32680867e-03 1.04864621e-02 3.31412104e-03 1.78427825e-03
7.93622164e-04 6.30596544e-05 3.19257517e-04]
[7.19747326e-04 2.24813142e-03 5.61106610e-03 4.52872913e-04
1.80234305e-04 9.45477466e-04 1.88936671e-04]
[2.02106705e-03 8.35610782e-04 1.08349070e-03 3.03177298e-04
0.00000000e+00 0.00000000e+00 1.05535063e-04]]
Traceback (most recent call last):
File "plot_02-CountsAnalysis.py", line 129, in <module>
plot_matrix(counts.score, counts.terms['B'].labels, counts.terms['A'].labels)
File "/usr/local/lib/python3.5/dist-packages/lisc/plts/utils.py", line 93, in decorated
func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/lisc/plts/counts.py", line 45, in plot_matrix
cmap = get_cmap(cmap)
File "/usr/local/lib/python3.5/dist-packages/lisc/plts/utils.py", line 49, in get_cmap
cmap = sns.cubehelix_palette(as_cmap=True)
AttributeError: 'bool' object has no attribute 'cubehelix_palette'
Right now, to re-load Words data, you have to load the overall object, and then separately load results data. Q: would it be more convenient to load results directly when using load_object
?
# Current loading procedure:
words = load_object('tutorial_words', SCDB('tutorials/lisc_db')) # This doesn't load results
words.results[0].load(directory=SCDB('tutorials/lisc_db')) # This load the first Words result
The reason for loading separately is that for large scrapes, the data can be large, and we don't always want to load it all together. So we do want an option to no re-load all Articles data. However, it would be convenient to be able to do so. I'm thinking maybe a keyword option (reload_results
or something) that would optionally reload all the data in load_object
if set to True.
Note: idea from @ryanhammonds
Hi,
Sorry to bother. When I used 'collect_counts' to fetch the articles, it successfully showed the number of papers that include the term. However, I want to know the titles of the articles that corresponds to the number. I tried used print(meta_dat), but it had not include the titles. Followed is my code and result:
##################
coocs, term_counts, meta_dat = collect_counts(
terms_a=terms_a, terms_b=terms_b, db='pubmed', save_and_clear=True,usehistory=True,verbose=True)
##################
Is there a way to extract the articles' titles that corresponds to the number?
Thanks for your help!
Best
It would be nice to add support for specifying date ranges to search for with LISC, for EUtils.
There are some notes on optional parameters for dates here:
https://www.ncbi.nlm.nih.gov/books/NBK25499/
There looks to be two ways to do things:
There is also a datetype
option to specify things like "modification date" or "publication date". By default, we want publication date. I guess maybe might as well make this user definable though as well. I think, if it's easy, adding support for both both types of date makes sense. Note that these settings are only relevant to EQuery, and ESearch (I think).
Mainly, I think doing this relates to the URLs handling in the eutils file:
https://github.com/lisc-tools/lisc/blob/master/lisc/urls/eutils.py
Otherwise, this will also just need some checking through the code for managing & accepting these extra settings.
I'm hoping this is an easy extension! This is the kind of update of is somewhat depending on knowing a bit about the codebase, and overall organization and approach - which in this case is to connect to a RESTful API (from EUtils).
"Counts" Section:
"Count Object" section:
"Co-occurrence data - different word lists" section:
"Synonyms and Exclusion Words" section:
Are there automated tests or manual steps described so that the function of the software can be verified?
Hello! Thanks for making this great package!
I noticed an inconsistency in these lines of the Overview tutorial:
147 # For example, the following set of search term components:
148 #
149 # - search terms ['brain', 'cortex']
150 # - inclusion words ['biology', 'biochemistry']
151 # - exclusion words ['body', 'corporeal']
152 #
153 # All combine to give the search term of:
154 #
155 # - `'("gene"OR"genetic)AND("biology"OR"biochemistry")NOT("body"OR"corporeal)'`
In order for the components to actually combine to the search term, I assume that either
search terms ['gene', 'genetic']
, or'("brain"OR"cortex)AND("biology"OR"biochemistry")NOT("body"OR"corporeal)'
I thought probably the latter was correct, since 'gene' and 'genetic' hadn't been mentioned yet in the examples.
Cheers
For @ryanhammonds : while you are doing a check through the codebases and familiarizing yourself with the code, I think we should try and add more Examples in docstrings.
Here's roughly what I'm thinking:
Examples
section to the docstring. As a starting point, I'd say functions / classes listed in the API page are all candidates to have doctest examples.Examples
sections that you think would be useful to docstrings, and follow doctest format for them allNotes on doctest:
# doctest:+SKIP
directive, after the example line, to indicate it shouldn't be runsummary.sh
file (like in fooof), and/or some other way check & run these examples.python -m doctest -v file.py
, or run a whole module, with pytest, with pytest --doctest-modules --ignore=module/tests module
(that also includes the option to ignore explicit test files)LISC has requirements for some basic text processing, like tokenization and removing stopwords.
Right now, this is done with a dependency on nltk, and in particular requiring downloaded data files from nltk. Some things can fail if they are run without these downloaded files. It would perhaps be nice to update how nltk stuff is managed or maybe even drop / change the nltk dependency, as it seems a bit bulky for what we need / do.
Possibilities:
nltk
an optional dependency?spacy
? It seems newer and cleaner.However / counter-point: if this turns out to be really messy, I don't think it's a huge deal (everything works), and it might be reasonable to just more explicitly say "this module requires these files downloaded from nltk", and perhaps try and enforce this to happen during install, and then this would avoid future problems from there.
When it's ready - release LISC 0.1.0 on PYPI.
Note: when pushed to PYPI, add the PYPI badge and the Python badge to the README.
"Words" Section:
"Function Approach: scrape_words" section:
"Object Approach: Words" Section:
"Metadata" Section:
Updates for sphinx documentation:
currentmodule
directive), and consolidate on linking to 'shallow' link (where __init__ is / where we expect people to import from), and make sure it works in a way that you get the in code links in rendered sphinx-gallery pagesinherited-members
in autodoc_default_options
in conf.pycheck
command in makefileinstall
command calls make clean
body_max_width
equal to None in html_theme_options
Notes:
Hi there, lisc is awesome and we have used it for a while, but we recently encountered some issues when searching papers.
To reproduce, we could build a method called find_papers(num: int)
first:
from lisc.objects.base import Base
from lisc import Words
import re
import urllib.request
def find_papers(num):
terms = [['Pembrolizumab']]
inclusions = [['Predict']]
exclusions = [['TIDE']]
base = Base()
base.add_terms(terms)
base.add_terms(inclusions, 'inclusions')
base.add_terms(exclusions, 'exclusions')
print(base.make_search_term(0))
words = Words()
words.add_terms(terms)
words.add_terms(inclusions,'inclusions')
words.add_terms(exclusions,'exclusions')
words.run_collection(retmax=num)
# return a list of PubMed ID.
pmid = words.results[0].ids
return pmid
We call find_papers(100)
and it's working fine, but we want to retrieve all papers so we increased num
to 400+ and we found after passing a number we will suddenly not be able to retrieve anything. After reading the code we found this should be caused by the URL length limit. We have prepared a fix and we hope this can resolve this issue: #85 .
Thank you.
Adam
We currently only offer functionality for collecting 'words' at the level of abstract text.
It would be nice to add functionality to be able to use the PMC database, and collect full text materials.
The tests are currently fairly slow, mostly because they launch a bunch of test scrapes.
It would be nice to have a switch for the tests, to run a test suite with test data, but without launching any collections. This would work as a quick version, and to test components without internet.
It would be nice to have a quick way to see / check the URL that is built for a particular search / request.
It might also be useful to have logging, to keep track of every URL requested (easier for debugging).
LISC inherited a DB object, but it's over-tuned to a specific project / computer. Need to figure out if and how to get this into a generalized form.
Qs:
It would be nice to have better logging / printing as searches are run. In particular, for debugging, collecting and printing full URLs and/or term lists would be really useful.
The plotting functions aren't well organized or set up for this to be a general utility.
Hello,
thank you for the nice package and documentation!
I managed to search for co-occurences using the pattern
terms =[[A],[B]
inclusions=[C]
coming to a search string of "(A) AND (B) AND (C)".
How can I design a word search with the same search string?
So far I only managed "(A) OR (B) AND (C)" or "(A) AND (B or C)" via inclusions. I can then get the dois from the result and find overlaps, but that's rather annoying and not within the package framework.
Any help is appreciated.
Hi,
Thanks for your developed tool "lisc". However, I am curious about the collection words with multiple terms. I have two lists of terms (terms_A and terms_B) and I want to use 'collect_words' to query the title and abstract of pubmed. However, I don't know how to use the co-occurance of these terms?
BTW, I noticed the parameter of 'collect_words' had inclusion and exclusion:(collect_words(terms, inclusions=None, exclusions=None)), but I don't know how to use the inclusion, could you give me a example?
Thanks!
The eutils.info call that should populate the meta_dat dictionary doesn't seem to be working.
E-utils has changed it's requirements regarding API keys:
https://www.ncbi.nlm.nih.gov/books/NBK25497/
ToDo: update support for using API keys. Also note rate limit changes
No Key: 3 requests / second
Key: 10 requests / second
^ToDo: Update rate limiting based on whether user is authenticated or not.
ToDo Item: Update the test suite (tests aren't properly updated from conversion from the ERP_SCANR code).
Hi @TomDonoghue !
As always suped thankful for your very useful tools :)
I was wondering if you could do a co-occurence search, within the pre-filtered data in Articles. Or you would have to manage this throush the inclusion terms?
In this case for example:
Terms = ['word1','word2','word3'] --> Articles
TermsA = ['X']
InclusionA = ['x1','x2', Terms]
TermsB = ['Y']
InclusionA = ['y1','y2', Terms]
Terms: will look for
'word1' OR 'word2' OR 'word3'
Whereas TermsA with InclusionA: will look for
'X' AND ('x1' OR 'x2' OR ('word1' OR 'word3' OR 'word3'))
Or instead will it look for:
'X' AND (('x1' OR 'x2') AND('word1' OR 'word3' OR 'word3'))
How it should be done to obain the latter case?
Thanks a lot!
Hello! Thank you so much for providing this useful tool and I would like to know the collection content are literatures' full text, abstracts or the titles only.
Hi,
I met a problem about the 'words.run_collection'. When I did the tutorial as followed:
#######################
db = SCDB('lisc_db')
words.run_collection(usehistory=True, retmax=15, save_and_clear=True, directory=db)
#######################
It reorted an error that:
FileNotFoundError: {Errno 2] No such file or directory: 'lisc_db\data\words\raw\ovary.jason'
Could you give me a hand?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ššš
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ā¤ļø Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.