nlesc / litstudy Goto Github PK

LitStudy: Using the power of Python to automate scientific literature analysis from the comfort of a Jupyter notebook

Home Page: https://nlesc.github.io/litstudy/

License: Apache License 2.0

Python 69.06% HTML 30.94%

literature-review literature-search literature-review-tool python jupyter systematic-literature-reviews systematic-reviews bibliographics bibliometric-analysis bibliometric-visualization

litstudy's Introduction

LitStudy

LitStudy is a Python package that enables analysis of scientific literature from the comfort of a Jupyter notebook. It provides the ability to select scientific publications and study their metadata through the use of visualizations, network analysis, and natural language processing.

In essence, this package offers five main features:

Extract metadata from scientific documents sourced from various locations. The data is presented in a standardized interface, allowing for the combination of data from different sources.
Filter, select, deduplicate, and annotate collections of documents.
Compute and plot general statistics for document sets, such as statistics on authors, venues, and publication years.
Generate and plot various bibliographic networks as interactive visualizations.
Topic discovery using natural language processing (NLP) allows for the automatic discovery of popular topics.

Frequently Asked Questions

If you have any questions or run into an error, see the Frequently Asked Questions section of the documentation. If your question or error is not on the list, please check the GitHub issue tracker for a similar issue or create a new issue.

Supported Source

LitStudy supports the following data sources. The table below lists which metadata is fully (✓) or partially (*) provided by each source.

Name	Title	Authors	Venue	Abstract	Citations	References
Scopus	✓	✓	✓	✓	✓	✓
SemanticScholar	✓	✓	✓	✓	* ^{(count only)}	✓
CrossRef	✓	✓	✓	✓	* ^{(count only)}	✓
DBLP	✓	✓	✓
arXiv	✓	✓		✓
IEEE Xplore	✓	✓	✓	✓	* ^{(count only)}
Springer Link	✓	✓	✓	✓	* ^{(count only)}
CSV file	✓	✓	✓	✓
bibtex file	✓	✓	✓	✓
RIS file	✓	✓	✓	✓

Example

An example notebook is available in notebooks/example.ipynb and here.

Installation Guide

LitStudy is available on PyPI! Full installation guide is available here.

pip install litstudy

Or install the latest development version directly from GitHub:

pip install git+https://github.com/NLeSC/litstudy

Documentation

Documentation is available here.

Requirements

The package has been tested for Python 3.7. Required packages are available in requirements.txt.

litstudy supports several data sources. Some of these sources (such as semantic Scholar, CrossRef, and arXiv) are openly available. However to access the Scopus API, you (or your institute) requires a Scopus subscription and you need to request an Elsevier Developer API key (see Elsevier Developers). For more information, see the guide by pybliometrics.

License

Apache 2.0. See LICENSE.

Change log

See CHANGELOG.md.

Contributing

See CONTRIBUTING.md.

Citation

If you use LitStudy in your work, please cite the following publication:

S. Heldens, A. Sclocco, H. Dreuning, B. van Werkhoven, P. Hijma, J. Maassen & R.V. van Nieuwpoort (2022), "litstudy: A Python package for literature reviews", SoftwareX 20

As BibTeX:

@article{litstudy,
    title = {litstudy: A Python package for literature reviews},
    journal = {SoftwareX},
    volume = {20},
    pages = {101207},
    year = {2022},
    issn = {2352-7110},
    doi = {https://doi.org/10.1016/j.softx.2022.101207},
    url = {https://www.sciencedirect.com/science/article/pii/S235271102200125X},
    author = {S. Heldens and A. Sclocco and H. Dreuning and B. {van Werkhoven} and P. Hijma and J. Maassen and R. V. {van Nieuwpoort}},
}

Related work

Don't forget to check out these other amazing software packages!

ScientoPy: Open-source Python based scientometric analysis tool.
pybliometrics: API-Wrapper to access Scopus.
ASReview: Active learning for systematic reviews.
metaknowledge: Python library for doing bibliometric and network analysis in science.
tethne: Python module for bibliographic network analysis.
VOSviewer: Software tool for constructing and visualizing bibliometric networks.

litstudy's People

Stargazers

Watchers

Forkers

luwangg barbapapa215 giuliostramondo gfsreboucas axinx zxj19910214 gabrielsr michamans doubianimehdi cffbots mhdella mrauha done520 courses-worldwide elseviersoftwarex roberteller ialzyoud daniaguirre gaowudao centaurioun semmyk-research onetardigrada mdbabumiamssm colinmiller20 martinuray francisandorful ettoreaquino tleedepriest crzysqrl awesome-software mmdelc abhinavwidak bigdatasciencegroup amaiya okoknik weblearn22 larsgrobe danibene yaaun arthurit0 eriktks r-wrobel kaszanas

litstudy's Issues

`litstudy` sometimes produces documents which have a title that is `None`

Some users are reporting that a Document can sometimes have a title that is None. Many of the internal functions in LitStudy assume that documents always have valid title and crash when the title is None.

This should be fixed. Either by allowing None titles or by making sure none of the sources can produce a document with a None title.

Improve network layout

pyvis is very slow if the network being visualized is large (ie, >1000 nodes). The current solution is to disable interactivity in pyvis for large graph and calculate the graph layout manually using ForceAtlas2 (fa2). However, the results look not great. This should be improved by using a different graph layout algorithm.

'No Edges Given' for Network Analysis

'No edges given' is output instead of co-citation network analysis.

Any ideas why this might be happening?

Thanks,

ValueError: keyword grid_b when plotting histograms

Notice an error when calling litstudy.plot_country_histogram(...) and any of the other plots based on plot_histogram
Dug and noticed the error is because of changes in matplotlib.

[observed]
All plots based on plot_histogram() throws: ValueError: keyword grid_b is not recognized
Apparently, this might be due to changes in matplotlib
## reproducibility: python==3.8.16, matplotlib==3.7.1

    ax.grid(b=False, which='both', axis='y')
    xlabel, ylabel = ylabel, xlabel
else:
    ax.grid(b=False, which='both', axis='x')

Automate selecting the number of topics

Currently, the topic modeling methods require the user to select the number of topics. It would be great if the optimal number of topics was selected automatically. For example, this paper looks interesting:

Zhao, W., Chen, J.J., Perkins, R. et al. A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16, S8 (2015). https://doi.org/10.1186/1471-2105-16-S13-S8

module 'networkx' has no attribute 'to_scipy_sparse_matrix'

Hello,

Thank you for creating this great tool!

Issue:

When running litstudy.plot_cocitation_network() I get the error - module 'networkx' has no attribute 'to_scipy_sparse_matrix'.

To reproduce:

Edition Windows 10 Enterprise
Version 22H2
Installed on ‎23-‎05-‎2023
OS build 19045.3693
Experience Windows Feature Experience Pack 1000.19053.1000.0

On both:
Microsoft Edge for Business
Learn more about Microsoft Edge for Business
Version 120.0.2210.61 (Official build) (64-bit)

and

Google chrome Version 119.0.6045.160 (Official Build) (64-bit)

Here is the full error message:

Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.

AttributeError Traceback (most recent call last)
Cell In[49], line 1
----> 1 litstudy.plot_cocitation_network(docs, max_edges=500)

File ~\AppData\Local\anaconda3\envs\LitStudy\lib\site-packages\litstudy\network.py:305, in plot_cocitation_network(docs, max_edges, node_size, **kwargs)
301 """Plot a citation network.
302
303 This is a shorthand for plot_network(build_cocitation_network(docs))."""
304 b, p = split_kwargs(**kwargs)
--> 305 return plot_network(
306 build_cocitation_network(docs, max_edges=max_edges, **b),
307 # min_node_size=node_size,
308 # max_node_size=node_size,
309 **p
310 )

File ~\AppData\Local\anaconda3\envs\LitStudy\lib\site-packages\litstudy\network.py:88, in plot_network(g, height, smooth_edges, max_node_size, min_node_size, largest_component, interactive, controls, scale, iterations, gravity)
85 else:
86 sizes = [g for (_, g) in g.degree]
---> 88 layout = calculate_layout(g, gravity=gravity, iterations=iterations)
89 sizes = np.array(sizes, dtype=np.float32)
90 ratio = (max_node_size - min_node_size) / np.amax(sizes)

File ~\AppData\Local\anaconda3\envs\LitStudy\lib\site-packages\litstudy\network.py:22, in calculate_layout(g, iterations, gravity)
14 from fa2 import ForceAtlas2
16 model = ForceAtlas2(
17 verbose=True,
18 scalingRatio=1,
19 gravity=gravity,
20 )
---> 22 matrix = nx.to_scipy_sparse_matrix(g, dtype="f", format="lil", weight="weight")
23 pos = model.forceatlas2(matrix, iterations=iterations)
25 return dict(zip(g.nodes(), pos))

AttributeError: module 'networkx' has no attribute 'to_scipy_sparse_matrix'

General Problem with search_

Search, especially for semanticscholar and scopus seems to be extremely slow. Either I'm doing it wrong or there are issues with connecting to the api.

test2 = litstudy.search_semanticscholar("Meier, D", limit= 10, batch_size = 10)

Seems to work very fine when using the api outside litstudy. But in Litstudy it takes extremely long, throws errors of not finding Paper Id XYZ or, when returning papers, they are all the same or throwing Endpoint request timed out.

Improve fuzzy matching when calulating statistics

Calculating the statistics requires fuzzy matching of names. Currently, this matching is not too aggressive since we do want to avoid incorrectly matching two different names. The matching algorithm should be improved, possibly by adding additional parameters are asking the user if two names are equal?

Fuzzy matching appears in three places:

Affiliation names (e.g., "University of Amsterdam" == "the University of Amsterdam")
Author names (e.g., "John Doe" == "John. M. Doe"?)
Venue/conference/journals names (e.g., "Journal on Parallel Computing" == "J. Parallel Computing")

`build_corpus` always removes words having a frequency below 5

In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using

corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)

The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.

Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).

Thanks,

Compatibility of `litstudy` in different versions of Python 3.*

Oficially setup.cfg specifies that litstudy is compatiblity with all versions of Python >= 3.6. However, this has never been officially tested except for Python 3.8. Look into setting up tests for all versions of Python >= 3.6.

load_csv not showing up in litstudy=1.0.4

I came across issue #29 when searching for how to resolve AttributeError: module 'litstudy' has no attribute 'load_csv' for litstudy.load_csv(...)
I see that load_csv was added in 1.0.4. This is the version installed in my env.
PS: I installed litstudy today with pip install litstudy

I went searching in my ...\envs\litstudy\lib\site-packages\litstudy folder.
I noticed csv.py is missing \sources.
I observed in __init__.py in \sources that there's no entry for load_csv as well.

How do I go resolving this. Thanks.

NB: I'm trying to use a csv file from my query in Scholarly

`refine_semanticscholar` fails due to typeerror

I tried refining with semanticscholar, but it seems there a small issue.

import litstudy

docs = litstudy.search_crossref("Kalverla", limit=10)
docs, not_found = litstudy.refine_semanticscholar(docs)

>> ...
>>
File ~/mambaforge/envs/litstudy/lib/python3.11/site-packages/litstudy/sources/semanticscholar.py:199, in refine_semanticscholar.<locals>.callback(doc)
    196 if isinstance(doc, ScholarDocument):
    197     return doc
--> 199 return fetch_semanticscholar(doc.id, session)

TypeError: fetch_semanticscholar() takes 1 positional argument but 2 were given

I think this can be fixed by changing that erroneous line to return fetch_semanticscholar(doc.id, session=session)

Manipulate and save a DocumentSet object after loading.

Hello, I am wondering how possible is to manipulate (like in a pandas table) and save a loaded DocumentSet such as .bib, ieee_csv. Or also manipulate and save the data after doing a refinement (for example using refine_scupos).

Thank you!

Local cdn resources have problems on chrome/safari when used in jupyter-notebook.

Hey, when I try to plot cocitation network:
litstudy.plot_cocitation_network(docs_filtered, max_edges=500)
plt.show()
I am getting this message in my Jupiter Notebook.:
Local cdn resources have problems on chrome/safari when used in jupyter-notebook.

Does someone have an idea how to resolve that issue?

Support WoS (Web of Science)

WoS (Web of Science) is a very popular database for scientific literature. There exists an Python wrapper for their API (https://pypi.org/project/wos/). It would be great if this was supported in litstudy.

Look into Python 3.8 support

litstudy currently does not run under Python 3.8. We need to investigate under which versions of Python it does run and why it does not run under other versions. Maybe time to drop support for Python 2.x?

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title'

AttributeError: 'DocumentSet' object has no attribute 'title' is displayed, even after changing title within relevant CSV file (docs_springer) to read 'title'.

Thanks in advance! :)

Sam

AttributeError DocumentSet No Attribute Title

Having issues with refine_scopus() function (AttributeError: 'list' object has no attribute '_refine_docs')

Hello, I have been having issues with the refine_scopus() function and would appreciate some insight. After loading in a set of DOIs from an RIS file using

docs_remaining = litstudy.load_ris_file('/dbfs/*****************/DOI.ris')

I now see what type of object docs_remaining is.

print(type(docs_remaining))
<class 'list'>

print(len(docs_remaining))
2723

and then when I go to sort which DOIs are on scopus using the refine_scopus() function, as follows:

docs_scopus, docs_notfound = litstudy.refine_scopus(docs_remaining)

It throws an error "AttributeError: 'list' object has no attribute '_refine_docs'".

My initial thought is that the litstudy.load_ris_file should result in the creation of a DocumentSet with all the DOIs, rather than a list. I attached my commands below to provide additional assistance.

** For reproducibility, I am using Python 3.9.5 on a databricks 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) instance **

Thank you so much! Please let me know if I can provide any more information.

---- commands to reproduce----

%pip install litstudy

import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbs

plt.rcParams['figure.figsize'] = (10, 6)
sbs.set('paper')
path = os.path.abspath(os.path.join('..'))
if path not in sys.path:
sys.path.append(path)

import litstudy

import logging
logging.getLogger().setLevel(logging.CRITICAL)

docs_remaining = litstudy.load_ris_file('/dbfs/*****************/DOI.ris')

print(type(docs_remaining))
<class 'list'>

print(len(docs_remaining))
2723

docs_scopus, docs_notfound = litstudy.refine_scopus(docs_remaining)
Creating config file at /root/.pybliometrics/config.ini with default paths...
Configuration file successfully created at /root/.pybliometrics/config.ini
For details see https://pybliometrics.rtfd.io/en/stable/configuration.html.
AttributeError: 'list' object has no attribute '_refine_docs'

Filtering Words not working in build_corpus method

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that remove_word argument is not acting how it should be:

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
Corpus = litstudy.build_corpus(docs=db,
                               remove_words=remove_words,
                               min_word_length=3,
                               min_docs=10,
                               max_docs_ratio=0.75,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=0.7)

Expected behavior

The expected behavior is pretty straightforward: passing a list of words in remove_words should filter them from the word frequency vector of each document, thus removing them from the Corpus object. Which could be checked by not finding them in the Corpus.dictionary.items():

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

Observed behavior

The words are not removed at all:

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

Raises:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[28], line 2
      1 remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
----> 2 assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

AssertionError:

Incompability with gensim 4

Dear all, I found that train_lda_model() fails, probably since gensim.models.lda is not available any more with current gensim: https://github.com/piskvorky/gensim/wiki/Migrating-from-Gensim-3.x-to-4#15-removed-third-party-wrappers
train_nmf_model works fine. Best, Lars.

Export `DocumentSet` to RIS

Currently, there is no way to export results after loading them into memory. On possibility would be to add the ability to export a DocumentSet to a RIS file.

Support for OpenAlex

According to the website of OpenAlex: The OpenAlex dataset describes scholarly and how those entities are connected to each other.

There is an open web API available. It would be great if OpenAlex was supported in litstudy!

Listing document titles

Hi!

I'm quite new to Python (so this might be an easy fix), but I found LitStudy really interesting to look into.

I have tried to load documents from different files (from different databases) and used litstudy.types.DocumentSet.union to get a DocumentSet without duplicates. However, I would like to know which papers are then in this new collection/dataset. Is it possible to get LitStudy to list (e.g. in a Pandas DataFrame?) the titles of the documents in a specific dataset? Or provide a list/table of the titles just at any stage in the process?

Most Cited Papers and Correlating Authors to Topic Clusters Questions

Hi,
Is there any way of checking which countries / authors / institutes have been citing the papers within my csv file, as I would like to find out where the papers within my csv file are most prominently being cited.
I would also be interested in knowing if there is a way to correlate authors to topic clusters, in order to see which authors are most active in each of the clusters.
Thank you!

loading RIS file error (Tag appears multiple times)

I am trying to mimic the example notebook an load ris file of full details record exported from Web of Science, but I received duplicate tags error and I cannot proceed further. Could you please let me know how can I fix this issue?

KeyError: Item DOI

Hi all,

when i tried to upload and merge csv file using litstudy, I got this error Item DOI, however, the csv file I downloaded it from Scopus

Limiting the number of search results in `search_scopus` might be done incorrectly

search_scopus takes a limit parameter than can be used to limit the number of search results.

However, ScopusSearch will still fetch all results and limiting is done afterwards. This means that far to many results are downloaded which can take a long time.

The solution would be to find a way to limit the number of results returned by ScopusSearch. An alternative solution would be to show some kind of progress bar to at least indicate that something is going on.

Filtering outliers from Corpus with strange behavior

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that min_docs and max_docs_ratio are not working as expected.

For example, when forcing outliers to be kept in Corpus by setting min_docs=1 and max_docs_ratio=1, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):

Corpus = litstudy.build_corpus(docs=curtailment_docs,
                               remove_words=None,
                               min_word_length=None,
                               min_docs=1,
                               max_docs_ratio=1,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=None)

Expected behavior

After performing a "dumb filter" on my database, prior to building the Corpus:

curtailment_docs = docs.filter_docs(lambda d: d.abstract is not None)
curtailment_docs = db.filter_docs(lambda d: 'curtailment' in d.abstract.lower())

I was expecting to see 'curtailment' as a "forced outlier".

'curtailment' in [token[1] for token in list(Corpus.dictionary.items())]

But it gives me:

False

Observations

Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs that I'm working with.

TypeError: fetch_semanticscholar() takes 1 positional argument but 2 were given

When trying to use
docs_scopus, docs_notfound = litstudy.refine_semanticscholar(docs_bib, session=None)
I always get the error
TypeError: fetch_semanticscholar() takes 1 positional argument but 2 were given
might you please help me with this? All the rest is taken from the example.

`'charmap' codec can't decode byte X in position Y character maps to Z`

Some users report errors such as the following when opening files using litstudy:

`'charmap' codec can't decode byte 0x9d in position 6105: character maps to <undefined>`

It likely results from Python reading the file using the wrong character codec. More research is required.

Attribute Error: 'Document Set' object has no attribute 'title' when building corpus for topic modelling

Would it be possible to add in 'item title' to build.corpus function for topic modelling, as currently, if you use .csv download from Scopus, title and DOI are expressed as 'Item Title' and 'Item DOI', rather than 'Item' and 'DOI'.

Thanks!
S

Export `DocumentSet` to bibtex

Currently, there is no way to export results after loading them into memory. On possibility would be to add the ability to export a DocumentSet to a bibtex file.

The "square_distances" argument of TSNE function used in calculate_embedding is deprecated

According to this issue: scikit-learn/scikit-learn#12401 , the square_distance argument of TSNE function used in calculate_embedding is deprecated since 0.28

For scikit-learn versions>=0.28 gets error:

File [c:\Users\user\.pyenv-win-venv\envs\psd1\lib\site-packages\litstudy\nlp.py:410](file:///C:/Users/user/.pyenv-win-venv/envs/psd1/lib/site-packages/litstudy/nlp.py:410), in calculate_embedding(corpus, rank, svd_dims, perplexity, seed)
    407 else:
    408     components = tfidf
--> 410 model = TSNE(
    411     rank, metric="cosine", square_distances=True, perplexity=perplexity, random_state=seed
    412 )
    413 return model.fit_transform(components)

TypeError: __init__() got an unexpected keyword argument 'square_distances'

How to reproduce an error:
Run the example notebook of the documentation by running the cell that contains:
litstudy.plot_embedding(corpus, topic_model);

Suggested solution: remove the square_distance=True argument

Willing to commit a change to the main branch: Yes

`refine-crossref` does not work with the default `session=None`

I tried using refine_crossref with the default kwargs, but it returns empty with some warnings:

import litstudy

docs = litstudy.search_semanticscholar("random topic", limit=10)
docs, not_found = litstudy.refine_crossref(docs)

>> WARNING:root:failed to retrieve 10.3115/1667583.1667673: 'NoneType' object has no attribute 'get'

Explicitly passing a requests.Session() solves the problem.

By contrast, refine_semanticscholar creates a session if none is passed:

litstudy/litstudy/sources/semanticscholar.py

Lines 159 to 160 in 0cd1495

    
           if session is None: 
        
               session = requests.Session()

I think it would nice to do the same for refine_crossref.

Transforming an existing Scopus produced CSV to match load_ieee_csv() or load_springer_csv format

Is there an easy way to manipulate a csv that contains pulled data from Scopus to match either the load_ieee_csv() or load_springer_csv functions?

I saw there was an issue about 2 weeks ago that mentioned the possibility of a load_scopus_csv() equivalent function and I just want a little more clarification.

While reading the documentation, it seems that the standard is to query Scopus directly with litstudy, but I was interested in loading existing data.

Would something like converting my csv to ieee or springer format be feasible to do quickly, or would you recommend I query Scopus directly with litstudy until something of a load_scopus_csv() function has been implemented and circle back at that point?

Thank you!

Unexpected results from litstudy.plot_author_histogram()

Dear all,

I am observing some unexpected behavior in litstudy.plot_author_histogram() in the current master branch. I tried loading bibtex files from Web of Sciences and Scopus, as well as a Scopus csv. In all cases, I get not only a histogram showing the publication counts per author, but I get two histograms (here one blue, one orange), that I can stack. Apparently I have two categories here, but I cannot make sense of this and did not find anything mentioned in the documentation.

Best, Lars.

Unable to find documents based on DOI

Hello,
When attempting to load an csv file, I am getting the, 'no document found for DOI,' for all DOIs within the csv file. Below is an example of the error.
90%|█████████ | 451/499 [00:51<00:06, 7.57it/s]WARNING:root:no document found for DOI 10.5194/acp-22-395-2022:

My code is as follows:
import os
import sys

path = os.path.abspath(os.path.join('..'))
if path not in sys.path:
sys.path.append(path)

import litstudy
docs1 = litstudy.load_csv(data_path + 'data2.csv')
docs_scopus, docs_notfound = litstudy.refine_scopus(docs1)
print(len(docs_scopus), 'papers found on Scopus')
print(len(docs_notfound), 'papers were not found and were discarded')

I am unsure as to why it is not able to find the documents based on their DOIs, as they are in the correct format. Any help on this would be greatly appreciated.

Support the arxiv.org API

arXiv is a popular repository for scientific preprints. They have an API (https://arxiv.org/help/api). This should be supported in litstudy.

Different results from unique() and difference of deduplicated set

Dear all,
I have a document set that returns a duplicate accorind to unique():
len(docset) -> 1014
len(docset.unique()) -> 1013
However, len(docset-docset.unique()) -> 0
I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents.
Best, Lars.

Scopus400 Error: Exceeds the maximum number allowed for the service level

Using a standard Elsevier API, the maximum service level seems to be only 180 papers when using the enrich metadata function. Is this correct? The example given on LitStudy page shows around 1400 papers - any idea why this might be?

Thanks in advance,

Sam

litstudy cannot handle Byte order mark

Files might sometimes start with a byte order mark (https://en.wikipedia.org/wiki/Byte_order_mark). Currently, litstudy cannot read these files since it should ignore the first bytes.

A solution might be to introduce a new litstudy_open to replace the standard Python open to handle this issue.

Fix Load Scopus Csv ScopusCsvDocument authors method

The current Scopus csv test case has the column value "Authors with Affiliations" formatted as Author First Name, Author Last Name, Affiliation; ..... however when using scopus recently this format is Author First Author Last, Affiliation;

This causes the author name property to become the Author First and Author Last AND the first affiliation. This method also needs to be modified to support multiple affiliations for a single author. This may need to be opened up as a separate issue, although it shouldn't be too difficult to implement.

The plan is to add a new test case for a recently exported scopus csv file, then modify the method to pass tests for both files.

how to download data from IEEE or google scholar for a specific topics in CSV format

Errors when refining from Semanticscholar and Crossref

Hi,
first - thank you for this great python module! It is really helpful, I am using it to refine and evaluate a bibtex database.
I have been trying to refine my bibtex file with other databases. It works pretty well with scopus.
Crossref unfortunately gives me an error with Python 3.10 from within the spyder editor on macOS:

File /opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
  exec(code, globals, locals)

File ~/process/process.py:124 in <module>
  collectData(bibtexIn)

File ~/process/process.py:118 in collectData
  refinedDB = docsFound | refinedDB

File ~/process/litrev/lib/python3.10/site-packages/litstudy/types.py:304 in __or__
  return self.union(other)

File ~/process/litrev/lib/python3.10/site-packages/litstudy/types.py:204 in union
  left, right = DocumentSet._intersect_indices(self, other)

File ~/process/litrev/lib/python3.10/site-packages/litstudy/types.py:139 in _intersect_indices
  if id.matches(needle):

File ~/process/litrev/lib/python3.10/site-packages/litstudy/types.py:414 in matches
  return fuzzy_match(self._title, other._title)

File ~/process/litrev/lib/python3.10/site-packages/litstudy/common.py:58 in fuzzy_match
  return canonical(lhs) == canonical(rhs)

File ~/process/litrev/lib/python3.10/site-packages/litstudy/common.py:37 in canonical
  key = unidecode(key).lower()

File /opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/unidecode/__init__.py:60 in unidecode_expect_ascii
  bytestring = string.encode('ASCII')

AttributeError: 'list' object has no attribute 'encode'

The interface to semanticscholar just does not return any results (despite the DOIs being in the database):

158 papers to be evaluated
  0%|          | 0/158 [00:00<?, ?it/s]WARNING:root:failed to retreive 10.1080/13602365.2017.1376342: Paper with id paper=10.1080/13602365.2017.1376342 not found

Best, Lars.

Issue with error scopus400 which I cannot figure how to fix

Hello, I was trying to run the package and encountered the scopus400 error (exceeds the maximum number allowed for the service level) when running the function litstudy.refine_scopus.

Notice that I double-checked that I have a valid key, and since it didn't work, I tried the exact content of your example on the following address https://nlesc.github.io/litstudy/example.html
And the error persists.

Can you please help me with it? Do you know how I can troubleshoot this issue?

Thank you

ValueError: n_components must be < n_features; got 50 >= 47

Hi there,

I'm using the Corpus Topic Model function in LitStudy and have been given a Value Error 'n_components must be < n_features; got 50 >= 47'

Any idea why this might be?

Thanks in advance,

search_scopus issues

Hi, there're some problems with me when i run the code cell with

scopus_query = 'title-abs-key(medical data sharing)'
docset = litstudy.search_scopus(scopus_query, docset)

In detail, it gave me the tips with

Scopus401Error Traceback (most recent call last)
in
4
5 scopus_query = 'title-abs-key(medical data sharing)'
----> 6 docset = litstudy.search_scopus(scopus_query, docset)

c:\Users\Judiths\Desktop\CLARIFY survey\automated-literature-analysis\litstudy\search.py in search_scopus(query, docs, retrieve_orcid)
78 affiliations_cache = {}
79 try:
---> 80 retrieved_paper_ids = ScopusSearch(query, view="STANDARD").get_eids()
81 except ScopusQueryError:
82 print("Impossible to process query "{}".".format(query))

c:\users\judiths\anaconda3\envs\mysurvey\lib\site-packages\pybliometrics\scopus\scopus_search.py in init(self, query, refresh, subscriber, view, download, integrity_fields, integrity_action, verbose, **kwds)
195 Search.init(self, query=query, api='ScopusSearch', refresh=refresh,
196 count=count, cursor=subscriber, view=view,
--> 197 download=download, verbose=verbose, **kwds)
198 self.integrity = integrity_fields or []
199 self.action = integrity_action

c:\users\judiths\anaconda3\envs\mysurvey\lib\site-packages\pybliometrics\scopus\classes\search.py in init(self, query, api, refresh, view, count, max_entries, cursor, download, verbose, **kwds)
77 params.update({'start': 0})
78 # Download results
---> 79 res = cache_file(url=SEARCH_URL[api], params=params, **kwds).json()
80 n = int(res['search-results'].get('opensearch:totalResults', 0))
81 self._n = n

c:\users\judiths\anaconda3\envs\mysurvey\lib\site-packages\pybliometrics\scopus\utils\get_content.py in cache_file(url, params, **kwds)
74 except:
75 reason = ""
---> 76 raise errorsresp.status_code
77 else:
78 resp.raise_for_status()

Scopus401Error: Invalid API Key

i have successfully installed the python virtual environment, but it's wrong when i run the code. i don't know if it's the code bugs or not. if you know the reasons why it happened, could you please answer me?

Loading CSV file from Scopus

The Scopus web search engine offers the ability to export search results as a CSV file. It might be interesting to add a load_scopus_csv function, similar to load_ieee_csv and load_springer_csv

Issues with Python 3.11

I've seen this in the CI, and tried on my machine, gensim cannot be installed at the moment with Python 3.11, so the install fails.
The error is gensim/models/word2vec_inner.c:217:12: fatal error: 'longintrepr.h' file not found and apparently the Internet does not have an answer at the moment, except wait until someone fixes the issue upstream.

Add ability to search CrossRef

Currently, for CrossRef, the only supported features are loading one document based on the DOI (fetch_crossref) or refining an existing DocumentSet (refine_crossref). The CrossRef API offers more ways to search papers (for example, based on title) and it would be nice if these were also supported.

nlesc / litstudy Goto Github PK

litstudy's Introduction

LitStudy

Frequently Asked Questions

Supported Source

Example

Installation Guide

Documentation

Requirements

License

Change log

Contributing

Citation

Related work

litstudy's People

Stargazers

Watchers

Forkers

litstudy's Issues

Issue:

To reproduce:

Here is the full error message:

Warning: When cdn_resources is 'local' jupyter notebook has issues displaying graphics on chrome/safari. Use cdn_resources='in_line' or cdn_resources='remote' if you have issues viewing graphics in a notebook.

Description

Expected behavior

Observed behavior

Description

Expected behavior

Observations

Recommend Projects

Recommend Topics

Recommend Org