Comments (19)
-
Deprecation: Yes, the way sklearn is doing this is quite good. We are already following a similar approach: https://github.com/scopus-api/scopus/blob/master/scopus/scopus_search.py#L9 or https://github.com/scopus-api/scopus/blob/master/scopus/abstract_retrieval.py#L577. In version 2.0, everything that has a future deprecation warning will be removed.
-
kwds: Maybe we can do this first in a separate PR. It's relatively easy to implement and does not depend on other issues. I think the retrieval classes don't need this, because they do not have header params that we are not already using.
-
Many results: I am 75% sure that an error of having to many results comes immediately, i.e. on the first run where we try to get the number of results: https://github.com/scopus-api/scopus/blob/master/scopus/classes/search.py#L82. We should verify this. If not, and the error is thrown only when we reach the end of the search results, then we still need that parameter. We could implement it in two ways: We define a global variable equal to 5000. A new boolean parameter, that non-subscribers should set to
True
, will check that the number of results is less than or equal to that global variable. Alternatively, we keep it the way it is, but set it toNone
as default.
from pybliometrics.
Also, happy to work on a PR for this
from pybliometrics.
Oh, this is wonderful news! The 5k cap always annoyed me. Also, there must have been changes behind the scenes, both on the architecture and the documentation.
So, it's the max_entries
that we need to purge and the count
that we need to make less restrictive, right?
Probably we should also remove the default for the view: It would become more flexible as it caters to both subscribers and non-subscribers.
A PR would be amazing! In the meantime I'd update the documentation.
from pybliometrics.
Actually, you can just modify the "search" class to use the new "cursor" parameter instead of "start" and "count". I'll find some time this week to arrange my code for sharing. It is completely free of record count limitations and I've used it for some time now and it seems to work
from pybliometrics.
Yes, that would be amazing, from one of the two of you. This and the next week I am completely unavailable.
from pybliometrics.
Hi @Michael-E-Rose , I think you're going to want to keep the max_entries
param in the Search
class, but add that param to ScopusSearch
to address the issue I mentioned. The reason why you want to keep max_entries
is b/c non-subscribers still have a 5,000 doc limit, so you'll want to have a check that max_records
isn't exceeded for them. The count
param is a little different. My first thought was to deprecate count
by making the default value be None
, and throwing a warning whenever the user specifies a none-None
value for it, saying something along the lines of "we're no longer passing the count
param to the API. Rather, the default value for count
will be used depending on your access level" or something like that.
from pybliometrics.
Opps, meant to address that last comment to @gilad1bl . @Michael-E-Rose , I noticed you assigned this issue to yourself. Were you or @gilad1bl planning on putting together a PR for this? If not, I'd be happy to give it a shot.
from pybliometrics.
Hey, @crew102 and @Michael-E-Rose !
Just sent a PR - hope its good enough to use.
Though I've played it cool so far, this is my first real GitHub contribution so thanks for the motivation!
from pybliometrics.
Hey @gilad1bl , as far as I can tell your PR simply uses a different way to iterate over an entire result set (i.e., it uses a cursor instead of using the start
parameter). Unfortunately this won't solve the issues mentioned in this thread. I would suggest you open up a separate issue for refactoring scopus
to use a cursor
param, as it's not quite related to the max_entries
and count
issues discussed here (though its still probably a good idea).
from pybliometrics.
@crew102 thanks! as far as I can tell, this is the only way to iterate over a set of results that contains more than 5,000 entries. using the start parameter would result in an error from Elsevier's side if it is set to more than 5,000. Am I wrong?
from pybliometrics.
I'm not entirely sure what would happen if you set the start
parameter to above 5,000. It definitely should throw an error if you're not a Scopus subscriber, as only subscribers can download result sets larger than 5,000 (trying to download these large result sets with a cursor parameter for non-subscribers would throw an error too). For subscribers (i.e. those users that have a paid subscription), I image that the API doesn't care if you use the start
or cursor
parameter to iterate over large result sets, though I think the scopus
package should be using a cursor parameter since the API docs started to recommend it.
from pybliometrics.
The university I work from is a paid subscriber and I get errors trying to use start that is over 5,000. You are right though, that it does not completely address the issue that you raised. Do you want me to try and add the option to pass start and count directly?
from pybliometrics.
Ah, you're right, start
can't be above 5,000, regardless of the user's access level. But yes, I'd be great if you could add the option to pass count
and max_entries
in ScopusSearch
and other classes that inherit from Search
(I guess you could add the option to pass in start
as well, though note the first item below). A few things:
- There are a ton of parameters like
start
that could theoretically be passable to the APIs (e.g., take a look at the parameters you can pass to the Search API - https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl). For a more general solution, thescopus
package may want to consider using**kargs
instead of specifying all of these parameters one by one. In other words, it may be better to pick a few of the most important parameters likequery
and make those explicit parameters in the classes, but allow the user to pass in arbitrary keyword args using the**kargs
parameter. - I would keep an eye on making sure any change you make in the parameters is backwards compatible, so it doesn't break people's code.
- If you'd want to take a shot at a PR for this issue, I would suggest keeping it seperate from your existing PR. In other words, I would suggest opening a new PR for "refactoring API parameters."
from pybliometrics.
Good evening guys! I couldn't follow your discussion until now. I thank both of your for the all the contributions, much appreciated!
I agree with @crew102 that backwards compatibility is key. Also, there should be a more flexible approach to the Search classes (via kwargs
) for the more experienced users. Also, I think we don't need the max_entries
in the search class as a pre-capture: There will be an error anyways. The pre-capture was introduced very early and should be deprecated.
However I have not fully understood how the cursor works. Should it completely replace start
and end
? Can you please enlighten me?
PS: The fact that I am assigned just means that I, and not John, will be responsible for this issue.
from pybliometrics.
K, a few things here:
-
I agree with @crew102 that backwards compatibility is key.
K, we'll need to keep all params that exist in the classes (even if they won't be used) to achieve backwards compat, at least in the short term. Do you have a a preferred pattern for deprecating paramters? For example, let's say we want to deprecate the start
parameter, how do you want to go about doing that? My first thought is to follow the pattern used in scikit-learn, where we would first soft-deprecate the parameter in the next few releases of the package and then later remove that param completely in future versions. The user would get a warning in the next release if they use a deprecated param. For example, take a look at these code snippets that should how pooling_func
was deprecated:
-
Also, there should be a more flexible approach to the Search classes (via kwargs) for the more experienced users.
K, I can add this functionality. Will the Retrieval
classes need this as well? If so, let's put that on the back-burner and concentrate on Search
classes for now.
-
Also, I think we don't need the max_entries in the search class as a pre-capture: There will be an error anyways. The pre-capture as introduced very early and should be deprecated.
I actually think we want to keep the ability to error when the results exceed some user-defined threshold like max_entries
. For example, consider this hypothetical situation that could occur if max_entries
is totally removed: A non-subscriber calls ScopusSearch and their query results in 10,000 entries (i.e., 10,000 entries match their query). ScopusSearch then starts downloading the data and when it gets up to 5,000 entries downloaded the API throws an error saying "you've exceeded the amount your can download in a single result set." The user had to A) wait a long time to realize that they can't actually get all the results b/c the API won't throw that error immediately and B) they've now eaten up 5,000 of the 20,000 of their weekly download limit (https://dev.elsevier.com/api_key_settings.html). It would be a lot better if we could avoid this situation by just erroring after a check on max_entries
is performed. I suggest we keep the default value of max_entries
equal to 5,000 but allow the user to set max_entries=None
, which would basically amount to not performing the check that max_entries
is not exceeded.
-
However I have not fully understood how the cursor works. Should it completely replace start and end?
Can we actually discuss the cursor in the #87 thread instead?
from pybliometrics.
I implemented the kwds-functionality for ScopusSearch via d7f203a because it was a quick thing to do. I did so only for the ScopusSearch class because the other two searchers classes don't have fields that are not already implemented and which at the same time could be used w/o breaking the code.
A quick testing showed no difference in the number and kind of results returned, though.
from pybliometrics.
Hey @Michael-E-Rose , I was thinking that **kwds
would be used for the query params, not the header. There are potentially a lot of query params that the user may want to specify - https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl
from pybliometrics.
Yes you are right, I put it in the wrong dictionary. d079f93 fixes this: ScopusSearch now features keywords that go in to the query parameters.
I am testing a bit more now.
from pybliometrics.
To me this issue seems resolved.
ScopusSearch() uses the cursor for subscribers and the count-check for non-subscribers, AuthorSearch() and AffiliationSearch() will deprecate their max_entries
and start
parameters.
Thanks to both of you!
from pybliometrics.
Related Issues (20)
- Errors in citescoreyearinfolist HOT 1
- ScopusSearch by DOI produces different results if space added HOT 2
- LitStudy - Scopus400 Error: Exceeds the maximum number allowed for the service level HOT 2
- CitationOverview's cc property unexpected Error. HOT 6
- What is the Scopus search link? HOT 3
- __str__ not always working HOT 1
- 'No Edges Given' on Network Analysis
- 'ENTITLED' view is not supported HOT 2
- Improper 404 error HOT 2
- Implement init() function to replace execution of code on import HOT 2
- Fix tests HOT 3
- Move creating folder from `get_folder()` to `init()` HOT 2
- Check valid views with new `VIEWS` constant
- Scopus401Error: The requestor is not authorized to access the requested view or fields of the resource HOT 1
- CitationOverview's cc property unexpected Error. HOT 7
- Lookup quota reset time after depletion? HOT 4
- AffiliationSearch: parent affiliation id is always "0" HOT 1
- AttributeError with AbstractRetrieval(..., view="REF").references
- AbstractRetrieval(..., view="REF") sometimes breaks when there are no references
- ScienceDirect: Article Retrieval API
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pybliometrics.