Giter Site home page Giter Site logo

count and max_entries parameters need to be passable to __init__ method of all classes that inherit from Search about pybliometrics HOT 19 CLOSED

pybliometrics-dev avatar pybliometrics-dev commented on May 16, 2024
count and max_entries parameters need to be passable to __init__ method of all classes that inherit from Search

from pybliometrics.

Comments (19)

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024 1
  1. Deprecation: Yes, the way sklearn is doing this is quite good. We are already following a similar approach: https://github.com/scopus-api/scopus/blob/master/scopus/scopus_search.py#L9 or https://github.com/scopus-api/scopus/blob/master/scopus/abstract_retrieval.py#L577. In version 2.0, everything that has a future deprecation warning will be removed.

  2. kwds: Maybe we can do this first in a separate PR. It's relatively easy to implement and does not depend on other issues. I think the retrieval classes don't need this, because they do not have header params that we are not already using.

  3. Many results: I am 75% sure that an error of having to many results comes immediately, i.e. on the first run where we try to get the number of results: https://github.com/scopus-api/scopus/blob/master/scopus/classes/search.py#L82. We should verify this. If not, and the error is thrown only when we reach the end of the search results, then we still need that parameter. We could implement it in two ways: We define a global variable equal to 5000. A new boolean parameter, that non-subscribers should set to True, will check that the number of results is less than or equal to that global variable. Alternatively, we keep it the way it is, but set it to None as default.

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Also, happy to work on a PR for this

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

Oh, this is wonderful news! The 5k cap always annoyed me. Also, there must have been changes behind the scenes, both on the architecture and the documentation.

So, it's the max_entries that we need to purge and the count that we need to make less restrictive, right?
Probably we should also remove the default for the view: It would become more flexible as it caters to both subscribers and non-subscribers.

A PR would be amazing! In the meantime I'd update the documentation.

from pybliometrics.

gilad1bl avatar gilad1bl commented on May 16, 2024

Actually, you can just modify the "search" class to use the new "cursor" parameter instead of "start" and "count". I'll find some time this week to arrange my code for sharing. It is completely free of record count limitations and I've used it for some time now and it seems to work

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

Yes, that would be amazing, from one of the two of you. This and the next week I am completely unavailable.

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Hi @Michael-E-Rose , I think you're going to want to keep the max_entries param in the Search class, but add that param to ScopusSearch to address the issue I mentioned. The reason why you want to keep max_entries is b/c non-subscribers still have a 5,000 doc limit, so you'll want to have a check that max_records isn't exceeded for them. The count param is a little different. My first thought was to deprecate count by making the default value be None, and throwing a warning whenever the user specifies a none-None value for it, saying something along the lines of "we're no longer passing the count param to the API. Rather, the default value for count will be used depending on your access level" or something like that.

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Opps, meant to address that last comment to @gilad1bl . @Michael-E-Rose , I noticed you assigned this issue to yourself. Were you or @gilad1bl planning on putting together a PR for this? If not, I'd be happy to give it a shot.

from pybliometrics.

gilad1bl avatar gilad1bl commented on May 16, 2024

Hey, @crew102 and @Michael-E-Rose !
Just sent a PR - hope its good enough to use.
Though I've played it cool so far, this is my first real GitHub contribution so thanks for the motivation!

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Hey @gilad1bl , as far as I can tell your PR simply uses a different way to iterate over an entire result set (i.e., it uses a cursor instead of using the start parameter). Unfortunately this won't solve the issues mentioned in this thread. I would suggest you open up a separate issue for refactoring scopus to use a cursor param, as it's not quite related to the max_entries and count issues discussed here (though its still probably a good idea).

from pybliometrics.

gilad1bl avatar gilad1bl commented on May 16, 2024

@crew102 thanks! as far as I can tell, this is the only way to iterate over a set of results that contains more than 5,000 entries. using the start parameter would result in an error from Elsevier's side if it is set to more than 5,000. Am I wrong?

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

I'm not entirely sure what would happen if you set the start parameter to above 5,000. It definitely should throw an error if you're not a Scopus subscriber, as only subscribers can download result sets larger than 5,000 (trying to download these large result sets with a cursor parameter for non-subscribers would throw an error too). For subscribers (i.e. those users that have a paid subscription), I image that the API doesn't care if you use the start or cursor parameter to iterate over large result sets, though I think the scopus package should be using a cursor parameter since the API docs started to recommend it.

from pybliometrics.

gilad1bl avatar gilad1bl commented on May 16, 2024

The university I work from is a paid subscriber and I get errors trying to use start that is over 5,000. You are right though, that it does not completely address the issue that you raised. Do you want me to try and add the option to pass start and count directly?

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Ah, you're right, start can't be above 5,000, regardless of the user's access level. But yes, I'd be great if you could add the option to pass count and max_entries in ScopusSearch and other classes that inherit from Search (I guess you could add the option to pass in start as well, though note the first item below). A few things:

  • There are a ton of parameters like start that could theoretically be passable to the APIs (e.g., take a look at the parameters you can pass to the Search API - https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl). For a more general solution, the scopus package may want to consider using **kargs instead of specifying all of these parameters one by one. In other words, it may be better to pick a few of the most important parameters like query and make those explicit parameters in the classes, but allow the user to pass in arbitrary keyword args using the **kargs parameter.
  • I would keep an eye on making sure any change you make in the parameters is backwards compatible, so it doesn't break people's code.
  • If you'd want to take a shot at a PR for this issue, I would suggest keeping it seperate from your existing PR. In other words, I would suggest opening a new PR for "refactoring API parameters."

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

Good evening guys! I couldn't follow your discussion until now. I thank both of your for the all the contributions, much appreciated!

I agree with @crew102 that backwards compatibility is key. Also, there should be a more flexible approach to the Search classes (via kwargs) for the more experienced users. Also, I think we don't need the max_entries in the search class as a pre-capture: There will be an error anyways. The pre-capture was introduced very early and should be deprecated.

However I have not fully understood how the cursor works. Should it completely replace start and end? Can you please enlighten me?

PS: The fact that I am assigned just means that I, and not John, will be responsible for this issue.

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

K, a few things here:

  • I agree with @crew102 that backwards compatibility is key.

K, we'll need to keep all params that exist in the classes (even if they won't be used) to achieve backwards compat, at least in the short term. Do you have a a preferred pattern for deprecating paramters? For example, let's say we want to deprecate the start parameter, how do you want to go about doing that? My first thought is to follow the pattern used in scikit-learn, where we would first soft-deprecate the parameter in the next few releases of the package and then later remove that param completely in future versions. The user would get a warning in the next release if they use a deprecated param. For example, take a look at these code snippets that should how pooling_func was deprecated:

https://github.com/scikit-learn/scikit-learn/blob/7b136e92acf49d46251479b75c88cba632de1937/sklearn/cluster/hierarchical.py#L753-L756

https://github.com/scikit-learn/scikit-learn/blob/7b136e92acf49d46251479b75c88cba632de1937/sklearn/cluster/hierarchical.py#L780-L784

  • Also, there should be a more flexible approach to the Search classes (via kwargs) for the more experienced users.

K, I can add this functionality. Will the Retrieval classes need this as well? If so, let's put that on the back-burner and concentrate on Search classes for now.

  • Also, I think we don't need the max_entries in the search class as a pre-capture: There will be an error anyways. The pre-capture as introduced very early and should be deprecated.

I actually think we want to keep the ability to error when the results exceed some user-defined threshold like max_entries. For example, consider this hypothetical situation that could occur if max_entries is totally removed: A non-subscriber calls ScopusSearch and their query results in 10,000 entries (i.e., 10,000 entries match their query). ScopusSearch then starts downloading the data and when it gets up to 5,000 entries downloaded the API throws an error saying "you've exceeded the amount your can download in a single result set." The user had to A) wait a long time to realize that they can't actually get all the results b/c the API won't throw that error immediately and B) they've now eaten up 5,000 of the 20,000 of their weekly download limit (https://dev.elsevier.com/api_key_settings.html). It would be a lot better if we could avoid this situation by just erroring after a check on max_entries is performed. I suggest we keep the default value of max_entries equal to 5,000 but allow the user to set max_entries=None, which would basically amount to not performing the check that max_entries is not exceeded.

  • However I have not fully understood how the cursor works. Should it completely replace start and end?

Can we actually discuss the cursor in the #87 thread instead?

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

I implemented the kwds-functionality for ScopusSearch via d7f203a because it was a quick thing to do. I did so only for the ScopusSearch class because the other two searchers classes don't have fields that are not already implemented and which at the same time could be used w/o breaking the code.
A quick testing showed no difference in the number and kind of results returned, though.

from pybliometrics.

crew102 avatar crew102 commented on May 16, 2024

Hey @Michael-E-Rose , I was thinking that **kwds would be used for the query params, not the header. There are potentially a lot of query params that the user may want to specify - https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

Yes you are right, I put it in the wrong dictionary. d079f93 fixes this: ScopusSearch now features keywords that go in to the query parameters.

I am testing a bit more now.

from pybliometrics.

Michael-E-Rose avatar Michael-E-Rose commented on May 16, 2024

To me this issue seems resolved.

ScopusSearch() uses the cursor for subscribers and the count-check for non-subscribers, AuthorSearch() and AffiliationSearch() will deprecate their max_entries and start parameters.

Thanks to both of you!

from pybliometrics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.