Giter Site home page Giter Site logo

nypl / libbib Goto Github PK

View Code? Open in Web Editor NEW
24.0 13.0 0.0 1.35 MB

An R package providing WorldCat API communication, functions for validating and normalizing bibliographic codes, translation from call numbers to subject, and other related utilities helpful for assessment librarians

License: GNU General Public License v3.0

R 99.50% Makefile 0.50%
library-science library-assessment marc-records isbn lccn dewey-decimal library-of-congress issn worldcat subject-classifications

libbib's Introduction

libbib

Cran Version

Description

An R package providing WorldCat API communication, functions for validating and normalizing bibliographic codes such as ISBN; ISSN; and LCCN, translation from call numbers (Library of Congress and Dewey Decimal) to their subject classifications or subclassifications, and other related utilities helpful for assessment librarians. Also provides various loadable data files such call number / subject crosswalks and code tables.

Speed of execution and robustness are priorities in this package. To the end of optimizing speed and efficiency, careful consideration is taken to exploit vectorized functions and efficient data.table joins in the code.

On a (real life) example of 3 million very messy ISBN 10s and 13s, aggressive ISBN normalization took less than 1 minute (commodity hardware). It salvaged almost half a million previously unusable ISBNs, bringing the number of usable ISBNs (for matching, cataloging, etc...) from less than 90% to 99.6%!

As for robustness, this package is well tested, with over 300 automated tests, at time of writing.

data.table is a hard dependency of this package. Using data.table internally makes, for example, the call number -> subject conversions very fast. Additionally, loading this package also automatically loads data.table. If you don't use data.table in your own code, everything will work just fine! But you might want to look into it!

Some examples

> get_dewey_decimal_subject_class("823.912")
[1] "Literature (Belles-lettres) and rhetoric"

> get_lc_call_subject_classification(c("ND 237", "PQ2246.M3"),
+                                    subclassification=TRUE)
[1] "Painting"
[2] "French, Italian, Spanish, and Portuguese literature"

> convert_to_isbn_13(c("012491540X", "9004037810"))
[1] "9780124915404" "9789004037816"

> normalize_isbn_13(c("978-9-66-819791-8", "__9__781572411579"))
[1] "9789668197918" "9781572411579"

> normalize_lccn(" 79139101 /AC/r932")
[1] "79139101"

> # Do a WorldCat APU search on 19th century materials on ethics
> # (Dewey code 170s / LC Call prefix BJ)
> results <- worldcat_api_search('($dewey="17*" or $lc_call="bj*")
+                                    and srw.yr="18*"')
> results[,.(oclc, title, result_number, num_results)][1:5]
        oclc                                       title result_number num_results
      <char>                                      <char>         <int>      <char>
1:   8665856 The principles of moral and political ph...             1        1716
2: 191264919                The economy of human life. /             2        1716
3:  22571399                                   Solitude:             3        1716
4:  65250134 The theory of moral sentiments, or, An e...             4        1716
5:  13106952    Letters on the improvement of the mind :             5        1716

> worldcat_api_bib_read_info_by_isbn("9780984201006")
        oclc          isbn   issn                        title
      <char>        <char> <char>                       <char>
1: 462894360 9780984201006   <NA> The Great Debate about Art /
         author                   leader
         <char>                   <char>
1: Harris, Roy, 00000cam a2200000 a 4500
                                       oh08
                                     <char>
1: 091031s2010    ilua     b    000 0 eng c

> worldcat_api_classify_by_oclc("877749545")
        oclc                 title               author total_holdings total_eholdings
      <char>                <char>               <char>          <int>           <int>
1: 877749545 Max Ernst's 'Celebes' Penrose, Roland, Sir              8               0
2: 877749545 Max Ernst's 'Celebes' Penrose, Roland, Sir              8               0
   call_type recommendation holdings http_status_code classify_response_code
      <char>         <char>   <char>            <int>                  <int>
1:       DDC          759.4        6              200                      0
2:       LCC       ND588.E7        6              200                      0

> worldcat_api_locations_by_oclc("877749545", max_libraries=10,
+                                include.bib.info=FALSE)
        oclc institution_identifier
      <char>                 <char>
1: 877749545                    NLE
2: 877749545                    NLW
3: 877749545                    EUM
4: 877749545                    LTU
5: 877749545                    ELU
6: 877749545                  UKUAL
                                institution_name copies
                                          <char> <char>
1:                  National Library of Scotland      1
2:                     National Library of Wales      1
3:              University of Manchester Library      1
4: University of Leicester, David Wilson Library      1
5:     University of London Senate House Library      1
6:                 University of the Arts London      1

> loc_permalink_from_lccn("73167510", format="marcxml")
[1] https://lccn.loc.gov/73167510/marcxml

> loc_permalink_from_lccn(c("2010292065", "2012451004")
[1] "https://lccn.loc.gov/2010292065" "https://lccn.loc.gov/2012451004"

> worldcat_permalink_from_oclc_number("1005106045")
[1] "http://www.worldcat.org/oclc/1005106045"

> get_isbn_10_check_digit("0-124-91540-X", allow.hyphens=TRUE)
[1] "X"

> is_valid_isbn_10(c("012491540X", "9004037812"))
[1] TRUE FALSE

> marc_leader_get_info("00000cam a22000008i 4500")
         record_type      bib_level
              <char>         <char>
1: Language Material Monograph/Item

# The Brothers Karamazov (1970 reissue but original publication date)
> marc_008_get_info("950622r19701880ru            000 0 rus d",
+                    original.pub.date=TRUE)
      pub_date pub_place_code lang_code
         <int>         <char>    <char>
  1:     1880             ru       rus

# reissue publication date
> marc_008_get_info("950622r19701880ru            000 0 rus d")
      pub_date pub_place_code lang_code
         <int>         <char>    <char>
  1:     1970             ru       rus

> data("dewey_subject_crosswalk")
> dewey_subject_crosswalk
Key: <thekey>
     thekey                                           description
     <char>                                                <char>
  1:    000               Computer science, knowledge and systems
  2:    001                                             Knowledge
  ...
990:    998                         Arctic islands and Antarctica
991:    999                               Extraterrestrial worlds

> data("lc_subject_subclassification")
> lc_subject_subclassification
Key: <thekey>
     thekey                                           description
     <char>                                                <char>
  1:     AC                  Collections. Series. Collected works
  2:     AE                                         Encyclopedias
  ...
227:      Z Books (General). Writing. Paleography. Book indust...
228:     ZA                       Information resources/materials

> get_language_from_code(c("yor", "spa"))
[1] "Yoruba"  "Spanish"

> get_country_from_code(c("ck", " NYA"))
[1] "Colombia"         "New York (State)"

> someisbns <- c("9782711875177;garbage-isbn;2711875172;2844268900",
+                "1861897952; 978-1-86189-795-4")
> split_map_filter_reduce(someisbns,
+                         mapfun=
+                           function(x){normalize_isbn(x, convert.to.isbn.13=TRUE)},
+                         filterfun=remove_duplicates_and_nas,
+                         reduxfun=recombine_with_sep_closure())
[1] "9782711875177;9782844268907" "9781861897954"

Functions

ISBNs

  • get_isbn_10_check_digit
  • check_isbn_10_check_digit
  • is_valid_isbn_10
  • normalize_isbn_10
  • get_isbn_13_check_digit
  • check_isbn_13_check_digit
  • is_valid_isbn_13
  • convert_to_isbn_13
  • normalize_isbn_13
  • normalize_isbn

ISSN

  • get_issn_check_digit
  • check_issn_check_digit
  • is_valid_issn
  • normalize_issn

LCCN

  • normalize_lccn

Library of Congress Call Numbers

  • get_lc_call_subject_classification
  • is_valid_lc_call
  • get_lc_call_first_letter
  • get_all_lc_call_subject_letters

Dewey Decimal (DDC) Call Numbers

  • get_dewey_decimal_subject_class
  • get_dewey_decimal_subject_division
  • get_dewey_decimal_subject_section

WorldCat API

  • worldcat_api_search
  • worldcat_api_bib_read_info_by_oclc
  • worldcat_api_bib_read_info_by_isbn
  • worldcat_api_bib_read_info_by_issn
  • worldcat_api_locations_by_oclc
  • worldcat_api_locations_by_isbn
  • worldcat_api_locations_by_issn
  • worldcat_api_classify_by_oclc
  • worldcat_api_classify_by_isbn
  • worldcat_api_classify_by_issn

Interfacing with the web

  • loc_permalink_from_lccn
  • worldcat_permalink_from_issn
  • worldcat_permalink_from_isbn
  • worldcat_permalink_from_oclc_number
  • oclc_classify_link_from_standard_num

Included data files (loadable with data(datafile))

Marc field deconstruction

  • marc_leader_get_info
  • marc_008_get_info

other Marc code translations

  • get_language_from_code
  • get_country_from_code

split / map / filter / reduce and related utilies

  • car
  • remove_duplicates_and_nas
  • recombine_with_sep_closure
  • split_map_filter_reduce

miscellaneous utilities

  • dt_del_cols
  • dt_keep_cols
  • dt_counts_and_percents
  • get_clean_names
  • dt_set_clean_names
  • dt_percent_not_na
  • dt_na_breakdown
  • dt_add_to_col_names

working with special attributes

  • fread_plus_date
  • fwrite_plus_date
  • set_lb_attribute
  • set_lb_date
  • cp_lb_attributes

libbib's People

Contributors

tonyfischetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libbib's Issues

Take a better look at efficiency

  • Is it being as efficient as it could possibly be? No.
  • Is there some optimizations we can do regarding NAs? Yes.
  • Are huge vectors being copied? I don't know.
  • Can certain repeated operations be eliminated with arguments to foundation functions? Yes.
  • Should we do ISBN 10 normalization after ISBN 13 is positive in normalize_ISBN$. Probably not.
  • Should the call to convert_to_isbn_10 in normalize_isbn_10 specify skip.validity.check=FALSE. Oh yeah. (and then get rid of that comment)
  • Is gsub("x", "X"...) more efficient than to_upper? Could be.
  • Can you compile regexes in R? Possibly.
  • Is making that y vector in normalize_ISBN_10 costly? Almost certainly.

404 error assistance

Hello - I'm new to libraries, but competent in R. I'm getting a 404 error on worldcat_api_search:

worldcat_api_search('($dewey="17*" or $lc_call="bj*") and srw.yr="18*"') Error in worldcat_api_get_http_response(fullurl, print.api.responses = debug) : API returned non-OK status code 404

I believe the proxy is working (i.e. it returned 407 before I defined the proxy correctly in my .Renviron). Is there a function to test the connection? If not, any ideas on how can I diagnose whether the issue is proxy, API key or other?

Test new functions

Write testthat tests for is_valid_issn, normalize_issn, and normalize_lccn.

Además... fix Roxygen doc for normalize_lccn

read_xml error

The package libbib installed on ubuntu linux through install.packages command. The following error appears with a proposed example

worldcat_api_classify_by_oclc("877749545")
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, :
Space required after the Public Identifier [65]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.