Giter Site home page Giter Site logo

hathi_analysis's Introduction

N.B. Text below being revised, new charts added, etc.

Since writing this originally, the analysis has evolved somewhat, and this needs to be revised before being shared broadly.

HathiTrust Usage Analysis

As part of some work to expand the SimplyE project to include more materials useful to the research community, I've done some basic analysis of HathiTrust usage.

HathiTrust describes itself as:

a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. The mission of HathiTrust is to contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. There are more than 120 partners in HathiTrust, and membership is open to institutions worldwide.

It contains more than 15 million volumes, including 5.8 million open volumes in the public domain. NYPL, through its Google Library Project partnership, has contributed more than 300,000 volumes to HathiTrust. For more info see the HathiTrust about page.

The data below covers the period May 8, 2014 - May 7, 2017.

Usage

All Items

This graph displays all volumes, open volumes accessed, and closed volumes with attempted access by year. (See full graph for larger view.)

<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~hadro/15.embed"></iframe>

Open items

This graph displays all open volumes and all open volumes accessed by year (toggle "Total volumes" in the legend to see all HathiTrust volumes for a given year as well). (See full graph for larger view.)

<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~hadro/19.embed"></iframe>

Headline numbers

15,095,384 retrievable records from the Hathifiles (Pandas choked on a small subset of records which are not included here, and are almost certainly not among the most-viewed items)

3,247,601 distinct HathiTrust volume IDs (that I was able to scrape out of three years of analytics data)

Most viewed titles

See top_open_items.txt for the full list

Title Uses Publication Date
Quicksand, 823739 views 1928
The surnames of Scotland, their origin meaning and history / 619250 views 1962
Solid mensuration, 452809 views 1934
The human figure / 348795 views 1907
America is in the heart, a personal history, 326990 views 1946
Godey's magazine. 311573 views 1850
Roster of the Confederate soldiers of Georgia, 1861-1865 / 285162 views 9999

Most viewed NYPL titles

See NYPL_top_open_items.txt for the full list

Title Uses Publication Date
Miranda / 48465 views 1915
Wife no. 19, or the story of a life in bondage : being a complete exposé of Mormonism, and revealing the sorrows, sacrifices and sufferings of women in polygamy / 36158 views 1875
Men of West Virginia ... 30308 views 1903
Illustrated trade catalogue and price list : manufacturers, importers and jobbers of watchmakers', jewelers' and engravers' supplies of every description : optical goods, chains, charms, etc. : originators of the box matetial [sic] and makers of Swartchild's celebrated watchmakers' benches : 1897-1898 / 28594 views 1897
A standard history of Stark County, Ohio : an authentic narrative of the past, with particular attention to the modern era in the commercial, industrial, civic and social development : a chronicle of the people, with family lineage and memoirs / 26722 views 1916

Closed ("Limited View") titles with most attempted views

See top_closed_items.txt for the full list

Title Attempted Uses Publication Date
The competent manager : a model for effective performance / 12226 views 1982
Objects of daily use, with over 1800 figures from University college, London, 12150 views 1927
My experiences in the world war, 10976 views 1931
Catalogue of Alexandrian coins, 9132 views 1933
The regimental history of the 3rd Queen Alexandra's own Gurkha rifles from April 1815 to December 1927, 9088 views 1929

Publication Date analysis

See full chart for better view

This chart describes the frequency of publication year for the top 500,000 requested items in HathiTrust from May 2014 - May 2017.

<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~hadro/12.embed"></iframe>

Note: the embedded histogram on this page only includes the top 40,000 data points because the free version of plotly has a 40K data point limit; for a full-screen interactive version of this chart with 500,000 data points, see the full Histogram.

Meanwhile, you can toggle the data series on this chart, for example if you want to view just the items among the top 500K that were requested but could not be viewed because they are "limited view" items (i.e. closed for copyright reasons).

Notes:

  • There are 18,455 volumes not displayed on the complete chart, because they did not fall within the 1600-2020 publication date range
    • 10,349 volumes with a publication date either before 1600 or after 2020
      • 7,770 of those have publication dates of "9999", which almost always means they are part of an ongoing publication or serial
    • 8,106 volumes with no valid publication date value in the HathiFiles

Usage analysis

See full chart for better view

This chart describes the usage curve for the top 10,000 items in HathiTrust from May 2014 - May 2017.

<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~hadro/11.embed"></iframe>

(For full-screen interactive version of this chart, see Line Chart (or the linear scale version as well).

Tools and method

My steps and the tools I used were very roughly as follows:

  • Scraped three years of daily complete Google Analytics urls, using Corey Harper's helpful PygAnalytics tool
  • Downloaded the complete HathiFiles for May 2017, which includes basic bibliographic and rights metadata for every volume in HathiTrust
  • Various slicing, dicing, matching, joining, and other manipulations using the invaluable Pandas Python Data Analysis Library
    • Unholy amounts of regular expressions, via the Pandas .extract() and .extractall() methods
  • Ingest of ~15 million rows of HathiFiles into Postgres database, using the Pandas .to_sql() method
  • Data visualization using the Plotly Python Library (including the handy ability to run Plotly in 'offline mode', so you don't have to constantly upload each iteration of a revised chart).

hathi_analysis's People

Contributors

hadro avatar

Watchers

 avatar  avatar

hathi_analysis's Issues

Comparison of CCE to HT

As a user, I would like to compare the number of CCE registrations for books published in the US with HT's data of US published books so that I can determine whether CCE registrations are a good representation for all books published in the US for a particular time.

I have book registration data for 1923-1952 (1953-onward includes non-book type registrations).

AC1: For each year from 1923 to 1952, please count the number of unique titles in HT.

select count(distinct(hathitrust_record_number)) from hathifiles where (publication_date = '1949' or publication_date = '1949.0') and bibliograhic_format = 'BK' and publication_place LIKE '__u'

select count(distinct(oclc_number)) from hathifiles where (publication_date = '1949' or publication_date = '1949.0') and bibliograhic_format = 'BK' and publication_place LIKE '__u'

Add in the usage

It would be good to see if there are relationships between publication year and usage amount on a volume level. This might need some binning to be useful.

Quick sketch
Axis 1: Years, maybe binned into decades or centuries
Axis 2: Access level, binned into ... 0, 1, 2-5, 6-20, ... 1,000-1,000,000 (not really sure about the bins)
Axis 3: Either number of volumes in year/access bin, or percentage of volumes in year/access bin compared to that entire year's volumes

Access relative to supply

I'd be interested in seeing how the top40k items represent access proportional to the amount of material in hathi. Some pseudoish code to explain it.

df = pd.DataFrame(data, columns = ['year', 'items_in_top40', 'items_in_hathi', 'items_open])

df['rel_total'] = df.items_in_top_40/df.items_in_hathi
df.plot(x = 'year', y = rel_total, type = 'scatter')

df['rel_open'] = df.items_in_top_40/df.items_open
df.plot(x = 'year', y = rel_open, type = 'scatter')

Access oriented graph

Via NK:

what if we did these graphs with just open volumes, so
all volumes in hathi
all open volumes in hathi
all accessed open volumes in hathi
on a big overlaid bar chart
and then the bottom panel would just be a timeseries of %of open volumes accessed (edited)
it would give a bit of a more continuous idea of “if we do the work to open this item, what is the chance it will be accessed?”

CC @nkrabben

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.