Since writing this originally, the analysis has evolved somewhat, and this needs to be revised before being shared broadly.
As part of some work to expand the SimplyE project to include more materials useful to the research community, I've done some basic analysis of HathiTrust usage.
HathiTrust describes itself as:
a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. The mission of HathiTrust is to contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. There are more than 120 partners in HathiTrust, and membership is open to institutions worldwide.
It contains more than 15 million volumes, including 5.8 million open volumes in the public domain. NYPL, through its Google Library Project partnership, has contributed more than 300,000 volumes to HathiTrust. For more info see the HathiTrust about page.
The data below covers the period May 8, 2014 - May 7, 2017.
This graph displays all volumes, open volumes accessed, and closed volumes with attempted access by year. (See full graph for larger view.)
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~hadro/15.embed"></iframe>This graph displays all open volumes and all open volumes accessed by year (toggle "Total volumes" in the legend to see all HathiTrust volumes for a given year as well). (See full graph for larger view.)
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~hadro/19.embed"></iframe>15,095,384 retrievable records from the Hathifiles (Pandas choked on a small subset of records which are not included here, and are almost certainly not among the most-viewed items)
3,247,601 distinct HathiTrust volume IDs (that I was able to scrape out of three years of analytics data)
See top_open_items.txt for the full list
Title | Uses | Publication Date |
---|---|---|
Quicksand, | 823739 views | 1928 |
The surnames of Scotland, their origin meaning and history / | 619250 views | 1962 |
Solid mensuration, | 452809 views | 1934 |
The human figure / | 348795 views | 1907 |
America is in the heart, a personal history, | 326990 views | 1946 |
Godey's magazine. | 311573 views | 1850 |
Roster of the Confederate soldiers of Georgia, 1861-1865 / | 285162 views | 9999 |
See NYPL_top_open_items.txt for the full list
See top_closed_items.txt for the full list
Title | Attempted Uses | Publication Date |
---|---|---|
The competent manager : a model for effective performance / | 12226 views | 1982 |
Objects of daily use, with over 1800 figures from University college, London, | 12150 views | 1927 |
My experiences in the world war, | 10976 views | 1931 |
Catalogue of Alexandrian coins, | 9132 views | 1933 |
The regimental history of the 3rd Queen Alexandra's own Gurkha rifles from April 1815 to December 1927, | 9088 views | 1929 |
See full chart for better view
This chart describes the frequency of publication year for the top 500,000 requested items in HathiTrust from May 2014 - May 2017.
<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~hadro/12.embed"></iframe>Note: the embedded histogram on this page only includes the top 40,000 data points because the free version of plotly has a 40K data point limit; for a full-screen interactive version of this chart with 500,000 data points, see the full Histogram.
Meanwhile, you can toggle the data series on this chart, for example if you want to view just the items among the top 500K that were requested but could not be viewed because they are "limited view" items (i.e. closed for copyright reasons).
Notes:
- There are 18,455 volumes not displayed on the complete chart,
because they did not fall within the 1600-2020 publication date range
- 10,349 volumes with a publication date either before 1600 or after 2020
- 7,770 of those have publication dates of "9999", which almost always means they are part of an ongoing publication or serial
- 8,106 volumes with no valid publication date value in the HathiFiles
- 10,349 volumes with a publication date either before 1600 or after 2020
See full chart for better view
This chart describes the usage curve for the top 10,000 items in HathiTrust from May 2014 - May 2017.
<iframe width="900" height="800" frameborder="0" scrolling="no" src="https://plot.ly/~hadro/11.embed"></iframe>(For full-screen interactive version of this chart, see Line Chart (or the linear scale version as well).
My steps and the tools I used were very roughly as follows:
- Scraped three years of daily complete Google Analytics urls, using Corey Harper's helpful PygAnalytics tool
- Downloaded the complete HathiFiles for May 2017, which includes basic bibliographic and rights metadata for every volume in HathiTrust
- Various slicing, dicing, matching, joining, and other manipulations using the
invaluable Pandas Python Data Analysis
Library
- Unholy amounts of regular expressions, via the Pandas
.extract()
and.extractall()
methods
- Unholy amounts of regular expressions, via the Pandas
- Ingest of ~15 million rows of HathiFiles into Postgres database, using
the Pandas
.to_sql()
method - Data visualization using the Plotly Python Library (including the handy ability to run Plotly in 'offline mode', so you don't have to constantly upload each iteration of a revised chart).