Giter Site home page Giter Site logo

angpa / gutenbergr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ropensci/gutenbergr

0.0 1.0 0.0 18.83 MB

Search and download public domain texts from Project Gutenberg

Home Page: https://docs.ropensci.org/gutenbergr

R 95.81% Python 3.06% Shell 1.13%

gutenbergr's Introduction

output
github_document

gutenbergr: R package to search and download public domain texts from Project Gutenberg

Authors: David Robinson
License: GPL-2

Build Status CRAN_Status_Badge Build status Coverage Status rOpenSci peer-review Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Download and process public domain works from the Project Gutenberg collection. Includes

  • A function gutenberg_download() that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads the text of Frankenstein.
  • Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
    • gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc
    • gutenberg_authors contains information about each author, such as aliases and birth/death year
    • gutenberg_subjects contains pairings of works with Library of Congress subjects and topics

Installation

Install the package with:

install.packages("gutenbergr")

Or install the development version using devtools with:

devtools::install_github("ropensci/gutenbergr")

Examples

The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).

Suppose we wanted to download Emily Bronte's "Wuthering Heights." We could find the book's ID by filtering:

library(dplyr)
library(gutenbergr)

gutenberg_works() %>%
  filter(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Gothic Fiction/Movie Books/Best Books Ever Listings Public domain in the USA. TRUE

# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Gothic Fiction/Movie Books/Best Books Ever Listings Public domain in the USA. TRUE

Since we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:

wuthering_heights <- gutenberg_download(768)
#>                                          768 
#> "http://aleph.gutenberg.org/7/6/768/768.zip"
wuthering_heights
#> # A tibble: 12,085 x 2
#>    gutenberg_id text                                                                     
#>           <int> <chr>                                                                    
#>  1          768 "WUTHERING HEIGHTS"                                                      
#>  2          768 ""                                                                       
#>  3          768 ""                                                                       
#>  4          768 "CHAPTER I"                                                              
#>  5          768 ""                                                                       
#>  6          768 ""                                                                       
#>  7          768 "1801.--I have just returned from a visit to my landlord--the solitary"  
#>  8          768 "neighbour that I shall be troubled with.  This is certainly a beautiful"
#>  9          768 "country!  In all England, I do not believe that I could have fixed on a"
#> 10          768 "situation so completely removed from the stir of society.  A perfect"   
#> # … with 12,075 more rows

gutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.

# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
#>                                              768                                             1260 
#>     "http://aleph.gutenberg.org/7/6/768/768.zip" "http://aleph.gutenberg.org/1/2/6/1260/1260.zip"
books
#> # A tibble: 32,744 x 3
#>    gutenberg_id text                                                                     
#>           <int> <chr>                                                                    
#>  1          768 "WUTHERING HEIGHTS"                                                      
#>  2          768 ""                                                                       
#>  3          768 ""                                                                       
#>  4          768 "CHAPTER I"                                                              
#>  5          768 ""                                                                       
#>  6          768 ""                                                                       
#>  7          768 "1801.--I have just returned from a visit to my landlord--the solitary"  
#>  8          768 "neighbour that I shall be troubled with.  This is certainly a beautiful"
#>  9          768 "country!  In all England, I do not believe that I could have fixed on a"
#> 10          768 "situation so completely removed from the stir of society.  A perfect"   
#>    title            
#>    <chr>            
#>  1 Wuthering Heights
#>  2 Wuthering Heights
#>  3 Wuthering Heights
#>  4 Wuthering Heights
#>  5 Wuthering Heights
#>  6 Wuthering Heights
#>  7 Wuthering Heights
#>  8 Wuthering Heights
#>  9 Wuthering Heights
#> 10 Wuthering Heights
#> # … with 32,734 more rows

books %>%
  count(title)
#> # A tibble: 2 x 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Jane Eyre: An Autobiography 20659
#> 2 Wuthering Heights           12085

It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle's works, each annotated with both gutenberg_id and title, using:

aristotle_books <- gutenberg_works(author == "Aristotle") %>%
  gutenberg_download(meta_fields = "title")
#>                                                 1974 
#>     "http://aleph.gutenberg.org/1/9/7/1974/1974.zip" 
#>                                                 2412 
#>     "http://aleph.gutenberg.org/2/4/1/2412/2412.zip" 
#>                                                 6762 
#>     "http://aleph.gutenberg.org/6/7/6/6762/6762.zip" 
#>                                                 6763 
#>     "http://aleph.gutenberg.org/6/7/6/6763/6763.zip" 
#>                                                 8438 
#>     "http://aleph.gutenberg.org/8/4/3/8438/8438.zip" 
#>                                                12699 
#> "http://aleph.gutenberg.org/1/2/6/9/12699/12699.zip" 
#>                                                26095 
#> "http://aleph.gutenberg.org/2/6/0/9/26095/26095.zip"

aristotle_books
#> # A tibble: 39,950 x 3
#>    gutenberg_id text                                                                    
#>           <int> <chr>                                                                   
#>  1         1974 "THE POETICS OF ARISTOTLE"                                              
#>  2         1974 ""                                                                      
#>  3         1974 "By Aristotle"                                                          
#>  4         1974 ""                                                                      
#>  5         1974 "A Translation By S. H. Butcher"                                        
#>  6         1974 ""                                                                      
#>  7         1974 ""                                                                      
#>  8         1974 "[Transcriber's Annotations and Conventions: the translator left"       
#>  9         1974 "intact some Greek words to illustrate a specific point of the original"
#> 10         1974 "discourse. In this transcription, in order to retain the accuracy of"  
#>    title                   
#>    <chr>                   
#>  1 The Poetics of Aristotle
#>  2 The Poetics of Aristotle
#>  3 The Poetics of Aristotle
#>  4 The Poetics of Aristotle
#>  5 The Poetics of Aristotle
#>  6 The Poetics of Aristotle
#>  7 The Poetics of Aristotle
#>  8 The Poetics of Aristotle
#>  9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # … with 39,940 more rows

FAQ

What do I do with the text once I have it?

  • The Natural Language Processing CRAN View suggests many R packages related to text mining, especially around the tm package.
  • The tidytext package is useful for tokenization and analysis, especially since gutenbergr downloads books as a data frame already.
  • You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package.
  • If you're considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing "Last, First" names).

How were the metadata R files generated?

See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 05 May 2016.

Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects these rules and complies to the best of our ability. Namely:

  • Project Gutenberg allows wget to harvest Project Gutenberg using this list of links. The gutenbergr package visits that page once to find the recommended mirror for the user's location.
  • We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from http://www.gutenberg.lib.md.us/8/84/84.zip.
  • We retrieve the .zip file rather than txt to minimize bandwidth on the mirror.

Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.

Code of Conduct

This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer

gutenbergr's People

Contributors

dgrtwo avatar evanodell avatar jimhester avatar maelle avatar sckott avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.