Giter Site home page Giter Site logo

pubcrawl's Introduction

*** IMPORTANT ***

No further development will occur in this package as it has been supeseded by the actively maintained and quite spiffy! epubr package.


Travis-CI Build Status AppVeyor Build Status Coverage Status

pubcrawl

Convert ā€˜epubā€™ Files to Text

Description

Convert ā€˜epubā€™ Files to Text

The ā€˜epubā€™ file format is really just a structured ā€˜ZIPā€™ archive with metadata, graphics and (usually) ā€˜HTMLā€™ text. Tools are provided to turn an ā€˜epubā€™ file into a tidy data frame.

Whatā€™s Inside The Tin

The following functions are implemented:

  • epub_to_text: Convert an epub file into a data frame of plaintext chapters

NOTE

There are edge cases Iā€™ve totally not covered yet. Feel free to jump in and make this a real, useful package!

TODO

  • Refactor so there arenā€™t so many heavy dependencies
  • [ ] Try to get hgr on CRAN so itā€™s not a GH dep Moved the cleaner code into here
  • Better docs
  • Embed some epubs for examples and tests
  • Setup Travis, Appveyor, code coverage

Installation

devtools::install_github("hrbrmstr/pubcrawl")

Usage

library(pubcrawl)
library(tidyverse)

# current verison
packageVersion("pubcrawl")
## [1] '0.1.0'

An Oā€™Reilly epub

epub_to_text("~/Data/R Packages.epub")
## # A tibble: 26 x 4
##    path                         size date                content                                                       
##    <chr>                       <dbl> <dttm>              <chr>                                                         
##  1 OEBPS/cover.html              315 2015-03-24 21:49:16 Cover                                                         
##  2 OEBPS/titlepage01.html        466 2015-03-24 21:49:16 "R Packages\n\nHadley Wickham"                                
##  3 OEBPS/copyright-page01.html  3286 2015-03-24 21:49:16 "R Packages\n\nby Hadley  Wickham\n\n\n\nPrinted in the Uniteā€¦
##  4 OEBPS/toc01.html            17557 2015-03-24 21:49:16 "navPrefaceIn This Book\n\nConventions Used in This Book\n\nUā€¦
##  5 OEBPS/preface01.html        17784 2015-03-24 21:49:16 "Preface\n\n\nIn This Book\n\nThis book will guide you from bā€¦
##  6 OEBPS/part01.html             444 2015-03-24 21:49:16 Getting Started                                               
##  7 OEBPS/ch01.html             12007 2015-03-24 21:49:16 "Introduction\n\nIn R, the fundamental unit of shareable codeā€¦
##  8 OEBPS/ch02.html             28633 2015-03-24 21:49:18 "Package Structure\n\nThis chapter will start you on the roadā€¦
##  9 OEBPS/part02.html             454 2015-03-24 21:49:18 Package Components                                            
## 10 OEBPS/ch03.html             28629 2015-03-24 21:49:18 "R Code\n\nThe first principle of using a package is that allā€¦
## # ... with 16 more rows

A Project Gutenberg epub that comes with the package

epub_to_text(system.file("extdat", "augustine.epub", package="pubcrawl")) %>% 
  mutate(path = abbreviate(path))
## # A tibble: 10 x 4
##    path                             size date                content                                                   
##    <chr>                           <dbl> <dttm>              <chr>                                                     
##  1 OEBPS/@@@@@@@3296@3296-@3296--0 63804 2017-10-02 07:00:00 "THE CONFESSIONS\nOF\nSAINT AUGUSTINE\n\nBy Saint Augustiā€¦
##  2 OEBPS/@@@@@@@3296@3296-@3296--1 68504 2017-10-02 07:00:00 "BOOK III\nTo Carthage I came, where there sang all arounā€¦
##  3 OEBPS/@@@@@@@3296@3296-@3296--2 80192 2017-10-02 07:00:00 "BOOK V\nAccept the sacrifice of my confessions from the ā€¦
##  4 OEBPS/@@@@@@@3296@3296-@3296--3 51898 2017-10-02 07:00:00 "O crooked paths! Woe to the audacious soul, which hoped,ā€¦
##  5 OEBPS/@@@@@@@3296@3296-@3296--4 80194 2017-10-02 07:00:00 "Anubis, barking Deity, and allĀ Ā Ā Ā Ā Ā Ā Ā  The monster Gods ā€¦
##  6 OEBPS/@@@@@@@3296@3296-@3296--5 80718 2017-10-02 07:00:00 "The boy then being stilled from weeping, Euodius took upā€¦
##  7 OEBPS/@@@@@@@3296@3296-@3296--6 65956 2017-10-02 07:00:00 "And Thou knowest how far Thou hast already changed me, wā€¦
##  8 OEBPS/@@@@@@@3296@3296-@3296--7 57022 2017-10-02 07:00:00 "BOOK XII\nMy heart, O Lord, touched with the words of Thā€¦
##  9 OEBPS/@@@@@@@3296@3296-@3296--8 69513 2017-10-02 07:00:00 "BOOK XIII\nI call upon Thee, O my God, my mercy, Who creā€¦
## 10 OEBPS/@@@@@@@3296@3296-@3296--9 21223 2017-10-02 07:00:00 "The Confessions of Saint Augustine, by Saint Augustine\nā€¦

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

pubcrawl's People

Contributors

hrbrmstr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pubcrawl's Issues

More/better tests

No real edge cases are tested so adding a few more (legal!) epubs from a few different soucres and ensuring the main process works wld be šŸ‘

rmarkdown to markdown - checkboxes

Hey Bob,

I have a quick question related to the README of this package.

I see that you are using Rmd file that compiles to github md file. Pandoc doesn't necessarily translates everything in rmd to md, and one of those things are the checkboxes - [ ], and my understanding is that they don't really work with this approach (see here and here .

How did you manage to get the checkboxes to work? Some special pandoc workaround?
Sorry for opening an issue for this and spamming you with notifications, but this would be useful to know since I need to use workarounds which are not straightforward as writing a checkbox in markdown.

Cheers and thanks in advance!

encoding. of course it is encoding...

It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights

Generous Dealing of Yahya Son of Khƃ\u0081Lid with A Man Who Forged A Letter in His Name.

should be

Generous Dealing of Yahya Son of KhƁLid with A Man Who Forged A Letter in His Name.

Fix issues under Appveyor/Windows

pubcrawl.Rcheck.zip

^^ is output from Appveyor's failed build (it builds but the tests fail)

I will not be spending any time on it since it works on macOS and Linux.

If you do uncover the issue and fix it, please PR and ensure you add yourself to the DESCRIPTION

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.