Giter Site home page Giter Site logo

jjfiv / poetry-identification Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 6.93 MB

Poetry Identification Code from my dissertation runs on zip files containing DJVUXML from the Internet Archive.

Home Page: https://ciir.cs.umass.edu/downloads/poetry

License: Apache License 2.0

Rust 90.83% TSQL 0.38% Python 8.79%
djvuxml internet-archive poetry random-forests machine-learning digital-humanities

poetry-identification's Introduction

Poetry-Identification

Poetry Identification Code from my dissertation runs on zip files containing DJVUXML from the Internet Archive.

Where did this model come from?

For details about where this model came from, or what it does, refer to my dissertation for now.

@phdthesis{foley2019thesis,
  author = {John Foley},
  title = {{Poetry: Identification, Entity Recognition, and Retrieval}},
  year = {2019},
  school = {University of Massachusetts},
}

Can I get some data for this?

Data from my dissertation is available at CIIR/downloads/poetry. The training data used to build the model is there, as well as the output of this model on the 50,000 books from the INEX 2007 challenge (basically a random sample of Internet Archive books).

How do I run the code?

You'll need a bunch of DJVU-XML books available in a zip file. I have so many of these -- email me and we can work something out :)

Prepare

  1. Get Rust.
  2. gunzip ../models/forest-05-2019.json.gz # Extract the model; it's too big for github otherwise -- only need to do this once.

Build and run the code:

cd classification
cargo build --release
./target/release/classification --model ../models/forest-05-2019.json --books input_books.zip > input_books.poetry.jsonl

The classification binary once built is very portable because Rust does static linking -- you can build it once and copy it to a cluster of Linux machines fairly easily.

About this Code

This code is written in Rust. There are two packages: djvuxml-rs which is a pretty generic way to interact with internet-archive scanned book files, and classification which runs through using a JSONified Random Forest model and makes predictions at the page level. The files on CIIR/downloads/poetry -- Poetry50K collection were generated from de-duplicating the output of this code.

Help? Where's the code for XXX?

I'm slowly cleaning up and open-sourcing all the code. If you're looking for a piece that's not made it public yet, please don't hesitate to contact me! File an issue here or check out my personal website to find my latest academic email.

poetry-identification's People

Contributors

jjfiv avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.