Giter Site home page Giter Site logo

Comments (5)

DaveBathnes avatar DaveBathnes commented on August 26, 2024 3

Hi @thedug thanks for raising this!

I was slightly surprised by the sudden interest in these notes but realised OpenLibrary had seen them and linked to the repository. They were written quite a while ago without really expecting anyone to look at them, so I imagine quite a few things have changed now - but it looks like you've done a good job of tackling that problem!

I've got some time booked in to properly look at this repository though and make sure there's some decent scripts, so thanks for your notes on this.

from openlibrary-search.

thedug avatar thedug commented on August 26, 2024 2

I ended up dropping the column and and will re add it.

I also ran into an issue will null chars. I used this to remove them.

tr < ol_dump_editions_2021-11-30_processed.csv -d '\000' > ol_dump_editions_2021-11-30_processed_nonulls.csv

from openlibrary-search.

DaveBathnes avatar DaveBathnes commented on August 26, 2024 1

Just a quick update on this. I'm almost there with a significant refactor which will properly script the database creation (all currently in this branch https://github.com/LibrariesHacked/openlibrary-search/tree/1-error-loading-editions). I've been testing today but due to the sheer size of data it's been going all day. The first attempt failed when disk space ran out!

@thedug On your original question - you were right of course with the work_key causing an error, so dropping it would have fixed and then recreating once the copy import is done. I think in my notes I must have omitted the fact that I start off with the table with only the columns to enable the copy command to work, then add the column and populate it. The editions table just needs the work_key added to link with the works table. Then the authorship table links the authors and works tables.

@gennaios I think it would definitely be a good enhancement to then make it database agnostic. Once I have the database scripts I'd like to refactor them to allow for multiple database engines. There are plenty of complexities to that - indexing the json column for example, which is a particular command to PostgreSQL, and the copy commands which are by far the quickest way of getting the data in to a PostgreSQL DB, but something more general would work across DBs.

Thanks for the feedback and apologies for very late replies!

from openlibrary-search.

gennaios avatar gennaios commented on August 26, 2024

By chance, I also happened to find such recently and have an interest. I’ll be importing into Sqlite. I’m not sure what would be some ideal approach but perhaps reformatting the dumps such that one could import into any db? If such is possible and you’d consider such, that’d be great.

from openlibrary-search.

Xaneets avatar Xaneets commented on August 26, 2024

@DaveBathnes

Just a quick update on this. I'm almost there with a significant refactor which will properly script the database creation (all currently in this branch https://github.com/LibrariesHacked/openlibrary-search/tree/1-error-loading-editions). I've been testing today but due to the sheer size of data it's been going all day. The first attempt failed when disk space ran out!

To speed up data cleaning, you can use a Fast-Open-Library based on the rust language. It clears data 6 > times faster

from openlibrary-search.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.