Giter Site home page Giter Site logo

Comments (5)

toolness avatar toolness commented on September 25, 2024 1

Whoa, I did not know about COPY. Or Typecast, or the other stuff. Thanks!

Basically the problem I am trying to solve is just that I'd like to maintain an NYC-DB instance that continuously updates itself. Right now one problem I have is that of having it be updated with as little downtime as possible, and I can think of two potential solutions:

  1. Just point to a Postgres server via DNS. Once every n days, rebuild the whole NYC-DB in a new database server, and point DNS to it, then delete the old database server. This is similar to how the production deployment of a web app might work, and AWS RDS seems to support it, though alternatively we could use EC2, or some non-AWS cloud provider.

  2. Have a single database server that we update individual datasets of via atomic transactions. This means that e.g. we could have a batch job that updates the hpd_registrations dataset by dropping the existing tables and populating new tables in a single atomic transaction, so that there's no observable period where the tables for the dataset appear to be empty or only partially populated. It's this case where it seems like it might be useful to just have an SQL file of a dataset that I can dump into a DB as a transaction, but I guess another way to do it might be to add some support to nycdb to have it wrap everything it does in a single transaction (including dropping the dataset tables at the very beginning, if they exist).

But also, I am super rusty with databases so everything I just wrote could be ridiculous.

from nycdb.

aepyornis avatar aepyornis commented on September 25, 2024

I'm not opposed to this idea, but I'm not sure I exactly understand the problem that it solves. What would be the purpose of having .sql files without loading them into postgres?

By the way, the program can already easily create .sql dump files of any dataset: nycdb --dump hpd_violations will create a .sql dump of the dataset hpd_violations.

If you want to go directly from a csv to raw sql there's already good software that does that well: see csvsql from csvkit

I think the implementation would be a little complicated, as we have no code that crafts INSERT statements. Right now Typecast's job is to parse the csv and convert the column into a python object which is then handed over to psycog which does the translation to postgres for us.

You're correct that if we were to utilize Postgres's COPY functionality to upload CSVs instead of using INSERT statements we'd get much better performance, but any single error in the CSV will cause the entire dataset to fail. I'm not sure that the amount of effort it would take to rework nycdb to produce CSVs for COPY is worth it, but I'm certainly open to the idea or a pull request.

I want to understand the motivation more? Is the amount of time it takes to load a dataset a big problem? I usually re-build the dataset once a month to produce the data dumps that I put on github. True, it can take over 10 hours but I haven't had the need to build the entire thing all that often.

Sometimes if I'm in a hurry I speed up the process by parallelizing the loading. I spin up a big server and insert 4 datasets at a time using something like this:

nycdb --list-datasets | xargs -P 4 -I dataset  nycdb --load dataset 

from nycdb.

aepyornis avatar aepyornis commented on September 25, 2024

Thanks -- I understand what you are going for now.

Do note that by importing a sql file using psql -- psql -d nycdb < hpd_violations.sql -- you are not necessarily executing the sql code inside a transaction. Depending how the pg_dump file has been configured, it typically drops the existing table with the same name, creates it, locks it, and then starts inserting the data. Trying to insert millions of rows inside a single transaction will likely give you memory errors. As far as i'm aware, Postgres transactions are not intended to be used to update entire databases with GBs of data.

Some thoughts on this problem:

If you only needed it updated once a week, I'd go with the approach that you described in option 1. Automate the rebuilding of the database and then after it's ready, just swap the host variable in your production app configuration (fyi if ansible happens to be your automation tool of choice, this repo has an ansible playbook that will setup a server with postgres ready to run nycdb)

These are the update frequencies of each dataset:

Daily:

  • DOB Jobs
  • DOB complaints
  • HPD Violations
  • OATH Hearings

Monthly

  • HPD Registrations
  • HPD Complaints
  • Department of Finance Rolling Sales
  • ACRIS

Yearly/never change:

  • Pluto
  • Tax Bills
  • 2017 Marshal Evictions

So if you are looking to update NYCDB daily you only need to update 4 datasets every day (with a bigger update once a month). A solution might be as simple as scripting nycdb to upload those datasets with different table names (or in a different schema keeping the name the same) and renaming the tables when they are ready.

from nycdb.

toolness avatar toolness commented on September 25, 2024

Ahh this is super helpful, thanks!

Oy, I was worried about the massive transactions exhausting system resources. I like your idea about renaming the tables though, I think I'll do something involving that... Gonna close this issue now.

from nycdb.

aepyornis avatar aepyornis commented on September 25, 2024

glad it was helpful :)

from nycdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.