I've been thinking of some ways to maintain a constantly-updated NYC-DB instance, and

Add a '--sql' option to dump the SQL of a dataset from its original CSVs/XLSs/etc about nycdb HOT 5 CLOSED

toolness commented on September 25, 2024

Add a '--sql' option to dump the SQL of a dataset from its original CSVs/XLSs/etc

from nycdb.

Comments (5)

toolness commented on September 25, 2024 1

Whoa, I did not know about COPY. Or Typecast, or the other stuff. Thanks!

Basically the problem I am trying to solve is just that I'd like to maintain an NYC-DB instance that continuously updates itself. Right now one problem I have is that of having it be updated with as little downtime as possible, and I can think of two potential solutions:

Just point to a Postgres server via DNS. Once every n days, rebuild the whole NYC-DB in a new database server, and point DNS to it, then delete the old database server. This is similar to how the production deployment of a web app might work, and AWS RDS seems to support it, though alternatively we could use EC2, or some non-AWS cloud provider.
Have a single database server that we update individual datasets of via atomic transactions. This means that e.g. we could have a batch job that updates the hpd_registrations dataset by dropping the existing tables and populating new tables in a single atomic transaction, so that there's no observable period where the tables for the dataset appear to be empty or only partially populated. It's this case where it seems like it might be useful to just have an SQL file of a dataset that I can dump into a DB as a transaction, but I guess another way to do it might be to add some support to nycdb to have it wrap everything it does in a single transaction (including dropping the dataset tables at the very beginning, if they exist).

But also, I am super rusty with databases so everything I just wrote could be ridiculous.

from nycdb.

aepyornis commented on September 25, 2024

I'm not opposed to this idea, but I'm not sure I exactly understand the problem that it solves. What would be the purpose of having .sql files without loading them into postgres?

By the way, the program can already easily create .sql dump files of any dataset: nycdb --dump hpd_violations will create a .sql dump of the dataset hpd_violations.

If you want to go directly from a csv to raw sql there's already good software that does that well: see csvsql from csvkit

I think the implementation would be a little complicated, as we have no code that crafts INSERT statements. Right now Typecast's job is to parse the csv and convert the column into a python object which is then handed over to psycog which does the translation to postgres for us.

You're correct that if we were to utilize Postgres's COPY functionality to upload CSVs instead of using INSERT statements we'd get much better performance, but any single error in the CSV will cause the entire dataset to fail. I'm not sure that the amount of effort it would take to rework nycdb to produce CSVs for COPY is worth it, but I'm certainly open to the idea or a pull request.

I want to understand the motivation more? Is the amount of time it takes to load a dataset a big problem? I usually re-build the dataset once a month to produce the data dumps that I put on github. True, it can take over 10 hours but I haven't had the need to build the entire thing all that often.

Sometimes if I'm in a hurry I speed up the process by parallelizing the loading. I spin up a big server and insert 4 datasets at a time using something like this:

nycdb --list-datasets | xargs -P 4 -I dataset  nycdb --load dataset

from nycdb.

aepyornis commented on September 25, 2024

Thanks -- I understand what you are going for now.

Do note that by importing a sql file using psql -- psql -d nycdb < hpd_violations.sql -- you are not necessarily executing the sql code inside a transaction. Depending how the pg_dump file has been configured, it typically drops the existing table with the same name, creates it, locks it, and then starts inserting the data. Trying to insert millions of rows inside a single transaction will likely give you memory errors. As far as i'm aware, Postgres transactions are not intended to be used to update entire databases with GBs of data.

Some thoughts on this problem:

If you only needed it updated once a week, I'd go with the approach that you described in option 1. Automate the rebuilding of the database and then after it's ready, just swap the host variable in your production app configuration (fyi if ansible happens to be your automation tool of choice, this repo has an ansible playbook that will setup a server with postgres ready to run nycdb)

These are the update frequencies of each dataset:

Daily:

DOB Jobs
DOB complaints
HPD Violations
OATH Hearings

Monthly

HPD Registrations
HPD Complaints
Department of Finance Rolling Sales
ACRIS

Yearly/never change:

Pluto
Tax Bills
2017 Marshal Evictions

So if you are looking to update NYCDB daily you only need to update 4 datasets every day (with a bigger update once a month). A solution might be as simple as scripting nycdb to upload those datasets with different table names (or in a different schema keeping the name the same) and renaming the tables when they are ready.

from nycdb.

toolness commented on September 25, 2024

Ahh this is super helpful, thanks!

Oy, I was worried about the massive transactions exhausting system resources. I like your idea about renaming the tables though, I think I'll do something involving that... Gonna close this issue now.

from nycdb.

aepyornis commented on September 25, 2024

glad it was helpful :)

from nycdb.

Add a '--sql' option to dump the SQL of a dataset from its original CSVs/XLSs/etc about nycdb HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent