Comments (5)
Whoa, I did not know about COPY
. Or Typecast, or the other stuff. Thanks!
Basically the problem I am trying to solve is just that I'd like to maintain an NYC-DB instance that continuously updates itself. Right now one problem I have is that of having it be updated with as little downtime as possible, and I can think of two potential solutions:
-
Just point to a Postgres server via DNS. Once every n days, rebuild the whole NYC-DB in a new database server, and point DNS to it, then delete the old database server. This is similar to how the production deployment of a web app might work, and AWS RDS seems to support it, though alternatively we could use EC2, or some non-AWS cloud provider.
-
Have a single database server that we update individual datasets of via atomic transactions. This means that e.g. we could have a batch job that updates the
hpd_registrations
dataset by dropping the existing tables and populating new tables in a single atomic transaction, so that there's no observable period where the tables for the dataset appear to be empty or only partially populated. It's this case where it seems like it might be useful to just have an SQL file of a dataset that I can dump into a DB as a transaction, but I guess another way to do it might be to add some support tonycdb
to have it wrap everything it does in a single transaction (including dropping the dataset tables at the very beginning, if they exist).
But also, I am super rusty with databases so everything I just wrote could be ridiculous.
from nycdb.
I'm not opposed to this idea, but I'm not sure I exactly understand the problem that it solves. What would be the purpose of having .sql files without loading them into postgres?
By the way, the program can already easily create .sql dump files of any dataset: nycdb --dump hpd_violations
will create a .sql dump of the dataset hpd_violations.
If you want to go directly from a csv to raw sql there's already good software that does that well: see csvsql from csvkit
I think the implementation would be a little complicated, as we have no code that crafts INSERT statements. Right now Typecast's job is to parse the csv and convert the column into a python object which is then handed over to psycog which does the translation to postgres for us.
You're correct that if we were to utilize Postgres's COPY functionality to upload CSVs instead of using INSERT statements we'd get much better performance, but any single error in the CSV will cause the entire dataset to fail. I'm not sure that the amount of effort it would take to rework nycdb
to produce CSVs for COPY is worth it, but I'm certainly open to the idea or a pull request.
I want to understand the motivation more? Is the amount of time it takes to load a dataset a big problem? I usually re-build the dataset once a month to produce the data dumps that I put on github. True, it can take over 10 hours but I haven't had the need to build the entire thing all that often.
Sometimes if I'm in a hurry I speed up the process by parallelizing the loading. I spin up a big server and insert 4 datasets at a time using something like this:
nycdb --list-datasets | xargs -P 4 -I dataset nycdb --load dataset
from nycdb.
Thanks -- I understand what you are going for now.
Do note that by importing a sql file using psql -- psql -d nycdb < hpd_violations.sql
-- you are not necessarily executing the sql code inside a transaction. Depending how the pg_dump file has been configured, it typically drops the existing table with the same name, creates it, locks it, and then starts inserting the data. Trying to insert millions of rows inside a single transaction will likely give you memory errors. As far as i'm aware, Postgres transactions are not intended to be used to update entire databases with GBs of data.
Some thoughts on this problem:
If you only needed it updated once a week, I'd go with the approach that you described in option 1. Automate the rebuilding of the database and then after it's ready, just swap the host variable in your production app configuration (fyi if ansible happens to be your automation tool of choice, this repo has an ansible playbook that will setup a server with postgres ready to run nycdb)
These are the update frequencies of each dataset:
Daily:
- DOB Jobs
- DOB complaints
- HPD Violations
- OATH Hearings
Monthly
- HPD Registrations
- HPD Complaints
- Department of Finance Rolling Sales
- ACRIS
Yearly/never change:
- Pluto
- Tax Bills
- 2017 Marshal Evictions
So if you are looking to update NYCDB daily you only need to update 4 datasets every day (with a bigger update once a month). A solution might be as simple as scripting nycdb to upload those datasets with different table names (or in a different schema keeping the name the same) and renaming the tables when they are ready.
from nycdb.
Ahh this is super helpful, thanks!
Oy, I was worried about the massive transactions exhausting system resources. I like your idea about renaming the tables though, I think I'll do something involving that... Gonna close this issue now.
from nycdb.
glad it was helpful :)
from nycdb.
Related Issues (20)
- New Dataset: Building Elevation and Subgrade (BES)
- New Dataset: Historic Districts
- New Dataset: Individual Landmark Sites
- New Dataset: HPD Inclusionary Housing HOT 2
- New Dataset: Scenic Landmarks
- New Dataset: City of New York Municipal Solar-Readiness Assessment (Local Law 24 of 2016) data
- Deduplication logic for City Marshal Evictions Dataset
- parsing of `court` to `court_boro` in creating `oca_case_summary` is picking up "Kingston City Court" and including it as "Brooklyn"
- Columns need updating on speculation watchlist
- dobjobs import failing on NULL row HOT 1
- We should add BBL search to the Who Owns What landing page HOT 1
- Add Dataset: HPD Fee Charges
- add HPD OMO/HWO charges documentation to wiki
- consolidate OMO & HWO datasets into single hpd_remergency_repairs dataset HOT 1
- add DOB Electrical permits
- DOB Safety Violations layout change HOT 1
- add DOB complaint disposition codes table to existing dob_complaints dataset
- add DOF property assessment/valuation data
- add Local Law 44 datasets (includes Article XI)
- daily shelter count data 404
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nycdb.