nycdb / nycdb Goto Github PK

Database of NYC Housing Data

License: GNU Affero General Public License v3.0

Shell 1.14% Python 90.51% PLpgSQL 8.08% Dockerfile 0.27%

nyc housing database psql python3 data civic-data open-data

nycdb's Issues

Add rent stabilization data scraped from 2019 tax bills

We've recently completed a download of all 2019 Statement of Account (SOA) tax bill data that corresponds to BBLs stored in PLUTO 19_v1. We also converted the PDFs to text and attempted to parse out information about how many rent-stabilized units are mentioned in each BBL's tax bill (if any).

The result is a dataset that is different from the existing rent stabilization data in a number of ways:

We only pulled the rent stabilization numbers from the tax bills. We didn't scrape anything about J-51 and other tax abatements because the most important ones have recently been added to NYC's open data portal and brought into NYCDB by PRs like #80. (If there is strong interest in obtaining other numbers from the PDFs, we can always add them later.)
We are applying as little post-processing to the data as possible, as interviews with users of the existing stabilization data have expressed confusion and distrust of the numbers in it. In line with this, we also would like to provide direct links to the PDFs of the tax bills that the data was scraped from. This allows anyone to easily trace the data provenance of every statistic if they have any doubts about the accuracy of the scrape.

Given this, we're proposing a new dataset with the following structure:

BBL	soa_downloaded	soa_rentstab_units	soa_url
1011587504	t	31	https://nyc-doffer.s3.amazonaws.com/pdf/1/01158/7504/soa-2019-06-05.pdf.br

BBL (10-character string) is the Boro-Block-Lot number of the property.
soa_downloaded (boolean) represents whether or not we were able to download the SOA from the DOF site. (The DOF site reported that it didn't have SOAs for a few hundred of the BBLs in question.)
soa_rentstab_units (null or positive integer) is the 2018 rent stabilization count, as extracted from the 2019 SOA. If it is null, then we were not able to find any mention of a rent stabilization count (this could mean the building has no rent stabilized units, but it could also mean that our scraping logic couldn't find the mention, or it could mean that for some reason the building wasn't charged a rent stabilization fee that year for some reason).
soa_url (string) is the URL to the PDF of the 2019 SOA that we downloaded from the DOF site. Curious users can refer to this if they doubt the accuracy of our scrape.

some "DRIVE" addresses get changed to "DRIVEIVE" and "PLAINS" to "PLACEAINS"

select streetname from hpd_registrations where streetname ~ 'DRIVEIVE';

Looks like it's an issue with src/nycdb/address.py

In ADDING_NEW_DATASETS.md, devs shouldn't have to re-install nycdb on every code change

Step 6 currently says:

Step 6 - re-install nycdb to register the new dataset

pip uninstall nycdb

pip setup.py install (from inside of the src directory)

This is quite cumbersome to do after every code change and shouldn't be necessary if the developer has installed nycdb in editable mode, e.g. via pip install -e. This is actually how the project is installed in the Docker setup, so at the very least, we should add a note mentioning that they won't need to perform this step if they're using Docker. However, I think that ideally we should tell developers to install nycdb in editable mode if they actually want to hack on it, and then just remove step 6 entirely.

add hpd problems dataset

replace ad-hoc makefiles and modules with a single python program

Let's take stock of the current situation. The database is current composed of 8 datasets, which use various languages:

module	language
311	ruby
acris	makefile
dobjobs	python
dof-sales	python + shell
hpd-registrations	python + shell
hpd-violations	shell
pluto	python + shell
rent-stabilization	python

A little bit more than half rely on python, but they don't share much code. Moreover the download scripts are done in a a way that leave more to be desired such as ability to skip already downloaded files ,resume downloads, parallel importing, etc.

Adding BBL, processing addresses, and downloading from open data are features that many off the data importers could use.

Add DOB violations data

Formal request to add DOB violations data https://data.cityofnewyork.us/Housing-Development/DOB-Violations/3h2n-5cm9 :)

i guess sometimes they are numbers....

parse.rb:58:in `add_bbl': undefined method `include?' for 34:Fixnum (NoMethodError)

File downloading could result in garbled CSVs if chunks straddle a multi-byte UTF-8 sequence

While working on #25, I noticed that nycdb.file.download_file() seems to write CSV files in text mode, decoding each 512k chunk as UTF-8. While it's unlikely, I think there's a possibility that the last byte of a chunk could actually be in the midst of a multi-byte UTF-8 sequence, which would result in data being lost unnecessarily.

(I'm also guessing it might be possible for the downloaded data to be in an encoding other than utf-8, in which case lots of characters might be garbled, so it might be useful to check the Content-Type header for a charset, but I really have no idea as I'm unfamiliar with the datasets being retrieved.)

Anyhow, I decided not to address this in #25 because I just wanted to focus on the progress bar logic, but I figured I'd flag it here.

It should be possible to update a nyc-db instance without rebuilding all tables from scratch

I wasn't sure where else to discuss it and thought a github issue might be OK, but we can move this discussion somewhere else if needed.

Problem statement: Updating NYC-DB takes several hours because each massively large table must be rebuilt from scratch using source CSV files. This results in a lot of duplicate effort because most of the data already existed in the database.

Prior work

Here's some prior work that works around the lack of an explicit update mechanism, and/or explores potential solutions to the update problem:

https://github.com/justFixNYC/who-owns-what contains a dbtool.py script that orchestrates the running of nycdb by dropping existing tables for datasets and deleting existing CSV files before re-downloading and re-loading them.
JustFixNYC/nycdb-k8s-loader has a number of mechanisms to ease updating the database:
- JustFixNYC/nycdb-k8s-loader#10 keeps track of the Last-Modified and/or ETag headers of downloaded files to make conditional requests when downloading datasets in the future. This allows us to easily skip datasets if they haven't changed.
- JustFixNYC/nycdb-k8s-loader#4 uses nycdb as a python module to first load datasets into a temporary schema to ensure that the public schema can still be queried while new data is being loaded.
toolness/rust-nycdb-csv-fun explores the notion of creating what amounts to a changelog for CSV files. This effectively allows us to get only the rows that have been changed or added since the last version of a CSV file. though it currently has some constraints, such as:
- Requiring that each row have a primary key that uniquely identifies it. It's not clear whether all tables have this. But this isn't actually a dealbreaker, since supporting updating on some tables is still better than not supporting updating at all.
- The notion of rows being deleted is unsupported. While such support could be added, it's unclear whether it'd actually be desirable behavior (for example, when processing a CSV of rolling financial data, we probably want to keep the old financial data that has been "rolled off").

Potential solutions

The aforementioned rust-nycdb-csv-fun shows some promise, but there's a few things holding it back from being a full solution. In particular, the SQL scripts that post-process a dataset after it's been loaded often do things that might make it difficult to modify a dataset's tables afterwards, e.g.:

Making changes to a table's schema by adding columns with values computed from pre-existing columns, such as adding a bbl column to ACRIS data.
Modifying existing columns by "cleaning up" their values, e.g. hpd_registrations/contacts_cleanup.sql.
Creating brand new tables that use data derived from the original dataset tables, such as hpd_registrations/registrations_grouped_by_bbl_with_contacts.sql.

(It should also be noted that who-owns-what has its own custom SQL scripts that execute on top of nycdb's "official" ones, and individuals might have their own custom scripts that they run after installing nycdb. It'd be nice if such things were either still supported, or if guidance could be provided on how to adapt such existing solutions to any new paradigm that we create.)

One way to deal with (1) and (2) is to actually crib from nycdb-k8s-loader and load & post-process the updated rows in a separate temporary schema. Then for any existing rows that were changed, we could drop them from the public schema's version of the table. Finally, we'd just "append" the table in the temporary schema to the one in the public schema.

Another way to deal with (1) and (2) is to move such changes from SQL into the Python code that loads the data into the database. However, this might be slower than doing things in SQL, and it could also increase the barrier to entry for contributors who know SQL but not Python. That said, it could do the opposite for folks who prefer Python over SQL, and it might also be easier to unit test the changes in Python-land too.

Dealing with (3) might be straightforward if we just convert the CREATE TABLE statements to CREATE MATERIALIZED VIEW statements. However, I haven't actually used materialized views before--I've only read about them--so I don't know if this would work in practice.

Anyways, that's all I've got for now. Thoughts @aepyornis (or anyone else)?

Dockerfile no longer uses minimum supported version of Python?

Hi! I noticed that in 4512e6b the Python version used by the Dockerfile got changed to 3.6, but the reason it was 3.5 was so that, by forcing Docker to use the minimum-supported version of Python that NYCDB supports, we wouldn't accidentally use 3.6 features that broke folks on 3.5...

Or are we dropping 3.5 support from NYCDB? Anyways, just thought I'd flag it.

Add J51 verification information to verify.py

Add brand new J-51 NYC open data feed as a dataset

Dominic told me at yesterday's HDC meeting that the city just released an open data feed of J-51 data!

https://data.cityofnewyork.us/City-Government/J-51-Exemption-and-Abatement/y7az-s7wc

Until now this has only been available as a bulk download, which we have some code for parsing, but now that an open data feed is available, we should probably just use that.

Consider adding per-dataset documentation

I think it'd be useful to have Markdown documentation for each dataset. For example, the hpd_registrations dataset could have a file called hpd_registrations.md that explains things like:

Where the dataset comes from, e.g.:
- If the dataset came from the Socrata-powered NYC Open Data portal, the docs could contain a link to the portal page.
- If the dataset pulls from a CSV that was generated from external source code, such as John Krauss' tax bill scraping, it could contain a link to the source code.
- If any manual work was done in the creation of the dataset, such as the geocoding that was done for the Marshall Evictions data, that could be mentioned too. This would be especially helpful for folks who want to pick up where someone else left off, e.g. someone who wants to make more up-to-date Marshall Evictions data a year from now but doesn't know how the original dataset was created.
A link to the data dictionary for the dataset, if one exists (or if the writer is really ambitious, they could write a full data dictionary in the markdown file). This would be very helpful to newbies who are unfamiliar with the dataset in question.
Any extra columns added to the dataset in post-processing that weren't in the original CSV.
Mentions of any derived tables in the dataset, such as hpd_registrations_grouped_by_bbl, and why they might be useful.
Any other general information about the dataset that might be useful to know. For example, the dobjobs dataset seems to support full-text search on some columns, which may be useful to document.

I'm not sure where such markdown files would live, though. Here are some ideas:

Alongside the YAML files that describe the datasets, e.g. src/nycdb/datasets/hpd_registrations.md.
Package the documentation in the same file as the dataset metadata by converting the YAML files into Markdown files with YAML front matter.
In a separate, new docs/ directory (we would want to make sure such a directory is included in the published python package too!).

Regardless of where we put it, we should also remember to update the ADDING_NEW_DATASETS.md file to encourage new dataset creators to document their datasets.

Consider adding test data to the PyPi distribution

Hello again!

I really like the integration test data for this project. It's been very useful for building tests for any project that uses NYCDB, such as Who Owns What, as well as some other experiments of mine.

However, the test data isn't in the PyPI distribution of the project (at least, I don't think it is). This means that for all my projects I've had to download the ZIP file version of the repository to get the data.

Since the data is only about a megabyte or two in size, I'm curious if it'd be helpful to developers to have it just ship with the PyPI distribution, along with documentation so that devs know it's there? No worries if not, I could be the only person who finds it handy.

fix formatting of common id field in ecb_violations and oath_hearings

As I was creating the documentation entries for these datasets I noticed that the oath_hearings and ecb_violations tables can be linked using a common field, but it is formatted differently in each table.

In ecb_violations the column is ecbviolationnumber, it is stored as text, and it is usually 7 or 8 characters long.

In oath_hearings the same column is called ticketnumber, it is also stored as text, and it is mostly (78%) 10 characters long (with the rest mostly 6-9 characters).

If you left pad ecbviolationnumber to 10 with zeros, the join works and you get correct matches. However, I'm not yet sure what to make of the <10 length values for ticketnumber in oath_hearings.

I'll need to look into this more, but once I'm sure how the two columns should be formatted, what is the best approach for fixing the columns? I've seen that for some datasets these types of fixes are done in the sql scripts, but other times they are made in dataset_transformations.py. Is there some logic for when one method is preferred?

Set postgres comments on tables and columns to help users

I found out that Postgres has a COMMENT SQL command that allows tables and columns (and other database entities) to be documented. Here's an example:

CREATE TABLE boop( foo VARCHAR(50) );
COMMENT ON COLUMN boop.foo IS 'a funky thing';

When using psql, asking it for details on the table shows the comment for the column:

myuser=# \d+ boop
                                 Table "public.boop"
 Column |         Type          | Modifiers | Storage  | Stats target |  Description
--------+-----------------------+-----------+----------+--------------+---------------
 foo    | character varying(50) |           | extended |              | a funky thing

Some database clients like will also display these comments. For example, Postico on OS X has a table design view that appears to display comments (though I'm not on OS X so I can't verify this).

If we have nycdb add comments for tables and columns, it could help new users (and even experienced ones) understand the structure of the data better. One way we could do this is by changing the YAML structure of dataset metadata to also support objects for fields instead of strings, e.g.:

# ...
schema:
  table_name: hpd_violations
  fields:
    # ...
    Block: integer
    Lot: integer
    Class:
        type: char(1)
        comment: |
            Indicator of seriousness of the violations, where A is the least serious
            and C is the most serious.
    InspectionDate: date
    ApprovedDate: date
    # ...

Anyways, just an idea. Along with #85 this could help folks acclimatize to NYC-DB better.

add annualized sale data from dof

Consider inserting many database rows in a single query

While working on #41, I noticed that the Database.insert_rows() method executes an individual SQL INSERT statement for each row in a data set.

I'm not a database expert by any means, but Django has a bulk_create method that inserts multiple records in a single query, which I've found increases efficiency quite a bit, at least in Django-land. So I'm curious if this could potentially speed up dataset loading in nycdb too.

Add newer Marshal Eviction data as a dataset

At JustFix, we've geocoded and massaged a bunch of marshal eviction data for 2017, 2018, and the first two months of 2019. It would be good to get this into NYCDB as a handy dataset.

Consider adding a script that creates scaffolding for a new dataset

It would be nice to have a script that does most of the grunt work involved in creating a new dataset.

I made a simple version of one and used it to create #96:

https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd

I'm not sure if it should be added to the nycdb repository or kept separate from it, though. It has a rather unorthodox integration test that involves adding a sample dataset to nycdb and then runs pytest to run the test it created for the new dataset, and I'm not sure if pytest is ok with running itself while already running itself.

The code is definitely going to break as we evolve nycdb's architecture (see e.g. #94). I'm not sure if maintaining the script as nycdb evolves will be too much of a burden to keep it in the repo.

Anyways, I figured I'd plop the idea here to see what folks think.

add pluto 19v1

add pluto 18v1

add hpd repair and vacate orders dataet

HPD Order to Repair/Vacate Orders

I recently discovered this dataset and am interested in using it for a project that relies on the HDC's auto-updating nycdb instance, so I was hoping to add it to nycdb.

Prefix tables with agency name

Currently, tables come in named after the type of data.

--------+-----------------+-------+----------
 public | all_violations  | table | deltanyc
 public | dobjobs         | table | deltanyc
 public | dof_sales       | table | deltanyc
 public | open_violations | table | deltanyc
 public | pluto_16v2      | table | deltanyc

It would be really helpful if they were prefixed with the agency name, both to avoid collisions and to help identify the source of the data.

For example, hpd.all_violations

Consider removing "Future datasets to add" from README and instead add a new "dataset request" issue label

The README currently has a "Future datasets to add" section that contains one item, "census data". To make things more participatory and encourage discussion, I'm wondering if it might be more useful to instead have a new GitHub issue label called "dataset request" (or something similar) and encourage folks to file issues with that label? For examples of such issues, see #22, #24, #76, #77, and #78.

For more context: we'd like to have a NYC-DB hackathon soon, and as a lead-in to that, we want to encourage HDC members to file github issues for potential improvements to NYC-DB. Adding new datasets seems like the kind of thing that it'd be useful to have a label for, especially since it makes it easy to see a list of all currently-requested new datasets.

Allow ISO8601 dates

Currently the date() function from the Typecast class doesn't support ISO8601 dates (2019-12-01).

add BBL to acris tables

Consider adding a --schema command-line option

So in #53 (comment) @aepyornis mentioned:

A solution might be as simple as scripting nycdb to upload those datasets with different table names (or in a different schema keeping the name the same) and renaming the tables when they are ready.

First, I did not know what a Postgres Schema was until just now. I was originally going to suggest adding a --table-prefix command-line option but a Schema seems to be the more Postgres-y thing to do here.

So do you think it'd be useful to have a --schema command-line argument that allows users to specify the schema of a table? I suppose in the code it might be very similar to the "table name prefix" idea, just using <schema-name>. as the prefix, but aside from my use case, it might be useful for e.g. folks in a multi-tenant situation.

Anyways, if it's too much of a niche thing don't worry about it, I will just script all this myself and not worry about pushing anything upstream!

implement load from SQL dump per dataset

this should reduce the time of rebuild the database by loading some datasets from a sql dump that don't change (such as pluto)

Rolling sales error with jan's data

***DOF ROLLING SALES***
Setting up python3 virtual environment with xlrd
Downloading/unpacking xlrd
  Downloading xlrd-1.0.0-py3-none-any.whl (143kB): 143kB downloaded
Installing collected packages: xlrd
Successfully installed xlrd
Cleaning up...
psql:schema.sql:1: NOTICE:  table "dof_sales" does not exist, skipping
DROP TABLE
CREATE TABLE
ERROR:  invalid input syntax for integer: "-"
CONTEXT:  COPY dof_sales, line 24761, column commercialunits: "-"

Should there be more indexes on "documentid" for ACRIS tables?

Hello again! I am still very new to all this, so I'm not sure if what I'm proposing is the right thing to do, but here goes...

According to a NYC DOF document on the ACRIS dataset:

The "Document ID" in the master record is used to link all other record types to a master record. To find all of the lot (property) records associated with a master record, simply retrieve all records in the
property dataset with the same “Document ID” as the selected master record. The same process should be repeated for Party, Cross Reference and Remark records.

So it seems like the various tables are all related via the documentid column.

However, when I try looking up the parties involved in a document, by making a join between real_property_master and real_property_parties on documentid, it takes a really long time. Like a basic SELECT * FROM real_property_parties WHERE documentid = 'blah' takes maybe 30-60 seconds.

I also noticed that in acris.sql, there's an index on documentid in real_property_master, but not in real_property_parties. Should there basically be an index on this column in all the ACRIS tables?

Get rid of tox.ini?

I noticed there's a src/tox.ini but tox doesn't appear to be used by Travis... I was also confused by it because it mentions Python 3.4, while the .travis.yml doesn't, which made me wonder what the earliest version of Python we supported was.

Anyhow, just thought I'd mention it!

Add NYC Property Address Directory (PAD) as a dataset

The NYC open data feed for PAD is useful because it essentially maps BINs to BBLs, which is very useful for cross-referencing different kinds of datasets.

It will also be useful in Who Owns What, so that we can let users know that the statistics they're seeing for a building might actually represent aggregate data amongst multiple buildings (e.g. if multiple BINs map to the same BBL).

nycdb.info is down

getting the cloudflare error page

Add HPD housing litigations as a dataset

The Housing Litigations NYC Open Data feed contains some nice data on HP Actions filed by tenants that would be useful to have in NYC-DB. It looks like the raw data already has BBL information so it should be easy to import.

PostGIS, MapPluto, and geographies

add gis boundaries for spatial queries

script to generate neighborhood data

ValueError: day is out of range for month while importing dob violations

OrderedDict([('isndobbisviol', '129467'), ('boro', '1'), ('bin', '1014006'), ('block', '00769'), ('lot', '00068'), ('iss
uedate', '19910606'), ('violationtypecode', 'Z'), ('violationnumber', 'MTF2LE'), ('housenumber', '194'), ('street', '8 AVENUE'), ('dispositiondate', '1
9940231'), ('dispositioncomments', ''), ('devicenumber', ''), ('description', ''), ('ecbnumber', ''), ('number', 'V*060691ZMTF2LE'), ('violationcategor
y', 'V*-DOB VIOLATION - DISMISSED'), ('violationtype', 'Z-ZONING'), ('bbl', '1007690068')])

Traceback (most recent call last):
  File "/home/nycdb/.local/bin/nycdb", line 11, in <module>
    load_entry_point('nycdb==0.1.12', 'console_scripts', 'nycdb')()
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/cli.py", line 102, in main
    dispatch(args)
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/cli.py", line 93, in dispatch
    Dataset(args.load, args=args).db_import()
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/dataset.py", line 61, in db_import
    self.import_schema(schema)
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/dataset.py", line 97, in import_schema
    batch = list(itertools.islice(rows, 0, BATCH_SIZE))
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 144, in cast_rows
    yield self.cast_row(row)
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 155, in cast_row
    d[column] = self.cast[column.lower()](val)
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 180, in <lambda>
    d[k] = lambda x: date(x)
  File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 85, in date
    date = datetime.datetime.strptime(x, '%Y%m%d')
  File "/usr/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.6/_strptime.py", line 528, in _strptime
    datetime_date(year, 1, 1).toordinal() + 1
ValueError: day is out of range for month

Invalid SSL certificate

FYI, the websites SSL certificate is giving a cloud flare error.

https://api.nycdb.info/ Error 526 Ray ID: 4c70e0214aeec5e2 • 2019-04-13 22:28:52 UTC
Invalid SSL certificate

Consider storing each dataset transformation in a separate python module (file)

So I was thinking about making a simple script that builds the scaffolding for a new dataset. For example, running python make_new_dataset.py my_new_dataset could:

create a SQL file in src/nycdb/my_new_dataset.sql,
create a YAML file in src/nycdb/datasets/my_new_dataset.yml,
add an integration test suite at src/tests/integration/test_my_new_dataset.py,

and so on. But I realized that doing this for the dataset transformation code is the only place where we can't create a new file--instead, we have to edit src/nycdb/dataset_transformations.py. While this is fine, I was thinking another alternative could be to have each dataset's transformation actually be a separate python file/module/package, named after the dataset, just like we already do with SQL and YAML files.

Some advantages of this approach:

Probably first and foremost, it's very discoverable. Devs can already see by browsing the repo's file structure that there are certain SQL and YAML files that correspond to specific datasets. But the location of Python-related code for datasets is opaque; moving each dataset's specific code to its own file/directory would make it easier to locate, and easier to intuit the architecture of NYCDB without having to read documentation.
It's straightforward to "remove" a dataset (e.g. if someone made a typo in the dataset name or something)--just delete all files in the repo that have the base filename of the dataset.
It's a bit more extensible, e.g. if we want to define more per-dataset functions, we can just hang them off the dataset's module/package.
If a dataset ends up having lots of dataset-specific transformation code, it can be housed in its own module/package, rather than polluting dataset_transformations.py.

Anyways, just a thought I figured I'd put out there.

dobjobs work types aren't being captured in postgres

They're in the original database as 'x' or null and the columns are defined as boolean, but everything appears null in the data so I don't think boolean is capturing it. Easy for me to change boolean to text type but if you want to do some transformation to the data to get it into postgres as true/false, that's beyond me right now.

add option to disable progress bar

When loading multiple datasets at the same time (i.e. make -j 3 nyc-db) the progress bars overlay on each other and are not readable.

New marshal_evictions_18 dataset is missing values for 'executeddate'

All of the values in that field show up as null... maybe there is a bug in a date format transformation somewhere?

add historic dof annual sales data

Fix regex warnings: DeprecationWarning: invalid escape sequence \d

add progress bar to download and postgres loading

PyYAML raises error if a version older than 5.1 is installed

#91 fixed a warning, but it means we now require pyyaml version 5.1+ and it'll break installations where pip already has installed an older version.

We should change the pyyaml dependencies in requirements.yml and setup.py!

Add a '--sql' option to dump the SQL of a dataset from its original CSVs/XLSs/etc

I've been thinking of some ways to maintain a constantly-updated NYC-DB instance, and I'm realizing that one potential pipeline is really just to use nycdb (the command-line program) as a tool that downloads the datasets and translates them into SQL files, rather than directly communicating with a postgres instance. It could also be a potential workaround for #42, if that is actually a legitimate performance issue.

Having each dataset in its own SQL file would also give (advanced) users more control over how the data is loaded, since they could use psql to e.g. load a dataset in a single transaction, and so on. It would also make it easy to parallelize the creation of multiple SQL files, if one has lots of compute resources to spare.

I am fairly noobish when it comes to databases of this size, though, so my thinking could be silly.

Anyways, I don't think this should be hard to implement since it ought to just involve writing the CREATE TABLE and INSERT statements to a file rather than executing them on the database.

@aepyornis if I were to submit a PR that added such functionality, would you accept it?

Add an '--unload' option/subcommand to nycdb

It would be really nice to have an --unload command to nycdb that simply undoes anything that --load does.

The simple version of this could just drop the tables mentioned in the YAML file, but a more awesome version would also be able to drop any derived tables (e.g. those hpd_registrations_grouped_by_bbl or whatever) and functions that were created... I have some code over in JustFixNYC/nycdb-k8s-loader that parses the actual SQL files to glean some of this information, if it'd be helpful, but maybe that's complicating things too much.

Is ADDING_NEW_DATASETS.rst actually markdown?

Hey, I noticed that some documentation on adding new datasets was added in #57, which is awesome! But it seems like it's Markdown rather than Restructured Text, as the .rst extension implies?

It also seems like it'd be useful to link to the document from the README maybe.

Should I issue a PR that makes these changes?

nycdb / nycdb Goto Github PK

nycdb's Issues

Step 6 - re-install nycdb to register the new dataset

Prior work

Potential solutions

Recommend Projects

Recommend Topics

Recommend Org