nycdb / nycdb Goto Github PK
View Code? Open in Web Editor NEWDatabase of NYC Housing Data
License: GNU Affero General Public License v3.0
Database of NYC Housing Data
License: GNU Affero General Public License v3.0
We've recently completed a download of all 2019 Statement of Account (SOA) tax bill data that corresponds to BBLs stored in PLUTO 19_v1. We also converted the PDFs to text and attempted to parse out information about how many rent-stabilized units are mentioned in each BBL's tax bill (if any).
The result is a dataset that is different from the existing rent stabilization data in a number of ways:
We only pulled the rent stabilization numbers from the tax bills. We didn't scrape anything about J-51 and other tax abatements because the most important ones have recently been added to NYC's open data portal and brought into NYCDB by PRs like #80. (If there is strong interest in obtaining other numbers from the PDFs, we can always add them later.)
We are applying as little post-processing to the data as possible, as interviews with users of the existing stabilization data have expressed confusion and distrust of the numbers in it. In line with this, we also would like to provide direct links to the PDFs of the tax bills that the data was scraped from. This allows anyone to easily trace the data provenance of every statistic if they have any doubts about the accuracy of the scrape.
Given this, we're proposing a new dataset with the following structure:
BBL | soa_downloaded | soa_rentstab_units | soa_url |
---|---|---|---|
1011587504 | t | 31 | https://nyc-doffer.s3.amazonaws.com/pdf/1/01158/7504/soa-2019-06-05.pdf.br |
select streetname from hpd_registrations where streetname ~ 'DRIVEIVE';
Looks like it's an issue with src/nycdb/address.py
Step 6 currently says:
Step 6 - re-install nycdb to register the new dataset
pip uninstall nycdb
pip setup.py install
(from inside of thesrc
directory)
This is quite cumbersome to do after every code change and shouldn't be necessary if the developer has installed nycdb in editable mode, e.g. via pip install -e
. This is actually how the project is installed in the Docker setup, so at the very least, we should add a note mentioning that they won't need to perform this step if they're using Docker. However, I think that ideally we should tell developers to install nycdb in editable mode if they actually want to hack on it, and then just remove step 6 entirely.
Let's take stock of the current situation. The database is current composed of 8 datasets, which use various languages:
module | language |
---|---|
311 | ruby |
acris | makefile |
dobjobs | python |
dof-sales | python + shell |
hpd-registrations | python + shell |
hpd-violations | shell |
pluto | python + shell |
rent-stabilization | python |
A little bit more than half rely on python, but they don't share much code. Moreover the download scripts are done in a a way that leave more to be desired such as ability to skip already downloaded files ,resume downloads, parallel importing, etc.
Adding BBL, processing addresses, and downloading from open data are features that many off the data importers could use.
Formal request to add DOB violations data https://data.cityofnewyork.us/Housing-Development/DOB-Violations/3h2n-5cm9 :)
parse.rb:58:in `add_bbl': undefined method `include?' for 34:Fixnum (NoMethodError)
While working on #25, I noticed that nycdb.file.download_file()
seems to write CSV files in text mode, decoding each 512k chunk as UTF-8. While it's unlikely, I think there's a possibility that the last byte of a chunk could actually be in the midst of a multi-byte UTF-8 sequence, which would result in data being lost unnecessarily.
(I'm also guessing it might be possible for the downloaded data to be in an encoding other than utf-8, in which case lots of characters might be garbled, so it might be useful to check the Content-Type
header for a charset
, but I really have no idea as I'm unfamiliar with the datasets being retrieved.)
Anyhow, I decided not to address this in #25 because I just wanted to focus on the progress bar logic, but I figured I'd flag it here.
I wasn't sure where else to discuss it and thought a github issue might be OK, but we can move this discussion somewhere else if needed.
Problem statement: Updating NYC-DB takes several hours because each massively large table must be rebuilt from scratch using source CSV files. This results in a lot of duplicate effort because most of the data already existed in the database.
Here's some prior work that works around the lack of an explicit update mechanism, and/or explores potential solutions to the update problem:
https://github.com/justFixNYC/who-owns-what contains a dbtool.py
script that orchestrates the running of nycdb
by dropping existing tables for datasets and deleting existing CSV files before re-downloading and re-loading them.
JustFixNYC/nycdb-k8s-loader has a number of mechanisms to ease updating the database:
JustFixNYC/nycdb-k8s-loader#10 keeps track of the Last-Modified
and/or ETag
headers of downloaded files to make conditional requests when downloading datasets in the future. This allows us to easily skip datasets if they haven't changed.
JustFixNYC/nycdb-k8s-loader#4 uses nycdb as a python module to first load datasets into a temporary schema to ensure that the public schema can still be queried while new data is being loaded.
toolness/rust-nycdb-csv-fun explores the notion of creating what amounts to a changelog for CSV files. This effectively allows us to get only the rows that have been changed or added since the last version of a CSV file. though it currently has some constraints, such as:
Requiring that each row have a primary key that uniquely identifies it. It's not clear whether all tables have this. But this isn't actually a dealbreaker, since supporting updating on some tables is still better than not supporting updating at all.
The notion of rows being deleted is unsupported. While such support could be added, it's unclear whether it'd actually be desirable behavior (for example, when processing a CSV of rolling financial data, we probably want to keep the old financial data that has been "rolled off").
The aforementioned rust-nycdb-csv-fun shows some promise, but there's a few things holding it back from being a full solution. In particular, the SQL scripts that post-process a dataset after it's been loaded often do things that might make it difficult to modify a dataset's tables afterwards, e.g.:
Making changes to a table's schema by adding columns with values computed from pre-existing columns, such as adding a bbl
column to ACRIS data.
Modifying existing columns by "cleaning up" their values, e.g. hpd_registrations/contacts_cleanup.sql
.
Creating brand new tables that use data derived from the original dataset tables, such as hpd_registrations/registrations_grouped_by_bbl_with_contacts.sql
.
(It should also be noted that who-owns-what has its own custom SQL scripts that execute on top of nycdb's "official" ones, and individuals might have their own custom scripts that they run after installing nycdb. It'd be nice if such things were either still supported, or if guidance could be provided on how to adapt such existing solutions to any new paradigm that we create.)
One way to deal with (1) and (2) is to actually crib from nycdb-k8s-loader and load & post-process the updated rows in a separate temporary schema. Then for any existing rows that were changed, we could drop them from the public schema's version of the table. Finally, we'd just "append" the table in the temporary schema to the one in the public schema.
Another way to deal with (1) and (2) is to move such changes from SQL into the Python code that loads the data into the database. However, this might be slower than doing things in SQL, and it could also increase the barrier to entry for contributors who know SQL but not Python. That said, it could do the opposite for folks who prefer Python over SQL, and it might also be easier to unit test the changes in Python-land too.
Dealing with (3) might be straightforward if we just convert the CREATE TABLE
statements to CREATE MATERIALIZED VIEW
statements. However, I haven't actually used materialized views before--I've only read about them--so I don't know if this would work in practice.
Anyways, that's all I've got for now. Thoughts @aepyornis (or anyone else)?
Hi! I noticed that in 4512e6b the Python version used by the Dockerfile got changed to 3.6, but the reason it was 3.5 was so that, by forcing Docker to use the minimum-supported version of Python that NYCDB supports, we wouldn't accidentally use 3.6 features that broke folks on 3.5...
Or are we dropping 3.5 support from NYCDB? Anyways, just thought I'd flag it.
Dominic told me at yesterday's HDC meeting that the city just released an open data feed of J-51 data!
https://data.cityofnewyork.us/City-Government/J-51-Exemption-and-Abatement/y7az-s7wc
Until now this has only been available as a bulk download, which we have some code for parsing, but now that an open data feed is available, we should probably just use that.
I think it'd be useful to have Markdown documentation for each dataset. For example, the hpd_registrations
dataset could have a file called hpd_registrations.md
that explains things like:
hpd_registrations_grouped_by_bbl
, and why they might be useful.dobjobs
dataset seems to support full-text search on some columns, which may be useful to document.I'm not sure where such markdown files would live, though. Here are some ideas:
Alongside the YAML files that describe the datasets, e.g. src/nycdb/datasets/hpd_registrations.md
.
Package the documentation in the same file as the dataset metadata by converting the YAML files into Markdown files with YAML front matter.
In a separate, new docs/
directory (we would want to make sure such a directory is included in the published python package too!).
Regardless of where we put it, we should also remember to update the ADDING_NEW_DATASETS.md
file to encourage new dataset creators to document their datasets.
Hello again!
I really like the integration test data for this project. It's been very useful for building tests for any project that uses NYCDB, such as Who Owns What, as well as some other experiments of mine.
However, the test data isn't in the PyPI distribution of the project (at least, I don't think it is). This means that for all my projects I've had to download the ZIP file version of the repository to get the data.
Since the data is only about a megabyte or two in size, I'm curious if it'd be helpful to developers to have it just ship with the PyPI distribution, along with documentation so that devs know it's there? No worries if not, I could be the only person who finds it handy.
As I was creating the documentation entries for these datasets I noticed that the oath_hearings
and ecb_violations
tables can be linked using a common field, but it is formatted differently in each table.
In ecb_violations
the column is ecbviolationnumber
, it is stored as text, and it is usually 7 or 8 characters long.
In oath_hearings
the same column is called ticketnumber
, it is also stored as text, and it is mostly (78%) 10 characters long (with the rest mostly 6-9 characters).
If you left pad ecbviolationnumber
to 10 with zeros, the join works and you get correct matches. However, I'm not yet sure what to make of the <10 length values for ticketnumber
in oath_hearings
.
I'll need to look into this more, but once I'm sure how the two columns should be formatted, what is the best approach for fixing the columns? I've seen that for some datasets these types of fixes are done in the sql scripts, but other times they are made in dataset_transformations.py
. Is there some logic for when one method is preferred?
I found out that Postgres has a COMMENT
SQL command that allows tables and columns (and other database entities) to be documented. Here's an example:
CREATE TABLE boop( foo VARCHAR(50) );
COMMENT ON COLUMN boop.foo IS 'a funky thing';
When using psql
, asking it for details on the table shows the comment for the column:
myuser=# \d+ boop
Table "public.boop"
Column | Type | Modifiers | Storage | Stats target | Description
--------+-----------------------+-----------+----------+--------------+---------------
foo | character varying(50) | | extended | | a funky thing
Some database clients like will also display these comments. For example, Postico on OS X has a table design view that appears to display comments (though I'm not on OS X so I can't verify this).
If we have nycdb
add comments for tables and columns, it could help new users (and even experienced ones) understand the structure of the data better. One way we could do this is by changing the YAML structure of dataset metadata to also support objects for fields instead of strings, e.g.:
# ...
schema:
table_name: hpd_violations
fields:
# ...
Block: integer
Lot: integer
Class:
type: char(1)
comment: |
Indicator of seriousness of the violations, where A is the least serious
and C is the most serious.
InspectionDate: date
ApprovedDate: date
# ...
Anyways, just an idea. Along with #85 this could help folks acclimatize to NYC-DB better.
While working on #41, I noticed that the Database.insert_rows()
method executes an individual SQL INSERT
statement for each row in a data set.
I'm not a database expert by any means, but Django has a bulk_create
method that inserts multiple records in a single query, which I've found increases efficiency quite a bit, at least in Django-land. So I'm curious if this could potentially speed up dataset loading in nycdb too.
At JustFix, we've geocoded and massaged a bunch of marshal eviction data for 2017, 2018, and the first two months of 2019. It would be good to get this into NYCDB as a handy dataset.
It would be nice to have a script that does most of the grunt work involved in creating a new dataset.
I made a simple version of one and used it to create #96:
https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd
I'm not sure if it should be added to the nycdb repository or kept separate from it, though. It has a rather unorthodox integration test that involves adding a sample dataset to nycdb and then runs pytest to run the test it created for the new dataset, and I'm not sure if pytest is ok with running itself while already running itself.
The code is definitely going to break as we evolve nycdb's architecture (see e.g. #94). I'm not sure if maintaining the script as nycdb evolves will be too much of a burden to keep it in the repo.
Anyways, I figured I'd plop the idea here to see what folks think.
HPD Order to Repair/Vacate Orders
I recently discovered this dataset and am interested in using it for a project that relies on the HDC's auto-updating nycdb instance, so I was hoping to add it to nycdb.
Currently, tables come in named after the type of data.
--------+-----------------+-------+----------
public | all_violations | table | deltanyc
public | dobjobs | table | deltanyc
public | dof_sales | table | deltanyc
public | open_violations | table | deltanyc
public | pluto_16v2 | table | deltanyc
It would be really helpful if they were prefixed with the agency name, both to avoid collisions and to help identify the source of the data.
For example, hpd.all_violations
The README currently has a "Future datasets to add" section that contains one item, "census data". To make things more participatory and encourage discussion, I'm wondering if it might be more useful to instead have a new GitHub issue label called "dataset request" (or something similar) and encourage folks to file issues with that label? For examples of such issues, see #22, #24, #76, #77, and #78.
For more context: we'd like to have a NYC-DB hackathon soon, and as a lead-in to that, we want to encourage HDC members to file github issues for potential improvements to NYC-DB. Adding new datasets seems like the kind of thing that it'd be useful to have a label for, especially since it makes it easy to see a list of all currently-requested new datasets.
Currently the date()
function from the Typecast
class doesn't support ISO8601 dates (2019-12-01
).
So in #53 (comment) @aepyornis mentioned:
A solution might be as simple as scripting nycdb to upload those datasets with different table names (or in a different schema keeping the name the same) and renaming the tables when they are ready.
First, I did not know what a Postgres Schema was until just now. I was originally going to suggest adding a --table-prefix
command-line option but a Schema seems to be the more Postgres-y thing to do here.
So do you think it'd be useful to have a --schema
command-line argument that allows users to specify the schema of a table? I suppose in the code it might be very similar to the "table name prefix" idea, just using <schema-name>.
as the prefix, but aside from my use case, it might be useful for e.g. folks in a multi-tenant situation.
Anyways, if it's too much of a niche thing don't worry about it, I will just script all this myself and not worry about pushing anything upstream!
this should reduce the time of rebuild the database by loading some datasets from a sql dump that don't change (such as pluto)
***DOF ROLLING SALES***
Setting up python3 virtual environment with xlrd
Downloading/unpacking xlrd
Downloading xlrd-1.0.0-py3-none-any.whl (143kB): 143kB downloaded
Installing collected packages: xlrd
Successfully installed xlrd
Cleaning up...
psql:schema.sql:1: NOTICE: table "dof_sales" does not exist, skipping
DROP TABLE
CREATE TABLE
ERROR: invalid input syntax for integer: "-"
CONTEXT: COPY dof_sales, line 24761, column commercialunits: "-"
Hello again! I am still very new to all this, so I'm not sure if what I'm proposing is the right thing to do, but here goes...
According to a NYC DOF document on the ACRIS dataset:
The "Document ID" in the master record is used to link all other record types to a master record. To find all of the lot (property) records associated with a master record, simply retrieve all records in the
property dataset with the same “Document ID” as the selected master record. The same process should be repeated for Party, Cross Reference and Remark records.
So it seems like the various tables are all related via the documentid
column.
However, when I try looking up the parties involved in a document, by making a join between real_property_master
and real_property_parties
on documentid
, it takes a really long time. Like a basic SELECT * FROM real_property_parties WHERE documentid = 'blah'
takes maybe 30-60 seconds.
I also noticed that in acris.sql
, there's an index on documentid
in real_property_master
, but not in real_property_parties
. Should there basically be an index on this column in all the ACRIS tables?
I noticed there's a src/tox.ini
but tox doesn't appear to be used by Travis... I was also confused by it because it mentions Python 3.4, while the .travis.yml
doesn't, which made me wonder what the earliest version of Python we supported was.
Anyhow, just thought I'd mention it!
The NYC open data feed for PAD is useful because it essentially maps BINs to BBLs, which is very useful for cross-referencing different kinds of datasets.
It will also be useful in Who Owns What, so that we can let users know that the statistics they're seeing for a building might actually represent aggregate data amongst multiple buildings (e.g. if multiple BINs map to the same BBL).
getting the cloudflare error page
The Housing Litigations NYC Open Data feed contains some nice data on HP Actions filed by tenants that would be useful to have in NYC-DB. It looks like the raw data already has BBL information so it should be easy to import.
add gis boundaries for spatial queries
OrderedDict([('isndobbisviol', '129467'), ('boro', '1'), ('bin', '1014006'), ('block', '00769'), ('lot', '00068'), ('iss
uedate', '19910606'), ('violationtypecode', 'Z'), ('violationnumber', 'MTF2LE'), ('housenumber', '194'), ('street', '8 AVENUE'), ('dispositiondate', '1
9940231'), ('dispositioncomments', ''), ('devicenumber', ''), ('description', ''), ('ecbnumber', ''), ('number', 'V*060691ZMTF2LE'), ('violationcategor
y', 'V*-DOB VIOLATION - DISMISSED'), ('violationtype', 'Z-ZONING'), ('bbl', '1007690068')])
Traceback (most recent call last):
File "/home/nycdb/.local/bin/nycdb", line 11, in <module>
load_entry_point('nycdb==0.1.12', 'console_scripts', 'nycdb')()
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/cli.py", line 102, in main
dispatch(args)
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/cli.py", line 93, in dispatch
Dataset(args.load, args=args).db_import()
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/dataset.py", line 61, in db_import
self.import_schema(schema)
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/dataset.py", line 97, in import_schema
batch = list(itertools.islice(rows, 0, BATCH_SIZE))
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 144, in cast_rows
yield self.cast_row(row)
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 155, in cast_row
d[column] = self.cast[column.lower()](val)
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 180, in <lambda>
d[k] = lambda x: date(x)
File "/home/nycdb/.local/lib/python3.6/site-packages/nycdb/typecast.py", line 85, in date
date = datetime.datetime.strptime(x, '%Y%m%d')
File "/usr/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/lib/python3.6/_strptime.py", line 528, in _strptime
datetime_date(year, 1, 1).toordinal() + 1
ValueError: day is out of range for month
FYI, the websites SSL certificate is giving a cloud flare error.
https://api.nycdb.info/ Error 526 Ray ID: 4c70e0214aeec5e2 • 2019-04-13 22:28:52 UTC
Invalid SSL certificate
So I was thinking about making a simple script that builds the scaffolding for a new dataset. For example, running python make_new_dataset.py my_new_dataset
could:
src/nycdb/my_new_dataset.sql
,src/nycdb/datasets/my_new_dataset.yml
,src/tests/integration/test_my_new_dataset.py
,and so on. But I realized that doing this for the dataset transformation code is the only place where we can't create a new file--instead, we have to edit src/nycdb/dataset_transformations.py
. While this is fine, I was thinking another alternative could be to have each dataset's transformation actually be a separate python file/module/package, named after the dataset, just like we already do with SQL and YAML files.
Some advantages of this approach:
Probably first and foremost, it's very discoverable. Devs can already see by browsing the repo's file structure that there are certain SQL and YAML files that correspond to specific datasets. But the location of Python-related code for datasets is opaque; moving each dataset's specific code to its own file/directory would make it easier to locate, and easier to intuit the architecture of NYCDB without having to read documentation.
It's straightforward to "remove" a dataset (e.g. if someone made a typo in the dataset name or something)--just delete all files in the repo that have the base filename of the dataset.
It's a bit more extensible, e.g. if we want to define more per-dataset functions, we can just hang them off the dataset's module/package.
If a dataset ends up having lots of dataset-specific transformation code, it can be housed in its own module/package, rather than polluting dataset_transformations.py
.
Anyways, just a thought I figured I'd put out there.
They're in the original database as 'x' or null and the columns are defined as boolean, but everything appears null in the data so I don't think boolean is capturing it. Easy for me to change boolean to text type but if you want to do some transformation to the data to get it into postgres as true/false, that's beyond me right now.
When loading multiple datasets at the same time (i.e. make -j 3 nyc-db
) the progress bars overlay on each other and are not readable.
All of the values in that field show up as null... maybe there is a bug in a date format transformation somewhere?
#91 fixed a warning, but it means we now require pyyaml version 5.1+ and it'll break installations where pip already has installed an older version.
We should change the pyyaml dependencies in requirements.yml
and setup.py
!
I've been thinking of some ways to maintain a constantly-updated NYC-DB instance, and I'm realizing that one potential pipeline is really just to use nycdb
(the command-line program) as a tool that downloads the datasets and translates them into SQL files, rather than directly communicating with a postgres instance. It could also be a potential workaround for #42, if that is actually a legitimate performance issue.
Having each dataset in its own SQL file would also give (advanced) users more control over how the data is loaded, since they could use psql
to e.g. load a dataset in a single transaction, and so on. It would also make it easy to parallelize the creation of multiple SQL files, if one has lots of compute resources to spare.
I am fairly noobish when it comes to databases of this size, though, so my thinking could be silly.
Anyways, I don't think this should be hard to implement since it ought to just involve writing the CREATE TABLE
and INSERT
statements to a file rather than executing them on the database.
@aepyornis if I were to submit a PR that added such functionality, would you accept it?
It would be really nice to have an --unload
command to nycdb that simply undoes anything that --load
does.
The simple version of this could just drop the tables mentioned in the YAML file, but a more awesome version would also be able to drop any derived tables (e.g. those hpd_registrations_grouped_by_bbl
or whatever) and functions that were created... I have some code over in JustFixNYC/nycdb-k8s-loader that parses the actual SQL files to glean some of this information, if it'd be helpful, but maybe that's complicating things too much.
Hey, I noticed that some documentation on adding new datasets was added in #57, which is awesome! But it seems like it's Markdown rather than Restructured Text, as the .rst
extension implies?
It also seems like it'd be useful to link to the document from the README maybe.
Should I issue a PR that makes these changes?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.