Giter Site home page Giter Site logo

toddwschneider / nyc-taxi-data Goto Github PK

View Code? Open in Web Editor NEW
2.0K 127.0 572.0 48.53 MB

Import public NYC taxi and for-hire vehicle (Uber, Lyft) trip data into a PostgreSQL or ClickHouse database

License: MIT License

R 90.80% Shell 7.70% Ruby 1.50%
clickhouse nyc nyc-taxi-dataset postgresql

nyc-taxi-data's Introduction

New York City Taxi and For-Hire Vehicle Data

Scripts to download, process, and analyze data from 3+ billion taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. There are separate sets of scripts for storing data in either a PostgreSQL or ClickHouse database.

Most of the raw data comes from the NYC Taxi & Limousine Commission.

The repo was created originally in support of this post: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

TLC 2022 Parquet Format Update

The TLC changed the raw data format from CSV to Apache Parquet in May 2022, including a full replacement of all historical files. This repo is now updated to handle the Parquet files in one of two ways:

  1. The "old" Postgres-based code still works, by adding an intermediate step that converts each Parquet file into a CSV before using the Postgres COPY command
  2. A separate set of scripts loads the Parquet files directly into a ClickHouse database

As part of the May 2022 update, the TLC added several new columns to the High Volume For-Hire Vehicle (Uber, Lyft) trip files, including information about passenger fares, driver pay, and time spent waiting for passengers. These new fields are available back to February 2019.

This repo no longer works with the old CSV files provided by the TLC. Those files are no longer available to download from the TLC's website, but if you happen to have them lying around and want to use this repo, you should look at this older verion of the code from before the Parquet file format change.

ClickHouse Instructions

See the clickhouse directory

PostgreSQL Instructions

1. Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac

2. Install R

From CRAN

Note that R used to be optional for this repo, but is required starting with the 2022 file format change. The scripts use R to convert Parquet files to CSV before loading into Postgres. There are other ways to convert from Parquet to CSV that wouldn't require R, but I found that R's arrow package was faster than some of the other CLI tools I tried

3. Download raw data

./download_raw_data.sh

4. Initialize database and set up schema

./initialize_database.sh

5. Import taxi and FHV data

./import_yellow_taxi_trip_data.sh
./import_green_taxi_trip_data.sh
./import_fhv_taxi_trip_data.sh
./import_fhvhv_trip_data.sh

Note that the full import process might take several hours or possibly even over a day depending on computing power

Schema

  • trips table contains all yellow and green taxi trips. Each trip has a cab_type_id, which references the cab_types table and refers to one of yellow or green
  • fhv_trips table contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Juno
  • fhv_bases maps fhv_trips to base names and "doing business as" labels, which include ride-hailing app names
  • nyct2010 table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areas
  • taxi_zones table contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDs
  • central_park_weather_observations has summary weather data by date

Other data sources

These are bundled with the repository, so no need to download separately, but:

See Also

Mark Litwintschik has used the taxi dataset to benchmark performance of many different technology stacks, including PostgreSQL and ClickHouse. His summary is here: https://tech.marksblogg.com/benchmarks.html

TLC summary statistics

There's a Ruby script in the tlc_statistics/ folder to import data from the TLC's summary statistics reports:

ruby import_statistics_data.rb

These summary statistics are used in the NYC Taxi & Ridehailing Stats dashboard

Taxi vs. Citi Bike comparison

Code in support of the post When Are Citi Bikes Faster Than Taxis in New York City? lives in the citibike_comparison/ folder

2017 update

Code in support of the 2017 update to the original post lives in the analysis/2017_update/ folder

Questions/issues/contact

[email protected], or open a GitHub issue

nyc-taxi-data's People

Contributors

dbkaplun avatar seifer08ms avatar toddwschneider avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nyc-taxi-data's Issues

Error: Aesthetics must be either length 1 or the same as the data

After importing all the data and running cat analysis.R | R --no-save I get the following error:

> png(filename = "graphs/taxi_uber_lyft_trips_per_day.png", width = 640, height = 520)
> ggplot(data = trips_per_day, aes(x = date, y = trips_per_day, color = category)) +
+   geom_line(size = 1) +
+   scale_y_continuous("Trips per day\n", labels = unit_format("k", 1/1000, "")) +
+   scale_x_date("") +
+   scale_color_manual("", values = c("#F7B731", "#161629", "#E70B81")) +
+   title_with_subtitle("NYC Taxis Losing Market Share to Uber", "Trips per day in NYC, based on TLC summary data") +
+   theme_tws(base_size = 24) +
+   theme(legend.position = "bottom")
Error: Aesthetics must be either length 1 or the same as the data (1): x, y, colour

airport_trips_summary_monthly_avg table does not exist.

When I run cat analysis.R | R --no-save I get the following error message:

> airport_monthly = query("
+   SELECT *
+   FROM airport_trips_summary_monthly_avg
+   ORDER BY ntacode, airport_code, month
+ ")
Error in postgresqlExecStatement(conn, statement, ...) :
  RS-DBI driver: (could not Retrieve the result : ERROR:  relation "airport_trips_summary_monthly_avg" does not exist
LINE 3:   FROM airport_trips_summary_monthly_avg
               ^
)
Calls: query ... dbSendQuery -> dbSendQuery -> postgresqlExecStatement -> .Call
Execution halted

The table doesn't seem to exist in the database I setup following installation instructions:

nyc-taxi-data=# \d
                         List of relations
 Schema |                 Name                  |   Type   | Owner
--------+---------------------------------------+----------+-------
 public | airport_pickups                       | table    | mark
 public | airport_pickups_by_type               | table    | mark
 public | airport_trips                         | table    | mark
 public | airport_trips_summary                 | table    | mark
 public | bridge_and_tunnel                     | table    | mark
 public | cab_types                             | table    | mark
 public | cab_types_id_seq                      | sequence | mark
 public | census_tract_pickup_growth_2009_2015  | table    | mark
 public | census_tract_pickups_by_hour          | table    | mark
 public | central_park_weather_observations     | view     | mark
 public | central_park_weather_observations_raw | table    | mark
 public | citigroup_dropoffs                    | table    | mark
 public | custom_geometries                     | table    | mark
 public | daily_dropoffs_by_borough             | table    | mark
 public | daily_pickups_by_borough_and_type     | table    | mark
 public | die_hard_3                            | table    | mark
 public | dropoff_by_lat_long_cab_type          | table    | mark
 public | geography_columns                     | view     | mark
 public | geometry_columns                      | view     | mark
 public | goldman_sachs_dropoffs                | table    | mark
 public | green_tripdata_staging                | table    | mark
 public | green_tripdata_staging_id_seq         | sequence | mark
 public | greenwich_hamptons_dropoffs           | table    | mark
 public | hourly_dropoffs                       | table    | mark
 public | hourly_pickups                        | table    | mark
 public | hourly_uber_2015_pickups              | table    | mark
 public | neighborhood_centroids                | table    | mark
 public | northside_dropoffs                    | table    | mark
 public | northside_pickups                     | table    | mark
 public | nyct2010                              | table    | mark
 public | nyct2010_centroids                    | table    | mark
 public | nyct2010_gid_seq                      | sequence | mark
 public | payment_types                         | table    | mark
 public | pickups_and_weather                   | table    | mark
 public | pickups_comparison                    | table    | mark
 public | raster_columns                        | view     | mark
 public | raster_overviews                      | view     | mark
 public | spatial_ref_sys                       | table    | mark
 public | trips                                 | table    | mark
 public | trips_by_lat_long_cab_type            | table    | mark
 public | trips_id_seq                          | sequence | mark
 public | uber_taxi_zone_lookups                | table    | mark
 public | uber_trips_2015                       | table    | mark
 public | uber_trips_2015_id_seq                | sequence | mark
 public | uber_trips_staging                    | table    | mark
 public | uber_trips_staging_id_seq             | sequence | mark
 public | yellow_tripdata_staging               | table    | mark
 public | yellow_tripdata_staging_id_seq        | sequence | mark

The table's creation isn't mentioned anywhere in analysis/prepare_analysis.sql nor elsewhere in the code repository.

Non-character argument excception w/ Goldman Sachs Weekday Taxi Drop Offs chart

I've imported the entire dataset into Postgres and run the following:

$ cat analysis.R | R --no-save

It comes back with the following exception:

> png(filename = "graphs/gs_dropoffs.png", width = 640, height = 420)
> ggplot(data = filter(gs, dow %in% 1:5),
+        aes(x = timestamp_for_x_axis)) +
+   geom_histogram(binwidth = 600) +
+   scale_x_datetime("\ndrop off time", labels = date_format("%l %p"), minor_breaks = "1 hour") +
+   scale_y_continuous("taxi drop offs\n", labels = comma) +
+   title_with_subtitle("Goldman Sachs Weekday Taxi Drop Offs at 200 West St", "Based on NYC TLC data from 1/2009–6/2015") +
+   theme_tws(base_size = 19)
Error in strsplit(unitspec, " ") : non-character argument
Calls: <Anonymous> ... fullseq.POSIXt -> seq -> floor_time -> parse_unit_spec -> strsplit
Execution halted

Street related drop off data

I admire the job you have done there. I was just wondering how you generated results like
THIS. The Data which is available on TLC database present only drop off and pick up data for each area but not the exact coordinates. i would be glad if you could explain that a bit more.

2015 Uber data decompressed to wrong folder

The import_uber_trip_data.sh scripts intends to unzip data/uber-raw-data-janjune-15.csv.zip into the data directory but instead will extract it to the current working directory import_uber_trip_data.sh is executing from.

$ cd ~/nyc-taxi-data
$ ./import_uber_trip_data.sh
Tue Feb  9 14:25:18 EET 2016: beginning load for data/uber-raw-data-apr14.csv
...
cat: data/uber-raw-data-janjune-15.csv: No such file or directory
$ ls data/uber-raw-data-janjune-15.csv
ls: cannot access data/uber-raw-data-janjune-15.csv: No such file or directory

$ ls uber-raw-data-janjune-15.csv
uber-raw-data-janjune-15.csv

April 2105 Green Taxi trip data fails to load

While loading up the data into PostgreSQL I got the following error while the April 2014 data for Green Taxis was loading:

Sat Feb 27 22:56:15 EET 2016: beginning load for data/green_tripdata_2015-04.csv
ERROR:  extra data after last expected column
CONTEXT:  COPY green_tripdata_staging, line 2: "2,2015-04-01 00:00:00,2015-04-01 00:08:15,N,1,-73.958816528320312,40.716823577880859,-73.98297119140..."
Sat Feb 27 22:56:16 EET 2016: finished raw load for data/green_tripdata_2015-04.csv

It happened on the second line of the file and no data for that month for Green Taxis was loaded in.

Steps to re-create this:

This was all done on a fresh Ubuntu 14.04.3 LTS installation.

$ sudo apt-get install postgresql-9.3-postgis-2.1 git unzip
$ git clone https://github.com/toddwschneider/nyc-taxi-data.git
$ cd nyc-taxi-data
$ vi minimal_downloads.txt

Contents of minimal_downloads.txt:

https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2014-04.csv
https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-04.csv
https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2014-04.csv
https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-04.csv
$ cat minimal_downloads.txt | xargs -n 1 -P 6 wget -P data/ &
$ sudo su - postgres -c "psql -c 'CREATE USER mark; ALTER USER mark WITH SUPERUSER;'"
$ ./initialize_database.sh
$ ./import_trip_data.sh

Data size

You said the data is around 260GB, but my computer storage is only 100GB..could you please tell me some hint to solve this problem?

Thank you!

Date/time value is out of range in the Uber 2014 data

When I import the Uber data for 2014 I get an error message that the date/time field value is out of range for the value 4/13/2014 0:01:00.

$ ./import_uber_trip_data.sh
Tue Feb  9 14:32:16 EET 2016: beginning load for data/uber-raw-data-apr14.csv
ERROR:  date/time field value out of range: "4/13/2014 0:01:00"
HINT:  Perhaps you need a different "datestyle" setting.
CONTEXT:  COPY uber_trips_staging, line 15076, column pickup_datetime: "4/13/2014 0:01:00"
Tue Feb  9 14:32:16 EET 2016: finished raw load for data/uber-raw-data-apr14.csv

It looks as though this value exists in a number of records:

$ grep -n "4/13/2014 0:01:00" data/uber-raw-data-apr14.csv
15076:"4/13/2014 0:01:00",40.7075,-73.9483,"B02512"
98906:"4/13/2014 0:01:00",40.7342,-73.999,"B02598"
98907:"4/13/2014 0:01:00",40.6316,-73.8876,"B02598"
98908:"4/13/2014 0:01:00",40.7668,-73.9676,"B02598"
98909:"4/13/2014 0:01:00",40.7345,-74.0014,"B02598"
98910:"4/13/2014 0:01:00",40.7463,-73.983,"B02598"
98911:"4/13/2014 0:01:00",40.7059,-73.92,"B02598"
263998:"4/13/2014 0:01:00",40.7396,-74.0276,"B02617"
263999:"4/13/2014 0:01:00",40.7386,-74.0055,"B02617"
264000:"4/13/2014 0:01:00",40.737,-73.9882,"B02617"
264001:"4/13/2014 0:01:00",40.7139,-73.9598,"B02617"
422703:"4/13/2014 0:01:00",40.7409,-74.0075,"B02682"
422704:"4/13/2014 0:01:00",40.714,-73.9579,"B02682"
422705:"4/13/2014 0:01:00",40.7336,-74.0048,"B02682"
422706:"4/13/2014 0:01:00",40.7573,-73.9847,"B02682"
422707:"4/13/2014 0:01:00",40.7162,-73.9549,"B02682"
422708:"4/13/2014 0:01:00",40.7358,-73.998,"B02682"
422709:"4/13/2014 0:01:00",40.6923,-73.9878,"B02682"
422710:"4/13/2014 0:01:00",40.7325,-74.0035,"B02682"
422711:"4/13/2014 0:01:00",40.7335,-74.0043,"B02682"
422712:"4/13/2014 0:01:00",40.6776,-73.912,"B02682"
422713:"4/13/2014 0:01:00",40.6352,-73.9503,"B02682"

Just about every other month in 2014 for the Uber data is raising the same issue.

gzipped data

Not so much of an issue, but would it be possible to gzip the data at the download source? No biggie for me since it'll just download overnight, but it might affect others.

(Great project, btw. I just submitted a FOIA to Chicago for similar data. If they can give me a copy, would you want it? I have parking ticket/towing data from last year if you're interested in that, too)

alternative download script

this is not very exciting but since I wrote it, I thought I'd share it and let you decide whether it'd be of any use. very simple. uses 'curl' instead 'wget' and compresses directly to disk using 'xz'. it will also die on failures and allow transfers to resume (although that's probably not all that useful for many failures since it'll have to restart if 'xz' doesn't know what size the original data is).

pull-data.sh.gz

TLC Shapefile

I have a question about the shapefile. Apparently the coordinates in the shapefile do not match the actual latitude longitude.

End of line in linux

One may have trouble running this in Ubuntu, the first step could be problematic since the line endings are for MAC. One can change the raw data urls to a more appropriate one, however, by issuing the command:

cat raw_data_urls.txt | tr '\r' '\n' | tr -s '\n' > raw_data_urls.translated.txt

(following https://marcelog.github.io/articles/mac_newline_to_unix_eol.html)
Then editing download_raw_data.sh to read:

cat setup_files/raw_data_urls.translated.txt | xargs -n 1 -P 6 wget -c -P data/

How to determine Airport Area?

Good work, may I ask how to determine the Latitude and longitude range of airports? How to determine the boundaries of rectangles?

Invalid date exception when running import_statistics_data

After downloading and importing all the data and running:

./import_trip_data.sh
./import_uber_trip_data.sh
cat analysis/prepare_analysis.sql \
      tlc_statistics/create_statistics_tables.sql | \
      psql nyc-taxi-data

I ran the import_statistics_data.rb script:

$ ruby tlc_statistics/import_statistics_data.rb

And it resulted in the following:

tlc_statistics/import_statistics_data.rb:16:in `strptime': invalid date (ArgumentError)
        from tlc_statistics/import_statistics_data.rb:16:in `block (2 levels) in <main>'
        from tlc_statistics/import_statistics_data.rb:14:in `each'
        from tlc_statistics/import_statistics_data.rb:14:in `block in <main>'
        from /usr/lib/ruby/1.9.1/csv.rb:1354:in `open'
        from tlc_statistics/import_statistics_data.rb:13:in `<main>'

OmniSci (MapD) transformation

Nice repo, thx a lot. I am working on migration of the scripts to be compatible with OmniSci (mapd-core) GPU accelerated SQL analytics engine. Will do pull request when done.

Non-character argument error being raised when running analysis.R

When running the following on a fresh Ubuntu 14.04.3 LTS instance I'm getting a "non-character argument" error message. Commenting out a print statement seems to get past it.

$ cat analysis.R | R --no-save
> for (i in 1:nrow(ntas_to_calculate)) {
+   for (j in 1:nrow(airports)) {
+     nta = ntas_to_calculate[i, ]
+     ap = airports[j, ]
+     data = filter(airport, ntacode == nta$ntacode, airport_code == ap$code, trips_count >= min_trips)
+     fname = paste0("graphs/airport/", nta$ntacode, "_", ap$code, ".png")
+
+     if (nrow(data) < 12) {
+       png(filename = fname, width = 640, height = 120)
+       print(insufficient_data)
+       dev.off()
+       next()
+     }
+
+     display_name = nta_display_name(nta$ntacode)
+     if (is.na(display_name)) display_name = nta$ntaname
+     title_text = paste0(display_name, " to ", ap$name, " Taxi Travel Time")
+     title_rel = ifelse(nchar(display_name) > 20, 1, 1.2)
+
+     p = ggplot(data = data, aes(x = timestamp_for_x_axis)) +
+           geom_line(aes(y = pct50, alpha = "  Median   ")) +
+           geom_ribbon(aes(ymin = pct25, ymax = pct75, alpha = " 25–75th percentile   ")) +
+           geom_ribbon(aes(ymin = pct10, ymax = pct90, alpha = "10–90th percentile")) +
+           scale_x_datetime("", labels = date_format("%l %p"),
+                            breaks = "3 hours", minor_breaks = "1 hour") +
+           scale_y_continuous("trip duration in minutes\n") +
+           expand_limits(y = 0) +
+           coord_cartesian(xlim = xlim) +
+           scale_alpha_manual("", values = c(1, 0.2, 0.2)) +
+           title_with_subtitle(title_text,
+                               "Weekdays only, based on NYC TLC data from 1/2009–6/2015") +
+           theme_tws(base_size = 19) +
+           theme(legend.position = "bottom",
+                 plot.title = element_text(size = rel(title_rel))) +
+           guides(alpha = guide_legend(override.aes = list(alpha = c(1, 0.4, 0.2),
+                                                           size = c(1, 0, 0),
+                                                           fill = c(NA, "black", "black"))))
+
+     png(filename = fname, width = 640, height = 420)
+     print(p)
+     add_credits()
+     dev.off()
+   }
+ }
Error in strsplit(unitspec, " ") : non-character argument
Calls: print ... fullseq.POSIXt -> seq -> floor_time -> parse_unit_spec -> strsplit
Execution halted

Here's the diff showing the commented out print statement. I'm not sure if this is a good fix or not so I'll just leave this as a diff here and let you make the call.

diff --git a/analysis/analysis.R b/analysis/analysis.R
index bb64d7b..6646826 100644
--- a/analysis/analysis.R
+++ b/analysis/analysis.R
@@ -273,7 +273,7 @@ for (i in 1:nrow(ntas_to_calculate)) {
                                                           fill = c(NA, "black", "black"))))

     png(filename = fname, width = 640, height = 420)
-    print(p)
+    #print(p)
     add_credits()
     dev.off()
   }

If it makes a difference, in my test environment I've only imported the following files:

https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-jul14.csv
https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/uber-trip-data/uber-raw-data-janjune-15.csv.zip
https://storage.googleapis.com/tlc-trip-data/2013/green_tripdata_2013-09.csv
https://storage.googleapis.com/tlc-trip-data/2014/green_tripdata_2014-09.csv
https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-09.csv
https://storage.googleapis.com/tlc-trip-data/2009/yellow_tripdata_2009-09.csv

Error in strsplit(unitspec, " ") : non-character argument

After setting up the database I'm getting a non-character argument error when running the analysis.R script.

Setup steps:

cat raw_uber_data_urls.txt \
      raw_data_urls.txt | \
      xargs -n 1 -P 6 wget -P data/
./initialize_database.sh
./import_trip_data.sh
./import_uber_trip_data.sh
cat analysis/prepare_analysis.sql \
      tlc_statistics/create_statistics_tables.sql | \
      psql nyc-taxi-data
cd tlc_statistics
ruby import_statistics_data.rb # Patched as in #14

When running this:

$ cat analysis.R | R --no-save

The following error occurs:

> png(filename = "graphs/citi_dropoffs.png", width = 640, height = 420)
> ggplot(data = filter(citi, dow %in% 1:5),
+        aes(x = timestamp_for_x_axis)) +
+   geom_histogram(binwidth = 600) +
+   scale_x_datetime("\ndrop off time", labels = date_format("%l %p"), minor_breaks = "1 hour") +
+   scale_y_continuous("taxi drop offs\n", labels = comma) +
+   title_with_subtitle("Citigroup Weekday Taxi Drop Offs at 388 Greenwich St", "Based on NYC TLC data from 1/2009–6/2015") +
+   theme_tws(base_size = 19)
Error in strsplit(unitspec, " ") : non-character argument
Calls: <Anonymous> ... fullseq.POSIXt -> seq -> floor_time -> parse_unit_spec -> strsplit
Execution halted

taxi_zone_centroids does not exist

Hey Todd,
I followed the instructions given and the lines below fail with the following error. I am not sure if this is a local issue but I followed the import instructions and am not aware of errors.

/* see http://www.charlespetzold.com/etc/AvenuesOfManhattan/ */
CREATE TABLE rotated_taxi_zones AS
SELECT
t.gid,
ST_Rotate(t.geom, 29 * 2 * pi() / 360, m.geom) AS rotated_geom,
ST_X(ST_Rotate(t.geom, 29 * 2 * pi() / 360, m.geom)) AS rotated_x,
ST_Y(ST_Rotate(t.geom, 29 * 2 * pi() / 360, m.geom)) AS rotated_y
FROM taxi_zone_centroids t, manhattan_centroid m;
ALTER TABLE rotated_taxi_zones ADD PRIMARY KEY (gid);

ERROR:  relation "taxi_zone_centroids" does not exist
LINE 7: FROM taxi_zone_centroids t, manhattan_centroid m;

Any idea what the cause of the issue might be? Thanks

Many years of S3 files fail to download

I'm attempting to update a copy of the Taxicab dataset and updated to the newest commits. When attempting to download just the Yellow/green data (I edited the URLs file), I'm only able to download 2009 and 2010 data. Both wget and a regular browser request fail for dozens of other URLs that I try from the file. I'm assuming the permissions in S3 got changed accidentally, but I wondered if there might be something else.

Example: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2022-01.csv
Error: 404 not found (wget)

From S3:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>trip data/yellow_tripdata_2022-01.csv</Key>
<RequestId>BYPM9VSTW4FSM4EV</RequestId>
<HostId>/nQv/eDe2QprYThNg9mlGjozXlQ/7dX1Me4UNxtEkDG881eQV+IqbcbGIizpwazr9IRw7OOFN74=</HostId>
</Error>

Thanks for your work on this project and maintaining the dataset!

Access method "brin" does not exist

After successfully running almost all of the script import_trip_data.sh I got an error on the final line:

ERROR: access method "brin" does not exist

This appears to come from the final line:
psql nyc-taxi-data -c "CREATE INDEX ON trips USING BRIN (pickup_datetime) WITH (pages_per_range = 32);"

I tried BTREE instead of BRIN w/o luck. Any suggestions? Is the line required or optional for better performance?

[Info] - Reason for executing populate_yellow_trip for every file

Just wondering if there is a technical reason for this: why is the scrip populate_yellow_trips executed every ever data file after the COPY activity is finished? Wouldn't it be more efficient to have all the source files be imported into the db, and then execute the populate script? Thx for the clarification

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.