Giter Site home page Giter Site logo

seamless-census's Introduction

seamless-census

Import US Census data into a seamless storage environment.

Usage

Running the download and load steps for the entire US requires ~45 GB of disk space.

Download data

You can use the following command to download data from the Census bureau. Create a temporary directory to receive the files before you combine them and load them to S3, in a location that has plenty of disk space. The arguments are the temporary directory and the two-letter postal abbreviations of the states for which you want to retrieve data (you can also use the special code ALL to retrieve data for every state, territory and district). The command below, for instance, would download data for the greater Washington, DC megalopolis.

python downloadData.py temporary_dir DC MD VA WV DE

Load data

Use the same temporary directory you used above. If you omit the s3 bucket name, it will place the tiles in the tiles directory in the temporary directory.

JAVA_OPTS=-Xmx[several]G mvn exec:java -Dexec.mainClass="com.conveyal.data.census.CensusLoader" -Dexec.args="temporary_dir s3_bucket_name"

Extract data

Now for the fun part. The following command will extract the data stored in the s3 bucket specified, using the bounding box specified, to the geobuf file out.pbf.

JAVA_OPTS=-Xmx[several]G mvn exec:java -Dexec.mainClass="com.conveyal.data.census.CensusExtractor" -Dexec.args="s3://bucket_name n e s w out.pbf"

Data storage

Data is stored in a directory structure, which is kept in Amazon S3. Census data is split up into zoom-level-11 tiles and stored in GeoBuf files, each in a directory for its source, its x coordinate and named its y coordinate. For example, us-census-2012/342/815.pbf might contain US LODES data and decennial census data for southeastern Goleta, CA.

Enumeration units that fall into two tiles should be included in both tiles. It is the responsibility of the data consumer to deduplicate them; this can be done based on IDs. An enumeration unit that is duplicated across tiles must have the same integer ID in both tiles.

We have already loaded LODES data from 2013, 2014, 2015, and 2017 in the S3 buckets lodes-data, lodes-data-2014, lodes-data-2015, etc. These buckets and their contents are publicly readable and requester-pays (i.e. accessing them will incur fees on your AWS account). The 2013 data lack Massachusetts, and uses 2011 data for Kansas, due to data availability. The 2014 and 2015 data do not have these problems. The 2017 data exclude federal employees and use 2016 data for Alaska and South Dakota. See LODES Technical Documentation for details.

Use in Conveyal Analysis

Any dataset that can be placed in this format can be used in Conveyal Analysis

seamless-census's People

Contributors

abyrd avatar ansoncfit avatar mattwigway avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seamless-census's Issues

Progress indicator during downloads

I've had the downloads from the FTP server stall a few times. It's hard to tell whether progress is being made and what the expected waiting time is.

We should show the total volume of data to be downloaded (bytes and number of files) and the progress so far. Seeing the URLs being fetched would also be helpful.

Rather than writing the progress indicator code, this could be achieved by just calling the wget command, or possibly by using some library functionality in Python.

mvn exec:java shutdown is not clean

Read 37405 features in 12701msec
[WARNING] thread Thread[java-sdk-http-connection-reaper,5,com.conveyal.data.census.CensusExtractor] was interrupted but is still alive after waiting at least 15000msecs
[WARNING] thread Thread[java-sdk-http-connection-reaper,5,com.conveyal.data.census.CensusExtractor] will linger despite being asked to die via interruption
[WARNING] NOTE: 1 thread(s) did not finish despite being asked to  via interruption. This is not a problem with exec:java, it is a problem with the running code. Although not serious, it should be remedied.
[WARNING] Couldn't destroy threadgroup org.codehaus.mojo.exec.ExecJavaMojo$IsolatedThreadGroup[name=com.conveyal.data.census.CensusExtractor,maxpri=10]
java.lang.IllegalThreadStateException
    at java.lang.ThreadGroup.destroy(ThreadGroup.java:778)
    at org.codehaus.mojo.exec.ExecJavaMojo.execute(ExecJavaMojo.java:328)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.