Giter Site home page Giter Site logo

eqasim-org / ile-de-france Goto Github PK

View Code? Open in Web Editor NEW
46.0 7.0 66.0 1.36 MB

An open synthetic population of Île-de-France for agent-based transport simulation

License: GNU General Public License v2.0

Python 100.00%
population-synthesis open-data france matsim transport simulation agent-based

ile-de-france's Introduction

An open synthetic population of Île-de-France

Via Île-de-France

This repository contains the code to create an open data synthetic population of the Île-de-France region around in Paris and other regions in France.

Main reference

The main research reference for the synthetic population of Île-de-France is:

Hörl, S. and M. Balac (2021) Synthetic population and travel demand for Paris and Île-de-France based on open and publicly available data, Transportation Research Part C, 130, 103291.

What is this?

This repository contains the code to create an open data synthetic population of the Île-de-France region around in Paris and other regions in France. It takes as input several publicly available data sources to create a data set that closely represents the socio-demographic attributes of persons and households in the region, as well as their daily mobility patterns. Those mobility patterns consist of activities which are performed at certain locations (like work, education, shopping, ...) and which are connected by trips with a certain mode of transport. It is known when and where these activities happen.

Such a synthetic population is useful for many research and planning applications. Most notably, such a synthetic population serves as input to agent-based transport simulations, which simulate the daily mobility behaviour of people on a spatially and temporally detailed scale. Moreover, such data has been used to study the spreading of diseases, or the placement of services and facilities.

The synthetic population for Île-de-France can be generated from scratch by everybody who has basic knowledge in using Python. Detailed instructions on how to generate a synthetic population with this repository are available below.

Although the synthetic population is independent of the downstream application or simulation tool, we provide the means to create an input population for the agent- and activity-based transport simulation framework MATSim.

This pipeline has been adapted to many other regions and cities around the world and is under constant development. It is released under the GPL license, so feel free to make adaptations, contributions or forks as long as you keep your code open as well!

Documentation

This pipeline fulfils two purposes: First, to create synthetic populations of French regions in CSV and GLPK format including households, persons and their daily localized activities. Second, the pipeline makes use of infrastructure data to generate the inputs to agent-based transport simulations. These steps are described in the following documents:

Furthermore, we provide documentation on how to make use of the code to create popuations and run simulations of other places in France. While these are examples, the code can be adapted to any other scenarios as well:

Publications

Versioning

The current version of the pipeline is v1.2.0. You can obtain it by cloning the v1.2.0 tag of this repository. Alternatively, you can also clone the develop branch to make use of the latest developments. The version number will be kept in the develop branch until a new version is officially released.

Note that whenever you create a population with this pipeline, the meta.json in the output will let you know the exact git commit with which the population was created.

ile-de-france's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ile-de-france's Issues

Distinguish between education destinations

Currently, we do not distinguish between education facility types for different population strata. The following would need to be done:

  • When processing the BPE (data.bpe.cleaned), write out the TYPEQU holding the facility type. See https://www.insee.fr/fr/statistiques/fichier/3568638/Contenu_bpe20_ensemble_xy.pdf
  • When creating education facilities (synthesis.locations.education), use this attribute to distinguish between N education facility types. These N need to be defined in some way. This could either be via age class or cross-over with other attributes (highest diploma / current diploma)
  • These N classes (if more complex than age) should then be implemented in an additional stage or in the synthesis.population.enriched stage (or directly in the location assignment, see below)
  • We need to preprocess the OD matrix for education. Currently, it only gives a probability for each destination municipality, given an origin municipality. What we need is to have N matrices, with zeros for destination municipalities where the selected category n is not available. In total the N matrices should then give the original matrix, so some decomposition needs to be applied here. This needs a new stage, potentially in synthesis.population.spatial.primary.education_matrices.
  • Assignment of facilities happens in two steps. First, we take the OD matrix and then assign an education destination municipality to each person in synthesis.population.spatial.primiary.candidates. Here, we need to perform the assignment for each of the N categories individually (i.e. select all persons with category n, then select the matrix for category n and perform the assignment as described in the paper / in the code already)
  • After, assignment of a facility from the selected municipalities happens. This does not need much modification in synthesis.population.spatial.primary.locations, only the assignment needs to happen per category n.

Non-deterministic output files for MATSim

Currently, only few MATSim output files are properly unit-tested, because the order in which persons in the population file or links in the network file are written out is random (probably there are HashMaps used). Either add a feature to MATSim or add some post-processing sorting mechanism.

Automatic reporting

It would be nice to have an automatic and visual report on the pipeline results. This could just be an HTML document with plots on the outputs and the comparison between the generated population and reference data. This will also help to analyze potential changes in the output data when new components are introduced.

Compatibility BPE 2019

BPE 2018 is not available anymore.

  • Update code (BPE is now delivered in CSV, previously DBF)
  • Verify for different scenarios (IDF, Nantes, Lyon, Corsica)
  • Verify consistency of the municipality codes (and retrofit if necessary)
  • Update documentation

Add mode choice stage

Given that we know travel times in the network:

  • Add a mode choice stage in which the mode for each trip is already chosen based on the given information (right now we assume that this will happen in the downstream simulation)
  • Potentially, make primary destination choice dependent on this information (gravity model)
  • Potentially, make secondary destination choice dependent on this information (Potential Path Area approaches)

matsim.simulation.prepare : ScenarioValidator rejects everything

I adapted the ile_de_france pipeline for another city in France.
(you can have a look here : https://github.com/Nitnelav/eqasim-nantes)

I didn't change much apart from being able to set a list of requested commune_id instead of a list of departments...

Everything is working all right until the ScenarioValidator step.
I get a "Person X has car leg without a route" for every single person in my population file... (see attached log file)

I'm a bit confused about where to look in order to fix this ...

Any idea ?

out.log

Clarify the generalizability of the synthesis

Hello,

This repository contains code to generate a synthetic population for any department in France effortlessly. This software is a more outstanding scientific contribution than just a population synthesis for Île-de-France.
Why not make it clear by, for instance, changing the repository's and environment's names with "synthetic_population"?
This clarification seems a mandatory passage for the sustainability of this repository.

Alongside the coherence of the naming of this tool, it would be easier for other people working with it. For example, if they work in different geographical areas, they can keep coherence in their work without the need to fork the repository and change the environment name.

Thank you for your work,

Aina

Integration of DVRP vehicles into Discrete Mode Choice

Hi all,

I am planning to extend the current model with a couple of services such as taxis and (shared) drt which are heavenly relying on operational KPIs such as trip rejection, utilization rate, waiting time etc.

When implementing respective estimators for DMC I ran into the problem of how to propely estimate variables as the ones listed above.
My idea is to start with initial values and for the upcoming iterations read in drt/taxi output from the iteration n-1 but I am struggling to do so. Maybe there is an easier solution?

Best regards,

Alex

Could we make a list of tips

For instance, how to use ftp for those who haven't before. It would save people time.

Or, for instance, some more detailed instructions on how to use this "osmosis_binary" option, since I've tried several version and it hasn't worked.

Advice on how to deal with making sure a system has the required things installed and callable (it's not always straightforward, especially on Linux servers....I figure we can spare others from this time suck by writing down what's worked for us....).

Writing commit version when code is not cloned

Currently, we write the current git commit version into meta.json when generating a synthetic population. However, some users may opt to download the repository as a whole (as a zip) and then execute the pipeline. In this case we will get an exception when obtaining the current commit identifier because the code directory will not be a git repository.

`synthesis.population.trips` removes all "education" purpose trips

Hi I am trying to use the EDGT census data for the Département 44 (Loire-Atlantique).

I got everything to work (exept a few things, notably the income wich is unavailable in the EDGT) until the population.spatial.primary.candidates step.

What happens is that df_persons["has_education_trip"] (line 68) is empty... and that raises an error on a pd.concat call (line 96).

Going back a bit, the issue come from the synthesis.population.trips.

In my test, at the beginning of the process I have the following counts :

for cat in df_trips["preceding_purpose"].cat.categories:
    print("%s : %d" % (cat, len(df_trips[df_trips["preceding_purpose"] == cat])))
education : 14
home : 35
leisure : 6
other : 7
shop : 4
work : 10

at the end i get :

education : 0
home : 75
leisure : 52
other : 8
shop : 4
work : 35

I know you don't have the data so you won't be able to just fix it, but could you explain what this process does ?
How do you think I should proceed, enforce at least 1 education purpose make it through the process ? or handle the case of df_persons["has_education_trip"] being empty ?

Strict car access in activity chain matching

As noted by @diallitoz, the pipeline may match activity chains with car trips to people that have no car availability or that don't have a license. Technically, the matching process (see reference paper) matches (age, gender, SC) for 100% of the assigned chains, but only a certain percentage for the "any cars" attribute, due to the minimum number of source observations we enforce in the matching process. So far, the assumption was that the faulty modes will be fixed in the simulation afterwards, where those agents will not have car as an alternative to choose from.

We can think of a process to enforce the matching of car availability. The simplest would be to construct a "car allowed" attribute for the persons that combines the car availability attribute and the driving license attribute. The same can be done for the HTS observations. If "car allowed" is false for a target observation, we then only allow source samples with "car allowed" also false. The inverse is not true (people that theoretically can use a car may choose not to).

Related to #107.

Improve modeling of car availability

For some scenarios (Nantes, for instance), we see that exact distance-based mode shares are hard to obtain if households in rural areas get too little car availability. The process should be improved to better represent car availability spatially to be able to work better with urban-rural continuum cases.

Feedback from Aurore

  • On lines 143 and 253 it should be "départements" and not "departements" (it does not make a huge difference, but it is a pity to forget it as Sebastian and you already put so much effort to write everywhere "Île-de-France" and "enquête" with the circumflex accents - which almost all French people forget! - 😉 )

  • In the label of Figure 2, page 5, it should be "arrondissement" and not "arrodinssement" (the "n" is after the "o" and not after the "i")

  • On line 258 there should be a white space between "19" and "municipalities"

  • On line 855 I think that one should write "IDFM" and not "IDFm" as the last word, "mobilités", it as least as important as all the other ones

  • On line 897 it should be "San Francisco" and not "San Francescio"

  • In the paragraph 3.1.2, is it normal that the RP census is never designated by its name?

  • On line 737 and the following ones: the age bins used in the explanation (0 to 15 years old, 15 to 29, 30 to 44) do not correspond to the ones represented in the Figure 11 (0 to 15 years old, 15 to 20, 21 to 44) and it makes it difficult to see the peak you talked about (more families in Alfortville, more students in the 16th arrondissement)

Fixed OSM data

Geofabrik consistently keeps snapshots for the 1st of January in every year. We should use those files to have a reproducible output.

Currently, we are using, for instance:
https://download.geofabrik.de/europe/france/ile-de-france-latest.osm.pbf
It is updated every day.

Instead, we could use the snapshot from 01/01/2022:
https://download.geofabrik.de/europe/france/ile-de-france-220101.osm.pbf

They keep a snapshot for every year, which can be seen here:
https://download.geofabrik.de/europe/france/

Shutil.which returning None

Hello,
When trying to use the pipeline, I noticed an issue that is present in these 4 places:

if shutil.which(context.config("git_binary")) == "":

if shutil.which(context.config("java_binary")) == "":

if shutil.which(context.config("osmosis_binary")) == "":

if shutil.which(context.config("maven_binary")) == "":

In some python versions, python 3.7 at least, shutil.which returns None if no executable is found and the current code only checks against an empty string.

I would suggest to change that line to

 if shutil.which(context.config("git_binary")) in ["", None]: 

Regards,
Tarek Chouaki.

Prefix for output jar

In recent versions, a jar file is written as output for MATSim is well, so one can start the simulation directly. It is always prefixed with "ile_de_france", but should make use of the prefix configuration value, just like all other files, like the population, network, etc.

Remove EPSG warnings

Warnings about CRS can be removed by replacing all dict(init = "EPSG:2154") occurences simply with "EPSG:2154".

Generating seasonal / weather / weekend scenarios

Currently, we're working on flexibilizing the selection of activity chains that are used to create a scenario. This is based on two approaches:

  • Making it possible to make use of weekend observations (or ENTD and EGT)
  • Reducing the set of active activity chains
    • Selecting individual week days (1-7),
    • seasons (based on reference dates in the surveys),
    • and weather (imputing precipitation to the reference dates)

The challenge is to make sure that this doesn't change the current output of the pipeline. So some components need to be added that automatically perform a comparison between the outputs regarding activity chains.

Clean GTFS individually for the use cases

The GTFS code now contains a couple of fixes that are specific for Nantes, Lyon or Toulouse, and it gets longer and longer with every case. The code should be split up by use case where all the cleaning and fixing is taking place, and the general merging / cutting code should be expected to only work with already cleaned data sets.

Automatic selection of input files

We're creating more and more populations with the same method. Basically, preparing the input files is always the same, and only a few are dependent on the area. And these files are easy to identify, because usually we know the regions involved (for the census and for OSM, for instance). So, theoretically, we can automate the whole process of creating a population just by providing a shape file of the area of interest.

Add automatic ping script that checks data availability

Currently, the data sets from IGN that are referenced in the documentation are not available anymore, because they have been replaced with never data sets. We should add a script that runs everyday:

  • Check whether the files are still online (directly URLs of the files)
  • Check that the text we refer to in the documentation is actually still be found on the websites (load HTML and verify)
  • Send an email to maintainers if something is missing

SIRENE data: some establishments have no corresponding headquarter

SIRENE data can wrongly include some establishments (StockEtablissement_utf8.zip) without a corresponding headquarter (StockUniteLegale_utf8.zip).

In this case, the inner merge below will delete some establishments

df_sirene, context.stage("data.sirene.raw_siren"),

and make false the following assertion

assert initial_count == final_count

BPE Documentation is outdated

As every year, the old version has been deleted, so now we have BPE 2020. Apparently, no problem with this version, everything runs fine. We just need to update it in the documentation.

Perform analysis at the end of the pipeline

Currently, as part of how the pipeline was developed, we compare intermediate results (e.g. the population after home and work location assignment) with reference data (e.g. the OD flows). However, it would be better and more versatile if we would only base the comparison on the final output. This would make it easier to replace and adapt algorithms, and set up proper unit tests.

PT agents don't use PT links

Hi,
I'm testing around this simulation approach, since I'd like to apply it also in other regions. After some tries it all ran smoothly, but it seems that the simulation output does not take into account the PT routes.

Looking at the output files, the PT is an used mode, with a realistic modal share, and visualizing the plans.xlm.gz in Simunto Via, the PT is actually included in the trips, so all good till here.
The problem, I think, is that the passengers are not associated to the actual transit routes and thus the activity is know (PT), but the agent teleports. Infact in the plans.xlm.gz the car vehicles are routed, the PT vehicles are not. Similarly to bike and walk, but as far as I understand those are actually teleported modes.
Do you have a tip on how to solve this?

I'm attaching an image for reference.

image

Thanks,
Federico

Cannot create conda env

When trying to run conda env create -f environment.yml I get the following error:

  - geopandas 0.6.1*
  - numba 0.49.0* -> llvmlite >=0.32.0
  - shapely 1.7.0*

[Mac OSX 10.15.5; Python 3.6.2 :: Anaconda 4.4.0 (x86_64)]

Increase extensibility of parameters

Currently, we have

IDFModeParameters parameters = IDFModeParameters.buildDefault();

However, to make it easier to extend this and reuse standard parameters, we should have something like

IDFWithDrtModeParameters parameters = new IDFWithDrtModeParameters();
IDFModeParameters.applyDefault(parameters);
IDFWithDrtModeParameters.applyDefault(parmeters);

i.e., instead of creating a new option, just apply the parameter values to the object.

Instructions need an update

step 10 needs to be updated.

a) it seems only Mars 2021 is available, and no longer Décembre 2020
b) the archive is difficult to download using a normal browser (Chrome asks for an App....not sure which one would understand that .7z is a file to download, not an application to open nor a website to go to...). This is the same problem that the other step with the greoservices.ign.fr has, I think.

Absolute paths with Osmosis in Windows

After talking to @diallitoz a couple of issues with the pipeline on Windows came up:

  • Osmosis path most likely needs to be set explicitly in config, and it needs to be the bat file
  • Avoid relative paths for Osmosis -> Need to have absolute data_path, but better absolutize all paths in eqasim!

Output Data

I note that the paper says

We do not provide the output data of the pipeline

I wanted to note here that having a cache of the input and output data available would be very useful to me. (I work on routing algorithms and use simulated traffic data as test inputs.) The Open Berlin Scenario and the Chile Scenario both have caches of this data. It would be great if your data were also available for easy use.

Lyon: definition of socioprofessional class variable is not comparable between HTS and Census

columns = ["sex", "any_cars", "age_class", "socioprofessional_class"]

In HTS, variable P9 is used to define the socioprofessional class. P9 describes in fact the occupational status (working or not). Instead of P9, we can use P11 with some category aggregation to match the census definition.

In Census, variable CS1 describes the socioprofessional category as it should be.

CS1 and P9 are not comparable. Matching HTS and Census assuming CS1 = P9 can bias the outcomes of matching.

Add a way to explicitly define the version

  • Define somewhere the version in the pipeline so it is automatically readable
  • Write out the version in the meta-data and the MATSim output to make sure one can recover from which version everything was generated

Could this tool use for other region?

Hi,
I want to know if I could change data or a little part of code to creat SP in other region out of france.
And If could, anything should be noticed.

Thanks!

Cooper

Improve unit tests

... by altering the determinism test such that we hardcode MD5 hashes of the output files and then run the test multiple times. Right now running the pipeline multiple times in the unit test makes Travis exit because the test takes to long. Therefore, currently, we're using the travis_wait hack, which is not ideal.

GTFS Functionality

In all scenarios except IDF, we have multiple GTFS feeds. Currently, one can provide a semicolon-separated list of feeds to be merged.

  • Allow the GTFS path to be a list to avoid the awkward format with semicolons
  • Check if this works well with validation / devalidation, otherwise this might need to be adjusted in synpp (in terms of hashing the config value if it is a complex object)

Futhermore, this often brings problems because sometimes GTFS schedules cannot be found that cover the same period of time. Since with pt2matsim we can either choose the day with most services (default) or a fixed day, we have situations in where some feeds are simply ignored because they don't cover the selected day. We already have code to unify the schedules, i.e., find all "wednesdays" (or whatever day is requested) and find in each feed the one with most active lines, and then shift all of those wednesdays to a specific date that is later used by pt2matsim.

  • Integrate the schedule shifting functionality into the pipeline

GTFS reference data

Currently, we use the day with most services. It would be better to explicitly define which day to use to create the transit schedule for MATSim. There is a catch that GTFS files have usually only a limited time frame, so we should make sure to throw an error if the date is not included in this range! (Otherwise pt2matsim will just create an empty transit schedule)

Refactor reset_index

The pipeline could be simplified (and potentially speeded up even further) if we made proper use of Pandas indices, which would also involve checking all the then unneccessary calls to reset_index if we really use the indexing capabilities of pandas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.