Giter Site home page Giter Site logo

match-gtfs-rt-to-gtfs's Introduction

match-gtfs-rt-to-gtfs

Try to match realtime transit data (e.g. from GTFS Realtime (GTFS-RT)) with GTFS Static data, even if they don't share an ID.

npm version ISC-licensed minimum Node.js version support me via GitHub Sponsors chat with me on Twitter

This repo uses @derhuerst/stable-public-transport-ids to compute IDs from transit data itself:

  1. gtfs-via-postgres is used to import the GTFS Static data into the DB.
  2. It computes these "stable IDs" for all relevant items in the GTFS Static data and store them in the DB.
  3. When given a pice of realtime data (e.g. from a GTFS Realtime feed), compute its "stable IDs" and check if they match those stored in the DB.

Installation

npm install match-gtfs-rt-to-gtfs

Note: match-gtfs-rt-to-gtfs needs PostgreSQL >=14 to work, as its dependency gtfs-via-postgres needs that version. You can check your PostgreSQL server's version with psql -t -c 'SELECT version()'.

Usage

building the database

Let's use gtfs-to-sql CLI from the gtfs-via-postgres to import our GTFS data into PostgreSQL:

gtfs-to-sql path/to/gtfs/*.txt | psql -b

To some extent, match-gtfs-rt-to-gtf fuzzily matches stop/station & route/line names (more on that below). For that to work, we need to tell it how to "normalize" these names. As an example, we're going to do this for the VBB data:

// normalize.js
const normalizeStopName = require('normalize-vbb-station-name-for-search')
const slugg = require('slugg')

const normalizeLineName = (name) => {
	return slugg(name.replace(/([a-zA-Z]+)\s+(\d+)/g, '$1$2'))
}

module.exports = {
	normalizeStopName,
	normalizeLineName,
	// With VBB vehicles, the headsign is almost always the last stop.
	normalizeTripHeadsign: normalizeStopName,
}

We're going to create two files that specify how to handle the GTFS-RT & GTFS (Static) data, respectively:

// gtfs-rt-info.js
const {
	normalizeStopName,
	normalizeLineName,
	normalizeTripHeadsign,
} = require('./normalize.js')

module.exports = {
	endpointName: 'vbb-hafas',
	normalizeStopName,
	normalizeLineName,
	normalizeTripHeadsign,
}
// gtfs-info.js
const {
	normalizeStopName,
	normalizeLineName,
	normalizeTripHeadsign,
} = require('./normalize.js')

module.exports = {
	endpointName: 'vbb-gtfs',
	normalizeStopName,
	normalizeLineName,
	normalizeTripHeadsign,
}

Now, we're going to use match-gtfs-rt-to-gtfs/build-index.js to import additional data into the database that is needed for matching:

set -o pipefail
./build-index.js gtfs-rt-info.js gtfs-info.js | psql -b

matching data

match-gtfs-rt-to-gtf does its job using fuzzy matching: As an example, it identifies two departure data points from GTFS-RT & GTFS – at the same time, at the same stop/station and with the same route/line name – as equivalent.

Now, let's match a departure against GTFS:

const createMatch = require('match-gtfs-rt-to-gtfs')
const gtfsRtInfo = require('./gtfs-rt-info.js') // see above
const gtfsInfo = require('./gtfs-info.js') // see above

const gtfsRtDep = {
	tripId: '1|12308|1|86|7112020',
	direction: 'Grunewald, Roseneck',
	line: {
		type: 'line',
		id: 'm29',
		fahrtNr: '22569',
		name: 'M29',
		public: true,
		adminCode: 'BVB',
		mode: 'bus',
		product: 'bus',
		operator: {
			type: 'operator',
			id: 'berliner-verkehrsbetriebe',
			name: 'Berliner Verkehrsbetriebe'
		},
	},

	stop: {
		type: 'stop',
		id: '900000013101',
		name: 'U Moritzplatz',
		location: {latitude: 52.503737, longitude: 13.410944},
	},

	when: '2020-11-07T14:55:00+01:00',
	plannedWhen: '2020-11-07T14:54:00+01:00',
	delay: 60,
	platform: null,
	plannedPlatform: null,
}

const {matchDeparture} = createMatch(gtfsRtInfo, gtfsInfo)
console.log(await matchDeparture(gtfsRtDep))
{
	tripId: '145341691',
	tripIds: {
		'vbb-hafas': '1|12308|1|86|7112020',
		'vbb-gtfs': '145341691',
	},
	routeId: '17449_700',
	direction: 'Grunewald, Roseneck',
	line: {
		type: 'line',
		id: null,
		fahrtNr: '22569',
		fahrtNrs: {'vbb-hafas': '22569'},
		name: 'M29',
		public: true,
		adminCode: 'BVB',
		mode: 'bus',
		product: 'bus',
		operator: {
			type: 'operator',
			id: 'berliner-verkehrsbetriebe',
			name: 'Berliner Verkehrsbetriebe'
		},
	},

	stop: {
		type: 'stop',
		id: '070101002285',
		ids: {
			'vbb-hafas': '900000013101',
			'vbb-gtfs': '070101002285',
		},
		name: 'U Moritzplatz',
		location: {latitude: 52.503737, longitude: 13.410944},
	},

	when: '2020-11-07T14:55:00+01:00',
	plannedWhen: '2020-11-07T14:54:00+01:00',
	delay: 60,
	platform: null,
	plannedPlatform: null,
}

finding the shape of a trip

const findShape = require('match-gtfs-rt-to-gtfs/find-shape')

const someTripId = '24582338' // some U3 trip from the HVV dataset
await findShape(someTripId)

findShape resolves with a GeoJSON LineString:

{
	type: 'LineString',
	coordinates: [
		[10.044385, 53.5872],
		// …
		[10.074888, 53.592473]
	],
}

How it works

gtfs-via-postgres adds a view arrivals_departures, which contains every arrival/departure of every trip in the GTFS static dataset. This repo adds another view arrivals_departures_with_stable_ids, which combines the data with the "stable IDs" stored in separate tables. It is then used for the matching process, which works essentially like this:

SELECT *
FROM arrivals_departures_with_stable_ids
WHERE (
	stop_stable_ids && ARRAY['stop-id1', 'stop-id2']
	OR station_stable_ids && ARRAY['station-id1', 'station-id2']
)
AND route_stable_ids && ARRAY['route-id1', 'route-id2']
AND t_departure > '2020-10-16T22:20:48+02:00'
AND t_departure < '2020-10-16T22:22:48+02:00'

Because PostgreSQL executes this query quite efficiently, we don't need to store a pre-computed list index of all arrivals/departures, but just an index of their stable stop/station/route IDs.

The size of this additional index depends on how many stable IDs your logic generates for each stop/station/route. Consider the 2020-09-25 VBB GTFS Static feed as an example: Without shapes.txt, it is 356MB as CSV files, ~2GB as imported & indexed in the DB by gtfs-via-posgres; match-gtfs-rt-to-gtfs's stable IDs indices add another

  • 300MB with few stable IDs per stop/station/route, and
  • 3GB with 10-30 stable IDs each.

API

gtfsInfo/gtfsRtInfo

{
	endpointName: string,
	normalizeStopName: (name: string, stop: FptfStop) => string,
	normalizeLineName(name: string, line: FptfLine) => string,
	normalizeTripHeadsign(headsign: string) => string,
}

Contributing

Note: This repos blends two families of techinical terms – GTFS-related ones and FPTF-/hafas-client-related ones –, which makes the code somewhat confusing.

If you have a question or need support using match-gtfs-rt-to-gtfs, please double-check your code and setup first. If you think you have found a bug or want to propose a feature, use the issues page.

match-gtfs-rt-to-gtfs's People

Contributors

derhuerst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

match-gtfs-rt-to-gtfs's Issues

customisable trip ID behaviour

When matching a trip (meaning a vehicle running as part of a line *at a specific point in time) from HAFAS against GTFS data, a problem arises: GTFS does not have the notion of a trip (ID); A GTFS trip specifies the wall clock time of all departures/arrivals relative to noon - 12h, and runs on every service day of its route. This means that we cannot map HAFAS trip IDs 1-to1 to GTFS trip IDs.

Currently, this project just picks the HAFAS trip ID, but the idea is to use IDs from GTFS whenever possible (a.k.a. whenever the matching worked). There are several options:

  • Use GTFS trip IDs for trips, even though the concepts don't line up. This may lead consumers of the matched feed to make false assumptions and therefore make wrong analyses of the data.
  • Stay with HAFAS trip IDs, because only they map 100% to the data coming from HAFAS.
  • Build custom trip IDs, combining the GTFS trip ID with the service day. This will allow consumers to take a combined trip ID apart and process it appropriately to their use case.

allow customising the DB schema

This use case has been brought up in derhuerst/berlin-gtfs-rt-server#9.

berlin-gtfs-rt-server uses hafas-gtfs-rt-feed underneath, which in turn uses gtfs-via-postgres & match-gtfs-rt-to-gtfs. The former already supports using a schema other than public, but match-gtfs-rt-to-gtfs doesn't support it yet.

Two places would have to be adapted:

  • The build-index command needs a new option (e.g. --db-schema).
  • The actual matching logic (index.js) needs a new option (e.g. dbSchema).

The name of the schema should be added to the DB queries in a secure manner, e.g. by using pg-format.

findTrip: check if first & last stopovers' trip_id matches

const [
gtfsFirstDep,
gtfsLastArr,
] = await Promise.all([
findDep(firstDep),
findArr(lastArr),
])
if (!gtfsFirstDep) {
debug(`couldn't find first departure`, firstDep)
return null;
}
if (!gtfsLastArr) {
debug(`couldn't find last arrival`, lastArr)
return null;
}
debug('matching all stopovers')
const matchStop = createMatchStop(gtfsRtInfo, gtfsInfo)
const gtfsStopovers = await Promise.all(t.stopovers.map(async (st) => {
return {
...st,
stop: await matchStop(st.stop),
}
}))
debug('done matching!')

Currently, we match the first & last stopover independently, but then never check that they actually returned the same trip_id. This means that, if there is another another departure (of the same line at the same stop/station at the same time) but in a different direction, the matched first & last stopovers' trip_ids might actually belong to distinct trips of the same line.

matching: option to configure tolerable time differences

With some realtime data sources (such as some HAFAS endpoints) and some GTFS datasets, there are minor discrepancies between the realtime data and the plan data, e.g. 1m for a departure of a specific trip.

We could introduce an option to allow such time differences when matching.

  • In the respective SQL queries, the time range would have to include the tolerance.
  • There should be logic in JS that fails if there are >1 matches, because in these cases, the matching can be wrong.

as reported by @hbruch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.