mierune / gtfs-parser Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 955 KB

parse and aggregate GTFS

Home Page: https://pypi.org/project/gtfs-parser/

License: MIT License

Python 100.00%

geojson geospatial gtfs python traffic

gtfs-parser's People

Stargazers

Watchers

Forkers

takohei

gtfs-parser's Issues

performance improve

Large dataset for benchmark
https://opentransportdata.swiss/de/dataset/timetable-2021-gtfs2020

time python -m gtfs_parser parse gtfs_fp2021_2021-12-08_09-10.zip swiss
extracting zipfile...
GTFS loaded.

real    218m5.224s
user    56m11.213s
sys     0m11.147s

`route_ids` in stop.geojson may have `NaN` value

https://api.gtfs-data.jp/v2/organizations/soedatown/feeds/soedamachibus/files/feed.zip?rid=current

Skip unused files to avoid errors

If an error occurs when reading a file other than the required table, the program terminates abnormally.

Example

GTFS of Nanto city has a SJIS encoded file named "result.csv".
I consider that the result of the mail detoxification was archived incorrectly.

"gtfs_parser\gtfs_parser\gtfs.py", line 12, in load_df
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte

Proposal

I propose that GTFSFactory only read files that are actually used by this tool.

feed_info, agency, routes, stops, trips, stop_times, shapes, calendar, calendar_dates

Currently, the GTFS specification has been extended and many files are defined that this tool does not use.
This will improve the performance.

`route_ids` in `stops.geojson` can be `nan`

http://www.pref.nara.jp/secure/204903/GTFS-JP(2021-03-18_1033).zip

gtfs_dir type

In the gtfs.py file, in the function GTFS(gtfs_dir: list) -> dict, why gtfs_dir is list instead of str?

read_routes() should return MultiLineStrings even if shape is not used

Problem

Trips with the same route_id may have different stopping patterns, such as round-trip or sectional trips.
However, read_routes() returns only one LineString per route if shape is not used.

Cause

Because read_routes() only refers to the first trip related to a route.

trip_id = route["trip_id"].unique()[0]

gtfs-parser/gtfs_parser/parse.py

Lines 93 to 111 in 479af5c

    
           for route_id in merged["route_id"].unique(): 
        
               route = merged[merged["route_id"] == route_id] 
        
               trip_id = route["trip_id"].unique()[0] 
        
               route = route[route["trip_id"] == trip_id].sort_values("stop_sequence") 
        
               features.append( 
        
                   { 
        
                       "type": "Feature", 
        
                       "geometry": { 
        
                           "type": "LineString", 
        
                           "coordinates": route[ 
        
                               ["stop_lon", "stop_lat"] 
        
                           ].values.tolist(), 
        
                       }, 
        
                       "properties": { 
        
                           "route_id": str(route_id), 
        
                           "route_name": route.route_concat_name.values.tolist()[0], 
        
                       }, 
        
                   } 
        
               )

Solution

read_routes() should return MultiLineStrings for each route containing LineStrings for each stop pattern.

Related Issue

The performance should also be improved. #1

Sample data

GTFS: Tokyo Toei Bus - ToeiBus-GTFS_20240421.zip

performance improve of stop unification for frequency aggregation

Problem

Processing of frequency aggregation is slow, taking 10 seconds for sample data containing 1580 trips.
This is about 20 times longer than the time to read stops and routes (0.5s).

Cause

In frequency aggregation, the processing time of the stop aggregation accounts for 94% of the total.
Most of the time is spent on __get_similar_stop_tuple().
The cause is that __get_similar_stop_tuple() is called for the number of stops by map() for the stops data frame.
__get_similar_stop_tuple() is slow because it searches and sorts all of the stops on each call.

Profiling results

Sun Apr 21 02:20:36 2024    chitetsu.prof
         18400825 function calls (17686952 primitive calls) in 17.311 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    720/1    0.010    0.000   17.334   17.334 {built-in method builtins.exec}
        1    0.000    0.000   17.334   17.334 test_vary_gtfs.py:1(<module>)
        1    0.001    0.001   15.349   15.349 test_vary_gtfs.py:82(main)
        1    0.004    0.004   15.349   15.349 test_vary_gtfs.py:50(exec_test)
        1    0.001    0.001   14.435   14.435 aggregate.py:12(__init__)
        1    0.003    0.003   14.434   14.434 aggregate.py:34(__aggregate_similar_stops)
        7    0.017    0.002   14.391    2.056 {pandas._libs.lib.map_infer}
        6    0.000    0.000   14.127    2.354 series.py:3908(map)
        6    0.000    0.000   14.124    2.354 base.py:1078(_map_values)
     1226    0.071    0.000   13.863    0.011 aggregate.py:91(<lambda>)
     1226    0.082    0.000   13.792    0.011 aggregate.py:134(__get_similar_stop_tuple)
     1228    0.017    0.000    6.339    0.005 frame.py:3197(query)
     1228    0.019    0.000    5.668    0.005 frame.py:3359(eval)
     9877    0.071    0.000    4.191    0.000 frame.py:2869(__getitem__)
     1228    0.021    0.000    3.950    0.003 eval.py:161(eval)
     7376    0.076    0.000    2.737    0.000 managers.py:1436(take)
27257/27244    0.209    0.000    2.714    0.000 series.py:201(__init__)
       25    0.000    0.000    2.666    0.107 __init__.py:1(<module>)
     6149    0.019    0.000    2.654    0.000 generic.py:3355(_take_with_is_copy)
     9820    0.027    0.000    2.383    0.000 common.py:50(new_method)

Solution

I consider that the process will be much faster if the process is done for the entire stops data frame at once.

Sample data

GTFS: feed_chitetsu_chitetsubus_20240326_191913.zip
Results of cProfile: chitetsu_prof.txt

	for route_id in merged["route_id"].unique():
	route = merged[merged["route_id"] == route_id]
	trip_id = route["trip_id"].unique()[0]
	route = route[route["trip_id"] == trip_id].sort_values("stop_sequence")
	features.append(
	{
	"type": "Feature",
	"geometry": {
	"type": "LineString",
	"coordinates": route[
	["stop_lon", "stop_lat"]
	].values.tolist(),
	},
	"properties": {
	"route_id": str(route_id),
	"route_name": route.route_concat_name.values.tolist()[0],
	},
	}
	)

mierune / gtfs-parser Goto Github PK

gtfs-parser's People

Stargazers

Watchers

Forkers

gtfs-parser's Issues

Example

Proposal

Problem

Cause

Solution

Related Issue

Sample data

Problem

Cause

Profiling results

Solution

Sample data

Recommend Projects

Recommend Topics

Recommend Org