Large dataset for benchmark <a href="https://opentransportdata.swiss/de/dataset/ti

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

x500 faster... <div class="snippet-clipboard-content notranslate position-relative

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

performance improve about gtfs-parser HOT 4 CLOSED

Kanahiro commented on September 4, 2024

performance improve

from gtfs-parser.

Comments (4)

liorsteinberg commented on September 4, 2024 1

Hey @Kanahiro ,

I think the read_routes function can be radically improved in terms of speed and performance.

I noticed the current implementation for generating GeoJSON features from the merged DataFrame iterates over each unique route_id, which can be quite slow, especially for large DataFrames. This is due to the repetitive filtering and sorting operations within a loop.

To enhance performance, I suggest leveraging pandas' groupby and apply methods. This approach efficiently groups the DataFrame by both route_id and trip_id and then applies a function to each group to construct the GeoJSON feature. This method minimizes the repetitive operations and leverages pandas' optimized group processing, which can significantly improve the execution time.

Locally, I replaced this code:

 # parse routes
        for route_id in merged["route_id"].unique():
            route = merged[merged["route_id"] == route_id]
            trip_id = route["trip_id"].unique()[0]
            route = route[route["trip_id"] == trip_id].sort_values("stop_sequence")
            features.append(
                {
                    "type": "Feature",
                    "geometry": {
                        "type": "LineString",
                        "coordinates": route[
                            ["stop_lon", "stop_lat"]
                        ].values.tolist(),
                    },
                    "properties": {
                        "route_id": str(route_id),
                        "route_name": route.route_concat_name.values.tolist()[0],
                    },
                }
            )

with this:

def create_feature(group):
    # Assuming the group is already sorted by stop_sequence, if not, uncomment the next line
    # group = group.sort_values("stop_sequence")
    route_id = group["route_id"].iloc[0]
    route_name = group["route_concat_name"].iloc[0]

    feature = {
        "type": "Feature",
        "geometry": {
            "type": "LineString",
            "coordinates": group[["stop_lon", "stop_lat"]].values.tolist(),
        },
        "properties": {
            "route_id": str(route_id),
            "route_name": route_name,
        },
    }

    return feature
    
# Ensure the DataFrame is sorted by stop_sequence before applying the function
merged_sorted = merged.sort_values(["route_id", "trip_id", "stop_sequence"])

# Apply the function to each group of route_id and trip_id
features = merged_sorted.groupby(["route_id", "trip_id"]).apply(create_feature).tolist()

from gtfs-parser.

Kanahiro commented on September 4, 2024 1

x500 faster...

time poetry run python -m gtfs_parser aggregate /Users/kanahiro/Downloads/GTFS
_FP2021_2021-12-08_09-10 output
GTFS loaded.

real    0m36.841s
user    0m32.781s
sys     0m3.893s

close this :)

from gtfs-parser.

Kanahiro commented on September 4, 2024

Hi, @liorsteinberg sorry for late response.
newest version of gtfs-parser is dramatically improved, please try it if you have interest still :)

from gtfs-parser.

liorsteinberg commented on September 4, 2024

Amazing work! Thanks

from gtfs-parser.

performance improve about gtfs-parser HOT 4 CLOSED

Comments (4)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent