Giter Site home page Giter Site logo

Comments (3)

thbar avatar thbar commented on August 21, 2024

Preliminary notes, in brain dump mode for now:

I have first looked at the transport-site database, which contains history for validations, using:

SELECT
    validation_json_size,
    to_char(100 * validation_json_size::float / (SUM(validation_json_size) OVER()) , '999D99%') as ratio,
	subquery.id,
	resource_id,
	url
FROM (SELECT *, pg_column_size(details) as validation_json_size from validations) as subquery
INNER JOIN resource r on r.id = resource_id
WHERE validation_json_size IS NOT NULL
ORDER by validation_json_size desc

On a recent database, this gives (before optimisation - just the top extract):

validation_json_size ratio id resource_id url
1489511 16.68% 102514 10391 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/17_07_20_mobibreizhbret_gtfs_zip
1103264 12.35% 110330 12714 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/02_10_2020_mobibreizhbret_gtfs_zip
1066277 11.94% 104949 10987 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/10_08_2020_mobibreizhbret_gtfs_zip
1039256 11.64% 101524 10248 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/mobibreizhbret_201706_gtfs_zip
959260 10.74% 114245 16188 https://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/11_2020_mobibreizhbret_gtfs_zip
953154 10.67% 113066 16189 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/11_2020_mobibreizhbret_gtfs_zip
274778 3.08% 113782 8361 https://data.iledefrance-mobilites.fr/api/v2/catalog/datasets/offre-horaires-tc-gtfs-idf/files/736ca2f956a1b6cc102649ed6fd56d45
191509 2.14% 113623 8781 https://data.centrevaldeloire.fr/api/v2/catalog/datasets/jvmalin-point-dacces-national/files/a98f4cdb41591e3530c1e4f29d39fc53
155512 1.74% 100930 7579 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq-aggregated-gtfs.zip
59756 .67% 113607 9178 https://data.centrevaldeloire.fr/api/v2/catalog/datasets/offre-theorique-mobilite-remi/files/8fdd2d65720750e8064cbfee68426e0f
58506 .66% 104631 10341 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200706_zip
49832 .56% 114319 16387 https://ressources.data.sncf.com/api/v2/catalog/datasets/sncf-ter-gtfs/files/24e02fa969496e2caa5863a365c66ec2
39581 .44% 109853 11939 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200924_zip
39037 .44% 114445 7840 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq_lim-aggregated-gtfs.zip
37718 .42% 114428 7806 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq_gir-aggregated-gtfs.zip
36156 .40% 100608 9093 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200106_zip
35406 .40% 100444 8315 https://static.data.gouv.fr/resources/gtfs-de-la-societe-de-transport-urbain-du-grand-montauban-semtm/20181128-174626/gtfs.zip
35307 .40% 101777 8336 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20191104_zip
35098 .39% 100586 8322 https://data.mulhouse-alsace.fr/api/datasets/1.0/offre-de-transport-solea-et-tram-train-en-format-gtfs/alternative_exports/sitram_gtfs_2018_2019_zip
30096 .34% 114413 7654 https://opendata.lillemetropole.fr/api/datasets/1.0/transport_arret_transpole-point/alternative_exports/gtfs_zip
27776 .31% 95156 7581 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq-aggregated-netex.zip
27229 .30% 101713 10335 https://static.data.gouv.fr/resources/horaires-theoriques-du-reseau-zoom-le-grand-chalon-gtfs/20200603-143958/gtfs-20200603-01-3-.zip
23419 .26% 107285 10635 https://sig.hautsdefrance.fr/ext/opendata/Transport/GTFS/59/RHDF_GTFS_COM_SCO_59_P1.zip
22496 .25% 113211 16402 https://static.data.gouv.fr/resources/horaires-theoriques-du-reseau-zoom-le-grand-chalon-gtfs-1/20201105-155731/gtfs-20201105-03.zip
22063 .25% 107299 10636 https://sig.hautsdefrance.fr/ext/opendata/Transport/GTFS/59/RHDF_GTFS_COM_SCO_59_P2.zip
19840 .22% 114312 16499 https://ressources.data.sncf.com/api/v2/catalog/datasets/sncf-intercites-gtfs/files/ed829c967a0da1252f02baaf684db32c
19316 .22% 112295 8593 https://trouver.datasud.fr/dataset/44187c20-e037-4733-950a-b4463d314b90/resource/f6342a2c-d02a-405f-9700-6a7121e2e06f/download/gtfs_84.zip
18305 .20% 100654 7906 https://static.data.gouv.fr/resources/offre-de-transports-reseau-dk-bus-de-la-communaute-urbaine-de-dunkerque-gtfs/20190701-034402/gtfs.zip
16947 .19% 114308 8588 https://trouver.datasud.fr/dataset/44187c20-e037-4733-950a-b4463d314b90/resource/db4be056-c7e8-4efb-8299-4b8c6235defe/download/gtfs_06.zip
16711 .19% 113547 11757 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=CG68
16617 .19% 113933 10354 https://data.explore.divia.fr/api/datasets/1.0/gtfs-divia-mobilites/attachments/gtfs_diviamobilites_current_zip
16160 .18% 107946 10363 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=CG68
15858 .18% 100423 9779 https://static.data.gouv.fr/resources/reseau-taneo-1/20200529-052803/gtfs-lot1-20200417-20201231.zip
15736 .18% 100542 8045 https://static.data.gouv.fr/resources/offre-de-transport-du-reseau-trema-gtfs/20190827-090229/export-2-septembre.zip
15685 .18% 101820 10344 https://static.data.gouv.fr/resources/horaires-du-reseau-ntecc-periode-scolaire-1/20200708-084224/gtfs.zip
15414 .17% 113574 11763 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=LIVO
14912 .17% 114412 7634 http://opendata.cts-strasbourg.fr/fichiers/gtfs/google_transit.zip

I have then grabbed data from a large dataset:

And run the validator locally with:

cargo run --release -- --input 12_2020_mobibreizhbret_gtfs.zip > 12_2020_mobibreizhbret_gtfs.validation.json

Finally, I filtered the JSON with jq to get an idea of where the big stuff is going:

cat 12_2020_mobibreizhbret_gtfs.validation.json | jq -c 'path(..)|[.[]|tostring]|join("/")' | sed -e 's/\([0-9]\)/X/g' | sort | uniq -c | sort -rn | grep "validations/CloseStops"
50998 "validations/CloseStops/XXX/related_objects/XX/object_type"
50998 "validations/CloseStops/XXX/related_objects/XX/name"
50998 "validations/CloseStops/XXX/related_objects/XX/id"
50998 "validations/CloseStops/XXX/related_objects/XX"
38169 "validations/CloseStops/XXX/related_objects/XXX/object_type"
38169 "validations/CloseStops/XXX/related_objects/XXX/name"
38169 "validations/CloseStops/XXX/related_objects/XXX/id"
38169 "validations/CloseStops/XXX/related_objects/XXX"
8675 "validations/CloseStops/XXX/related_objects/X/object_type"
8675 "validations/CloseStops/XXX/related_objects/X/name"
8675 "validations/CloseStops/XXX/related_objects/X/id"
8675 "validations/CloseStops/XXX/related_objects/X"
4702 "validations/CloseStops/XX/related_objects/XX/object_type"
4702 "validations/CloseStops/XX/related_objects/XX/name"
4702 "validations/CloseStops/XX/related_objects/XX/id"
4702 "validations/CloseStops/XX/related_objects/XX"
3861 "validations/CloseStops/XX/related_objects/XXX/object_type"
3861 "validations/CloseStops/XX/related_objects/XXX/name"
3861 "validations/CloseStops/XX/related_objects/XXX/id"
3861 "validations/CloseStops/XX/related_objects/XXX"
 900 "validations/CloseStops/XXX/severity"
 900 "validations/CloseStops/XXX/related_objects"
 900 "validations/CloseStops/XXX/object_type"
 900 "validations/CloseStops/XXX/object_name"
 900 "validations/CloseStops/XXX/object_id"

I have also discussed with @antoine-de and indeed here:

  • there is a large cardinality in the close stop related objects, because they refer to trips / stop times
  • one optimisation to conduct will be to just store the route, rather than the trip; there is a loss of information, but probably good enough for validators
  • this is preferable to trimming the list of errors for now (something we try to avoid at the moment)

I'll resume later to provide a change here.

from transport-validator.

thbar avatar thbar commented on August 21, 2024

Also useful query by @antoine-de:

select 
  validations.resource_id, close_stops->>'object_id' as object_id, close_stops->>'details',
    json_array_length(close_stops->'related_objects') as length from validations, 
   json_array_elements(validations.details->'CloseStops') as close_stops
where validations.resource_id is not null order by length desc;

from transport-validator.

thbar avatar thbar commented on August 21, 2024

Solved via #105 for now, this reduces the payload x14 for the largest file, and 3 to 4 times for more modest files, so a good improvement.

from transport-validator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.