Comments (22)
See #45.
from node-gtfs.
should be fixed with:
#56
from node-gtfs.
#56 is not a permanent fix. The proper solution is to write node-gtfs
to fully use streams in a non-blocking way. Splitting into chunks seems like a workaround.
from node-gtfs.
@derhuerst : can you explain in more detail how using streams would solve the memory overflow issue?
My understanding is that if we use file streams instead of parse
this will only reduce the amount of memory consumed by the file contents. However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert
operation.
Thus the solution implemented by @gerlacdt is not actually so much the division in chunks, but using insertMany
. The division in chunks is only done because the node js driver internally runs into the same issue if the chunk is too big.
from node-gtfs.
Node streams are basically just event emitters that "hold" data. Streams can be connected to other streams (See .pipe()
). One can also read from streams manually (data
event). This alone is nothing special when dealing with large amounts of data (larger than memory). Streams have a backpressure mechanism, which keeps the amount of data "held" in memory low. If one uses streams properly (no blocking operations, no non-flowing stream usage]), Node.js scripts can deal with any amount of data. Consider the Stream Handbook for best practices.
In the build script, sync operations are being done. Also, all the data "held" in a stream is read into memory. This is why I think the Node process runs out of memory.
from node-gtfs.
I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.
Pseudo-Algorithm:
loop files (agency.txt, stops.txt, stop_times.txt)
load file
get csv lines
loop lines
mongodb.insert-callback(line)
For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....
So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.
By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!
from node-gtfs.
I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.
I see this as a problem. For every file, it loads every file (or the part of it that could be loaded from disk) into memory. The data get's transformed and then written into the db. But as long as the communication with the db isn't done, it's still all in memory.
Since the script loads the data synchronously (via while
loop), there's no chance for the db layer to actual write stuff to the db while it receives more and more data.
For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....
AFAIK that's not a problem. (Again, the data that is kept in-memory is a problem.)
So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.
As I said, the script already uses streams, but not properly. Streams would solve the problem since they limit the amount of data being read from the files. Therefore, Node only needs to keep track of a small amount of data & callbacks.
By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!
With very limitied memory (try running the script on a Raspberry PI or VPS), properly implemented streams (working one-by-one) are still more suited than batch operations. But nevertheless, one can combine streams and batch operations. (;
from node-gtfs.
As I have said before
However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.
We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory. The problem are the millions callbacks. If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.
from node-gtfs.
We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory.
V8 has a default memory limit of 1.5GB. Also, this script should work on small VPS with just 512MB of memory.
If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.
I'm not willing to rewrite this build script, but I'd offer help doing so. I can also recommend things like promises and co
for making the script more readable.
You may also want to have a look how my Berlin-specific GTFS build script looks like.
from node-gtfs.
@gerlacdt I tested your pull request and it worked.
I'm down to rewrite the download script with streams - its about time it had an overhaul to be more readable.
from node-gtfs.
thx for merging!
Only for information: After the fix we discovered that we had again problems with out-of-memory-error for node-versions < 4.4.x. (Async.queue is really memory-hungry...)
So the best would be, rewriting the script with streams. Although i don't know if this will solve all problems because as far as i understand the csv.parser.on("readable") uses already a stream.
Happy coding!
from node-gtfs.
@brendannee : if you should still run into memory issues with streams have a look at the latest https://github.com/moovel/node-gtfs, it works a bit more stable using recursion.
Good luck and thanks a lot for your work!
from node-gtfs.
For what it's worth, none of these solutions are working for me
- #56
- https://github.com/moovel/node-gtfs
- --optimize_for_size --max_old_space_size=2000
http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
from node-gtfs.
Hi,
proposed solutions above are not working for me either. I keep getting the same error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
from node-gtfs.
@ those who cannot import their data using https://github.com/moovel/node-gtfs : can you post a link to the GTFS data you are trying to import?
from node-gtfs.
Here is the link to the Chicago Transit Agency: http://www.transitchicago.com/downloads/sch_data/google_transit.zip
I found the link here: http://www.gtfs-data-exchange.com/agency/chicago-transit-authority/
Uploaded by cta-archiver on Apr 16 2016
I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; and it threw the error when processing stoptimes file.
Here is the link to the Bay Area Rapid Transit: http://www.bart.gov/dev/schedules/google_transit.zip
Found here: http://www.gtfs-data-exchange.com/agency/bay-area-rapid-transit/
Uploaded by bart-archiver on Apr 04 2016 02:31
I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; It threw the error when processing transfers.
from node-gtfs.
I can confirm the issue - I did not find any super quick fix so far, sorry. I promise to look into it when I have some time but that can be weeks. :(
from node-gtfs.
Here is the gtfs file I was working with
http://www.amt.qc.ca/xdata/trains/google_transit.zip
from node-gtfs.
@brendannee @balmoovel
Fatal error while processing a 3-million line file.
Source: https://api.transport.nsw.gov.au/v1/publictransport/timetables/complete/gtfs
My experience is the same as @nlambert's:
For what it's worth, none of these solutions are working for me
#56
https://github.com/moovel/node-gtfs
--optimize_for_size --max_old_space_size=2000
from node-gtfs.
I just pushed an update that may help handle importing very large GTFS files.
Try out your large GTFS files and let me know what errors, if any, you get.
from node-gtfs.
Thanks @brendannee, it's working great!
from node-gtfs.
Closing this issue - please comment if you still have memory issues.
from node-gtfs.
Related Issues (20)
- [Feature Request] Create option to ignore SSL validation HOT 3
- Can I use the command line for query operations? HOT 6
- GTFS-R TripDescriptor HOT 5
- Working with Docker Database locked HOT 4
- Dropping/Clearing database HOT 1
- Deployment on errors HOT 5
- Invalid default csv parser option "relax" in import script. HOT 2
- Performance improvements HOT 4
- Possible performance improvement: DuckDB HOT 3
- Changing internal maxInsertVariables has significant impact on total import time HOT 4
- Agency_id defined in agency.txt but not in routes.txt results in invalid GTFS export HOT 1
- Edge deployment and SQLite HOT 4
- Not running with Deno HOT 7
- occupancyStatus from vehicle_positions HOT 6
- Persistent trip_updates when running getStopTimeUpdates HOT 13
- Deleting the db to avoid id collisions HOT 3
- Propagate RT delays to missing stop_sequences HOT 6
- updateGtfsRealtime doesn't work if one URL is down HOT 2
- Disable clean stale GTFS-Realtime data by default HOT 10
- Zip File Containing Multiple GTFS Static Zip Files HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-gtfs.