Giter Site home page Giter Site logo

Comments (22)

derhuerst avatar derhuerst commented on July 20, 2024

See #45.

from node-gtfs.

gerlacdt avatar gerlacdt commented on July 20, 2024

should be fixed with:
#56

from node-gtfs.

derhuerst avatar derhuerst commented on July 20, 2024

#56 is not a permanent fix. The proper solution is to write node-gtfs to fully use streams in a non-blocking way. Splitting into chunks seems like a workaround.

from node-gtfs.

balmoovel avatar balmoovel commented on July 20, 2024

@derhuerst : can you explain in more detail how using streams would solve the memory overflow issue?

My understanding is that if we use file streams instead of parse this will only reduce the amount of memory consumed by the file contents. However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

Thus the solution implemented by @gerlacdt is not actually so much the division in chunks, but using insertMany. The division in chunks is only done because the node js driver internally runs into the same issue if the chunk is too big.

from node-gtfs.

derhuerst avatar derhuerst commented on July 20, 2024

Node streams are basically just event emitters that "hold" data. Streams can be connected to other streams (See .pipe()). One can also read from streams manually (data event). This alone is nothing special when dealing with large amounts of data (larger than memory). Streams have a backpressure mechanism, which keeps the amount of data "held" in memory low. If one uses streams properly (no blocking operations, no non-flowing stream usage]), Node.js scripts can deal with any amount of data. Consider the Stream Handbook for best practices.

In the build script, sync operations are being done. Also, all the data "held" in a stream is read into memory. This is why I think the Node process runs out of memory.

from node-gtfs.

gerlacdt avatar gerlacdt commented on July 20, 2024

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.

Pseudo-Algorithm:

loop files (agency.txt, stops.txt, stop_times.txt)
load file
get csv lines
loop lines
mongodb.insert-callback(line)

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

from node-gtfs.

derhuerst avatar derhuerst commented on July 20, 2024

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.

I see this as a problem. For every file, it loads every file (or the part of it that could be loaded from disk) into memory. The data get's transformed and then written into the db. But as long as the communication with the db isn't done, it's still all in memory.

Since the script loads the data synchronously (via while loop), there's no chance for the db layer to actual write stuff to the db while it receives more and more data.

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

AFAIK that's not a problem. (Again, the data that is kept in-memory is a problem.)

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

As I said, the script already uses streams, but not properly. Streams would solve the problem since they limit the amount of data being read from the files. Therefore, Node only needs to keep track of a small amount of data & callbacks.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

With very limitied memory (try running the script on a Raspberry PI or VPS), properly implemented streams (working one-by-one) are still more suited than batch operations. But nevertheless, one can combine streams and batch operations. (;

from node-gtfs.

balmoovel avatar balmoovel commented on July 20, 2024

As I have said before

However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory. The problem are the millions callbacks. If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

from node-gtfs.

derhuerst avatar derhuerst commented on July 20, 2024

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory.

V8 has a default memory limit of 1.5GB. Also, this script should work on small VPS with just 512MB of memory.

If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

I'm not willing to rewrite this build script, but I'd offer help doing so. I can also recommend things like promises and co for making the script more readable.

You may also want to have a look how my Berlin-specific GTFS build script looks like.

from node-gtfs.

brendannee avatar brendannee commented on July 20, 2024

@gerlacdt I tested your pull request and it worked.

I'm down to rewrite the download script with streams - its about time it had an overhaul to be more readable.

from node-gtfs.

gerlacdt avatar gerlacdt commented on July 20, 2024

@brendannee

thx for merging!

Only for information: After the fix we discovered that we had again problems with out-of-memory-error for node-versions < 4.4.x. (Async.queue is really memory-hungry...)

So the best would be, rewriting the script with streams. Although i don't know if this will solve all problems because as far as i understand the csv.parser.on("readable") uses already a stream.

Happy coding!

from node-gtfs.

balmoovel avatar balmoovel commented on July 20, 2024

@brendannee : if you should still run into memory issues with streams have a look at the latest https://github.com/moovel/node-gtfs, it works a bit more stable using recursion.
Good luck and thanks a lot for your work!

from node-gtfs.

nlambert avatar nlambert commented on July 20, 2024

For what it's worth, none of these solutions are working for me

http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

from node-gtfs.

melisoner2006 avatar melisoner2006 commented on July 20, 2024

Hi,
proposed solutions above are not working for me either. I keep getting the same error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

from node-gtfs.

balmoovel avatar balmoovel commented on July 20, 2024

@ those who cannot import their data using https://github.com/moovel/node-gtfs : can you post a link to the GTFS data you are trying to import?

from node-gtfs.

melisoner2006 avatar melisoner2006 commented on July 20, 2024

Here is the link to the Chicago Transit Agency: http://www.transitchicago.com/downloads/sch_data/google_transit.zip

I found the link here: http://www.gtfs-data-exchange.com/agency/chicago-transit-authority/
Uploaded by cta-archiver on Apr 16 2016

I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; and it threw the error when processing stoptimes file.

Here is the link to the Bay Area Rapid Transit: http://www.bart.gov/dev/schedules/google_transit.zip

Found here: http://www.gtfs-data-exchange.com/agency/bay-area-rapid-transit/
Uploaded by bart-archiver on Apr 04 2016 02:31
I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; It threw the error when processing transfers.

from node-gtfs.

balmoovel avatar balmoovel commented on July 20, 2024

I can confirm the issue - I did not find any super quick fix so far, sorry. I promise to look into it when I have some time but that can be weeks. :(

from node-gtfs.

nlambert avatar nlambert commented on July 20, 2024

Here is the gtfs file I was working with

http://www.amt.qc.ca/xdata/trains/google_transit.zip

from node-gtfs.

senpai-notices avatar senpai-notices commented on July 20, 2024

@brendannee @balmoovel
Fatal error while processing a 3-million line file.
Source: https://api.transport.nsw.gov.au/v1/publictransport/timetables/complete/gtfs
My experience is the same as @nlambert's:

For what it's worth, none of these solutions are working for me
#56
https://github.com/moovel/node-gtfs
--optimize_for_size --max_old_space_size=2000

from node-gtfs.

brendannee avatar brendannee commented on July 20, 2024

I just pushed an update that may help handle importing very large GTFS files.

Try out your large GTFS files and let me know what errors, if any, you get.

from node-gtfs.

senpai-notices avatar senpai-notices commented on July 20, 2024

Thanks @brendannee, it's working great!

from node-gtfs.

brendannee avatar brendannee commented on July 20, 2024

Closing this issue - please comment if you still have memory issues.

from node-gtfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.