Importing a GTFS file always ends up with ns=0xFATAL ERROR: CALL_AND

See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

should be fixed with: <a class="issue-link js-issue-link" data-error-text="Failed

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="15

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I think <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

I think <a class="user-mention notranslate" data-hovercard-type="user" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Out of memory during import about node-gtfs HOT 22 CLOSED

blinktaginc commented on July 20, 2024

Out of memory during import

from node-gtfs.

Comments (22)

derhuerst commented on July 20, 2024

See #45.

from node-gtfs.

gerlacdt commented on July 20, 2024

should be fixed with:
#56

from node-gtfs.

derhuerst commented on July 20, 2024

#56 is not a permanent fix. The proper solution is to write node-gtfs to fully use streams in a non-blocking way. Splitting into chunks seems like a workaround.

from node-gtfs.

balmoovel commented on July 20, 2024

@derhuerst : can you explain in more detail how using streams would solve the memory overflow issue?

My understanding is that if we use file streams instead of parse this will only reduce the amount of memory consumed by the file contents. However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

Thus the solution implemented by @gerlacdt is not actually so much the division in chunks, but using insertMany. The division in chunks is only done because the node js driver internally runs into the same issue if the chunk is too big.

from node-gtfs.

derhuerst commented on July 20, 2024

Node streams are basically just event emitters that "hold" data. Streams can be connected to other streams (See .pipe()). One can also read from streams manually (data event). This alone is nothing special when dealing with large amounts of data (larger than memory). Streams have a backpressure mechanism, which keeps the amount of data "held" in memory low. If one uses streams properly (no blocking operations, no non-flowing stream usage]), Node.js scripts can deal with any amount of data. Consider the Stream Handbook for best practices.

In the build script, sync operations are being done. Also, all the data "held" in a stream is read into memory. This is why I think the Node process runs out of memory.

from node-gtfs.

gerlacdt commented on July 20, 2024

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.

Pseudo-Algorithm:

loop files (agency.txt, stops.txt, stop_times.txt)
load file
get csv lines
loop lines
mongodb.insert-callback(line)

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

from node-gtfs.

derhuerst commented on July 20, 2024

I think @balmoovel is right. The out-of-memory error was not caused by loading the files completely into memory.
It was caused by the number of callbacks created in a loop.

I see this as a problem. For every file, it loads every file (or the part of it that could be loaded from disk) into memory. The data get's transformed and then written into the db. But as long as the communication with the db isn't done, it's still all in memory.

Since the script loads the data synchronously (via while loop), there's no chance for the db layer to actual write stuff to the db while it receives more and more data.

For stop_times.txt this opened 2.6 millions callbacks asynchronously! Too many for node....

AFAIK that's not a problem. (Again, the data that is kept in-memory is a problem.)

So switching to "streams" is really nice-to-have and more node-like but will not solve the problem.

As I said, the script already uses streams, but not properly. Streams would solve the problem since they limit the amount of data being read from the files. Therefore, Node only needs to keep track of a small amount of data & callbacks.

By the way inserting millions of entries one-by-one is always a worse solution than inserting in batch-mode (10,000 at once)!

With very limitied memory (try running the script on a Raspberry PI or VPS), properly implemented streams (working one-by-one) are still more suited than batch operations. But nevertheless, one can combine streams and batch operations. (;

from node-gtfs.

balmoovel commented on July 20, 2024

As I have said before

However in our tests this did not seem to be what caused the memory overflow. That rather seems to be caused by the callbacks attached to each insert operation.

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory. The problem are the millions callbacks. If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

from node-gtfs.

derhuerst commented on July 20, 2024

We have tested this, and to our understanding reading the file into memory is not a problem on systems with gigabytes of memory.

V8 has a default memory limit of 1.5GB. Also, this script should work on small VPS with just 512MB of memory.

If you think that either our assumption is wrong or that you can provide a faster and more stable fix yourself then please do.

I'm not willing to rewrite this build script, but I'd offer help doing so. I can also recommend things like promises and co for making the script more readable.

You may also want to have a look how my Berlin-specific GTFS build script looks like.

from node-gtfs.

brendannee commented on July 20, 2024

@gerlacdt I tested your pull request and it worked.

I'm down to rewrite the download script with streams - its about time it had an overhaul to be more readable.

from node-gtfs.

gerlacdt commented on July 20, 2024

@brendannee

thx for merging!

Only for information: After the fix we discovered that we had again problems with out-of-memory-error for node-versions < 4.4.x. (Async.queue is really memory-hungry...)

So the best would be, rewriting the script with streams. Although i don't know if this will solve all problems because as far as i understand the csv.parser.on("readable") uses already a stream.

Happy coding!

from node-gtfs.

balmoovel commented on July 20, 2024

@brendannee : if you should still run into memory issues with streams have a look at the latest https://github.com/moovel/node-gtfs, it works a bit more stable using recursion.
Good luck and thanks a lot for your work!

from node-gtfs.

nlambert commented on July 20, 2024

For what it's worth, none of these solutions are working for me

#56
https://github.com/moovel/node-gtfs
--optimize_for_size --max_old_space_size=2000

http://www.stm.info/sites/default/files/gtfs/gtfs_stm.zip

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

from node-gtfs.

melisoner2006 commented on July 20, 2024

Hi,
proposed solutions above are not working for me either. I keep getting the same error:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

from node-gtfs.

balmoovel commented on July 20, 2024

@ those who cannot import their data using https://github.com/moovel/node-gtfs : can you post a link to the GTFS data you are trying to import?

from node-gtfs.

melisoner2006 commented on July 20, 2024

Here is the link to the Chicago Transit Agency: http://www.transitchicago.com/downloads/sch_data/google_transit.zip

I found the link here: http://www.gtfs-data-exchange.com/agency/chicago-transit-authority/
Uploaded by cta-archiver on Apr 16 2016

I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; and it threw the error when processing stoptimes file.

Here is the link to the Bay Area Rapid Transit: http://www.bart.gov/dev/schedules/google_transit.zip

Found here: http://www.gtfs-data-exchange.com/agency/bay-area-rapid-transit/
Uploaded by bart-archiver on Apr 04 2016 02:31
I excluded calendar dates, fare attributes, fare rules, shapes, frequencies, feed_info; It threw the error when processing transfers.

from node-gtfs.

balmoovel commented on July 20, 2024

I can confirm the issue - I did not find any super quick fix so far, sorry. I promise to look into it when I have some time but that can be weeks. :(

from node-gtfs.

nlambert commented on July 20, 2024

Here is the gtfs file I was working with

http://www.amt.qc.ca/xdata/trains/google_transit.zip

from node-gtfs.

senpai-notices commented on July 20, 2024

@brendannee @balmoovel
Fatal error while processing a 3-million line file.
Source: https://api.transport.nsw.gov.au/v1/publictransport/timetables/complete/gtfs
My experience is the same as @nlambert's:

For what it's worth, none of these solutions are working for me
#56
https://github.com/moovel/node-gtfs
--optimize_for_size --max_old_space_size=2000

from node-gtfs.

brendannee commented on July 20, 2024

I just pushed an update that may help handle importing very large GTFS files.

Try out your large GTFS files and let me know what errors, if any, you get.

from node-gtfs.

senpai-notices commented on July 20, 2024

Thanks @brendannee, it's working great!

from node-gtfs.

brendannee commented on July 20, 2024

Closing this issue - please comment if you still have memory issues.

from node-gtfs.

Out of memory during import about node-gtfs HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent