Prerequisites [ x ] I'm using the last version. 5.44.0

as I suggested in the previous issue, can you use <a href="https://github.com/davidmar

Sure, here are two flame graphs. Looks like parsing. <a target="_blank" rel="noope

Interesting proposal. It depends on the package, e.g.: <a href="https://github.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

High memory usage about metascraper HOT 9 CLOSED

DanDubinsky commented on August 31, 2024

High memory usage

from metascraper.

Comments (9)

DanDubinsky commented on August 31, 2024 1

Here's an update. I wasn't able to use that minify-html lib. It gets errors on yarn add for some reason. I tried this one instead https://www.npmjs.com/package/html-minifier and it fixes the memory issue, but it also returns all nulls for description, image and title properties, so some how the minified HTML is confusing the metascraper.

Now I'm trying something on the OS level and it seems promising so far. The containers are running Debian Linux and I was able to replace the default memory manager with jemalloc. Seems better so far. I did it at 15:00. There were lots of spikes and OOMs before and no spikes after for over an hour. Also the rss memory seems to be going up and down now instead of up and up. I'll leave it sit like this for a few days and see how it behaves and then report the findings back here in case anyone else has this issue.

from metascraper.

Kikobeats commented on August 31, 2024

as I suggested in the previous issue, can you use https://github.com/davidmarkclements/0x ?

just console.log is not enought for profiling memory usage, we need to understand what is happening inside node internals

from metascraper.

DanDubinsky commented on August 31, 2024

Sure, here are two flame graphs. Looks like parsing.

from metascraper.

DanDubinsky commented on August 31, 2024

Here's another hot section on the flame graph. If you would like to see the condition for yourself I think you can just run that index.js file. Just save one file as package.json and the other as index.js and then just yarn and 0x -o index.js.

Thanks,
Dan

from metascraper.

DanDubinsky commented on August 31, 2024

I have a question about the package. For these rules, how much of the HTML does it need to examine? Is it just meta tags?

require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-title')(),
require('metascraper-url')(),

I'm thinking maybe I can write a simple preprocessor to strip out all but meta tags. With smaller HTML I'm hoping the parser will need a lot less memory. I'm starting to get a little bit desperate here. It seems that our end user have been entering a lot of links into our app that are triggering memory spikes and firing OOM errors. Maybe 30 or 40 out of memory errors in just the last day or so.

from metascraper.

Kikobeats commented on August 31, 2024

Interesting proposal. It depends on the package, e.g.:
https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-title/src/index.js

Removing DOM elements clearly is going to help, but it could produce uncurated results.

Maybe you can test if minifying HTML has any impact? This seems promising:
https://github.com/wilsonzlin/minify-html

from metascraper.

Kikobeats commented on August 31, 2024

glad to see you're mastering it!

I'm still interesting into explore how we can make metascraper consume lower, but it seems the memory issue should be fixed on cheerio upstream first: https://github.com/search?q=repo%3Acheeriojs%2Fcheerio+memory&type=issues

I'm going to determine if we can do something effective there, thanks for going deep with this.

from metascraper.

Kikobeats commented on August 31, 2024

@DanDubinsky can you test if helps to reduce memory consumption?

const { minify } = require('html-minifier-terser')

html = await minify(html.toString(), {
  collapseWhitespace: true,
  conservativeCollapse: true,
  continueOnParseError: true,
  removeComments: true,
  collapseBooleanAttributes: true,
  collapseInlineTagWhitespace: true,
  includeAutoGeneratedTags: true,
  keepClosingSlash: true,
  minifyCSS: true,
  minifyJS: true,
  noNewlinesBeforeTagClose: true,
  preserveLineBreaks: true,
})

await metascraper({ url, html });

from metascraper.

DanDubinsky commented on August 31, 2024

Hey @Kikobeats,

The html-minifier-terser package helped with the memory consumption to some degree (heap especially, rss to some degree), but it seems to have broken the metascraper because the output for the metadata are all null values:

Meta data {
  description: null,
  image: null,
  title: null,
  url: 'https://docs.google.com/document/d/1bGgGlc1YXSiR3cwbGDhuZUF7bz9djnTlV1qMP0-xhMM?usp=sharing'
}

Also it looks like using the jemalloc memory manager only partially fixed the issue. It seemed to helped a lot with the slow memory leak I was seeing that was causing our servers to crash after about 3 days, even when users don't paste links to large files on our app. Memory seems to be holding steady between 180mb and 300mb. But it hasn't helped with the massive memory spikes we get with individual large files that cause the containers to crash at random times. We're still getting between 1 and 4 or those per day per container, depending on what the users post.

We seem to be getting fewer of them, but is most likely because our users aren't as active over weekends.

Next week I'm going to see if I can identify if the spike files have anything in common besides their size. We had this issue in the past with version 5.0.3 of the metascraper, but were able to work around it by skipping the scraping for html over 3mb. But the issue seems to be worse in the latest metascraper version. Here we are skipping all files over 2mb and it's still crashing. Maybe if the files spiking the memory have some other common attributes besides size, I can filter those out as well.

Thanks,
Dan

from metascraper.

High memory usage about metascraper HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent