Giter Site home page Giter Site logo

High memory usage about metascraper HOT 9 CLOSED

DanDubinsky avatar DanDubinsky commented on August 31, 2024
High memory usage

from metascraper.

Comments (9)

DanDubinsky avatar DanDubinsky commented on August 31, 2024 1

Here's an update. I wasn't able to use that minify-html lib. It gets errors on yarn add for some reason. I tried this one instead https://www.npmjs.com/package/html-minifier and it fixes the memory issue, but it also returns all nulls for description, image and title properties, so some how the minified HTML is confusing the metascraper.

Now I'm trying something on the OS level and it seems promising so far. The containers are running Debian Linux and I was able to replace the default memory manager with jemalloc. Seems better so far. I did it at 15:00. There were lots of spikes and OOMs before and no spikes after for over an hour. Also the rss memory seems to be going up and down now instead of up and up. I'll leave it sit like this for a few days and see how it behaves and then report the findings back here in case anyone else has this issue.

Screenshot 2024-02-16 at 4 12 29 PM

from metascraper.

Kikobeats avatar Kikobeats commented on August 31, 2024

as I suggested in the previous issue, can you use https://github.com/davidmarkclements/0x ?

just console.log is not enought for profiling memory usage, we need to understand what is happening inside node internals

from metascraper.

DanDubinsky avatar DanDubinsky commented on August 31, 2024

Sure, here are two flame graphs. Looks like parsing.
Screenshot 2024-02-14 at 8 57 17 PM
Screenshot 2024-02-14 at 8 57 56 PM

from metascraper.

DanDubinsky avatar DanDubinsky commented on August 31, 2024

Here's another hot section on the flame graph. If you would like to see the condition for yourself I think you can just run that index.js file. Just save one file as package.json and the other as index.js and then just yarn and 0x -o index.js.

Screenshot 2024-02-14 at 9 08 42 PM

Thanks,
Dan

from metascraper.

DanDubinsky avatar DanDubinsky commented on August 31, 2024

I have a question about the package. For these rules, how much of the HTML does it need to examine? Is it just meta tags?

require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-title')(),
require('metascraper-url')(),

I'm thinking maybe I can write a simple preprocessor to strip out all but meta tags. With smaller HTML I'm hoping the parser will need a lot less memory. I'm starting to get a little bit desperate here. It seems that our end user have been entering a lot of links into our app that are triggering memory spikes and firing OOM errors. Maybe 30 or 40 out of memory errors in just the last day or so.

Screenshot 2024-02-16 at 9 01 36 AM

from metascraper.

Kikobeats avatar Kikobeats commented on August 31, 2024

Interesting proposal. It depends on the package, e.g.:
https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-title/src/index.js

Removing DOM elements clearly is going to help, but it could produce uncurated results.

Maybe you can test if minifying HTML has any impact? This seems promising:
https://github.com/wilsonzlin/minify-html

from metascraper.

Kikobeats avatar Kikobeats commented on August 31, 2024

glad to see you're mastering it!

I'm still interesting into explore how we can make metascraper consume lower, but it seems the memory issue should be fixed on cheerio upstream first: https://github.com/search?q=repo%3Acheeriojs%2Fcheerio+memory&type=issues

I'm going to determine if we can do something effective there, thanks for going deep with this.

from metascraper.

Kikobeats avatar Kikobeats commented on August 31, 2024

@DanDubinsky can you test if helps to reduce memory consumption?

const { minify } = require('html-minifier-terser')

html = await minify(html.toString(), {
  collapseWhitespace: true,
  conservativeCollapse: true,
  continueOnParseError: true,
  removeComments: true,
  collapseBooleanAttributes: true,
  collapseInlineTagWhitespace: true,
  includeAutoGeneratedTags: true,
  keepClosingSlash: true,
  minifyCSS: true,
  minifyJS: true,
  noNewlinesBeforeTagClose: true,
  preserveLineBreaks: true,
})

await metascraper({ url, html });

from metascraper.

DanDubinsky avatar DanDubinsky commented on August 31, 2024

Hey @Kikobeats,

The html-minifier-terser package helped with the memory consumption to some degree (heap especially, rss to some degree), but it seems to have broken the metascraper because the output for the metadata are all null values:

Meta data {
  description: null,
  image: null,
  title: null,
  url: 'https://docs.google.com/document/d/1bGgGlc1YXSiR3cwbGDhuZUF7bz9djnTlV1qMP0-xhMM?usp=sharing'
}

Also it looks like using the jemalloc memory manager only partially fixed the issue. It seemed to helped a lot with the slow memory leak I was seeing that was causing our servers to crash after about 3 days, even when users don't paste links to large files on our app. Memory seems to be holding steady between 180mb and 300mb. But it hasn't helped with the massive memory spikes we get with individual large files that cause the containers to crash at random times. We're still getting between 1 and 4 or those per day per container, depending on what the users post.

Screenshot 2024-02-18 at 9 58 33 AM

We seem to be getting fewer of them, but is most likely because our users aren't as active over weekends.

Screenshot 2024-02-18 at 10 03 25 AM

Next week I'm going to see if I can identify if the spike files have anything in common besides their size. We had this issue in the past with version 5.0.3 of the metascraper, but were able to work around it by skipping the scraping for html over 3mb. But the issue seems to be worse in the latest metascraper version. Here we are skipping all files over 2mb and it's still crashing. Maybe if the files spiking the memory have some other common attributes besides size, I can filter those out as well.

Thanks,
Dan

from metascraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.