A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.
Hi.
Maybe you should contact @ginatrapani from @postlight?
Postlight is currently maintining Mercury Web Parser — all what left from Readability API.
They have no time to support and develop it … maybe you may cooperate somehow?
I would happy to see open source and living project that do such kind of things.
The Horseman article parser is fantastic! However, I would like to perform additional analysis on the raw HTML downloaded through Puppeteer without having to crawl the page again with a separate tool and risk getting flagged. The original HTML does not seem to be returned with the article object, and I can't find any included option that would allow it. Would it be possible to include an option to allow return of the original HTML, or is there another way to accomplish this goal?