fmacpro / horseman-article-parser Goto Github PK

A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.

License: GNU General Public License v3.0

JavaScript 100.00%

article-parser horseman keyphrases lighthouse nodejs puppeteer scraper sentiment spelling-suggestions

horseman-article-parser's People

Contributors

Stargazers

Watchers

Forkers

that-guy-dev odeskvaibhav prizm1 isakatirci vtechguys der-ofenmeister

horseman-article-parser's Issues

Parser doesn't work on bbc

I have used parser on bbc articles but it doesn't work on those

Article link: https://www.bbc.com/news/health-55040635

I have tried on bbc article it looks like it only picks up title text, it will be really helpful if I can parse that.

Maybe you should contact Postlight?

Hi.
Maybe you should contact @ginatrapani from @postlight?
Postlight is currently maintining Mercury Web Parser — all what left from Readability API.
They have no time to support and develop it … maybe you may cooperate somehow?
I would happy to see open source and living project that do such kind of things.

Regards. Anton.

Option to return original page HTML from Puppeteer

The Horseman article parser is fantastic! However, I would like to perform additional analysis on the raw HTML downloaded through Puppeteer without having to crawl the page again with a separate tool and risk getting flagged. The original HTML does not seem to be returned with the article object, and I can't find any included option that would allow it. Would it be possible to include an option to allow return of the original HTML, or is there another way to accomplish this goal?

fmacpro / horseman-article-parser Goto Github PK

horseman-article-parser's People

Contributors

Stargazers

Watchers

Forkers

horseman-article-parser's Issues

Parser doesn't work on bbc

Maybe you should contact Postlight?

Option to return original page HTML from Puppeteer

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent