j0k3r / graby Goto Github PK

View Code? Open in Web Editor NEW

359.0 17.0 71.0 2.7 MB

Graby helps you extract article content from web pages

License: MIT License

PHP 100.00%

text-rss php extract-website content readability composer hacktoberfest

graby's People

Contributors

Stargazers

Watchers

Forkers

tcitworld harikt psampaz sngrl bdunogier yiqideren rikysunandi aleksandarsavic kdecherf joecotellese aaa2000 w3extensions hakimihamdan88 eric-chau sultann nclshart gitter-badger dnagyelek behrouz-azand fossabot simounet latestalexey koptelovav ken-studio bkanber christineye robertmarton vifespoir techexo burkemw3 mike-vk jtojnar piterskiy thepearson miradozk honsa tbclarkson3 najrim presleyhank genjiluo marcus-at-localhost qa1 shtrom cihantas ubay887 shunf4 phiamo ivzhh freshy969 thecoderraman arazgholami huangyingting julthep zyuhel ahmedseaf webghostx qanu-survey a973c matelgran sergiy-petrov girishpanchal30 arbitrarygit fydedd allotmentandy merinorus kolaente opendevel holgerausb aviex411 lupka

graby's Issues

503 when parsing some URL

There is a problem with some URL.
For example this one: https://www.facebook.com/media/set/?set=a.10152476050152344.1073742179.58690437343&type=1

Get PDF meta data

It should be interesting to integrate https://github.com/smalot/pdfparser to read metadata from a PDF to provide them + a link to download the pdf.

remove copyright notices from images (may be other similar situations)

In the example url http://www.bbc.com/news/entertainment-arts-32547474

The html that is returned looks like:

In the original HTML, there are some extra tags for copyright notices that could/should be replaced

I'm not sure of generic ways to do this.. perhaps text in the figure tab, perhaps the specific text (or similar variations).

Rewrite some embeed content to https if page has https

For instance, embeed (iframes) youtube videos which are loaded from http and embeed from an https page are not shown, we could rewrite the url for these.

htmlawed/htmlawed composer package is not compatible with PHP 7.2 and the package maintainer is not responsive. Since PHP 7.2 was released, I forked the package. I can make you a repo co-owner, if you wish.

Can't install via composer

Output:

~/workspace $ composer require j0k3r/graby
Using version ^1.12 for j0k3r/graby
./composer.json has been created
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for j0k3r/graby ^1.12 -> satisfiable by j0k3r/graby[1.12.0].
    - j0k3r/graby 1.12.0 requires htmlawed/htmlawed dev-master -> satisfiable by htmlawed/htmlawed[dev-master] but these conflict with your requirements or minimum-stability.


Installation failed, deleting ./composer.json

Handle referer from siteconfig

When #61 will be merged, we should add the referer support for http header since some website can have them configured:

https://github.com/fivefilters/ftr-site-config/blob/5aa171b1b3cff4c5c7a0b3f27803feab2330c1d7/feeds.feedblitz.com.txt

http_header(referer): http://feedblitz.com

Add support for httplug

Instead of relying on Guzzle 5 and lock deps down to this version (see #8), it should be better to add support for httplug to be able to support multiple Guzzle version (or even other http lib).

Uncaught PHP Exception Exception: Url is not valid

Hi,

in wallabag v2.0.0. i see the following error on importing a json file:

[2016-04-06 23:52:33] request.CRITICAL: Uncaught PHP Exception Exception: "Url "http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html" is not valid." at wallabag/vendor/j0k3r/graby/src/Graby.php line 388 {"exception":"[object] (Exception(code: 0): Url \"http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html\" is not valid. at wallabag/vendor/j0k3r/graby/src/Graby.php:388)"} []

I have no idea why this url is not valid. Maybe you have?

edit: maybe because of the umlauts "ü"?

add ability to send HTTP header like user-agent or referer

Hi,

i would love to see graby being able to send additional http headers as configured in some ftr-site-config recipes.

Currently this is not supported:

graby/src/SiteConfig/SiteConfig.php

Line 39 in 026ffc7

// NOT YET USED

There are some site configs wanting to send the user-agent: https://github.com/fivefilters/ftr-site-config/search?utf8=%E2%9C%93&q=user-agent

An example is also wallabag/wallabag#2150 where the website thinks we are an internet explorer and we get redirected to hell.

Set default title if empty

In the case of wallabag/wallabag#1632, it seems Graby returns an empty string for the title. This should be tested and a default title should be shown.

array(8) {
  ["status"]=>
  int(500)
  ["html"]=>
  string(38) "[unable to retrieve full-text content]"
  ["title"]=>
  string(0) ""
  ["language"]=>
  NULL
  ["url"]=>
  string(86) "https://sulek.fr/index.php?article60/configuration-ipv6-pour-une-dedibox-sous-centos-7"
  ["content_type"]=>
  string(0) ""
  ["open_graph"]=>
  array(0) {
  }
  ["summary"]=>
  string(38) "[unable to retrieve full-text content]"
}

Handle zip archive when there is no zip extension

For example: https://github.com/nathanaccidentally/Cydia-Repo-Template/archive/master.zip

Ability to use custom site config would be great

The ability to have custom site config created by users other than the ones on Five Filters Site Config stream would really be great to have.

Properly handle plain text

http://ibiblio.org/pub/linux/docs/ldpResearch/ldp-historic/LinuxNews.03A

It should be wrapped inside a <pre> to keep structure.

Related wallabag/wallabag#444

Return OpenGraph information

As described in this feature for @wallabag we need open graph data in the response, at least to grab a picture of the content.

wallabag/wallabag#972

Ensure preview image is absolute

Some website doesn't put the og:image absolute.
So the returned open graph image can be relative and generate an error on the client side.

This is the case for that article: https://sechat.org/posts/1673771

<meta property="og:image" content="/assets/branding/logos/asterisk.png" />

Use background-image as content

Hi there,
One site I'm reading is using background-image on links to display images. On wallabag, the content is empty. Is there a way to fix that? I told them not to do this but you know how it works…

<a href="/image.jpg" style="background-image: url('/image.jpg');"><span></span></a>

Upgrade to Guzzle 6

Can't extract content from edition.cnn.com article. Meta refresh tags was not replaced

grabby.php:

<?php
use Graby\Graby;

require(__DIR__ . '/vendor/autoload.php');
require(__DIR__ . '/src/Graby.php');

$url = 'http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html';
$graby = new Graby(['debug' => true]);
$result = $graby->fetchContent($url);
print_r($result);

Command

php ./grabby.php

returns

Array
(
    [status] => 310
    [html] => [unable to retrieve full-text content]
    [title] => No title found
    [language] =>
    [date] =>
    [authors] => Array
        (
        )

    [url] => http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
    [content_type] =>
    [open_graph] => Array
        (
        )

    [native_ad] =>
    [all_headers] => Array
        (
        )

    [summary] => [unable to retrieve full-text content]
)

graby.log

[2018-02-25 12:51:35] graby.DEBUG: Graby is ready to fetch [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"edition.cnn.com.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com.merged"} []
[2018-02-25 12:51:35] graby.DEBUG: Fetching url: {url} {"url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:36] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:41] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:41] graby.DEBUG: Endless redirect: 11 on "{url}" {"url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:41] graby.DEBUG: Opengraph data: {ogData} {"ogData":[]} []
[2018-02-25 12:51:41] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: No "single_page_link" config found [] []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to extract content [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":232} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Trying again without tidy [] []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":154} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Success ? {is_success} {"is_success":false} []
[2018-02-25 12:51:41] graby.DEBUG: Extract failed [] []

edition.cnn.com.txt:

body: //section[contains(@class, 'body-text')]

strip_id_or_class: highlights

# Avoid redirecting to 'unsupported browser' page
find_string: <meta http-equiv="refresh"
replace_string: <meta norefresh

test_url: http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
test_contains: this discriminatory and ineffective practice

test_url: http://rss.cnn.com/rss/edition.rss
test_url: http://rss.cnn.com/rss/edition_technology.rss

the other websites I checked works correct
it seems like issue caused by non-cutting IF IE tags from the page and non-replacing <meta refresh tags

Store the publication date of the article

As shown here, Full-Text RSS extracts the publication date of the articles.
Graby should extract it too.

Allow setting custom logger

The default log file will be located in the composer directory which should not even be writeable. For greater flexibility, setting a logger should be allowed similarly to how php-readability allows it.

Is there a way to strip all inline styles?

Hi,

I'm trying to figure out if there is a way to strip all inline styles? I'm debating whether I should fork and add functionality here or post process on the HTML after running through graby.

Joe

Problems with images + captions

I'm using Graby library for a side project. When I try to render a page with embedded images the figcaption sometimes gets pushed down part way of the page.

Anyone have any issues with this too?

Here's an example of it happening

Parsing differences between f43.me and a local graby

Hi,
I've got a strange behavior with graby on my local machine. I started with wallabag 2.2.2, graby 1.6.0, readability 1.1.6 but I have the same issue with a standalone graby.
This URL http://www.rom-game.fr/news/2446-Jean%20Baudlot%20-%20de%20l%20Eurovision%20a%20Delphine%20Software.html cannot be parsed on my local but it works on https://f43.me/feed/test .
Any idea?

Attempted to call function "curl_init" from namespace "Graby\Ring\Client".

From @metasystem on November 10, 2015 16:3

Hi,
Just install wallabag and have issue on debian jessie

INFO - Matched route "new_entry".
DEBUG - Read existing security token from the session.
DEBUG - SELECT t0.username AS username_1, t0.username_canonical AS username_canonical_2, t0.email AS email_3, t0.email_canonical AS email_canonical_4, t0.enabled AS enabled_5, t0.salt AS salt_6, t0.password AS password_7, t0.last_login AS last_login_8, t0.locked AS locked_9, t0.expired AS expired_10, t0.expires_at AS expires_at_11, t0.confirmation_token AS confirmation_token_12, t0.password_requested_at AS password_requested_at_13, t0.roles AS roles_14, t0.credentials_expired AS credentials_expired_15, t0.credentials_expire_at AS credentials_expire_at_16, t0.id AS id_17, t0.name AS name_18, t0.created_at AS created_at_19, t0.updated_at AS updated_at_20, t0.authCode AS authCode_21, t0.twoFactorAuthentication AS twoFactorAuthentication_22, t0.trusted AS trusted_23, t24.id AS id_25, t24.theme AS theme_26, t24.items_per_page AS items_per_page_27, t24.language AS language_28, t24.rss_token AS rss_token_29, t24.rss_limit AS rss_limit_30, t24.user_id AS user_id_31 FROM wallabaguser t0 LEFT JOIN wallabagconfig t24 ON t24.user_id = t0.id WHERE t0.id = ?
DEBUG - User was reloaded from a user provider.
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
DEBUG - Graby is ready to fetch
DEBUG - Fetching url: {url}
DEBUG - Trying using method "{method}" on url "{url}"
CRITICAL - Fatal Error: Call to undefined function Graby\Ring\Client\curl_init()
CRITICAL - Uncaught PHP Exception Symfony\Component\Debug\Exception\UndefinedFunctionException: "Attempted to call function "curl_init" from namespace "Graby\Ring\Client"." at /root/wallabag/vendor/j0k3r/graby/src/Ring/Client/SafeCurlHandler.php line 49
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
INFO - Defining the initRuntime() method in the "form" extension is deprecated. Use the needs_environment option to get the Twig_Environment instance in filters, functions, or tests; or explicitly implement Twig_Extension_InitRuntimeInterface if needed (not recommended).
INFO - Defining the getGlobals() method in the "assetic" extension is deprecated without explicitly >implementing Twig_Extension_GlobalsInterface.

Installed with pdo_mysql

Copied from original issue: wallabag/wallabag#1511

linuxjournal.com multi-page fetches only first page

I am posting this issue here, because i think it is a bug in graby. The site config appears to be valid.

When adding http://www.linuxjournal.com/content/papas-got-brand-new-nas to wallabag, content fetching lasts very long. Only for adding this page to wallabag, the prod.log of wallabag grows by 2,4 MB.

It is only a problem with multi-page articles of linuxjournal.

What i observe:

wallabag needs some minutes to fetch this article
after finished fetching only the first page of the article is in wallabag
the article in wallabag ends with the sentence: This article appears to continue on subsequent pages which we could not extract

I have attached the log: prod.txt

From a first superficial view at the log, graby has a problem with the URL for the next page, because it contains an unusual string (maybe the comma?).

Return language

From https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L292

Get Content-Type from URL

As we discussed here wallabag/wallabag#1335, graby should return Content-Type of the URL.

HttpClient::fetch returns a binary string

When I try to fetch this URL http://fipcom.net/en.signup, HttpClient::fetch method returns a binary string, see below (the b before the content).

I don't know why we have this binary string here. It seems to be due to PHP 6 https://stackoverflow.com/questions/4749442/what-does-the-b-in-front-of-string-literals-do

Link goes removed

For that url: https://www.washingtonpost.com/world/national-security/trump-to-meet-russian-foreign-minister-at-the-white-house-as-moscows-alleged-election-interference-is-back-in-spotlight/2017/05/10/c6717e4c-34f3-11e7-b412-62beef8121f7_story.html

The first content is converted to:

<h3>By  and ,</h3>

When the original content is:

<span class="pb-byline" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
    By
    <a href="https://www.washingtonpost.com/people/carol-morello/">
        <span itemprop="name">Carol Morello</span>
    </a> 
    and 
    <a href="https://www.washingtonpost.com/people/greg-miller/">
        <span itemprop="name">Greg Miller</span>
    </a>
</span>

Need for a tool to quickly test site configs

Hello,

I do not know where to put that, because it concerns graby, graby-site-config and wallabag.
I was wondering if there was a way to have a small "standalone" version of graby that would read the config files without caching anything and return the content.

Basically, I am trying to help fivefilters (and thus graby-site-config) writing new config files, but doing it with wallabag running on a not-so-powerful server is really painful. Each time I make a change in the configuration files, I have to clear wallabag's cache, which is quite long (between 1-2 minutes on a Cubietruck!) ; delete the article and submit it again to wallabag. The whole process can take a few minutes, even when the issue was just a missed comma :( .

Unfortunately, I am hopeless coding anything in PHP... The ideal would be a php file without cache reading on-the-fly just one config file (or a specified config file) and that, given an URL, would display the content without any stylesheet (thus showing very quickly what are titles, paragraphs and so on).

Thanks in advance, and do not hesitate to ask further details if needed.

Regards

Add Angular controler in ajax_triggers

For some website, if we found ng-controller in the html, we should try to get a static page using _escaped_fragment_.

See https://developers.google.com/webmasters/ajax-crawling/docs/getting-started#2-set-up-your-server-to-handle-requests-for-urls-that-contain-_escaped_fragment_

For example: https://tempostorm.com/articles/the-singleton-special-when-to-oneoff?_escaped_fragment_

Related wallabag/wallabag#1413

Problem with escaped fragment when fetching some websites

Hello !

I come from Wallabag, which is using this project.

I have problem retrieving content from a webpage, because graby automatically adds an ?_escaped_fragment_= at the end of the URL, for crawling AJAX purpose.
That's a problem because the website in question gives a 404 error when detecting this escaped fragment. Probably to avoid being fetched by robots ?

Still, the content seems to be accessible without the fragment.

A solution would be to try to fetch again the URL without this escaped fragment if a 404 error is answered ?

Here is the website, you can test with or without the escaped fragment:
https://dzone.com/
https://dzone.com/?_escaped_fragment_=

Thank you !

Antonin

Couldn't fetch Readability\JSLikeHTMLElement

In wallabag, when I try to add this URL http://www.journaldugamer.com/tests/rencontre-ils-bossaient-sur-une-exclu-kinect-qui-ne-sortira-jamais/, I've got this error in my logs:

[2016-12-06 22:43:06] app.ERROR: Error while saving an entry {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\ContextErrorException(code: 0): Warning: DOMDocument::importNode(): Couldn't fetch Readability\\JSLikeHTMLElement at /var/www/wallabag/vendor/j0k3r/graby/src/Graby.php:297)","entry":"[object] (Wallabag\\CoreBundle\\Entity\\Entry: {})"} []

Guzzle, PHP HTTP client conflict

Hello!

I have a problem.
I'm trying to install this package - https://github.com/artesaos/laravel-linkedin but the package requires https://github.com/guzzle/guzzle 6.1.x version.
But your perfect (it's not irony!) package is blocking installation because requires an old version of Guzzle, PHP HTTP client.

Could you fix it, please?

And thank you for the great package!

Sincerely, Dmitry.

Xpath used twice doesn't work

For mobile.twitter.com configuration, I want to do this:

title: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
author: (//*[contains(@class, 'UserNames-displayName') or contains(@class, 'fullname')])[1]
body: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
date: (//div[contains(@class, 'TweetDetail-timeAndGeo') or contains(@class, 'metadata')])[1]

I have the same xpath for title and body.
But the content is OK only for the title.
The body is wrong.

Do you have any idea?

Not working URL

I don't know if this is the right place but I don't know where to ask anyway...
I'm using wallabag which internally uses Graby to extract article contents. I'm having problems with this website http://www.muylinux.com/2016/03/22/kde-plasma-5-6 but seems that it's working in Full-Text RSS
this website =>http://f43.me/feed/test
What can I do to fix that URL in Graby?

Detect if article contains media

If an article contains media (audio / video), graby should add a flag in the response (useful for this issue wallabag/wallabag#880).

Store the author of an article

This is supported in Full-Text RSS but not in Graby. The same with the date (#67)

Try to use Google AMP pages for fetching and/or redirect to original page

See https://support.google.com/webmasters/answer/6340290?hl=fr

If page meta contains , maybe try to parse this one instead of original one.
Same goes to opening original url (quite interesting when you're on mobile).

Of course, this should be set as a setting.

Originally at wallabag/wallabag#2173

Server Side Request Forgery (SSRF)

As reported with wallabag, graby is vulnerable to SSRF. This means one can bypass restrictions for resources only accessible to localhost like http://127.0.0.1/server-status,

iframes not correctly handled

From @tcitworld on September 29, 2015 18:42

Guess it's graby-related, but I put it here.

Medium seems to use iframes to display some pictures and the url is not rewritten to match the original website.

https://medium.com/vantage/live-photos-are-a-gimmick-92d7ad03bcbe

Copied from original issue: wallabag/wallabag#1438

    - Installation request for j0k3r/graby ^1.10 -> satisfiable by j0k3r/graby[1.10.0].
    - j0k3r/graby 1.10.0 requires guzzlehttp/guzzle ^5.2.0 -> satisfiable by guzzlehttp/guzzle[5.2.0, 5.3.0, 5.3.1, 5.3.x-dev] but these confli
ct with your requirements or minimum-stability.

Add support for `native_ad_clue`

Add support for native_ad_clue (see FTRSS).

More information about Native Ad can be found on the FiveFilters blog.

Tests\Graby\GrabyFunctionalTest::testDate is failing with incorrect date

On master:

2) Tests\Graby\GrabyFunctionalTest::testDate with data set #1 ('https://www.reddit.com/r/Linu...guide/', '2013-05-30T16:01:58+00:00')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-2013-05-30T16:01:58+00:00
+2013-05-30T16:10:50+00:00

The actual date seems to be the one of the first comment below the main content.

Parser: add support for list's start attribute

From @pVesian on June 7, 2017 8:14

Issue details

In HTML, lists can have a "start" attribute that allows a number in the list, instead of the default one.

Environment

Wallabagit & f43.me

Steps to reproduce/test case

Store this article: http://www.timothysykes.com/blog/10-things-know-short-selling/
Scroll to "“Called out” or “Buy in”", it's point 4 in the article, but point 1 in the stored article. By inspecting the HTML code, you will find that the parser removes the "start" attribute.

Thanks

Copied from original issue: wallabag/wallabag#3185

Get language from headers

Currently you are getting the language information from the "lang" tag, but many sites doesn't have this tag and uses HTTP headers instead.
My proposal is to use the "lang" tag and if it's not available get the language from the header "Content-Language".
Example URl without tag but with header:
http://www.investopedia.com/university/introduction-stock-trader-types/pivot-traders.asp

Already reported in wallabag but they send me here => wallabag/wallabag#2978