Giter Site home page Giter Site logo

graby's People

Contributors

aaa2000 avatar caneco avatar girishpanchal30 avatar gitter-badger avatar holgerausb avatar j0k3r avatar jtojnar avatar kdecherf avatar nicosomb avatar phiamo avatar shtrom avatar simounet avatar tcitworld avatar techexo avatar vendin avatar zyuhel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graby's Issues

Switch to newer HTMLawed

htmlawed/htmlawed composer package is not compatible with PHP 7.2 and the package maintainer is not responsive. Since PHP 7.2 was released, I forked the package. I can make you a repo co-owner, if you wish.

Can't install via composer

Output:

~/workspace $ composer require j0k3r/graby
Using version ^1.12 for j0k3r/graby
./composer.json has been created
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for j0k3r/graby ^1.12 -> satisfiable by j0k3r/graby[1.12.0].
    - j0k3r/graby 1.12.0 requires htmlawed/htmlawed dev-master -> satisfiable by htmlawed/htmlawed[dev-master] but these conflict with your requirements or minimum-stability.


Installation failed, deleting ./composer.json

Add support for httplug

Instead of relying on Guzzle 5 and lock deps down to this version (see #8), it should be better to add support for httplug to be able to support multiple Guzzle version (or even other http lib).

Uncaught PHP Exception Exception: Url is not valid

Hi,

in wallabag v2.0.0. i see the following error on importing a json file:

[2016-04-06 23:52:33] request.CRITICAL: Uncaught PHP Exception Exception: "Url "http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html" is not valid." at wallabag/vendor/j0k3r/graby/src/Graby.php line 388 {"exception":"[object] (Exception(code: 0): Url \"http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html\" is not valid. at wallabag/vendor/j0k3r/graby/src/Graby.php:388)"} []

I have no idea why this url is not valid. Maybe you have?

edit: maybe because of the umlauts "ü"?

add ability to send HTTP header like user-agent or referer

Hi,

i would love to see graby being able to send additional http headers as configured in some ftr-site-config recipes.

Currently this is not supported:

// NOT YET USED

There are some site configs wanting to send the user-agent: https://github.com/fivefilters/ftr-site-config/search?utf8=%E2%9C%93&q=user-agent

An example is also wallabag/wallabag#2150 where the website thinks we are an internet explorer and we get redirected to hell.

Set default title if empty

In the case of wallabag/wallabag#1632, it seems Graby returns an empty string for the title. This should be tested and a default title should be shown.

array(8) {
  ["status"]=>
  int(500)
  ["html"]=>
  string(38) "[unable to retrieve full-text content]"
  ["title"]=>
  string(0) ""
  ["language"]=>
  NULL
  ["url"]=>
  string(86) "https://sulek.fr/index.php?article60/configuration-ipv6-pour-une-dedibox-sous-centos-7"
  ["content_type"]=>
  string(0) ""
  ["open_graph"]=>
  array(0) {
  }
  ["summary"]=>
  string(38) "[unable to retrieve full-text content]"
}

Use background-image as content

Hi there,
One site I'm reading is using background-image on links to display images. On wallabag, the content is empty. Is there a way to fix that? I told them not to do this but you know how it works…

<a href="/image.jpg" style="background-image: url('/image.jpg');"><span></span></a>

Can't extract content from edition.cnn.com article. Meta refresh tags was not replaced

grabby.php:

<?php
use Graby\Graby;

require(__DIR__ . '/vendor/autoload.php');
require(__DIR__ . '/src/Graby.php');

$url = 'http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html';
$graby = new Graby(['debug' => true]);
$result = $graby->fetchContent($url);
print_r($result);

Command

php ./grabby.php

returns

Array
(
    [status] => 310
    [html] => [unable to retrieve full-text content]
    [title] => No title found
    [language] =>
    [date] =>
    [authors] => Array
        (
        )

    [url] => http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
    [content_type] =>
    [open_graph] => Array
        (
        )

    [native_ad] =>
    [all_headers] => Array
        (
        )

    [summary] => [unable to retrieve full-text content]
)

graby.log

[2018-02-25 12:51:35] graby.DEBUG: Graby is ready to fetch [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"edition.cnn.com.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com.merged"} []
[2018-02-25 12:51:35] graby.DEBUG: Fetching url: {url} {"url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:36] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:41] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:41] graby.DEBUG: Endless redirect: 11 on "{url}" {"url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:41] graby.DEBUG: Opengraph data: {ogData} {"ogData":[]} []
[2018-02-25 12:51:41] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: No "single_page_link" config found [] []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to extract content [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":232} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Trying again without tidy [] []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":154} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Success ? {is_success} {"is_success":false} []
[2018-02-25 12:51:41] graby.DEBUG: Extract failed [] []

edition.cnn.com.txt:

body: //section[contains(@class, 'body-text')]

strip_id_or_class: highlights

# Avoid redirecting to 'unsupported browser' page
find_string: <meta http-equiv="refresh"
replace_string: <meta norefresh

test_url: http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
test_contains: this discriminatory and ineffective practice

test_url: http://rss.cnn.com/rss/edition.rss
test_url: http://rss.cnn.com/rss/edition_technology.rss

the other websites I checked works correct
it seems like issue caused by non-cutting IF IE tags from the page and non-replacing <meta refresh tags

Allow setting custom logger

The default log file will be located in the composer directory which should not even be writeable. For greater flexibility, setting a logger should be allowed similarly to how php-readability allows it.

Is there a way to strip all inline styles?

Hi,

I'm trying to figure out if there is a way to strip all inline styles? I'm debating whether I should fork and add functionality here or post process on the HTML after running through graby.

Joe

Attempted to call function "curl_init" from namespace "Graby\Ring\Client".

From @metasystem on November 10, 2015 16:3

Hi,
Just install wallabag and have issue on debian jessie

INFO - Matched route "new_entry".
DEBUG - Read existing security token from the session.
DEBUG - SELECT t0.username AS username_1, t0.username_canonical AS username_canonical_2, t0.email AS email_3, t0.email_canonical AS email_canonical_4, t0.enabled AS enabled_5, t0.salt AS salt_6, t0.password AS password_7, t0.last_login AS last_login_8, t0.locked AS locked_9, t0.expired AS expired_10, t0.expires_at AS expires_at_11, t0.confirmation_token AS confirmation_token_12, t0.password_requested_at AS password_requested_at_13, t0.roles AS roles_14, t0.credentials_expired AS credentials_expired_15, t0.credentials_expire_at AS credentials_expire_at_16, t0.id AS id_17, t0.name AS name_18, t0.created_at AS created_at_19, t0.updated_at AS updated_at_20, t0.authCode AS authCode_21, t0.twoFactorAuthentication AS twoFactorAuthentication_22, t0.trusted AS trusted_23, t24.id AS id_25, t24.theme AS theme_26, t24.items_per_page AS items_per_page_27, t24.language AS language_28, t24.rss_token AS rss_token_29, t24.rss_limit AS rss_limit_30, t24.user_id AS user_id_31 FROM wallabaguser t0 LEFT JOIN wallabagconfig t24 ON t24.user_id = t0.id WHERE t0.id = ?
DEBUG - User was reloaded from a user provider.
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
DEBUG - Graby is ready to fetch
DEBUG - Fetching url: {url}
DEBUG - Trying using method "{method}" on url "{url}"
CRITICAL - Fatal Error: Call to undefined function Graby\Ring\Client\curl_init()
CRITICAL - Uncaught PHP Exception Symfony\Component\Debug\Exception\UndefinedFunctionException: "Attempted to call function "curl_init" from namespace "Graby\Ring\Client"." at /root/wallabag/vendor/j0k3r/graby/src/Ring/Client/SafeCurlHandler.php line 49
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
INFO - Defining the initRuntime() method in the "form" extension is deprecated. Use the needs_environment option to get the Twig_Environment instance in filters, functions, or tests; or explicitly implement Twig_Extension_InitRuntimeInterface if needed (not recommended).
INFO - Defining the getGlobals() method in the "assetic" extension is deprecated without explicitly >implementing Twig_Extension_GlobalsInterface.

Installed with pdo_mysql

Copied from original issue: wallabag/wallabag#1511

linuxjournal.com multi-page fetches only first page

I am posting this issue here, because i think it is a bug in graby. The site config appears to be valid.

When adding http://www.linuxjournal.com/content/papas-got-brand-new-nas to wallabag, content fetching lasts very long. Only for adding this page to wallabag, the prod.log of wallabag grows by 2,4 MB.

It is only a problem with multi-page articles of linuxjournal.

What i observe:

  • wallabag needs some minutes to fetch this article
  • after finished fetching only the first page of the article is in wallabag
  • the article in wallabag ends with the sentence: This article appears to continue on subsequent pages which we could not extract

I have attached the log: prod.txt

From a first superficial view at the log, graby has a problem with the URL for the next page, because it contains an unusual string (maybe the comma?).

Link goes removed

For that url: https://www.washingtonpost.com/world/national-security/trump-to-meet-russian-foreign-minister-at-the-white-house-as-moscows-alleged-election-interference-is-back-in-spotlight/2017/05/10/c6717e4c-34f3-11e7-b412-62beef8121f7_story.html

The first content is converted to:

<h3>By  and ,</h3>

When the original content is:

<span class="pb-byline" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
    By
    <a href="https://www.washingtonpost.com/people/carol-morello/">
        <span itemprop="name">Carol Morello</span>
    </a> 
    and 
    <a href="https://www.washingtonpost.com/people/greg-miller/">
        <span itemprop="name">Greg Miller</span>
    </a>
</span>

Need for a tool to quickly test site configs

Hello,

I do not know where to put that, because it concerns graby, graby-site-config and wallabag.
I was wondering if there was a way to have a small "standalone" version of graby that would read the config files without caching anything and return the content.

Basically, I am trying to help fivefilters (and thus graby-site-config) writing new config files, but doing it with wallabag running on a not-so-powerful server is really painful. Each time I make a change in the configuration files, I have to clear wallabag's cache, which is quite long (between 1-2 minutes on a Cubietruck!) ; delete the article and submit it again to wallabag. The whole process can take a few minutes, even when the issue was just a missed comma :( .

Unfortunately, I am hopeless coding anything in PHP... The ideal would be a php file without cache reading on-the-fly just one config file (or a specified config file) and that, given an URL, would display the content without any stylesheet (thus showing very quickly what are titles, paragraphs and so on).

Thanks in advance, and do not hesitate to ask further details if needed.

Regards

Problem with escaped fragment when fetching some websites

Hello !

I come from Wallabag, which is using this project.

I have problem retrieving content from a webpage, because graby automatically adds an ?_escaped_fragment_= at the end of the URL, for crawling AJAX purpose.
That's a problem because the website in question gives a 404 error when detecting this escaped fragment. Probably to avoid being fetched by robots ?

Still, the content seems to be accessible without the fragment.

A solution would be to try to fetch again the URL without this escaped fragment if a 404 error is answered ?

Here is the website, you can test with or without the escaped fragment:
https://dzone.com/
https://dzone.com/?_escaped_fragment_=

Thank you !

Antonin

Couldn't fetch Readability\JSLikeHTMLElement

In wallabag, when I try to add this URL http://www.journaldugamer.com/tests/rencontre-ils-bossaient-sur-une-exclu-kinect-qui-ne-sortira-jamais/, I've got this error in my logs:

[2016-12-06 22:43:06] app.ERROR: Error while saving an entry {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\ContextErrorException(code: 0): Warning: DOMDocument::importNode(): Couldn't fetch Readability\\JSLikeHTMLElement at /var/www/wallabag/vendor/j0k3r/graby/src/Graby.php:297)","entry":"[object] (Wallabag\\CoreBundle\\Entity\\Entry: {})"} []

Xpath used twice doesn't work

For mobile.twitter.com configuration, I want to do this:

title: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
author: (//*[contains(@class, 'UserNames-displayName') or contains(@class, 'fullname')])[1]
body: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
date: (//div[contains(@class, 'TweetDetail-timeAndGeo') or contains(@class, 'metadata')])[1]

I have the same xpath for title and body.
But the content is OK only for the title.
The body is wrong.

Do you have any idea?

Support for Guzzle 6?

My installation seems to be failing because I have guzzle 6 I think?

    - Installation request for j0k3r/graby ^1.10 -> satisfiable by j0k3r/graby[1.10.0].
    - j0k3r/graby 1.10.0 requires guzzlehttp/guzzle ^5.2.0 -> satisfiable by guzzlehttp/guzzle[5.2.0, 5.3.0, 5.3.1, 5.3.x-dev] but these confli
ct with your requirements or minimum-stability.

Tests\Graby\GrabyFunctionalTest::testDate is failing with incorrect date

On master:

2) Tests\Graby\GrabyFunctionalTest::testDate with data set #1 ('https://www.reddit.com/r/Linu...guide/', '2013-05-30T16:01:58+00:00')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-2013-05-30T16:01:58+00:00
+2013-05-30T16:10:50+00:00

The actual date seems to be the one of the first comment below the main content.

Parser: add support for list's start attribute

From @pVesian on June 7, 2017 8:14

Issue details

In HTML, lists can have a "start" attribute that allows a number in the list, instead of the default one.

Environment

Wallabagit & f43.me

Steps to reproduce/test case

Store this article: http://www.timothysykes.com/blog/10-things-know-short-selling/
Scroll to "“Called out” or “Buy in”", it's point 4 in the article, but point 1 in the stored article. By inspecting the HTML code, you will find that the parser removes the "start" attribute.

Thanks

Copied from original issue: wallabag/wallabag#3185

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.