j0k3r / graby Goto Github PK
View Code? Open in Web Editor NEWGraby helps you extract article content from web pages
License: MIT License
Graby helps you extract article content from web pages
License: MIT License
There is a problem with some URL.
For example this one: https://www.facebook.com/media/set/?set=a.10152476050152344.1073742179.58690437343&type=1
It should be interesting to integrate https://github.com/smalot/pdfparser to read metadata from a PDF to provide them + a link to download the pdf.
In the example url http://www.bbc.com/news/entertainment-arts-32547474
The html that is returned looks like:
In the original HTML, there are some extra tags for copyright notices that could/should be replaced
I'm not sure of generic ways to do this.. perhaps text in the figure tab, perhaps the specific text (or similar variations).
For instance, embeed (iframes) youtube videos which are loaded from http and embeed from an https page are not shown, we could rewrite the url for these.
htmlawed/htmlawed
composer package is not compatible with PHP 7.2 and the package maintainer is not responsive. Since PHP 7.2 was released, I forked the package. I can make you a repo co-owner, if you wish.
Output:
~/workspace $ composer require j0k3r/graby
Using version ^1.12 for j0k3r/graby
./composer.json has been created
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.
Problem 1
- Installation request for j0k3r/graby ^1.12 -> satisfiable by j0k3r/graby[1.12.0].
- j0k3r/graby 1.12.0 requires htmlawed/htmlawed dev-master -> satisfiable by htmlawed/htmlawed[dev-master] but these conflict with your requirements or minimum-stability.
Installation failed, deleting ./composer.json
When #61 will be merged, we should add the referer
support for http header since some website can have them configured:
http_header(referer): http://feedblitz.com
Hi,
in wallabag v2.0.0. i see the following error on importing a json file:
[2016-04-06 23:52:33] request.CRITICAL: Uncaught PHP Exception Exception: "Url "http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html" is not valid." at wallabag/vendor/j0k3r/graby/src/Graby.php line 388 {"exception":"[object] (Exception(code: 0): Url \"http://www.pro-linux.de/news/1/23430/linus-torvalds-über-das-internet-der-dinge.html\" is not valid. at wallabag/vendor/j0k3r/graby/src/Graby.php:388)"} []
I have no idea why this url is not valid. Maybe you have?
edit: maybe because of the umlauts "ü"?
Hi,
i would love to see graby being able to send additional http headers as configured in some ftr-site-config recipes.
Currently this is not supported:
graby/src/SiteConfig/SiteConfig.php
Line 39 in 026ffc7
There are some site configs wanting to send the user-agent: https://github.com/fivefilters/ftr-site-config/search?utf8=%E2%9C%93&q=user-agent
An example is also wallabag/wallabag#2150 where the website thinks we are an internet explorer and we get redirected to hell.
In the case of wallabag/wallabag#1632, it seems Graby returns an empty string for the title. This should be tested and a default title should be shown.
array(8) {
["status"]=>
int(500)
["html"]=>
string(38) "[unable to retrieve full-text content]"
["title"]=>
string(0) ""
["language"]=>
NULL
["url"]=>
string(86) "https://sulek.fr/index.php?article60/configuration-ipv6-pour-une-dedibox-sous-centos-7"
["content_type"]=>
string(0) ""
["open_graph"]=>
array(0) {
}
["summary"]=>
string(38) "[unable to retrieve full-text content]"
}
The ability to have custom site config created by users other than the ones on Five Filters Site Config stream would really be great to have.
http://ibiblio.org/pub/linux/docs/ldpResearch/ldp-historic/LinuxNews.03A
It should be wrapped inside a <pre>
to keep structure.
Related wallabag/wallabag#444
As described in this feature for @wallabag we need open graph data in the response, at least to grab a picture of the content.
Some website doesn't put the og:image
absolute.
So the returned open graph image can be relative and generate an error on the client side.
This is the case for that article: https://sechat.org/posts/1673771
<meta property="og:image" content="/assets/branding/logos/asterisk.png" />
Hi there,
One site I'm reading is using background-image on links to display images. On wallabag, the content is empty. Is there a way to fix that? I told them not to do this but you know how it works…
<a href="/image.jpg" style="background-image: url('/image.jpg');"><span></span></a>
grabby.php:
<?php
use Graby\Graby;
require(__DIR__ . '/vendor/autoload.php');
require(__DIR__ . '/src/Graby.php');
$url = 'http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html';
$graby = new Graby(['debug' => true]);
$result = $graby->fetchContent($url);
print_r($result);
Command
php ./grabby.php
returns
Array
(
[status] => 310
[html] => [unable to retrieve full-text content]
[title] => No title found
[language] =>
[date] =>
[authors] => Array
(
)
[url] => http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
[content_type] =>
[open_graph] => Array
(
)
[native_ad] =>
[all_headers] => Array
(
)
[summary] => [unable to retrieve full-text content]
)
graby.log
[2018-02-25 12:51:35] graby.DEBUG: Graby is ready to fetch [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"edition.cnn.com.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com"} []
[2018-02-25 12:51:35] graby.DEBUG: . looking for site config for {host} in primary folder {"host":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: ... found site config {host} {"host":"global.txt"} []
[2018-02-25 12:51:35] graby.DEBUG: Appending site config settings from global.txt [] []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"global"} []
[2018-02-25 12:51:35] graby.DEBUG: Cached site config with key: {key} {"key":"edition.cnn.com.merged"} []
[2018-02-25 12:51:35] graby.DEBUG: Fetching url: {url} {"url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:35] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:36] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:36] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:37] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:37] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:37] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:38] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:38] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:38] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:39] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:39] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:39] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:40] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/2.67.1/static/unsupp.html [] []
[2018-02-25 12:51:40] graby.DEBUG: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:40] graby.DEBUG: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://edition.cnn.com/2.67.1/static/unsupp.html"} []
[2018-02-25 12:51:41] graby.DEBUG: Meta refresh redirect found (http-equiv="refresh"), new URL: https://edition.cnn.com/ [] []
[2018-02-25 12:51:41] graby.DEBUG: Endless redirect: 11 on "{url}" {"url":"https://edition.cnn.com/"} []
[2018-02-25 12:51:41] graby.DEBUG: Opengraph data: {ogData} {"ogData":[]} []
[2018-02-25 12:51:41] graby.DEBUG: Looking for site config files to see if single page link exists [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: No "single_page_link" config found [] []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to extract content [] []
[2018-02-25 12:51:41] graby.DEBUG: Returning cached and merged site config for {host} {"host":"edition.cnn.com"} []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":232} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Trying again without tidy [] []
[2018-02-25 12:51:41] graby.DEBUG: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2018-02-25 12:51:41] graby.DEBUG: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {string} to strip element {"string":"highlights"} []
[2018-02-25 12:51:41] graby.DEBUG: Trying {pattern} for body (content length: {content_length}) {"pattern":"//section[contains(@class, 'body-text')]","content_length":154} []
[2018-02-25 12:51:41] graby.DEBUG: Using Readability [] []
[2018-02-25 12:51:41] graby.DEBUG: Detected title: {title} {"title":""} []
[2018-02-25 12:51:41] graby.DEBUG: Success ? {is_success} {"is_success":false} []
[2018-02-25 12:51:41] graby.DEBUG: Extract failed [] []
edition.cnn.com.txt:
body: //section[contains(@class, 'body-text')]
strip_id_or_class: highlights
# Avoid redirecting to 'unsupported browser' page
find_string: <meta http-equiv="refresh"
replace_string: <meta norefresh
test_url: http://edition.cnn.com/2012/05/13/us/new-york-police-policy/index.html
test_contains: this discriminatory and ineffective practice
test_url: http://rss.cnn.com/rss/edition.rss
test_url: http://rss.cnn.com/rss/edition_technology.rss
the other websites I checked works correct
it seems like issue caused by non-cutting IF IE tags from the page and non-replacing <meta refresh tags
As shown here, Full-Text RSS extracts the publication date of the articles.
Graby should extract it too.
The default log file will be located in the composer directory which should not even be writeable. For greater flexibility, setting a logger should be allowed similarly to how php-readability allows it.
Hi,
I'm trying to figure out if there is a way to strip all inline styles? I'm debating whether I should fork and add functionality here or post process on the HTML after running through graby.
Joe
I'm using Graby library for a side project. When I try to render a page with embedded images the figcaption sometimes gets pushed down part way of the page.
Anyone have any issues with this too?
Hi,
I've got a strange behavior with graby on my local machine. I started with wallabag 2.2.2, graby 1.6.0, readability 1.1.6 but I have the same issue with a standalone graby.
This URL http://www.rom-game.fr/news/2446-Jean%20Baudlot%20-%20de%20l%20Eurovision%20a%20Delphine%20Software.html cannot be parsed on my local but it works on https://f43.me/feed/test .
Any idea?
From @metasystem on November 10, 2015 16:3
Hi,
Just install wallabag and have issue on debian jessie
INFO - Matched route "new_entry".
DEBUG - Read existing security token from the session.
DEBUG - SELECT t0.username AS username_1, t0.username_canonical AS username_canonical_2, t0.email AS email_3, t0.email_canonical AS email_canonical_4, t0.enabled AS enabled_5, t0.salt AS salt_6, t0.password AS password_7, t0.last_login AS last_login_8, t0.locked AS locked_9, t0.expired AS expired_10, t0.expires_at AS expires_at_11, t0.confirmation_token AS confirmation_token_12, t0.password_requested_at AS password_requested_at_13, t0.roles AS roles_14, t0.credentials_expired AS credentials_expired_15, t0.credentials_expire_at AS credentials_expire_at_16, t0.id AS id_17, t0.name AS name_18, t0.created_at AS created_at_19, t0.updated_at AS updated_at_20, t0.authCode AS authCode_21, t0.twoFactorAuthentication AS twoFactorAuthentication_22, t0.trusted AS trusted_23, t24.id AS id_25, t24.theme AS theme_26, t24.items_per_page AS items_per_page_27, t24.language AS language_28, t24.rss_token AS rss_token_29, t24.rss_limit AS rss_limit_30, t24.user_id AS user_id_31 FROMwallabaguser
t0 LEFT JOINwallabagconfig
t24 ON t24.user_id = t0.id WHERE t0.id = ?
DEBUG - User was reloaded from a user provider.
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
DEBUG - Graby is ready to fetch
DEBUG - Fetching url: {url}
DEBUG - Trying using method "{method}" on url "{url}"
CRITICAL - Fatal Error: Call to undefined function Graby\Ring\Client\curl_init()
CRITICAL - Uncaught PHP Exception Symfony\Component\Debug\Exception\UndefinedFunctionException: "Attempted to call function "curl_init" from namespace "Graby\Ring\Client"." at /root/wallabag/vendor/j0k3r/graby/src/Ring/Client/SafeCurlHandler.php line 49
DEBUG - Notified event "kernel.request" to listener "Nelmio\CorsBundle\EventListener\CorsListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DebugHandlersListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\ProfilerListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\DumpListener::configure".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\FrameworkBundle\EventListener\SessionListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\FragmentListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\RouterListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Wallabag\CoreBundle\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\LocaleListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "FOS\RestBundle\EventListener\BodyListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\HttpKernel\EventListener\TranslatorListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Component\Security\Http\Firewall::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Symfony\Bundle\AsseticBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Nelmio\ApiDocBundle\EventListener\RequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Liip\ThemeBundle\EventListener\ThemeRequestListener::onKernelRequest".
DEBUG - Notified event "kernel.request" to listener "Scheb\TwoFactorBundle\Security\TwoFactor\EventListener\RequestListener::onCoreRequest".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ParamFetcherListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Bundle\FrameworkBundle\DataCollector\RouterDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Symfony\Component\HttpKernel\DataCollector\RequestDataCollector::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ControllerListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\ParamConverterListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\HttpCacheListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "Sensio\Bundle\FrameworkExtraBundle\EventListener\SecurityListener::onKernelController".
DEBUG - Notified event "kernel.controller" to listener "FOS\RestBundle\EventListener\ViewResponseListener::onKernelController".
INFO - Defining the initRuntime() method in the "form" extension is deprecated. Use theneeds_environment
option to get the Twig_Environment instance in filters, functions, or tests; or explicitly implement Twig_Extension_InitRuntimeInterface if needed (not recommended).
INFO - Defining the getGlobals() method in the "assetic" extension is deprecated without explicitly >implementing Twig_Extension_GlobalsInterface.
Installed with pdo_mysql
Copied from original issue: wallabag/wallabag#1511
I am posting this issue here, because i think it is a bug in graby. The site config appears to be valid.
When adding http://www.linuxjournal.com/content/papas-got-brand-new-nas to wallabag, content fetching lasts very long. Only for adding this page to wallabag, the prod.log of wallabag grows by 2,4 MB.
It is only a problem with multi-page articles of linuxjournal.
What i observe:
This article appears to continue on subsequent pages which we could not extract
I have attached the log: prod.txt
From a first superficial view at the log, graby has a problem with the URL for the next page, because it contains an unusual string (maybe the comma?).
As we discussed here wallabag/wallabag#1335, graby should return Content-Type of the URL.
When I try to fetch this URL http://fipcom.net/en.signup
, HttpClient::fetch
method returns a binary string, see below (the b
before the content).
I don't know why we have this binary string here. It seems to be due to PHP 6 https://stackoverflow.com/questions/4749442/what-does-the-b-in-front-of-string-literals-do
The first content is converted to:
<h3>By and ,</h3>
When the original content is:
<span class="pb-byline" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
By
<a href="https://www.washingtonpost.com/people/carol-morello/">
<span itemprop="name">Carol Morello</span>
</a>
and
<a href="https://www.washingtonpost.com/people/greg-miller/">
<span itemprop="name">Greg Miller</span>
</a>
</span>
Hello,
I do not know where to put that, because it concerns graby, graby-site-config and wallabag.
I was wondering if there was a way to have a small "standalone" version of graby that would read the config files without caching anything and return the content.
Basically, I am trying to help fivefilters (and thus graby-site-config) writing new config files, but doing it with wallabag running on a not-so-powerful server is really painful. Each time I make a change in the configuration files, I have to clear wallabag's cache, which is quite long (between 1-2 minutes on a Cubietruck!) ; delete the article and submit it again to wallabag. The whole process can take a few minutes, even when the issue was just a missed comma :( .
Unfortunately, I am hopeless coding anything in PHP... The ideal would be a php file without cache reading on-the-fly just one config file (or a specified config file) and that, given an URL, would display the content without any stylesheet (thus showing very quickly what are titles, paragraphs and so on).
Thanks in advance, and do not hesitate to ask further details if needed.
Regards
For some website, if we found ng-controller
in the html, we should try to get a static page using _escaped_fragment_
.
For example: https://tempostorm.com/articles/the-singleton-special-when-to-oneoff?_escaped_fragment_
Related wallabag/wallabag#1413
Hello !
I come from Wallabag, which is using this project.
I have problem retrieving content from a webpage, because graby automatically adds an ?_escaped_fragment_=
at the end of the URL, for crawling AJAX purpose.
That's a problem because the website in question gives a 404
error when detecting this escaped fragment. Probably to avoid being fetched by robots ?
Still, the content seems to be accessible without the fragment.
A solution would be to try to fetch again the URL without this escaped fragment if a 404
error is answered ?
Here is the website, you can test with or without the escaped fragment:
https://dzone.com/
https://dzone.com/?_escaped_fragment_=
Thank you !
Antonin
In wallabag, when I try to add this URL http://www.journaldugamer.com/tests/rencontre-ils-bossaient-sur-une-exclu-kinect-qui-ne-sortira-jamais/, I've got this error in my logs:
[2016-12-06 22:43:06] app.ERROR: Error while saving an entry {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\ContextErrorException(code: 0): Warning: DOMDocument::importNode(): Couldn't fetch Readability\\JSLikeHTMLElement at /var/www/wallabag/vendor/j0k3r/graby/src/Graby.php:297)","entry":"[object] (Wallabag\\CoreBundle\\Entity\\Entry: {})"} []
Hello!
I have a problem.
I'm trying to install this package - https://github.com/artesaos/laravel-linkedin but the package requires https://github.com/guzzle/guzzle 6.1.x version.
But your perfect (it's not irony!) package is blocking installation because requires an old version of Guzzle, PHP HTTP client.
Could you fix it, please?
And thank you for the great package!
Sincerely, Dmitry.
For mobile.twitter.com
configuration, I want to do this:
title: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
author: (//*[contains(@class, 'UserNames-displayName') or contains(@class, 'fullname')])[1]
body: (//div[contains(@class, 'TweetDetail-text') or contains(@class, 'tweet-text')])[1]
date: (//div[contains(@class, 'TweetDetail-timeAndGeo') or contains(@class, 'metadata')])[1]
I have the same xpath for title and body.
But the content is OK only for the title.
The body is wrong.
Do you have any idea?
I don't know if this is the right place but I don't know where to ask anyway...
I'm using wallabag which internally uses Graby to extract article contents. I'm having problems with this website http://www.muylinux.com/2016/03/22/kde-plasma-5-6 but seems that it's working in Full-Text RSS
this website =>http://f43.me/feed/test
What can I do to fix that URL in Graby?
If an article contains media (audio / video), graby should add a flag in the response (useful for this issue wallabag/wallabag#880).
This is supported in Full-Text RSS but not in Graby. The same with the date (#67)
See https://support.google.com/webmasters/answer/6340290?hl=fr
If page meta contains , maybe try to parse this one instead of original one.
Same goes to opening original url (quite interesting when you're on mobile).
Of course, this should be set as a setting.
Originally at wallabag/wallabag#2173
As reported with wallabag, graby is vulnerable to SSRF. This means one can bypass restrictions for resources only accessible to localhost like http://127.0.0.1/server-status
,
From @tcitworld on September 29, 2015 18:42
Guess it's graby-related, but I put it here.
Medium seems to use iframes to display some pictures and the url is not rewritten to match the original website.
https://medium.com/vantage/live-photos-are-a-gimmick-92d7ad03bcbe
Copied from original issue: wallabag/wallabag#1438
nt
See https://github.com/fin1te/safecurl/blob/v1.1/src/fin1te/SafeCurl/SafeCurl.php#L151 : Execption
. The typo is fixed in the master branch. I've sent a tweet and will probably contact him by email if necessary.
This causes wallabag to fail dramatically when adding a link with no working connexion.
My installation seems to be failing because I have guzzle 6 I think?
- Installation request for j0k3r/graby ^1.10 -> satisfiable by j0k3r/graby[1.10.0].
- j0k3r/graby 1.10.0 requires guzzlehttp/guzzle ^5.2.0 -> satisfiable by guzzlehttp/guzzle[5.2.0, 5.3.0, 5.3.1, 5.3.x-dev] but these confli
ct with your requirements or minimum-stability.
Add support for native_ad_clue
(see FTRSS).
More information about Native Ad can be found on the FiveFilters blog.
On master:
2) Tests\Graby\GrabyFunctionalTest::testDate with data set #1 ('https://www.reddit.com/r/Linu...guide/', '2013-05-30T16:01:58+00:00')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-2013-05-30T16:01:58+00:00
+2013-05-30T16:10:50+00:00
The actual date seems to be the one of the first comment below the main content.
From @pVesian on June 7, 2017 8:14
In HTML, lists can have a "start" attribute that allows a number in the list, instead of the default one.
Wallabagit & f43.me
Store this article: http://www.timothysykes.com/blog/10-things-know-short-selling/
Scroll to "“Called out” or “Buy in”", it's point 4 in the article, but point 1 in the stored article. By inspecting the HTML code, you will find that the parser removes the "start" attribute.
Thanks
Copied from original issue: wallabag/wallabag#3185
Currently you are getting the language information from the "lang" tag, but many sites doesn't have this tag and uses HTTP headers instead.
My proposal is to use the "lang" tag and if it's not available get the language from the header "Content-Language".
Example URl without tag but with header:
http://www.investopedia.com/university/introduction-stock-trader-types/pivot-traders.asp
Already reported in wallabag but they send me here => wallabag/wallabag#2978
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.