Giter Site home page Giter Site logo

jkphl / micrometa Goto Github PK

View Code? Open in Web Editor NEW
115.0 115.0 39.0 787 KB

A meta parser for extracting micro information out of web documents, currently supporting Microformats 1+2, HTML Microdata, RDFa Lite 1.1, JSON-LD and Link Types, written in PHP

Home Page: http://micrometa.jkphl.is

License: MIT License

PHP 85.82% HTML 14.18%

micrometa's People

Contributors

b4rtaz avatar blankse avatar chtipepere avatar jkphl avatar jspaetzel avatar lyrixx avatar madeitbelgium avatar rbairwell avatar rvanlaak avatar sarke avatar tomgillett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

micrometa's Issues

Recursive loop when ID is the same as URL

Example code:

<?php
include("vendor/autoload.php");
$jsonld = <<<EOF
<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@type": "Website",
  "@id": "https://www.example.com/",
  "url": "https://www.example.com/"
}
</script>
EOF;

$parser = new \Jkphl\Micrometa\Ports\Parser();
$parser("https://www.example.com/", $jsonld);

If url is changed to omit the final backslash, it works fine. I'm assuming this is related to issue #27. I'm attempting to put together a PR to fix it but I'm ending up a little lost, any advice you can give as to what the most likely cause would be appreciated.

Option to allow HTML in values

There are some cases when allowing HTML in values is expected.

Example input:

<div itemprop="description">line1<br>line2<br>line3<p>line4</p></div>

Output:

line1line2line3line4

Expected output:

line1<br>line2<br>line3<p>line4</p>

Installation with composer

Something is weird about the composer/autoloading integration. Instead of installing the dependencies into my vendor folder, it adds a new vendor folder and then is not compatible with composer's autoload.

However, if I remove the lines searching and including the autoload in Micrometa.php, it works fine with including the class.

(Enviroment: OS X 10.10.3, PHP 5.5.15, Composer version 1.0-dev)

References broken between separate script tags

I ran into this problem with the New York Times. They have the NewsArticle and the NewsMediaOrganization in separate tags, and because of the way Jkphl\Micrometa\Infrastructure\Parser\JsonLD parses them separately with ML\JsonLD\JsonLD, the latter is not able to resolve the references across tags, (for example in this case the publication).

My solution was to rewrite Jkphl\Micrometa\Infrastructure\Parser\JsonLD::parseDom() to collect the schema in one root node, and send that off to ML\JsonLD\JsonLD once.

I don't have time to do a PR tonight, but I will eventually.

Endless parsing

Hi!

The JSON-LD micrometa 2 parser can't finish if the url of @id and http://schema.org/sameAs property is the same.
I used the following script for testing:

<?php
require_once  'vendor/autoload.php';
use Jkphl\Micrometa\Ports\Parser;

$htmlSource_jsonld = '<!DOCTYPE html>
<html>
    <head>
        <script type="application/ld+json">
        {
          "@context": "http://schema.org",
          "@type": "organization",
          "@id": "http://www.website.de",
          "sameAs": "http://www.website.de"
        }
        </script>
    </head>
    <body></body>
</html>';

$time_start = microtime(true);

$objMicrometa = new Parser();
$result = $objMicrometa("http://www.website.de", $htmlSource_jsonld);

$time_end = microtime(true);
$time = $time_end - $time_start;
echo "\n $time Seconds Runtime";

Documentation errata

  • Add a hint to the live demo site
  • Promote the headlines > level 2 to show up on Read The Docs

JSON-LD parser does only find the first item

Am 20.03.2017 um 13:59 schrieb Claas Kalwa:

Hallo Joschi,

ich habe Probleme beim Extrahieren mehrerer JSON-LD Items mit dem
Micrometa V1 Parser. Er erkennt lediglich das erste Item, egal ob die
Items mit @graph gruppiert sind oder seperat in eigenen script-Elementen
vorkommen.

Im Anhang habe ich ein Beispiel, das eigentlich funktionieren sollte,
denke ich.

Hast Du eine Idee, wo das Problem liegen könnte?

Example source:

<!DOCTYPE html>

<html>
    <head>
        <title>TODO supply a title</title>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">

	<script type="application/ld+json">
	{
	 "@context": "http://schema.org",
	 "@graph": [
	{
	  "name": "Google Inc.",
	  "@type": "LocalBusiness",
	  "address": {
	    "@type": "PostalAddress",
	    "addressCountry": "United States",
	    "streetAddress": "1600 Amphitheatre Parkway",
	    "addressLocality": "Mountain View",
	    "addressRegion": "CA",
	    "postOfficeBoxNumber": null,
	    "postalCode": "94043",
	    "telephone": "+1 650-253-0000",
	    "faxNumber": "+1 650-253-0001"
	  }
	},
	{
	  "name": "Google Ann Arbor",
	  "@type": "LocalBusiness",
	  "address": {
	    "@type": "PostalAddress",
	    "addressCountry": "United States",
	    "streetAddress": "201 S. Division St. Suite 500",
	    "addressLocality": "Ann Arbor",
	    "addressRegion": "MI",
	    "postOfficeBoxNumber": null,
	    "postalCode": "48104",
	    "telephone": "+1 734-332-6500",
	    "faxNumber": "+1 734-332-6501"
	  }
	}
	 ]
	}
	</script>

    </head>
    <body>
        <div>TODO write content</div>
        
    </body>
</html>

Class 'Guzzle\Http\Url' not found

Hi

When trying that library, i got this message:

 [message] => Class 'Guzzle\Http\Url' not found
 [file] => vendor/jkphl/dom-factory/src/Domfactory/Infrastructure/Dom.php

The thing is I have upgraded Guzzle to 6.2.3 so it should work. It looks like Guzzle has not that class anymore.

Any idea?

Schema.org type serialization invalid when http and https are mixed

issue #24 does describe the argumentation the semantics of on namespaces being different. This is something we can work around as the library's implementation is flexible enough for that.

A consequence is that calling item->toObject will lead to an invalid type:

image

An IRI as http://schema.org/https://schema.org/WebPage can be seen as invalid.

What about adding several helpers for sanitizing / handling http & https on the profile equally?

Schema.org redirects http to https since a while, so both can be seen as identical. That would solve the following inconsistency:

image

Error in ItemList->getFirstItem()

Method getFirstItem() uses results of method getItems()

$items = $this->getItems(...$types);

and if all ok, then get element with index 0

return $items[0];

Here is the link to this line in code

But method getItems() uses function array_filter(), which preserves the array's keys. So, for example, if we have

ItemList $items [
    0 => type 'Breadcrumb'
    1 => type 'Breadcrumb'
    2 => type 'Product'
]

then after $items->getItems('Product') it will become

ItemList $items [
   2 => type 'Product'
]

and $items[0] will return null.

I suggest to use

public function getFirstItem(...$types)
    {
        $items = array_values($this->getItems(...$types));

or

public function getItems(...$types)
    {
        ...

        return array_values($this->items);

Unknown JSON-LD item

Hi

I'm looking to build a script that sees what data it can glean from any given url, microdata first, then content. Your parser seems perfect for that, but I've noticed a case where an error is thrown in certain situations.

I'm giving the following url:
http://www.currys.co.uk/gbuk/computing/laptops/laptops/lenovo-yoga-510-14-2-in-1-black-10146249-pdt.html

And I'm getting the following warning:

Warning: get_class() expects parameter 1 to be object, array given in C:\Users\danm\Documents\Websites\page-scraper-analyser\vendor\jkphl\micrometa\src\Jkphl\Micrometa\Parser\JsonLD.php on line 217
Unknown JSON-LD item: {"items":[{"id":"_:b0","types":["http:\/\/schema.org\/BreadcrumbList"]

Is it finding microdata but attempting to parse it as JSON-LD?

I've also noticed cases where no data is obtained though microdata is used on the page, is this indicative of poor configuration their end?

Thanks in advance

EDIT

Here's a list of urls with data that either isn't being returned, or is buggy:

I appreciate that some of these may be down to the implementation of the microdata on the pages themselves.

Assume `isPartOf` item as `CreativeWork` when instance of `ValueInterface`

Our implementation does traverse through the properties of an ItemInterface to look for isPartOf.

The expected value of isPartOf always should be an instance of CreativeWork, as stated in the specification: https://schema.org/isPartOf

As we do not live in an ideal world, for many URLs the value of isPartOf actually just is a string. More technically; isPartOf will get returned as ValueInterface instead of as ItemInterface.

How to fix?

As the expected value always should be an instance of CreativeWork, the interpreter / parser should not parse values of isPartOf as ValueInterface, but should change them to a ItemInterface of type CreativeWork.

To support this fix; the Google Structured Data test tool does also mark these values according to the above.

Another approach; what about also adding the isPartOf method to ItemInterface with return value : ?ItemInterface? Or is that JSON-LD specific?

Test data

Poorly coded Microdata attributes

Hi!

Is there any way to deal with poorly coded Microdata attributes which specifically mess up with the Vocabulary URI?

For example, fetching "https://www.belibe.it/zoccoli-professionali-dian-eva.html" while looking for Microdata I got a "BreadcrumbList" with some "ListItem" inside (yes, it's ok) but I also got a "https://schema.org/Product" (not a "Product") with "//schema.org/Offer" inside "offer" property. It's clear that they are not using just one Vocabulary URI but three: "http://schema.org", "https://schema.org" and "//schema.org".

How could I get all three as just one?

Thanks!

Inaccessible properties with capital letters

Hi

Not sure if this applies to all properties, as I'm just testing out the library, but I can not access a startDate property of an event microdata type.

Reproducable as:

$url = 'http://www.residentadvisor.net/events.aspx?ai=174'; // just some random event site
$parser = new \Jkphl\Micrometa($url);
$item = $parser->item('http://data-vocabulary.org/Event');
print_r($item);
var_dump($item->startDate);

Outputs:

Jkphl\Micrometa\Parser\Microdata\Item Object
(
[_url:protected] => Jkphl\Utility\Url Object
    (
        [_url:protected] => http://www.residentadvisor.net/events.aspx?ai=174
        [_parts:protected] => Array
            (
                [scheme] => http
                [host] => www.residentadvisor.net
                [path] => /events.aspx
                [query] => Array
                    (
                        [ai] => 174
                    )

            )

    )

[types] => Array
    (
        [0] => http://data-vocabulary.org/Event
    )

[id] => 
[value] => 
[_properties:protected] => stdClass Object
    (
        [startDate] => Array
            (
                [0] => 2014-05-16T00:00
            )

        [summary] => Array
            (
                [0] => 360 Degrees: Osunlade at Bird
            )

        [url] => Array
            (
                [0] => http://www.residentadvisor.net/event.aspx?569206=
            )

    )

)
NULL

In Jkphl\Micrometa\Item::__get any uppercase characters are exchanged as follows:

`startDate` -> `start-date`

If I remove the responsible line, it works just fine.

Is the property stored correctly as startDate or is it intended to be stored as start-date?

Edit: as for the versions I'm using:

    "mf2/mf2": "dev-master",
    "euskadi31/microdata": "dev-master",
    "jkphl/micrometa": "dev-master",

Unable to get "name" from ListItem -> item

Hi, I can't get the name property from ListItem -> item

Please tell me what can I do.

$items = $micrometa($url, '<script type="application/ld+json">
{
 "@context": "http://schema.org",
 "@type": "BreadcrumbList",
 "itemListElement":
 [
  {
   "@type": "ListItem",
   "position": 1,
   "item":
   {
    "@id": "https://example.com/dresses",
    "name": "Dresses"
    }
  },
  {
   "@type": "ListItem",
  "position": 2,
  "item":
   {
     "@id": "https://example.com/dresses/real",
     "name": "Real Dresses"
   }
  }
 ]
}
</script>');
var_dump($items); exit;
  ["dom":protected]=>
  object(DOMDocument)#1998 (35) {
    ["doctype"]=>
    NULL
    ["implementation"]=>
    string(22) "(object value omitted)"
    ["documentElement"]=>
    string(22) "(object value omitted)"
    ["actualEncoding"]=>
    NULL
    ["encoding"]=>
    NULL
    ["xmlEncoding"]=>
    NULL
    ["standalone"]=>
    bool(true)
    ["xmlStandalone"]=>
    bool(true)
    ["version"]=>
    string(3) "1.0"
    ["xmlVersion"]=>
    string(3) "1.0"
    ["strictErrorChecking"]=>
    bool(true)
    ["documentURI"]=>
    string(9) "/var/www/"
    ["config"]=>
    NULL
    ["formatOutput"]=>
    bool(false)
    ["validateOnParse"]=>
    bool(false)
    ["resolveExternals"]=>
    bool(false)
    ["preserveWhiteSpace"]=>
    bool(true)
    ["recover"]=>
    bool(false)
    ["substituteEntities"]=>
    bool(false)
    ["nodeName"]=>
    string(9) "#document"
    ["nodeValue"]=>
    NULL
    ["nodeType"]=>
    int(9)
    ["parentNode"]=>
    NULL
    ["childNodes"]=>
    string(22) "(object value omitted)"
    ["firstChild"]=>
    string(22) "(object value omitted)"
    ["lastChild"]=>
    string(22) "(object value omitted)"
    ["previousSibling"]=>
    NULL
    ["nextSibling"]=>
    NULL
    ["attributes"]=>
    NULL
    ["ownerDocument"]=>
    NULL
    ["namespaceURI"]=>
    NULL
    ["prefix"]=>
    string(0) ""
    ["localName"]=>
    NULL
    ["baseURI"]=>
    string(9) "/var/www/"
    ["textContent"]=>
    string(315) " { "@context": "http://schema.org", "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "item": { "@id": "https://example.com/dresses", "name": "Dresses" } }, { "@type": "ListItem", "position": 2, "item": { "@id": "https://example.com/dresses/real", "name": "Real Dresses" } } ] } "
  }
  ["links":protected]=>
  NULL
  ["items":protected]=>
  array(1) {
    [0]=>
    object(Jkphl\Micrometa\Ports\Item\Item)#2429 (3) {
      ["item":protected]=>
      object(Jkphl\Micrometa\Application\Item\Item)#4104 (8) {
        ["format":protected]=>
        int(4)
        ["value":protected]=>
        NULL
        ["children":protected]=>
        array(0) {
        }
        ["propertyListFactory":protected]=>
        object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
        }
        ["type":protected]=>
        array(1) {
          [0]=>
          object(Jkphl\Micrometa\Domain\Item\Iri)#2514 (2) {
            ["immutableProfile":protected]=>
            string(18) "http://schema.org/"
            ["immutableName":protected]=>
            string(14) "BreadcrumbList"
          }
        }
        ["properties":protected]=>
        object(Jkphl\Micrometa\Application\Item\PropertyList)#4105 (6) {
          ["aliases":protected]=>
          array(1) {
            ["http://schema.org/itemListElement"]=>
            array(1) {
              [0]=>
              string(15) "itemListElement"
            }
          }
          ["aliasFactory":protected]=>
          object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4103 (0) {
          }
          ["values":protected]=>
          array(1) {
            [0]=>
            array(2) {
              [0]=>
              object(Jkphl\Micrometa\Application\Item\Item)#3986 (8) {
                ["format":protected]=>
                int(4)
                ["value":protected]=>
                NULL
                ["children":protected]=>
                array(0) {
                }
                ["propertyListFactory":protected]=>
                object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
                }
                ["type":protected]=>
                array(1) {
                  [0]=>
                  object(Jkphl\Micrometa\Domain\Item\Iri)#4109 (2) {
                    ["immutableProfile":protected]=>
                    string(18) "http://schema.org/"
                    ["immutableName":protected]=>
                    string(8) "ListItem"
                  }
                }
                ["properties":protected]=>
                object(Jkphl\Micrometa\Application\Item\PropertyList)#4108 (6) {
                  ["aliases":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    array(1) {
                      [0]=>
                      string(4) "item"
                    }
                    ["http://schema.org/position"]=>
                    array(1) {
                      [0]=>
                      string(8) "position"
                    }
                  }
                  ["aliasFactory":protected]=>
                  object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4094 (0) {
                  }
                  ["values":protected]=>
                  array(2) {
                    [0]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#3998 (2) {
                        ["value":protected]=>
                        string(27) "https://example.com/dresses"
                        ["language":protected]=>
                        NULL
                      }
                    }
                    [1]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4006 (2) {
                        ["value":protected]=>
                        string(1) "1"
                        ["language":protected]=>
                        NULL
                      }
                    }
                  }
                  ["names":protected]=>
                  array(2) {
                    [0]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4093 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(4) "item"
                    }
                    [1]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4098 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(8) "position"
                    }
                  }
                  ["nameToCursor":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    int(0)
                    ["http://schema.org/position"]=>
                    int(1)
                  }
                  ["cursor":protected]=>
                  int(0)
                }
                ["itemId":protected]=>
                string(4) "_:b1"
                ["itemLanguage":protected]=>
                NULL
              }
              [1]=>
              object(Jkphl\Micrometa\Application\Item\Item)#4100 (8) {
                ["format":protected]=>
                int(4)
                ["value":protected]=>
                NULL
                ["children":protected]=>
                array(0) {
                }
                ["propertyListFactory":protected]=>
                object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
                }
                ["type":protected]=>
                array(1) {
                  [0]=>
                  object(Jkphl\Micrometa\Domain\Item\Iri)#4097 (2) {
                    ["immutableProfile":protected]=>
                    string(18) "http://schema.org/"
                    ["immutableName":protected]=>
                    string(8) "ListItem"
                  }
                }
                ["properties":protected]=>
                object(Jkphl\Micrometa\Application\Item\PropertyList)#4101 (6) {
                  ["aliases":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    array(1) {
                      [0]=>
                      string(4) "item"
                    }
                    ["http://schema.org/position"]=>
                    array(1) {
                      [0]=>
                      string(8) "position"
                    }
                  }
                  ["aliasFactory":protected]=>
                  object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4095 (0) {
                  }
                  ["values":protected]=>
                  array(2) {
                    [0]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4102 (2) {
                        ["value":protected]=>
                        string(32) "https://example.com/dresses/real"
                        ["language":protected]=>
                        NULL
                      }
                    }
                    [1]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4096 (2) {
                        ["value":protected]=>
                        string(1) "2"
                        ["language":protected]=>
                        NULL
                      }
                    }
                  }
                  ["names":protected]=>
                  array(2) {
                    [0]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4078 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(4) "item"
                    }
                    [1]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4076 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(8) "position"
                    }
                  }
                  ["nameToCursor":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    int(0)
                    ["http://schema.org/position"]=>
                    int(1)
                  }
                  ["cursor":protected]=>
                  int(0)
                }
                ["itemId":protected]=>
                string(4) "_:b2"
                ["itemLanguage":protected]=>
                NULL
              }
            }
          }
          ["names":protected]=>
          array(1) {
            [0]=>
            object(Jkphl\Micrometa\Domain\Item\Iri)#2526 (2) {
              ["immutableProfile":protected]=>
              string(18) "http://schema.org/"
              ["immutableName":protected]=>
              string(15) "itemListElement"
            }
          }
          ["nameToCursor":protected]=>
          array(1) {
            ["http://schema.org/itemListElement"]=>
            int(0)
          }
          ["cursor":protected]=>
          int(0)
        }
        ["itemId":protected]=>
        string(4) "_:b0"
        ["itemLanguage":protected]=>
        NULL
      }
      ["items":protected]=>
      array(0) {
      }
      ["pointer":protected]=>
      int(0)
    }
  }
  ["pointer":protected]=>
  int(0)
}

Do not assume `ItemInterface` is a collection by extending traversable `ItemListInterface`

The latest version of PHPStan does check for generics. In other words, for traversables it also does static code detection of their items.

The ItemInterface does extend ItemListInterface and thereby is a traversable. This is incorrect because it also can be a single item.

What is needed / what would conflict when removing the item interface from extending the list interface?

Bump monolog up to ^2

Could you update your monolog dependency up to ^2 ? This issue forces us to use another library :(

JSON LD with `sameas` throws `InvalidArgumentException`

I'm honestly not sure if this is an issue with the underlying JSON LD Parser or not. But an exception is thrown for metadata that includes the sameAs property. Testing with some example code:

{
"@context": "http://schema.org",
"@type": "Website",
"name": "Example Website",
"url": "https://example.com",
"sameAs": [
  "https://facebook.com/Example",
  "https://twitter.com/Example"
]
}

Which appears to be valid:

https://www.dropbox.com/s/sfcs4ifc9lj8izr/Screenshot%202017-06-19%2010.34.02.png?dl=0

Parsing this snippet results in an InvalidArgumentException with the message Empty type list is not allowed. After digging a little bit this appears to be due to the existence of the sameAs property.

Uncaught exception 'Jkphl\Domfactory\Ports\RuntimeException' with message 'cURL error 60: SSL certificate problem: unable to get local issuer certificate

Hi,

I have this error:
Uncaught exception 'Jkphl\Domfactory\Ports\RuntimeException' with message 'cURL error 60: SSL certificate problem: unable to get local issuer certificate in lib\vendor\jkphl\dom-factory\src\Domfactory\Infrastructure\Dom.php:72

Any idea? Exists the posibility to can set something like $client->setOption(CURLOPT_CAINFO, $certKeysPath.'/cacert.pem'); ??

Example from documentation is not working

Hey guys, I tried to follow documentation and I failed to get a working example without having to take a look at the demo file.

So, here is how it is into documentation:

use Jkphl\Micrometa\Ports\Parser;
$micrometa = new Parser();
$items = $micrometa('http://example.com');

This will fail, since Parser class is nowhere to find.

"Empty type list is not allowed" when rel=""

I'd like to preface this by saying that I'm not sure whether this actually warrants a change in the library. I just wanted to report this (and my workaround) in case others run into it, and in case there is anything that might be worthwhile adding to the library.

I've encountered a website that contains a lot of good structured data, but unfortunately also contains several links with an empty rel attribute in the footer, unrelated to the structured data type I was trying to retrieve.

Example:

<a href="/some/link" rel="">Some Link</a>

This causes Jkphl\Micrometa\Domain\Exceptions\InvalidArgumentException: Empty type list is not allowed to be thrown which prevents me from grabbing any of the actual data I was looking for which was already successfully retrieved.

From what I can tell, an empty rel attribute isn't strictly invalid, even if unusual.

My workaround is to retrieve the HTML manually, and remove all empty rel attributes before passing it to Micrometa.

Example:

$html = file_get_contents($url);
$html = preg_replace('/rel=["\']{2}/', '', $html);
$items = $parser($url, $html);

Not sure if it's worth adding anything to the library to ignore these empty rel attributes? I'd be happy to come up with a PR if so.

Warm regards.

Parse ld+json wrong

Hi i have like this json

  {
    "@context": "https://schema.org/",
    "@type": "Product",
    "name": "Extra",
    "image": "https://www..jpg",
    "category": [
      "category",
    ],
    "description": "This Stun",
    "SKU": "11111",
    "Offers": {
      "@type": "Offer",
      "priceCurrency": "GBP",
      "price": "509.99",
      "itemcondition": "http://schema.org/NewCondition",
      "availability": "https://schema.org/PreOrder",
      "url": "https://www."
    },
  }

I try to get offers but immutableName wrong and i get error OutOfBoundsException|
image

if i change manually Offers to offers all works
image

change json i cant it external and use like this i think bad idea str_replace("Offers", "offers", "json") because json on HTML page

What else can I do?

ML\JsonLD\Exception\JsonLdException: Loading https://schema.org failed

I was trying to parse the data from this link

https://www.lazada.co.id/products/dispenser-mini-air-minum-anak-aneka-karakter-hello-kitty-helo-kity-hk-doraemon-doremon-i363397334-s382716523.html

But then it throws JsonLdException.

When I try in Google schema tool, it works fine.

google-schema

This is my setup

$options = [ 
    'client'  => [
        'timeout' => 30, 
        'curl' => [
            CURLOPT_PROXY => 'proxy-host',
            CURLOPT_PROXYUSERPWD => 'proxy-pass',
        ],  
    ],  
    'request' => [
        'verify'  => false,
        'headers' => [
            'Cache-Control' => 'no-cache, no-store, must-revalidate',
            'Pragma'        => 'no-cache',
            'User-Agent'    => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
            'Expires'       => 'Thu, 19 Nov 1981 08:52:00 GMT',
        ]   
    ],  
];

$micrometa = new Parser();
$micrometa($url, null, FORMAT::ALL, $options);

Anything I've missing out?

Author is parsed as _:b1

When parsing the following URL, the author attribute is parsed as _:b1 where it is expected to be empty:

https://eu.usatoday.com/story/news/politics/elections/2016/07/18/donald-trump-hispanics-hillary-clinton/87241102/

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "NewsArticle",
  "author": {},
  "dateModified": "0001-01-01T00:00:00Z",
  "datePublished": "0001-01-01T00:00:00Z",
  "image": {},
  "mainEntityOfPage": {},
  "publisher": {
            "@type": "Organization",
    "logo": {},
    "name": "USA TODAY"
  }
}
</script>

image

Getting type by schema.org hierarchy?

Is there's a way to select items based on their schema.org inheritance?

One example from the docs:

$event = $item->getFirstItem(
    new Iri('http://microformats.org/profile/', 'h-event'),
    new Iri('http://schema.org/', 'Event')
);

Now say the schema.org Event in question is actually listed as a Festival or BusinessEvent (or whatever). They satisfy all the properties of the Event parent type, but they are not matched by micrometa.

Is there a way to do this without having to list all the possible sub-types?

Is there anyway to set proxy?

My server IP are blocked by some sites,

curl_setopt($ch, CURLOPT_PROXY, 'http://proxy-server.tld:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'mypass');

Is there anyway to set this option? #

Microdata Does not parse content="" on item property fields

Problem

According to http://schema.org/Review and https://search.google.com/structured-data/testing-tool/u/0/ , the following schema:

<div itemscope itemtype="http://schema.org/Offer">
    <!--price is 1000, a number, with locale-specific thousands separator
    and decimal mark, and the $ character is marked up with the
    machine-readable code "USD" -->
    <span itemprop="priceCurrency" content="USD">$</span><span
        itemprop="price" content="1000.00">1,000.00</span>
    <link itemprop="availability" href="http://schema.org/InStock" />In stock
</div>

is valid and "priceCurrency" should have the value of "USD" (and price should have "1000.00").

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.