Giter Site home page Giter Site logo

urlresolver.php's Introduction

Welcome to URLResolver.php

URLResolver.php is a PHP class that attempts to resolve URLs to a final, canonical link. On the web today, link shorteners, tracking codes and more can result in many different links that ultimately point to the same resource. By following HTTP redirects and parsing web pages for open graph and canonical URLs, URLResolver.php attempts to solve this issue.

Patterns Recognized

  • Follows 301, 302, and 303 redirects found in HTTP headers
  • Follows Open Graph URL <meta> tags found in web page <head>
  • Follows Canonical URL <link> tags found in web page <head>
  • Aborts download quickly if content type is not an HTML page

I am open to additional suggestions for improvement.

Usage

Resolving a URL can be as easy as:

<?php require_once('URLResolver.php');

$resolver = new URLResolver();
print $resolver->resolveURL('http://goo.gl/0GMP1')->getURL();

However, in most cases you will want to perform a little extra setup. The following code sets a user agent to identify your crawler (otherwise the default will be used) and also designates a temporary file that can be used for storing cookies during the session. Some web sites will test the browser for cookie support, so this will enhance your results.

<?php require_once('URLResolver.php');
$resolver = new URLResolver();

# Identify your crawler (otherwise the default will be used)
$resolver->setUserAgent('Mozilla/5.0 (compatible; YourAppName/1.0; +http://www.example.com)');

# Designate a temporary file that will store cookies during the session.
# Some web sites test the browser for cookie support, so this enhances results.
$resolver->setCookieJar('/tmp/url_resolver.cookies');

# resolveURL() returns an object that allows for additional information.
$url = 'http://goo.gl/0GMP1';
$url_result = $resolver->resolveURL($url);

# Test to see if any error occurred while resolving the URL:
if ($url_result->didErrorOccur()) {
	print "there was an error resolving $url:\n  ";
	print $url_result->getErrorMessageString();
}

# Otherwise, print out the resolved URL.  The [HTTP status code] will tell you
# additional information about the success/failure. For instance, if the
# link resulted in a 404 Not Found error, it would print '404: http://...'
# The successful status code is 200.
else {
	print $url_result->getHTTPStatusCode();
	print ': ';
	print $url_result->getURL();
}

Download and Requirements

License

URLResolver.php is licensed under the MIT License, viewable in the source code.

Download

URLResolver.php as a .tar.gz or .zip file.

Requirements

API

URLResolver()

$resolver = new URLResolver();
Create the URL resolver object that you call additional methods on.

$resolver->resolveURL($url);
$url is the link you want to resolve.
Returns a [URLResult] object that contains the final, resolved URL.

$resolver->setUserAgent($user_agent);
Pass in a string that is sent to each web server to identify your crawler.

$resolver->setCookieJar($cookie_file); # Defaults to disable cookies
*** This file will be removed at the end of each resolveURL() call. ***
Pass in the path to a file used to store cookies during each resolveURL() call.
If no cookie file is set, cookies will be disabled and results may suffer.
This file must not already exist. If it does, pass true as second argument to enable overwrite.

$resolver->setMaxRedirects($max_redirects); # Defaults to 10
Set the maximum number of URL requests to attempt during each resolveURL() call.

$resolver->setMaxResponseDataSize($max_bytes); # Defaults to 120000
Pass in an integer specifying the maximum data to download per request.
Multiple URL requests may occur during each resolveURL() call.
Setting this too low may limit the usefulness of results (default 120000).

$resolver->setRequestTimeout($num_seconds); # Defaults to 30
Set the maximum amount of time, in seconds, any URL request can take.
Multiple URL requests may occur during each resolveURL() call.

$resolver->isDebugMode($value); # Defaults to false
Set $value to true to enable debug mode and false to disable (the default).
This will print out each link visited, along with status codes and link types.

URLResolverResult()

$url_result = $resolver->resolveURL($url);
Retrieve the URLResolverResult() object representing the resolution of $url.

$url_result->getURL();
This is the best resolved URL we could obtain after following redirects.

$url_result->getHTTPStatusCode();
Returns the integer HTTP status code for the resolved URL.
Examples: 200 - OK (success), 404 - Not Found, 301 - Moved Permanently, ...

$url_result->hasSuccessHTTPStatus();
Returns true if the HTTP status code for the resolved URL is 200.

$url_result->hasRedirectHTTPStatus();
Returns true if the HTTP status code for the resolved URL is 301, 302, or 303.

$url_result->getContentType();
Returns the value of the Content-Type HTTP header for the resolved URL.
If header not provided, null is returned. Examples: text/html, image/jpeg, ...

$url_result->getContentLength();
Returns the size of the fetched URL in bytes for the resolved URL.
Determined only by the Content-Length HTTP header. null returned otherwise.

$url_result->isOpenGraphURL();
Returns true if resolved URL was marked as the Open Graph URL (og:url)

$url_result->isCanonicalURL();
Returns true if resolved URL was marked as the Canonical URL (rel=canonical)

$url_result->isStartingURL();
Returns true if resolved URL was also the URL you passed to resolveURL().

$url_result->didErrorOccur();
Returns true if an error occurred while resolving the URL.
If this returns false, $url_result is guaranteed to have a status code.

$url_result->getErrorMessageString();
Returns an explanation of what went wrong if didErrorOccur() returns true.

$url_result->didConnectionFail();
Returns true if there was a connection error (no header or no body returned).
May indicate a situation where you are more likely to try at least once more.
If this returns true, didErrorOccur() will true as well.

Changelog

  • v1.1 - June 3, 2014

    • Support http redirect code 303
  • v1.0 - December 3, 2011

    • Initial release supports http header redirects, og:url and rel=canonical

urlresolver.php's People

Contributors

mattwright avatar toddlevy avatar cfreeh avatar

Watchers

digfish avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.