Giter Site home page Giter Site logo

autopager's Introduction

Autopager

PyPI Version Build Status Code Coverage

Autopager is a Python package which detects and classifies pagination links.

License is MIT.

Installation

Install autopager with pip:

pip install autopager

Autopager depends on a few other packages like lxml and python-crfsuite; it will try install them automatically, but you may need to consult with installation docs for these packages if installation fails.

Autopager works in Python 3.6+.

Usage

autopager.urls function returns a list of pagination URLs:

>>> import autopager
>>> import requests
>>> autopager.urls(requests.get('http://my-url.org'))
['http://my-url.org/page/1', 'http://my-url.org/page/3', 'http://my-url.org/page/4']

autopager.select function returns all pagination <a> elements as parsel.SelectorList (the same object as scrapy response.css / response.xpath methods return).

autopager.extract function returns a list of (link_type, link) tuples where link_type is one of "PAGE", "PREV", "NEXT" and link is a parsel.Selector instance.

These functions accept HTML page contents (as an unicode string), requests Response or scrapy Response as a first argument.

By default, a prebuilt extraction model is used. If you want to use your own model use autopager.AutoPager class; it has the same methods but allows to provide model path or model itself:

>>> import autopager
>>> pager = autopager.AutoPager('my_model.crf')
>>> pager.urls(html)

You also have to use AutoPager class if you've cloned repository from git; prebuilt model is only available in pypi releases.

Detection Quality

Web pages can be very different; autopager tries to work for all websites, but some errors are inevitable. As a very rough estimate, expect it to work properly for 9/10 paginators on websites sampled from 1M international most popular websites (according to Alexa Top).

Contributing

How It Works

Autopager uses machine learning to detect paginators. It classifies <a> HTML elements into 4 classes:

  • PREV - previous page link
  • PAGE - a link to a specific page
  • NEXT - next page link
  • OTHER - not a pagination link

To do that it uses features like link text, css class names, URL parts and right/left contexts. CRF model is used for learning.

Web page is represented as a sequence of <a> elements. Only <a> elements with non-empty href attributes are in this sequence.

See also: https://github.com/TeamHG-Memex/autopager/blob/master/notebooks/Training.ipynb

Training Data

Data is stored at autopager/data. Raw HTML source code is in autopager/data/html folder. Annotations are in autopager/data/data.csv file; elements are stored as CSS selectors.

Training data is annotated with 5 non-empty classes:

  • PREV - previous page link
  • PAGE - a link to a specific page
  • NEXT - next page link
  • LAST - 'got to last page' link which is not just a number
  • FIRST - 'got to first page' link which is not just '1' number

Because LAST and FIRST are relatively rare they are converted to PAGE by pagination model. By using these classes during annotation it can be possible to make model predict them as well in future, with more training examples.

To add a new page to training data save it to an html file and add a row to the data.csv file. It is helpful to use http://selectorgadget.com/ extension to get CSS selectors.

Don't worry if your CSS selectors don't return <a> elements directly (it is easy to occasionally select a parent or a child of an <a> element when using SelectorGadget). If a selection itself is not <a> element then parent <a> elements and children <a> elements are tried, this is usually what is wanted because <a> tags are not nested on valid websites.

When using SelectorGadget pay special attention not to select anything other than pagination elements. Always check element count displayed by SelectorGadget and compare it to a number of elements you wanted to select.

Some websites change their DOM after rendering. This rarely affect paginator elements, but sometimes it can happen. To prevent it instead of downloading HTML file using "Save As.." browser menu option it is better to use "Copy Outer HTML" in developer tools or render HTML using a headless browser (e.g. Splash). If you do so make sure to put UTF-8 encoding to data.csv, regardless of page encoding defined in HTTP headers or <meta> tags.


define hyperiongray

autopager's People

Contributors

ivanprado avatar kmike avatar mehaase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autopager's Issues

ValueError: Invalid IPv6 URL

Here is the traceback

  File "/usr/local/lib/python3.6/dist-packages/autopager/autopager.py", line 51, in extract
    return list(get_shared_autopager().extract(page, direct, prev, next))
  File "/usr/local/lib/python3.6/dist-packages/autopager/autopager.py", line 112, in extract
    xseq = page_to_features(links)
  File "/usr/local/lib/python3.6/dist-packages/autopager/model.py", line 129, in page_to_features
    features = [link_to_features(a) for a in xseq]
  File "/usr/local/lib/python3.6/dist-packages/autopager/model.py", line 129, in <listcomp>
    features = [link_to_features(a) for a in xseq]
  File "/usr/local/lib/python3.6/dist-packages/autopager/model.py", line 55, in link_to_features
    p = urlsplit(href)
  File "/usr/lib/python3.6/urllib/parse.py", line 436, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

Example which triggers the issue (I don't have the one which actually happened in the wild):

autopager.extract('<a href="http://[">Error</a>')

Error when passing Scrapy response

There seems to be a problem when trying to pass a Scrapy response to autopager. The same page works when using requests instead of Scrapy.

(ipython)➜  TweetScraper git:(master) ✗ scrapy shell http://elcomercio.pe/buscar/ppk
2016-04-09 23:06:27 [scrapy] INFO: Scrapy 1.0.5 started (bot: TweetScraper)
2016-04-09 23:06:27 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-04-09 23:06:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'LOG_LEVEL': 'INFO', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'BOT_NAME': 'TweetScraper', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'TweetScraper'}
2016-04-09 23:06:27 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-04-09 23:06:27 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-09 23:06:27 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-09 23:06:27 [scrapy] INFO: Enabled item pipelines: SaveToFilePipeline
2016-04-09 23:06:27 [scrapy] INFO: Spider opened
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x108a57090>
[s]   item       {}
[s]   request    <GET http://elcomercio.pe/buscar/ppk>
[s]   response   <200 http://elcomercio.pe/buscar/ppk>
[s]   settings   <scrapy.settings.Settings object at 0x109e7e250>
[s]   spider     <DefaultSpider 'default' at 0x10bdb1890>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import autopager

In [2]: autopager.select(response)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-355d1f012366> in <module>()
----> 1 autopager.select(response)

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in select(page, direct, prev, next)
     36     By default, all link types are returned.
     37     """
---> 38     return get_shared_autopager().select(page, direct, prev, next)
     39
     40

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in select(self, page, direct, prev, next)
     96         """
     97         links = self.extract(page, prev=prev, next=next, direct=direct)
---> 98         return parsel.SelectorList([x for y, x in links])
     99
    100     def extract(self, page, direct=True, prev=True, next=True):

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in extract(self, page, direct, prev, next)
    110         sel = _any2selector(page)
    111         links = get_links(sel)
--> 112         xseq = page_to_features(links)
    113         yseq = self.crf.predict_single(xseq)
    114         for x, y in zip(links, yseq):

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/model.pyc in page_to_features(xseq)
    126
    127 def page_to_features(xseq):
--> 128     features = [link_to_features(a) for a in xseq]
    129
    130     around = get_text_around_selector_list(xseq, max_length=15)

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/model.pyc in link_to_features(link)
     60     )
     61
---> 62     elem = link.root
     63     elem_target = _elem_attr(elem, 'target')
     64     elem_rel = _elem_attr(elem, 'rel')

AttributeError: 'Selector' object has no attribute 'root'

group pagination links in "paginators"

Currently autopager classifies each <a> element as a part of paginator or not. Because there can be several paginators on a web page it'd be nice to group <a> links in "paginators".

This feature is nice to have if we want to detect the same paginator across different web pages, e.g. by checking for common URLs.

detect pagination options links

Websites often provide links "show 20/50/100 results per page"; by following them crawler can get the same contents multiple times. It'd be nice to detect these links.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.