Giter Site home page Giter Site logo

speedparser's Introduction

speedparser

Speedparser is a black-box "style" reimplementation of the Universal Feed Parser. It uses some feedparser code for date and authors, but mostly re-implements its data normalization algorithms based on feedparser output. It uses lxml for feed parsing and for optional HTML cleaning. Its compatibility with feedparser is very good for a strict subset of fields, but poor for fields outside that subset. See tests/speedparsertests.py for more information on which fields are more or less compatible and which are not.

On an Intel(R) Core(TM) i5 750, running only on one core, feedparser managed 2.5 feeds/sec on the test feed set (roughly 4200 "feeds" in tests/feeds.tar.bz2), while speedparser manages around 65 feeds/sec with HTML cleaning on and 200 feeds/sec with cleaning off.

installing

pip install speedparser

usage

Usage is similar to feedparser:

>>> import speedparser
>>> result = speedparser.parse(feed)
>>> result = speedparser.parse(feed, clean_html=False)

differences

There are a few interface differences and many result differences between speedparser and feedparser. The biggest similarity is that they both return a FeedParserDict() object (with keys accessible as attributes), they both set the bozo key when an error is encountered, and various aspects of the feed and entries keys are likely to be identical or very similar.

speedparser uses different (and in some cases less or none; buyer beware) data cleaning algorithms than feedparser. When it is enabled, lxml's html.cleaner library will be used to clean HTML and give similar but not identical protection against various attributes and elements. If you supply your own Cleaner element to the "clean_html kwarg, it will be used by speedparser to clean the various attributes of the feed and entries.

speedparser does not attempt to fix character encoding by default because this processing can take a long time for large feeds. If the encoding value of the feed is wrong, or if you want this extra level of error tollerance, you can either use the chardet module to detect the encoding based on the document or pass encoding=True to speedparser.parse and it will fall back to encoding detection if it encounters encoding errors.

If your application is using feedparser to consume many feeds at once and CPU is becoming a bottleneck, you might want to try out speedparser as an alternative (using feedparser as a backup). If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.

speedparser's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speedparser's Issues

lxml.clean cannot defang some raw text ("<3")

lxml.html.clean cannot clean some raw text (in titles, descriptions, etc); particularly, text that might look like html but isn't. The first case I noticed was the tag "<title><3</title>" will fail with a ParserError('Document is empty'), which is likely an underlying libxml2 issue.

We cannot simply pass the text through, as <3<script>alert('foo');</script> will also raise this same error. Currently, there is a regression test "TestHeartParserError" which confirms this error in speedparser and confirms that feedparser will read this content.

Using legacy feedparser date parsing breaks on this feed.

http://www.theprizefinder.com/feed/top-prizes

Not exactly sure on the best approach here... I don't want to uninstall feedparser since I use it when parsing fails catastrophically, but I also don't really want to use the broken date stuff. Perhaps we could optionally force feedparser compat instead of using actual feedparser?

Does speedparser compatible with python 3.7?

I get import error while trying it importing:

>>> import speedparser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python\Python37\lib\site-packages\speedparser\__init__.py", line 1, in <module>
    from speedparser import parse
ImportError: cannot import name 'parse' from 'speedparser' (C:\Python\Python37\lib\site-packages\speedparser\__init__.py)

I've checked __init__.py and if change first line to from .speedparser import parse it will fix import error, however later it throw another error:

>>> import speedparser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python\Python37\lib\site-packages\speedparser\__init__.py", line 1, in <module>
    from .speedparser import parse
  File "C:\Python\Python37\lib\site-packages\speedparser\speedparser.py", line 19, in <module>
    import urlparse
ModuleNotFoundError: No module named 'urlparse'

And it's look like it python 2 module, at least there a few libraries:
https://pypi.org/project/urlparse2/
https://pypi.org/project/urlparse3/
https://pypi.org/project/urlparse4/
but all of them only for python 2.

ImportError: cannot import name 'parse' from partially initialized module 'speedparser'

I installed speedparser using pip install speedparser . When I tried importing it in a python shell, it gives the following error (in both VSCode and the Windows Terminal).

>>> import speedparser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\speedparser\__init__.py", line 1, in <module>
    from speedparser import parse
ImportError: cannot import name 'parse' from partially initialized module 'speedparser' (most likely due to a circular import) (C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\speedparser\__init__.py)

The same error also exists when I try to import it by including the import statement in a .py file.

I have feedparser installed along speedparser. I don't think that is the issue but it seems like the only thing that might cause a circular import. I don't have any files named speedparser.py or parse.py in the current project directory.

image

Related issues #13 .

Dont get it worked out when I parse an url

I get an error when I try to parse something.
I try to parse this url: http://rss.cnn.com/rss/money_news_companies.rss

I work with Python3.4 and tornadoweb. So I made a change in the 'speedparser.py' file. I changed urlparse into urllib.parse, because urlparse is not longer supported in python3.

The error is:

{'bozo_tb': 'Traceback (most recent call last):\n File "C:\Python34\lib\site-packages\speedparser\speedparser.py", line 685, in parse\n parser = SpeedParser(document, cleaner, unix_timestamp, encoding)\n File "C:\Python34\lib\site-packages\speedparser\speedparser.py", line 582, in init\n self.tree = tree.getroottree()\nAttributeError: 'NoneType' object has no attribute 'getroottree'\n', 'bozo': 1, 'bozo_exception': AttributeError("'NoneType' object has no attribute 'getroottree'",), 'feed': {}, 'entries': []}

Not work with http://news.ycombinator.com/rss

>>> import speedparser
>>> feed = "http://news.ycombinator.com/rss"
>>> speedparser.parse(feed)
{'bozo_tb': 'Traceback (most recent call last):\n  File "/home/kir/.virtualenvs/rss/lib/python3.5/site-packages/speedparser/speedparser.py", line 688, in parse\n    parser = SpeedParser(document, cleaner, unix_timestamp, encoding)\n  File "/home/kir/.virtualenvs/rss/lib/python3.5/site-packages/speedparser/speedparser.py", line 585, in __init__\n    self.tree = tree.getroottree()\nAttributeError: \'NoneType\' object has no attribute \'getroottree\'\n', 'bozo_exception': AttributeError("'NoneType' object has no attribute 'getroottree'",), 'bozo': 1, 'feed': {}, 'entries': []}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.