Giter Site home page Giter Site logo

lassie's Introduction

Lassie

image

image

image

image

Lassie is a Python library for retrieving basic content from websites.

image

Usage

>>> import lassie
>>> lassie.fetch('http://www.youtube.com/watch?v=dQw4w9WgXcQ')
{
    'description': u'Music video by Rick Astley performing Never Gonna Give You Up. YouTube view counts pre-VEVO: 2,573,462 (C) 1987 PWL',
    'videos': [{
        'src': u'http://www.youtube.com/v/dQw4w9WgXcQ?autohide=1&version=3',
        'height': 480,
        'type': u'application/x-shockwave-flash',
        'width': 640
    }, {
        'src': u'https://www.youtube.com/embed/dQw4w9WgXcQ',
        'height': 480,
        'width': 640
    }],
    'title': u'Rick Astley - Never Gonna Give You Up',
    'url': u'http://www.youtube.com/watch?v=dQw4w9WgXcQ',
    'keywords': [u'Rick', u'Astley', u'Sony', u'BMG', u'Music', u'UK', u'Pop'],
    'images': [{
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg?feature=og',
        'type': u'og:image'
    }, {
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg',
        'type': u'twitter:image'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon-vfldLzJxy.ico',
        'type': u'favicon'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon_32-vflWoMFGx.png',
        'type': u'favicon'
    }],
    'locale': u'en_US'
}

Install

Install Lassie via pip

$ pip install lassie

or, with easy_install

$ easy_install lassie

But, hey... that's up to you.

Documentation

Documentation can be found here: https://lassie.readthedocs.org/

lassie's People

Contributors

ashibble avatar cameronmaske avatar jay754 avatar jmhobbs avatar jpadilla avatar litomore avatar mbeacom avatar michaelhelmick avatar slavaganzin avatar timgates42 avatar xuefeng-zhu avatar yaph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lassie's Issues

Handle when AMP image lists are lists of strings and not lists of objects

https://techcrunch.com/2016/11/15/chinese-scientists-crispr-a-human-for-the-first-time/

returns

{
    u 'articleBody': u "A group of Chinese scientists injected a human being with cells genetically edited using CRISPR-Cas9 technology. This is the first time CRISPR has been used on a fully formed adult human and\xa0it's encouraged a biomedical battle between\xa0China and the United States.\n\nThe scientists from China are hoping the genetically edited cells will help their patient fend off a virulent type of lung cancer in hopes it might work on other cancer patients who have not responded\xa0to chemotherapy, radiation and other treatments.\n\nHowever, another group of scientists in the U.S. proposed a similar study in June of this year. The $250 million study funded by Sean Parker's new cancer institute is slated to take place at the University of Pennsylvania. The National Institutes of Health (NIH) has already given the research a thumbs up, but it's still awaiting approval from the Food and Drug Administration (FDA).\n\nScientists have already tried to test other gene-editing techniques to treat human diseases. One method taking on HIV proved effective but CRISPR offers a much simpler path to healing by using an enzyme to snip out an unwanted genetic code.\n\nUsing CRISPR-Cas9 technology, scientists could take out all the genes ready to grow a genetically inherited cancer in a person before that cancer starts. In theory, they could also wipe out the disease by removing the genes causing the disease after it has already started wreaking havoc on the body. This is what both the Chinese and U.S. scientists hope to discover, but it looks like China already has its foot in the door.\n\nThe U.S. has a much more stringent medical regulatory system than many parts of the world and\xa0though the trial here is small and only intended for those patients with no other options it still must\xa0go through a\xa0process before we start altering human genetic code.\n\nThe first U.S. trial isn't meant to see whether or not the treatment is effective, however. Instead, it's merely to test its safety.\n\nCRISPR isn't fool-proof. Sometimes the Cas9 technology splices genes at the wrong place and can actually cause cancer.\n\nMeanwhile, Editas Biotechnology has proposed running a CRISPR trial by 2017 for genes causing blindness in humans. Stanford also has plans in the works for a human CRISPR trial to repair genes causing sickle cell anemia.\n\nBut China's early steps should be used as a cautionary tale for this new technology. Another group of Chinese scientists already ran CRISPR experiments on human embryos that\xa0didn't go very well\xa0-- at least two-thirds of the embryos were found to have genetic mutations and only a fraction of the 28 surviving embryos (out of 86 total tested) contained the replacement genetic material.\n\nSo it seems as though China has beat the U.S. to being first, we still have a long way to go in determining whether or not the technology is even safe enough at its current iteration to use for currently incurable diseases.", u 'author': {
        u '@type': u 'Person',
        u 'name': u 'Sarah Buhr'
    }, u 'url': u 'https://techcrunch.com/2016/11/15/chinese-scientists-crispr-a-human-for-the-first-time/', u 'image': {
        u '@list': [u 'https://tctechcrunch2011.files.wordpress.com/2016/11/3340435836_d347c3ce3d_b.jpg']
    }, u 'datePublished': u '2016-11-15T22:41:31+00:00', u 'headline': u 'Chinese scientists CRISPR a human', u 'mainEntityOfPage': u 'True', u '@context': u 'http://schema.org', u '@type': u 'Article'
}

The list is a list of strings and not objects

Please allow to configure the requests session

It would be useful to be able to configure the requests session used to retrieve the requested URL.

You could perhaps initialize a default session object in the Lassie constructor, which the user could then configure, and/or add a parameter to Lassie.fetch() to override the default session.

Add new filters for embeddable items

The idea is to return as much data as we can in the API so users can possibly embed media. (i.e. Spotify tracks)

We'll probably add a new embed.py and return a new embed key in the lassie API response.

Any reason to pindown upper version in requirements.txt

Hi,

Since lassie is a library, limiting upper versions for dependencies as in

requests>=2.18.4,<3.0.0
beautifulsoup4>=4.9.0,<4.10.0

can lead to conflicts for software using it, e.g. on pip install:

The conflict is caused by:
    The user requested beautifulsoup4==4.10.0
    lassie 0.11.11 depends on beautifulsoup4<4.10.0 and >=4.9.0

Is there any reason for the pindown?

How to handle wrong URL setting

I fetched the page http://bibviz.com/ using Lassie, the page has the following og:url setting
property="og:url" content="/". Lassie correctly returns the "incorrect" URL.

I wonder if cases like that should be handled by Lassie, i. e. being more forgiving of bad markup, or the calling code. A possibility to handle this in Lassie would be to check that the URL string starts with supported protocols, e. g. http://, https://, and ?.

Default keys to be returned?

So right now, lassie will always return the keys image, video and url

{
    'images': [],
    'url': 'http://somesitethatdoesnthaveanythingbutexists.com',
    'videos': []
}

Does anything think that if images or videos are empty before the time to return a response, that those keys should be popped out of the dict?

Handle Retrieved File Content

For example:

>>> import lassie
>>> lassie.fetch('http://i.imgur.com/3s7d35n.gif')

This should return a "beautifully crafted dictionary of important information" about the .gif file 😄

Error while fetching URL

Code

import lassie
url = "http://www.whowhatwear.com/la-style-jewelry"
lassie.fetch(url)

Error Log

Traceback (most recent call last):
  File "/Users/sean/knotch/lambda-article-parser/test.py", line 3, in <module>
    lassie.fetch(url)
  File "/Users/sean/knotch/lambda-article-parser/venv/lib/python3.6/site-packages/lassie/api.py", line 43, in fetch
    return l.fetch(url, **kwargs)
  File "/Users/sean/knotch/lambda-article-parser/venv/lib/python3.6/site-packages/lassie/core.py", line 179, in fetch
    self._filter_amp_data(soup, data, url, all_images)
  File "/Users/sean/knotch/lambda-article-parser/venv/lib/python3.6/site-packages/lassie/core.py", line 402, in _filter_amp_data
    image_list = image.get('@list')
AttributeError: 'str' object has no attribute 'get'

Seems very similar to this other issue: #70

Can't get the full article.

Hi, I want to extract the article from the source url. I got only the title of the article and small parts of it under the "description" parameter.

Possible relative URL in og:image

I just came accros a page with a relative path value for the og:image. Adding a call to urljoin on the src attribute in line 186 of core.py would be a possibility, but maybe it's better to check for the src prop (possibly href prop too) in _filter_meta_data and do it there. What do you think about that?

Error while fetching URL

Code

import lassie
url = "https://www.eater.com/2018/5/3/17311386/best-restaurants-oakland-bay-area"
lassie.fetch(url)

Error Log


AttributeError Traceback (most recent call last)
in ()
1 url = "https://www.eater.com/2018/5/3/17311386/best-restaurants-oakland-bay-area"
----> 2 lassie.fetch(url)

~/anaconda3/envs/mobi_mule/lib/python3.6/site-packages/lassie/api.py in fetch(url, **kwargs)
41 """
42 l = Lassie()
---> 43 return l.fetch(url, **kwargs)

~/anaconda3/envs/mobi_mule/lib/python3.6/site-packages/lassie/core.py in fetch(self, url, open_graph, twitter_card, touch_icon, favicon, all_images, parser, handle_file_content, canonical)
177 soup = BeautifulSoup(clean_text(html), parser)
178
--> 179 self._filter_amp_data(soup, data, url, all_images)
180
181 if open_graph:

~/anaconda3/envs/mobi_mule/lib/python3.6/site-packages/lassie/core.py in _filter_amp_data(self, soup, data, url, all_images)
394 })
395 elif isinstance(image, object):
--> 396 image_list = image.get('@list')
397 if image_list:
398 for _image in image_list:

AttributeError: 'list' object has no attribute 'get'

ImportError: No module named filters

Tried installing this in two environments: OS X with Python 2.7, and Ubuntu with Python 2.6. Tried using both pip and easy_install. Importing lassie for both resulted in the above import error. Here's the OS X error from easy_install:

>>> import lassie
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.8-intel/egg/lassie/__init__.py", line 19, in <module>
  File "build/bdist.macosx-10.8-intel/egg/lassie/api.py", line 11, in <module>
  File "build/bdist.macosx-10.8-intel/egg/lassie/core.py", line 16, in <module>
ImportError: No module named filters

Import fails on Python3.5

It appears something is seriously broken when trying to install lassie with Python 3.5. Install goes fine but when importing I get here:

Python 3.5.0 (default, Sep 23 2015, 04:41:38)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lassie
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/__init__.py", line 19, in <module>
    from .api import fetch
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/api.py", line 11, in <module>
    from .core import Lassie
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/core.py", line 13, in <module>
    from bs4 import BeautifulSoup
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/__init__.py", line 30, in <module>
    from .builder import builder_registry, ParserRejectedMarkup
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/__init__.py", line 308, in <module>
    from . import _htmlparser
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 7, in <module>
    from html.parser import (
ImportError: cannot import name 'HTMLParseError'

Encoding issues with german umlauts

Hi,

when getting the description from a German website the "ü" "ä" etc. end up being "ä", "ü" etc.
Example: https://finanzguru.de/
Result:

Finanzguru - Finanzen magisch einfach Finanzen magisch einfach. Verwalte deine Verträge, kündige per Fingertipp und spare Geld mit meinen Spartipps. Alles an einem Ort und komplett kostenfrei. Einfacher war es noch nie.

I am using lassie within Django.

Handle Non-200 HTTP Responses

By default Non-200 Responses should return empty content to lassie

An option to still return whatever response came from the request will be available through a flag to lassie.fetch though.

Find all generic metas

Find a list somewhere on the webs and pick what we think should be included in a response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.