Giter Site home page Giter Site logo

bryant1410 / blogbar Goto Github PK

View Code? Open in Web Editor NEW

This project forked from blogbar/blogbar

0.0 1.0 0.0 7.26 MB

Blogbar,聚合个人博客。

Home Page: http://www.blogbar.cc

Python 49.36% HTML 27.70% JavaScript 8.71% CSS 13.75% Shell 0.05% Nginx 0.27% Mako 0.17%

blogbar's Introduction

Blogbar

http://www.blogbar.cc

个人博客之死,就是个人博客之生。

将信息的快速传递交给新兴媒介,让个人博客回归原来的位置:一种信息的雕刻与沉淀的工具。

世界太嘈杂,这里只有个人兆赫

Blogbar,聚合个人博客。

技术栈

开发环境搭建

git clone https://github.com/blogbar/blogbar.git
cd blogbar
virtualenv venv
. venv/bin/activate
pip install -r requirements.txt
bower install

db/blogbar.sql导入本地数据库。

config/development_sample.py另存为config/development.py,并按需更新配置项。

python manage.py run

扩展

如果一个博客不提供 Feed,但是这个博客的价值又非常高(比如 Livid王垠Lifesinger等等),可继承爬取博客的爬虫基类 BaseSpider(位于 spiders/base.py)实现,步骤如下:

类变量赋值

在子类中对如下类变量重新赋值:

url = ""  # 网址
posts_url = ""  # 包含博文列表的网址(选填,只有当博客网址与博文列表网址不同时才需填写)
title = ""  # 博客标题
subtitle = ""  # 博客副标题(选填)
author = ""  # 博主

重载方法

重载如下 2 个方法:

  • get_posts:获取博文列表
  • get_post:获取单篇博文内容

具体使用方法见 BaseSpider 类,以及用于爬取网页内容的 lxml 库。

调试

编写过程中如需调试抓取结果,可使用 test_spider.py 提供的测试方法:

  • $ python test_spider.py get_posts
  • $ python test_spider.py get_post
  • $ python test_spider.py all

具体见 test_spider.py

提交

测试通过后,可发起 pull request。

示例

以下是爬取 Livid 博客的示例代码:

# coding: utf-8
from .base import BaseSpider, get_inner_html
from datetime import datetime


class LividSpider(BaseSpider):
    url = "http://livid.v2ex.com"
    title = "Livid"
    author = "Livid"

    @staticmethod
    def get_posts(tree):
        posts = []
        for li in tree.cssselect('.posts li'):
            date_element = li.cssselect('span')[0]
            published_at = datetime.strptime(date_element.text_content(), "%d %b %Y")
            link = li.cssselect('a')[0]
            posts.append({
                'url': link.get('href'),
                'title': link.text_content(),
                'published_at': published_at
            })
        return posts

    @staticmethod
    def get_post(tree):
        content_element = tree.cssselect('div.span10')[0]
        return get_inner_html(content_element)

blogbar's People

Contributors

bryant1410 avatar hustlzp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.