WebCrawlerBD

Web crawler for BD in python

Project Development Summary (Design) Manual (Documentation Requirements)

Ⅰ project background Baidu Post Bar is a broad information platform, where you can find the same interest through a simple search of the collective, to provide a wealth of information, some need to save a lot, especially some posts may soon be invalid, so I designed a " Text crawler "to meet this demand, will specify the contents of the post were saved as txt and jpg file, convenient and concise.

Ⅱ projects of the initial concept

Project Name: "Multi-functional post it text crawler"
Main contents of the project: Save the bar picture and the text to the local, and realize the preservation of the function of the form of diversification
Project implementation of several major modules: Noise removal tools: Text crawler implementation class: Image reptile implementation class: Start interface visualization class:
Target and implementation of the preset project: Users can enter the interface through the GUI interface to download the information, according to the interface back to the information to achieve accurate reptile function.

Ⅲ project implementation plan Development schedule 11/01 days ago to think about how to determine the theme of large operations 11/01 to complete the "project development process document" Find the basics of reading the crawler 11/21 to write the first picture of the crawler program 11/28 will be the image reptiles into a class 11/29 Write the graphical user interface class incoming parameters 12/07 days ago to find regular expressions such as text reptiles required knowledge 12/08 days of text crawler pseudo code is complete 12/19 Japanese word crawler program written 12/20 Japanese word reptile package associated with graphical user interface 12/23 debugging increase "only look at the landlord" "floor display" and other options 12/24 days of the code of the rules 12/26 to complete the "project development summary" 2. Personnel division of project development: Tao Wen Zheng: find information, view the page source code, to achieve the preparation of the program

List the key technologies in the project and how to acquire this knowledge:

graphical user interface: python textbook
Web page HTML code view and meaning: network data
regular expression matching: network data
text noise removal: network data
file write: network data
exception capture processing: python textbook

Ⅳ project support conditions

What computer system environment: "Multi-function post bar graphic crawler" in the win10 64bit computer environment development completed;
Development of the software used by the system: Used to develop part of Python 3.4, Python 3.5 And then all converted to Python 2.7.11, using SublimeText 3 prepared;
Development of auxiliary software tools used: Sublime Text 3; Five detailed development narration and implementation function
Program structure description:

the top of a graphical user interface to guide the user input the required parameters of the crawler: post code, whether to see the landlord / display floor number, folder and picture name, together with the text reptiles;
text crawler will download the HTML code, the use of regular expressions to filter the required text, call the removal of noise tools to organize, save the text file;
and then send the parameters to the picture reptile, download the HTML code, use the policy expression filter image URL, downloaded by the urlretrieve remote image, rename.

weeenzh / webcrawlerbd Goto Github PK

webcrawlerbd's Introduction

WebCrawlerBD

webcrawlerbd's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent