ProppiScraper: Houses' information

This is a web scraping instrument that downloads houses' information available on the internet.

Requirements

Python 3
Pip 3
virtualenv

First Use

Clone this repo.
You should have this folder hierarchy:

proppiScrapper
│   README.md
│   config.json
│   requirements.txt
│   scrapper.bat
│   scrapper.sh
└───logs
└───results
└───temp
└───venv
└───src

Verify the location of python3 with: which python3
Create a vritualenv called 'proppienv' in the folder venv with the following command

cd venv
virtualenv -p \path\to\python3 proppienv
cd ..

So finally, you should have

proppiScrapper
│   README.md
│   config.json
│   requirements.txt
│   scrapper.bat
│   scrapper.sh
└───logs
└───results
└───temp
└───venv
    └───proppienv
└───src

Activate the vritualenv source venv/proppienv/bin/activate
Install the requirements pip3 install -r requirements.txt

Windows

In order to execute the bat file for windows, you should follow the same steps aforementioned but taking care about the activation of the virtualenv is inside Scripts folder. venv/proppienv/Scripts/activate

Also you must add the Environment Variable "PROPPI" with the "proppiScrapper" path.

Execution

There are two options: a) You can start scrapping by executing the file scrapper.sh or scrapper.bat b) You can run the main python file src/scrapper.py activating previously the virtualenv if it is not done yet.

source venv/proppienv/bin/activate
python3 src/scrapper.py

Configuration

If you want to configure better the scrapper you should take a look in the config.json file.

Main Configuration

"scrap_[site]" it's a flag that enables/disables the scrapping in each site. Possible values are: "True" and "False".

Requests Configuration

"use_proxy" it's a flag that enables/disables the scrapping using Internet free proxy servers. Possible values are: "True" and "False" if you want to go directly from your local to the objective site.
"max_attempts" If you are using proxies, they may fail so proppiScrapper will try again with another different proxy as many times as you set this value. If all of them fails, it will try without proxy.
"sleep_time" time in seconds to wait between requests.

Sites Configuration

"from_page" In which page would you like to start? min possible value = 1.
"pages" This value represent how many pages since the "from_page" it will scrap.
"result_filename" This is the prefix that indicates which prefix the result file will have. If you will execute the scrapper more than once a day, I strongly recommend change this parameter each time if you want to keep the results of each execution separately, otherwise the results would be overwritten.
"ids_filename" This parameter is the prefix of an internal file that the scrapper use. It has the same behavior as the latter parameter.
"publisher_types" This site categorize it's announcements by publisher types. You could search only some of them.
- Possible Values for each site:
  - "lavoz" : ["inmobiliaria","matriculado-CPI","particular"]
  - "meli" : ["inmobiliaria","dueno-directo"]
  - "zonaprop": ["inmobiliaria","dueno-directo"]

simonians / webscraping_python Goto Github PK

webscraping_python's Introduction

ProppiScraper: Houses' information

Requirements

First Use

Windows

Execution

Configuration

Main Configuration

Requests Configuration

Sites Configuration

webscraping_python's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent