Light

niaev / link_tracking Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 36 KB

A simple Python script tool and package that uses web crawling concepts to find links and pages around the internet and SQLite databases to store found data.

License: The Unlicense

Python 100.00%

web-scraping link-tracking python beautifulsoup unlicense unlicensed

link_tracking's Introduction

link_tracking

A simple Python script tool and package that uses web crawling concepts to find links and pages around the internet and SQLite databases to store found data.

Using

You can clone this Git repository and add it to your project to use link_tracking as a package or to use the tracker.py script.

$ git clone https://github.com/Niaev/link_tracking.git

This package is not available at Python Package Index yet.

`tracker` script

This script can be found in the root of this repository. Follow the example below:

$ python3 tracker.py SEEDS_FILE [depth]

SEEDS_FILE - a file path, referring to a text file with a list of internet links. Example:

http://link-one.com/
https://link.org/two
...

DEPTH - is an optional integer number (default is 2), defining the link tracking depth - that is how many times it will enter in a child page link and search the link in there in a recursive way

The script will track links using your seeds and scrape its respective pages, then store in a SQLite database data/pages.db.

as a package

It has two modules: crawler and indexer.

crawler has some functions and the Crawler class, responsible for web crawling.

Class that receives an url and use urllib and bs4 to get page 
information and with functions to track and scrape links

indexer has just the Indexer class, responsible for handling, organizing and storing the collected data;

Class that receives a list of links to organize and index

The code is well documented with docstrings and comments. A more deep documentation can be found in this repository wiki - not yet available.

link_tracking's People

Contributors

Stargazers

Watchers

link_tracking's Issues

urlopen error "Server unavailabe or incorrect domain name: [insert any domain name here]"

Remake - Tarefas

Introdução

Especificar tarefas do projeto
Criar branch remake no repositório
Apagar __init__.py e links.txt na nova branch

Base

Criar data/pages.db - arquivo de dados
Criar dbbuilder.py - arquivo que cria base de dados
Criar tracker.py - arquivo que controla as classes
Criar crawler.py - arquivo com a classe Crawler
Criar indexer.py - arquivo com a classe Indexer

`dbbuilder.py`

Arquivo que cria base de dados

Importar sqlite3
Criar tabela de links
Criar tabela de links e conteúdos

`tracker.py`

Arquivo que controlará as classes.

Solicitar seeds - links base para "raspagem"
Solicitar profundidade para "raspagem"
Iniciar Crawler
Criar lista de links
Iniciar Indexer
Tratar lista de links e armazenar
Criar lista de conteúdos
Tratar lista de conteúdos e armazenar

`crawler.py`

Este arquivo deverá conter a classe Crawler e funções relativas à web crawling.

Iniciar classe
track() - Desenvolver função que busca links em uma página
scrape() - Desenvolver função que "raspa" página, buscando por título, descrição e conteúdo principal
scrape_list() - Desenvolver função que "raspa" uma lista qualquer de links, fora de Crawler
scrape_links() - Desenvolver função que "raspa" todos os links encontrados em track()
track_with_depht() - Desenvolver função recursiva que busca links em uma página, com um limite de níveis de profundidade
~~scrape_with_depht() - Desenvolver função recursiva que "raspa" uma página e todos os seus links, com limite de níveis de profundidade~~

`indexer.py`

Este arquivo deverá conter a classe Indexer e funções relativas ao tratamento de listas de links, excluindo duplicatas, ordenando os links, e armazenando.

Iniciar classe
removed_duplis() - Desenvolver que remove duplicatas de uma lista e retorna uma nova lista sem duplicatas
valid_links() - Desenvolver função que remove links inválidos e retorna uma nova lista apenas com links válidos
order_scraped_links() - Desenvolver função que ordena dicionários com conteúdo de páginas raspadas
store_links() - Desenvolver função que armazena links em arquivo de dados
store_pages() - Desenvolver função que armazena dicionários com conteúdo de páginas raspadas

Finalização

Escrever apresentação e documentação no README.md (em inglês)
Escrever README_PTBR.md
git merge remake

Outras tarefas que não são tão importantes para o desenvolvimento

Mudar o nome do respositório (nenhuma sugestão ainda)

tracker.py script keeps getting killed by the terminal due to high RAM consumption

Due to excessive recursive function calls, tracker.py script is consuming a large amount of RAM and the script keeps getting automatically killed by the terminal, on Ubuntu 20.04.

For solving this issue, please check out these links:

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.