Curso de Fundamentos de Web Scraping con Python y Xpath

Web scraping: Es una técnica usada por data scientist y backend developers para extraer información de internet, accede a esto usando el protocolo de tranferencias de hipertexto (HTTP) o a través de un navegador. Los datos extraídos usualmente son guardados en una base de datos, incluso en una hoja de cálculo para posteriores análisis. Puede hacerse de manera automática (bot) o manualmente.

Xpath: es un lenguaje que sirve para apuntar a las partes de un documento XML. Xpath modela un documento XML como un árbol de nodos. Existen diferentes tipos de nodos: elementos, atributos, texto.

Corparations than use webscrapy

articles

14 Web Scraping Tools: Who They Are For & What They Excel At
ParseHub is a system for automated scrap
http://plasmasturm.org/log/xpath101/

frameworks to do webscrapy

scrapy
Puppeteer
playwright
cypress

http status category

Informational responses (100 – 199) Successful responses (200 – 299) Redirection messages (300 – 399) Client error responses (400 – 499) Server error responses (500 – 599)

hypertext markup language :v

snipet: a lite piece of code

robots.txt proporciona información a los rastreadores de los buscadores sobre las páginas o los archivos que pueden solicitar o no de tu sitio web.

https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt?hl=es&visit_id=638213797110764645-1443811511&rd=1

XPath esta formado por nodos/etiquetas

site to practice scraping http://toscrape.com

NODOS: a node is either a redistribution point or a communication endpoint /ETIQUETAS Un nodo puede contener a otros nodos. En otras palabras Xpath nos permitirá navegar en los diferentes niveles de profundidad deseados con el fin extraer información. Para describir los nodos y relaciones con Xpath se usan una sintaxis de ejes.

Expresiones Xpath

Para escribir expresiones se usara lo siguiente $x(''). Entre las comillas se van a escribir las expresiones, las expresiones tienen diferentes símbolos que tienen una utilidad.

Se describe la utilidad de cada expresión.

/ hace referencia a la raíz, o tambien significa un salto entre nodos. e.g /html/bodyMuestra todo lo que hay dentro del body de html // Sirve para acceder a todos los nodos con la etiqueta seleccionada. e.g *//span muestra todas las etiquetas span* .. Sirve para acceder a los nodos padre de la etiqueta tag. e.g //span/.. accede a todos los nodos padre de span . Hace referencia al nodo actual. e.g. //span/. es equivalent a //span @ Sirve para traer los atributos. e.g //div/@class Nos da las clases de todos los divs

all the examples was using quotes

Predicados Xpath

is using '[]' into a xpath expresion

without predicate

$x('/html/body/div/div')
return (2) [div.row.header-box, div.row]

with predicate

$x('/html/body/div/div[1]')
return [div.row.header-box]

for get the last item we user 'last()

$x('/html/body/div/div[last()]')
return [div.row]

with atributes

$x('//span[@class="text"]/text()')

Operadores Xpath

'!=' i want to get all the span without the class .text

$x('//span[@class!="text"]')
(11) [span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.sh-red]

'position()' u can use this with the common operators

$x('/html/body/div/div[position()>1]')

'or' if any of the conditions is truth we'll get the span with them

$x('//span[@class="text" or @class="tag-item"]')
(20) [span.text, span.text, span.text, span.text, span.text, span.text, span.text, span.text, span.text, span.text, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item, span.tag-item]

'and' we need both contentions to be true

$x('//span[@class="text" and @class="tag-item"]')
[]

$x('//span[not(@class)]') (11) [span, span, span, span, span, span, span, span, span, span, span]

Wildcards

'/*' selecciona todos los nodos dentro del nodo seleccionado

$x('/html/*')
(2) [head, body]

'//*' selecciona todos los nodos que existen

$x('//*')
(153) [html, head, meta, title, link, link, style#operaUserStyle, style, body, div.container, div.row.header-box, div.col-md-8, h1, a, div.col-md-4, p, a, div.row, div.col-md-8, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, a.tag, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, div.quote, span.text, span, small.author, a, div.tags, meta.keywords, a.tag, a.tag, a.tag, …]

'/@*' select all the los attributes inside the node selected

$x('//span[@class="text"]/@*')
(20) [class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop, class, itemprop]

'//element/@*' select attributes of all the div elements in the body document

$x('/html/body//div/@*')
(48) [class, class, class, class, class, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, itemscope, itemtype, class, class, class]

with node() u select all inside the current node

$x('//span[@class="text" and @itemprop="text"]/node()')
(10) [text, text, text, text, text, text, text, text, text, text]

In-text search en Xpath

starts-with

$x('//small[@class="author" and starts-with(.,"A")]')
(4) [small.author, small.author, small.author, small.author]

$x('//small[@class="author" and starts-with(.,"A")]/text()').map(x => x.wholeText)
(4) ['Albert Einstein', 'Albert Einstein', 'Albert Einstein', 'André Gide']

using the function 'contains'

$x('//small[@class="author" and contains(.,"Ro")]/text()').map(x => x.wholeText)
(2) ['J.K. Rowling', 'Eleanor Roosevelt']

using the function 'ends-with'

x('//small[@class="author" and ends-with(.,"t")]/text()').map(x => x.wholeText)

using the function 'matches'

$x('//small[@class="author" and matches(.,"A.*n")]/text()').map(x => x.wholeText)

xpath axes

it's the same than use a point

$x('/html/body/div/self::div')
[div.container]

it's to get the children of the node

$x('/html/body/div/child::div')
(2) [div.row.header-box, div.row]

$x('/html/body/div/descendant::div')
(26) [div.row.header-box, div.col-md-8, div.col-md-4, div.row, div.col-md-8, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.
tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.col-md-4.tags-box]

$x('/html/body/div/descendant-or-self::div')
(27) [div.container, div.row.header-box, div.col-md-8, div.col-md-4, div.row, div.col-md-8, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.
quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.quote, div.tags, div.col-md-4.tags-box]

khr1stopher / cfws Goto Github PK

cfws's Introduction

Curso de Fundamentos de Web Scraping con Python y Xpath

articles

Predicados Xpath

Wildcards

In-text search en Xpath

xpath axes

cfws's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent