Giter Site home page Giter Site logo

lukapopovici / web-parsing-scripts Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 24 KB

A repository for web scrapers mostly written in Python using the Selenium framework.

Python 100.00%
beautifulsoup geckodriver python python3 selenium webscraper webscrapers webscraping

web-parsing-scripts's Introduction

web-parsing-scripts

Shoverl


A repository for web scrapers mostly written in Python using the Selenium framework. I'll add more to this repository as I learn more.

Why does this exist?

I've been curious for a while about why some websites have the "Prove you are not a robot" prompt before allowing you acces to their content. Then, I found out about data mining and browser automation. I made this repository to document my learning process in this field.

This scraper retrieves the links of trending projects from a topic page specified as a parameter.
The script utilizes Selenium to dynamically load new articles based on the value of the other parameter. Each increment of the parameter loads 40 more entries from the page.To begin the scraping process, the script first loads the initial page. Then, it continuously clicks on the "Load more" button as specified by the second parameter. This allows for the retrieval of additional content. Once all the desired content has been loaded, it will be extracted and stored in an XML file.
This is the first dinamically loaded page I have "scraped" that justifies the use of an automation framework.

A web scraper made with the purpose of getting data about popular posts on the Y-Combinator forum.Despite Hacker News being a static website that I could have accesed via requests (ex: I could have accesed all the pages after the frontpage by simply requesting them by adding /?p=$nextpagenumber in front of the url), I opted to use the automation framework Selenium instead.
I use Selenium to "click" on the "More" button at the bottom of the page, so that it loads the next page of the forum to be analyzed. And it will keep on going until the number of pages loaded reaches the value read as a parameter at the script's execution.

Basic web parser made for static html pages. Gets all the links from a page ,and can also do so recursively, with a recursivity level that can be set.The url of the scraped page and the recursivity level can be read from the command line when executing the script. Everything captured gets written on a text file.
Since it's only for static webpages, I didn't see the need to use more complicated frameworks. And so, I settled for the Beautiful Soup module, as to not overcomplicate myself.

Resources

Installing Firefox without Snap In order to avoid the issue of it not being found when using the gecko driver engine on Ubuntu,where it's installed from snap by default.
Gecko Webdriver Installer Distro-agnostic script that downloads and installs the "freshest" release of the Gecko Webdriver and moves it to /usr/bin.

web-parsing-scripts's People

Contributors

lukapopovici avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.