Giter Site home page Giter Site logo

hjsblogger / web-scraping-with-python Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 62 KB

Demonstration of Web Scraping using Selenium Python (Pytest & Pyunit) and Beautiful Soup

Python 94.02% Makefile 5.98%
beautiful-soup beautifulsoup beautifulsoup4 lambdatest selenium-python selenium-webdriver web-scraping youtube-scrapping

web-scraping-with-python's Introduction

Web Scraping with Selenium Python and Beautiful Soup

Bulb

In this 'Web Scraping with Python' repo, we have covered the following usecases:

  • Web Scraping using Selenium PyUnit
  • Web Scraping using Selenium Pytest
  • Web Scraping of dynamic website using Beautiful Soup and Selenium

The following websites are used for the purpose of demoing web scraping:

BulbAs mentioned online, scraping public web data from YouTube is legal as long as you don't go after information that is not available to the general public. However, there might be cases where the YouTube scraping might throw errors (or exceptions) when scraping is done on the Cloud Selenium Grid.

Pre-requisites for test execution

Step 1

Create a virtual environment by triggering the virtualenv venv command on the terminal

virtualenv venv
VirtualEnvironment

Step 2

Navigate the newly created virtual environment by triggering the source venv/bin/activate command on the terminal

source venv/bin/activate

Follow steps(3) and (4) for performing web scraping on LambdaTest Cloud Grid:

Step 3

Procure the LambdaTest User Name and Access Key by navigating to LambdaTest Account Page. You might need to create an an account on LambdaTest since it is used for running tests (or scraping) on the cloud Grid.

LambdaTestAccount

Step 4

Add the LambdaTest User Name and Access Key in the Makefile that is located in the parent directory. Once done, save the Makefile.

MakeFileChange

Dependency/Package Installation

Run the make install command on the terminal to install the desired packages (or dependencies) - Pytest, Selenium, Beautiful Soup, etc.

make install
Make-Install Make-Install-2

With this, all the dependencies and environment variables are set. We are all set for web scraping with the desired frameworks (i.e. Pyunit, Pytest, and Beautiful Soup)

Web Scraping using Selenium PyUnit (Local Execution)

The following websites are used for demonstration:

Follow the below mentioned steps to perform scraping on local machine:

Step 1

Set EXEC_PLATFORM environment variable to local. Trigger the command export EXEC_PLATFORM=local on the terminal.

Make-Local

Step 2

Trigger the command make clean to clean the remove pycache folder(s) and .pyc files

Make-Clean

Step 3

The Chrome browser is invoked in the Headless Mode. It is recommended to install Chrome on your machine before you proceed to Step(3)

Step 4

Trigger the make scrap-using-pyunit command on the terminal to scrap content from the above mentioned websites

Pyunit-Scraping-1 Pyunit-Scraping-2

As seen above, the content from LambdaTest YouTube channel and LambdaTest e-commerce playground are scrapped successfully!

Web Scraping using Selenium Pytest (Local Execution)

The following websites are used for demonstration:

Follow the below mentioned steps to perform scraping on local machine:

Step 1

Set EXEC_PLATFORM environment variable to local. Trigger the command export EXEC_PLATFORM=local on the terminal.

Make-Local

Step 2

The Chrome browser is invoked in the Headless Mode. It is recommended to install Chrome on your machine before you proceed to Step(4)

Step 3

Trigger the command make clean to clean the remove pycache folder(s) and .pyc files

Make-Clean

Step 4

Trigger the make scrap-using-pytest command on the terminal to scrap content from the above mentioned websites

Pytest-scraping-1 Pytest-scraping-2

Web Scraping using Beautiful Soup

Beautiful Soup is a Python library that is majorly used for screen-scraping (or web scraping). More information about the library is available on Beautiful Soup HomePage

The Beautiful Soup (bs4) library is already installed as a part of pre-requisite steps. Hence, it is safe to proceed with the scraping with Beautiful Soup. Scraping Club Infinite Scroll Website has infinite scrolling pages and Selenium is used to scroll to the end of the page so that all the items on the page can be scraped using the said libraries.

The following websites are used for demonstration:

Follow the below mentioned steps to perform web scraping using Beautiful Soup(bs4):

Step 1

Set EXEC_PLATFORM environment variable to local. Trigger the command export EXEC_PLATFORM=local on the terminal.

Make-Local

Step 2

Trigger the make scrap-using-beautiful-soup command on the terminal to scrap content from the above mentioned websites

scraping-bs4-1 scraping-bs4-2 scraping-bs4-3 scraping-bs4-4 scraping-bs4-5

As seen from the above screenshots, content on Pages (1) thru' (5) on LambdaTest E-Commerce Playground are successfully displayed on the console.

infinite-1 infinite-2

Also, all the 60 items on Scraping Club Infinite Scroll Website are scraped without any issues.

Web Scraping using Selenium Cloud Grid and Python

Note: As mentioned earlier, there could be cases where YouTube Scraping might fail on cloud grid (particularly when there are a number of attempts to scrape the content). Since cookies and other settings are cleared (or sanitized) after every test session, YouTube might take genuine web scraping as a Bot Attack! In such cases, you might across the following page where cookie consent has to be given by clicking on "Accept all" button.

Accept-All

You can find more information about this insightful Stack Overflow thread

Since we are using LambdaTest Selenium Grid for test execution, it is recommended to create an acccount on LambdaTest before proceeding with the test execution. Procure the LambdaTest User Name and Access Key by navigating to LambdaTest Account Page.

LambdaTestAccount

Web Scraping using Selenium Pyunit (Cloud Execution)

The following websites are used for demonstration:

Follow the below mentioned steps to perform scraping on LambdaTest cloud grid:

Step 1

Set EXEC_PLATFORM environment variable to cloud. Trigger the command export EXEC_PLATFORM=cloud on the terminal.

Terminal

Step 2

Trigger the command make clean to clean the remove pycache folder(s) and .pyc files

Make-Clean

Step 3

Trigger the make scrap-using-pyunit command on the terminal to scrap content from the above mentioned websites

Pyunit-cloud-1 Pyunit-cloud-2 Pyunit-cloud-3

As seen above, the content from LambdaTest YouTube channel and LambdaTest e-commerce playground are scrapped successfully! You can find the status of test execution in the LambdaTest Automation Dashboard.

Pyunit-LambdaTest-Status-1 Pyunit-LambdaTest-Status-2

As seen above, the status of test execution is "Completed". Since the browser is instantiated in the Headless mode, the video recording is not available on the dashboard.

Web Scraping using Selenium Pytest (Cloud Execution)

The following websites are used for demonstration:

Follow the below mentioned steps to perform scraping on LambdaTest cloud grid:

Step 1

Set EXEC_PLATFORM environment variable to cloud. Trigger the command export EXEC_PLATFORM=cloud on the terminal.

Terminal

Step 2

Trigger the command make clean to clean the remove pycache folder(s) and .pyc files

Make-Clean

Step 3

Trigger the make scrap-using-pytest command on the terminal to scrap content from the above mentioned websites

Pytest-cloud-1 Pytest-cloud-2 Pytest-cloud-3

As seen above, the content from LambdaTest YouTube channel and LambdaTest e-commerce playground are scrapped successfully! You can find the status of test execution in the LambdaTest Automation Dashboard.

Pytest-LambdaTest-Status-1 Pytest-LambdaTest-Status-2

As seen above, the status of test execution is "Completed". Since the browser is instantiated in the Headless mode, the video recording is not available on the dashboard.

Have feedback or need assistance?

Feel free to fork the repo and contribute to make it better! Email to himanshu[dot]sheth[at]gmail[dot]com for any queries or ping me on the following social media sites:

LinkedIn: @hjsblogger
Twitter: @hjsblogger

web-scraping-with-python's People

Contributors

hjsblogger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

web-scraping-with-python's Issues

[Images] Web Scraping with Python

LambdaTestAccount VirtualEnvironment ![MakeFileChange](https://github.com/hjsblogger/web-scraping-with-python/assets/1688653/55613de8-87e9-4cb3-8098-91f5edbd316f) Make-Install Make-Install-2 Bulb ![Bulb](https://github.com/hjsblogger/web-scraping-with-python/assets/1688653/52f062e0-239e-4131-bde5-5bb8cb218d7b) Make-Local

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.