Giter Site home page Giter Site logo

youtube-data-scraper's Introduction

Project Overview

Data Scraper

This part of the project navigates YouTube to collect video data (thumbnail images and tabular data). Since YouTube's API greatly limits the number of queries per day, it is not practical to build a dataset with it. This data scraper bypasses the API by simulating a human user navigating the website, collecting data along the way. This makes the collection of a substantial dataset practical.


The data scraper navigating YouTube.


Query Scraper

The query scraper, which was specifically built for this project, collects YouTube auto-suggestions, which the data scraper inputs into the search bar to find videos.


The query scraper collecting search terms.


Project Description

Data Scraper

The data scraper has been designed in a way that makes it easy to update when YouTube's website changes. A single file (./utils/constants.py) stores the CSS/Tag selectors for the HTML elements that the scraper interacts with. This way, if YouTube changes the name of an element, or its location on the page, its new selector can be changed in the file, making the program functional in a matter of a few minutes.

Thumbnail images of videos collected are numbered and stored in ./data/images.

Tabular data collected are stored in ./data/non_image_scrape/data.csv. Each row of the file holds data from a single YouTube video. The n-th row of data corresponds to the n-th image in ./data/images/. The columns hold the following information:

  • videoTitle: The title of the video.
  • videoUrl: The URL of the video.
  • thumbnailUrl: The URL of the thumbnail of the video that was saved in the ./data/images/ directory.
  • numViews: The number of views the video had at the time it was scraped.
  • numLikes: The number of likes the video had at the time it was scraped.
  • channelName: The name of the channel that published the video.
  • totalChannelViews: The sum of the views on all the videos published by the channel that published the scraped video, at the time it was scraped.
  • numChannelSubscribers: The number of subscribers that the channel that published the scraped video had at the time the video was scraped.
  • videoTags: The video's tags (keywords that the publisher of the video associated with it to make it easier to find).
  • scrapeDate: The date the video was scraped.
  • uploadDate: The date the video was uploaded.

Query Scraper

The query scraper generates queries for the data scraper to use, by following these steps:

  1. Input a frequently used english word (read from a file) into YouTube's search bar.
  2. Wait for auto-suggestions to load.
  3. Scrape the auto-suggestions.
  4. Save the scraped data to a file.

Scraped search terms are stored in ./data/non_image_scrape/queries.txt. Once this file contains the number of queries requested by the user, they will by processed, filtered, and saved to ./data/non_image_scrape/queries_cleaned.txt (the file that the data scraper uses), and the program execution will end.


Installation

  1. Clone this repository:

    $ git clone https://github.com/RaphaelCaloz/youtube-data-scraper.git
    
  2. Install the required python libraries:

    $ pip install -r requirements.txt
    
  3. Replace chromedriver.exe in the project root directory with the version of the ChromeDriver that matches your Google Chrome version. It can be downloaded here: https://chromedriver.chromium.org/downloads


How to Run

  1. Run scrape_queries.py. Enter the number of queries you would like to scrape in the terminal.

  2. Run scrape_data.py.

Note: Running any of these python files will open a Google Chrome browser window. At any time, the window can be closed to stop the python program without losing scraped data/queries. Alternatively, press Ctrl+C in the terminal to exit the program safely.

youtube-data-scraper's People

Contributors

raphaelcaloz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.