Project Overview

Data Scraper

This part of the project navigates YouTube to collect video data (thumbnail images and tabular data). Since YouTube's API greatly limits the number of queries per day, it is not practical to build a dataset with it. This data scraper bypasses the API by simulating a human user navigating the website, collecting data along the way. This makes the collection of a substantial dataset practical.

The data scraper navigating YouTube.

Query Scraper

The query scraper, which was specifically built for this project, collects YouTube auto-suggestions, which the data scraper inputs into the search bar to find videos.

The query scraper collecting search terms.

Project Description

Data Scraper

The data scraper has been designed in a way that makes it easy to update when YouTube's website changes. A single file (./utils/constants.py) stores the CSS/Tag selectors for the HTML elements that the scraper interacts with. This way, if YouTube changes the name of an element, or its location on the page, its new selector can be changed in the file, making the program functional in a matter of a few minutes.

Thumbnail images of videos collected are numbered and stored in ./data/images.

Tabular data collected are stored in ./data/non_image_scrape/data.csv. Each row of the file holds data from a single YouTube video. The n-th row of data corresponds to the n-th image in ./data/images/. The columns hold the following information:

videoTitle: The title of the video.
videoUrl: The URL of the video.
thumbnailUrl: The URL of the thumbnail of the video that was saved in the ./data/images/ directory.
numViews: The number of views the video had at the time it was scraped.
numLikes: The number of likes the video had at the time it was scraped.
channelName: The name of the channel that published the video.
totalChannelViews: The sum of the views on all the videos published by the channel that published the scraped video, at the time it was scraped.
numChannelSubscribers: The number of subscribers that the channel that published the scraped video had at the time the video was scraped.
videoTags: The video's tags (keywords that the publisher of the video associated with it to make it easier to find).
scrapeDate: The date the video was scraped.
uploadDate: The date the video was uploaded.

Query Scraper

The query scraper generates queries for the data scraper to use, by following these steps:

Input a frequently used english word (read from a file) into YouTube's search bar.
Wait for auto-suggestions to load.
Scrape the auto-suggestions.
Save the scraped data to a file.

Scraped search terms are stored in ./data/non_image_scrape/queries.txt. Once this file contains the number of queries requested by the user, they will by processed, filtered, and saved to ./data/non_image_scrape/queries_cleaned.txt (the file that the data scraper uses), and the program execution will end.

Installation

Clone this repository:

$ git clone https://github.com/RaphaelCaloz/youtube-data-scraper.git

Install the required python libraries:
```
$ pip install -r requirements.txt
```
Replace chromedriver.exe in the project root directory with the version of the ChromeDriver that matches your Google Chrome version. It can be downloaded here: https://chromedriver.chromium.org/downloads

How to Run

Run scrape_queries.py. Enter the number of queries you would like to scrape in the terminal.
Run scrape_data.py.

Note: Running any of these python files will open a Google Chrome browser window. At any time, the window can be closed to stop the python program without losing scraped data/queries. Alternatively, press Ctrl+C in the terminal to exit the program safely.

raphaelcaloz / youtube-data-scraper Goto Github PK

youtube-data-scraper's Introduction

Project Overview

Data Scraper

Query Scraper

Project Description

Data Scraper

Query Scraper

Installation

How to Run

youtube-data-scraper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent