Giter Site home page Giter Site logo

rtamiri / tedscraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from corralm/ted-scraper

0.0 0.0 0.0 6.45 MB

πŸ’¬ scrape TED talk data including transcripts in over 100 languages from TED.com

License: MIT License

Jupyter Notebook 59.08% Python 40.92%

tedscraper's Introduction

NOTE: This script no longer works after some changes were made on TED.com. I may fix it eventually...

TEDscraper

Scrape TED talk data including transcripts in over 100 languages from TED.com

Requirements

Python 3
Beautiful Soup 4
fake-useragent
lxml
Pandas
Requests

Usage

# move to TEDscraper directory
# import module (or use Jupyter Notebook)
from TEDscraper import TEDscraper

# instantiate the scraper & pass in optional arguments
scraper = TEDscraper(lang_code='en', urls='all', topics='all')

# scrape the data and save it to a dictionary
ted_dict = scraper.get_data()

# transform the dictionary to a sorted pandas DataFrame
df = scraper.to_dataframe(ted_dict)

# output DataFrame as CSV
df.to_csv('../data/ted_talks.csv', index=False)

Here is a list of other output formats Pandas docs.

Parameters

  • lang_code
    • English is the default language lang_code='en'
    • You can pass in other language codes using the lang_code param
    • TED translators don't always translate all features
      • Ex: Title and 'About Speaker' might be in English while the transcript is translated to French
  • urls
    • All urls are scraped by default for the selected language urls='all'
    • You may pass in a list of urls. However, there are a few limitations:
      • TED must have the talks available in the language you specify
      • Only one language can be provided per scrape call
  • topics
    • All topics are scraped by default topics='all'
    • You may pass in a list of topics to filter by them
  • force_fetch
    • Talks with known issues are skipped by default force_fetch=False
    • Set it to 'True' to attempt to scrape
    • See talks with known issues
  • exclude_transcript
    • All features are scraped by default exclude_transcript=False
    • Set it to 'True' to exclude the transcript

Attributes

Attribute Description Data Type
talk_id Talk identification number provided by TED int
title Title of the talk string
speaker_1 First speaker in TED's speaker list string
speakers Speakers in the talk dictionary
occupations *Occupations of the speakers dictionary
about_speakers *Blurb about each speaker dictionary
views Count of views int
recorded_date Date the talk was recorded string
published_date Date the talk was published to TED.com string
event Event or medium in which the talk was given string
native_lang Language the talk was given in string
available_lang All available languages (lang_code) for a talk list
comments Count of comments int
duration Duration in seconds int
topics Related tags or topics for the talk list
related_talks Related talks (key='talk_id', value='title') dictionary
url URL of the talk string
description Description of the talk string
transcript Full transcript of the talk string

*The dictionary key maps to the speaker in β€˜speakers’.

Languages

TED talks have been subtitled in over 100 languages. Here are the top languages:

Code Language
en English
es Spanish
pt-br Portuguese (Brazilian)
fr French
it Italian
zh-cn Chinese (simplified)
zh-tw Chinese (traditional)
ko Korean
ja Japanese
tr Turkish
ru Russian
he Hebrew

Here is a link to all language codes available as of May 2020.

You can see all the talks for each language at TED – Our Languages.

Meta

Author: Miguel Corral Jr.
Email: [email protected]
LinkedIn: https://www.linkedin.com/in/miguelcorraljr/
GitHub: https://github.com/corralm

Distributed under the MIT license. See LICENSE for more information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.