Giter Site home page Giter Site logo

sp500-scraper's Introduction

sp500-scraper

Constituent history of the S&P 500 from various data sources.

Usage

Each data source is queried once a day, during weekdays. The data is saved in two formats - csv and parquet. Generally, each run is saved with the file name convention YYYYMMDD. I prefer to interact with this data using arrow.

First clone the repo:

git clone https://github.com/riazarbi/sp500-scraper

Then, in R (adapt to your language)

library(arrow)
open_dataset("sp500-scraper/wikipedia/sp500/parquet, unify_schemas = TRUE")

Data Source Notes

How far back this data goes depends on how far back I could parse data from websites.

iShares

STATUS: working
FIRST DATE: 2006-10-31

The iShares symbols differ sometimes from SEC tickers (see, for example, Visa Corp in each dataset). This source does include ISIN though.

Wikipedia

STATUS: working
FIRST DATE: 2007-03-07

Seems to conform to SEC tickers. The data structure has evolved over time; I've kept all the columns but tried to rename to line them up as best as possible. CIK was added on 2014-05-12, which makes symbol joining much easier.

Pre 2022-11-01 was collected by traversing Wikipedia commit history. We pulled all the revisions, omitted any changes that changed a large percentage of symbols, or were themselves overwritten within 12 hours. We hope this will eliminate spurious changes.

Post 2022-11-01 parsed html tables, pre 2022-11-01 parsed wikitext.

Tidyquant

STATUS: working
FIRST DATE: 2022-11-08

We simply run tidyquant::tq_index("SP500") and save the result to file. It contains the SEDOL and CUSIP numbers.

Useful links

SEC ticker <-> CIK mappings (clean, no completeness guarantee): https://www.sec.gov/files/company_tickers.json
All CIK matched to entity name (messy, complete): https://www.sec.gov/Archives/edgar/cik-lookup-data.txt

sp500-scraper's People

Contributors

riazarbi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.