rippinrobr / baseball-stats-db Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 2.0 164.61 MB

Tools to help create and maintain databases based on the Baseball Databank Files

License: Apache License 2.0

Makefile 5.44% Go 94.56%

baseball-databank baseball-statistics baseball-players baseball-data

baseball-stats-db's People

Contributors

Stargazers

Watchers

Forkers

dtpoole cbwinslow

baseball-stats-db's Issues

Parallelize the DB Loading

Currently, loading the different databases is done in a serial fashion, the files are read and parsed multiple times, and the data is loaded into each db type one at a time. The new process should be

Read the files once
the data should be passed to each DB type's worker who is responsible for loading the data into the DB
Should add a verbose flag so I only print out the progress in the way I do it now when I need it

Update Makefile to make the new release structure

Needs to support the tar files with backups and schemas by database
streamline the makefile so there is less duplication

Add Chadwick Bureau's Register data to the databases

This has a list of all people who have played baseball with others who have had something to do with baseball.

https://github.com/chadwickbureau/register

Move DB loader util to sports-stats-utilities project

This project should only be backups, schemas and the sqlite db, no binaries. I will move it to https://github.com/rippinrobr/sports-stats-utilities

Loading 2017 Baseball Databank `FieldingPost.csv` throws an error

here was an error while attempting to parse and storethe file /Users/robertrowe/src/baseballdatabank/core/FieldingPost.csv Error: line 105, column 9: strconv.ParseInt: parsing "33.0": invalid syntax

Column 9 is InnOuts which should be an int

Change Release Artifacts to be packaged up by database type

Instead of one big release file, I want a tgz file for each of the db types

Build in Docker Environment

Hey Rob, great work on this!

I've been trying to make releases locally, but I'm having a few issues.

I'm new to Go, and had a bit of trouble of getting everything installed properly (go get ./internal/... ./cmd/databank-dbloader/... ./cmd/retrosched-dbloader/... ./cmd/retrogl-dbloader/... finally got me where I needed to be).

Additionally, I've had to alter -inputdir arguments in the Makefile to utilize environment variables so I can customize where data sources live on my machine.

I still haven't successfully compiled a release yet, but the further I get into this, I'm thinking that Dockerizing this tool might be useful for getting anyone up and running quickly. This would allow for a portable Go installation, -inputdir values could all be relative to installation paths within the Docker container, and db types could even have their own Docker image (sqlite, postgres, mysql, mongodb).

I'm willing to put some time into this and work on PR, but wanted to get your thoughts first.

Restructure the project

Since I am now using my csv-to project to load the data, I want this project to contain just the schema files, database backups, and docker files. The proposed structure is:

/baseballdatabank
           /schemas 
           /backups
           /dockerfiles
/retrosheet 
           /schemas
           /backups
           /dockerfiles

Any new data sources I add would have the similar structure. Each supported database will have its own schema, and dockerfile.

Move Go code into its own repository

I want this repo to only be concerned with schemas, database backups, and perhaps helpful scripts

Truncate calls in postgres throw errors

Truncate calls are erring out due to the foreign key relationships.

Create Task in Makefile to load and backup DBs

Need to automate the following processes

updating the SQLite, Postgres and MongoDB databases
backup the Postgres and MongoDB
Tar the backups and remove uncompressed backups

Add Ability to pass in more than one db from the command line

First step in parallel processing the data is to add the ability to pass in more than one db to load per run plus the ability to pass in all as an option. Just need to change code around db connections.

relates to #23

Loading 2017 Baseball Databank `FieldingOFsplit.csv.csv` throws an error

There was an error while attempting to parse and storethe file /Users/robertrowe/src/baseballdatabank/core/FieldingOFsplit.csv Error: line 25848, column 10: strconv.ParseInt: parsing "29.0": invalid syntax

Column 10 is the putouts column which is an int and not a float.

Add --input-files cmd line arg

A comma delimited list of file names to be parsed and stored

Create a friendlier error when schema isn't present.

If someone runs the db_loader without running the schema scripts first they receive an error similar to this:

2018/06/10 04:48:21 Insert error: no such table: people
[followed by 19477 more "Insert error: no such table: people" errors]

I need to write out an error like the ones the rust compiler creates, something that prints out the error above but followed by 'Have you loaded the schema file for your database? Schemas can be found in the 'schemas directory' or in the release files.

latest baseball databank release doesn't include 2018 stats

Just downloaded postgres_databank_backup_2018.01.tgz from the releases page, and ran the contained SQL. the latest yearID I was able to find in the batting and pitching tables is 2017. The post on the releases page says it had been updated to include 2018 stats. Any chance you can update the release to include 2018 stats?

Thanks.

Update Schema files

Schema only files haven't been updated for a bit. Need to get them updated.

Loading 2017 baseball databank Teams.csv file throws an error

2018/03/30 10:30:20 There was an error while attempting to parse and storethe file /Users/robertrowe/src/baseballdatabank/core/Teams.csv Error: line 11, column 23: strconv.ParseInt: parsing "53.0": invalid syntax

That is the SB column which should be an integer, there is no way to steal a fraction of a base.