matsonj / nba-monte-carlo Goto Github PK

Monte Carlo simulation of the NBA season, leveraging dbt, duckdb and evidence.dev

License: MIT License

Makefile 8.33% Dockerfile 3.76% Python 52.84% Svelte 4.75% PLpgSQL 30.33%

nba-monte-carlo's Introduction

Current progress: "Serverless BI"

The latest version of the project is available at mdsinabox.com. The website embraces the notion of "Serverless BI" - the pages are built asynchronously with open source software on commodity hardware and then pushed to a static site. The github action that automatically deploys the site upon PR can be found here.

MDS-in-a-box

This project serves as end to end example of running the "Modern Data Stack" on a single node. The components are designed to be "hot swappable", using makefile to create clearly defined interfaces between discrete components in the stack. It runs in many enviroments with many visualization options. In addition, the data transformation documentation is self hosted on github pages.

Many Environments

It runs practically anywhere, and has been tested in the environments below.

Operating System	Local	Docker	Devcontainer	Docker in Devcontainer
Windows (w/WSL)	n/a	✅	✅	✅
Mac (Ventura)	✅	✅	✅	✅
Linux (Ubuntu 20.04)	✅	✅	✅	✅

Beautiful serving layer

Evidence.dev

1	2	3

It can also be explored live at mdsinabox.com.

Getting Started

Building MDS-in-a-box in Github Codespaces

Want to try MDS-in-a-box right away? Create a Codespace:

You can run in the Codespace by running the following command:

make build run

You will need to wait for the pipeline to run and Evidence configuration to complete. The 4-core codespace performs signifcantly better in testing, and is recommended for a better experience.

Once the build completes, you can access the Evidence dashboard by clicking on the Open in Browser button on the Ports tab: and log in with the username and password: "admin" and "password".

Codespaces also supports "Docker-in-docker", so you can run docker inside the codespace with the following command:

make docker-build docker-run-evidence

Building MDS-in-a-box in Windows

Create your WSL environment. Open a PowerShell terminal running as an administrator and execute:

wsl --install

If this was the first time WSL has been installed, restart your machine.

Open Ubuntu in your terminal and update your packages.

sudo apt-get update

Install python3.

sudo apt-get install python3.9 python3-pip python3.9-venv

clone the this repo.

mkdir my_projects
cd my_projects
git clone https://github.com/matsonj/nba-monte-carlo.git
# Go one folder level down into the folder that git just created
cd nba-monte-carlo

build your project

make build run

Make sure to open up evidence when prompted (default location is 127.0.0.1:8088). The username and password is "admin" and "password".

Using Docker

You can build a docker container by running:

make docker-build

Then run the container using

make docker-run-evidence

These are both aliases defined in the Makefile:

docker-build:
	docker build -t mdsbox .

docker-run-evidence:
	docker run \
		--publish 8088:8088 \
		--env MDS_SCENARIOS=10000 \
		--env MDS_INCLUDE_ACTUALS=true \
		--env MDS_LATEST_RATINGS=true \
		--env MDS_ENABLE_EXPORT=true \
		--env ENVIRONMENT=docker \
		mdsbox make run serve

Notes on Design Choices

DuckDB as compute engine

Using DuckDB keeps install and config very simple - its a single command and runs everywhere. It also frankly covers for the sin of building a monte carlo simulation in SQL - it would be quite slow without the kind of computing that DuckDB can do.

Postgres was also considered in this project, but it is not a great pattern to run postgres on the same node as the rest of the data stack.

Using Parquet instead of a database

This project leverages parquet in addition to the DuckDB database for file storage. This is experimental and implementation will evolve over time - especially as both the DuckDB format continues to evolve and Iceberg/Delta support is added to DuckDB.

External Tables

dbt-duckdb supports external tables, which are parquet files exported to the data_catalog folder. This allows easier integration with Rill, for example, which can read the parquet files and transform them directly with its own DuckDB implementation.

What's next?

To-dos

clean up env vars + implement incremental builds
submit your PR or open an issue!

Source Data

The data contained within this project comes from pro football reference, sports reference (cfb), basketball reference, and draft kings.

nba-monte-carlo's People

Contributors

Stargazers

Watchers

Forkers

alex-monahan jwills tayloramurphy jattenberg pandemicsyn schrepfler aaronsteers cleysonl petehunt matinnuhamunada tjdev7 pedramnavid geofdobbe mrjsj gregwdata tomsej bensimonds therockstardba vibhor-cyberoi jaobsec butchland umbersar savadev menzenski faithlierheimer richardscottoz rk4mile bbuxton93 archiewood meltanolabs nkeleher z3z1ma ud3sh zashirah olgzzz barsik-sus portlander999 rvs wisemuffin dsumpter bkktimber crerecombinase jadegeek m-o-leary jpooksy maestrofx ramnathv minerra-analytics toddy86 albughdadim halilibrahimaker ahmedthahir ryanprokesch mermelstein deep-learning-explorers dgg23 aadityamagic

nba-monte-carlo's Issues

duckdb version issue on superset using docker

Environment: MacOS Ventura 13.4.1

Command: make docker-run-superset

Description: When I run the docker container for the superset visuals, I get the error shown below in the dashboard. I tried setting the duckdb-engine version to 0.7.1 in the meltano.yml file, but it still didn't work.

IOError from DuckDB when run with Superset

First of all, this is a great project! However, I ran into a minor error on my local machine.

I cloned the project and run it with make docker-run-superset. Everything went well until I visit the dashboard in Superset. Then I got the following error:

Error: (duckdb.IOException) IO Error: Trying to read a database file with version number 51, but we can only read version 43.
The database file was created with an newer version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

See the storage page for more information: https://duckdb.org/internals/storage
(Background on this error at: http://sqlalche.me/e/13/e3q8)

Definitely a backward incompatibility with DuckDB but I don't know what DuckDB version I should use.

Rework how parquet export is orchestrated.

Today there is a pretty janky parquet integration. What I think is a better way to do this is something like the following:

use target-parquet in meltano, and remove target-duckdb. This allows a cleaner interface into duckdb, and also hedges against stability problems with the duckdb database format.
add a dbt tag (?) to identify models to export.
add a var to dbt_project.yml for export_to_parquet: true (or false) if you don't want to export.
run the entire project inside of duckdb.
run a macro to export files, either as a run-operation or via on-run-end. The macro would grab a list of all models with a certain tag, then loop through exporting them one at a time.

This allows for a much cleaner set of dbt models. It also is a clear hand-off into some other system to handle the parquet files. As an example, you could orchestrate this pretty cleanly with meltano invoking the run-operation after dbt build completes.

automate ingestion

currently i am updating by hand. need to figure out a way to get the csv sources updates w/python

add conference pages to ncaaf model

meltano pipeline error

I'm attempting to run on a mac within pyenv (python version 3.8.15). No problem running 'make build'. However running into an error running 'make pipeline'. See Meltano Logs below. Seems like it might relate to python 3.8 multiprocessing issue which is a dependency of target-parquet. Any tips?

Meltano Logs

2022-11-15T22:36:12.680915Z [info 2022-11-15T22:36:13.985106Z [info 2022-11-15T22:36:13.985599Z [info 2022-11-15T22:36:13.985788Z [info 2022-11-15T22:36:13.995274Z [info 2022-11-15T22:36:13.995459Z [info 2022-11-15T22:36:13.995631Z [info 2022-11-15T22:36:13.998637Z [info 2022-11-15T22:36:13.998865Z [info 2022-11-15T22:36:14.109560Z [info 2022-11-15T22:36:14.110472Z [info 2022-11-15T22:36:14.126474Z [info 2022-11-15T22:36:14.126637Z [info 2022-11-15T22:36:14.126707Z [info 2022-11-15T22:36:14.126767Z [info 2022-11-15T22:36:14.126826Z [info 2022-11-15T22:36:14.126882Z [info 2022-11-15T22:36:14.126936Z [info 2022-11-15T22:36:14.126989Z [info 2022-11-15T22:36:14.127063Z [info 2022-11-15T22:36:14.127114Z [info 2022-11-15T22:36:14.127167Z [info 2022-11-15T22:36:14.127219Z [info 2022-11-15T22:36:14.127268Z [info 2022-11-15T22:36:14.127317Z [info 2022-11-15T22:36:14.127366Z [info 2022-11-15T22:36:14.127415Z [info 2022-11-15T22:36:14.127466Z [info 2022-11-15T22:36:14.127514Z [info 2022-11-15T22:36:14.127565Z [info 2022-11-15T22:36:14.127614Z [info 2022-11-15T22:36:14.127663Z [info 2022-11-15T22:36:14.127713Z [info 2022-11-15T22:36:14.142130Z [info 2022-11-15T22:36:14.142359Z [info 2022-11-15T22:36:14.142598Z [info 2022-11-15T22:36:14.142795Z [info 2022-11-15T22:36:14.143263Z [info 2022-11-15T22:36:14.143363Z [info 2022-11-15T22:36:14.143426Z [info 2022-11-15T22:36:14.144764Z [info 2022-11-15T22:36:14.144874Z [info 2022-11-15T22:36:14.145051Z [info 2022-11-15T22:36:14.145115Z [info 2022-11-15T22:36:14.145438Z [info 2022-11-15T22:36:14.145510Z [info 2022-11-15T22:36:14.145568Z [info 2022-11-15T22:36:14.145835Z [info 2022-11-15T22:36:14.145935Z [info 2022-11-15T22:36:14.146058Z [info 2022-11-15T22:36:14.337744Z [error Traceback (most recent call last):
File "/Users//.pyenv/versions/3.8 yield
File "/Users//.pyenv/versions/3.8 await self.run_with_job()
File "/Users//.pyenv/versions/3.8 await self.execute()
File "/Users//.pyenv/versions/3.8 await manager.run()
File "/Users//.pyenv/versions/3.8 _check_exit_codes(
File "/Users//.pyenv/versions/3.8 raise RunnerError("Loader meltano.core.runner.RunnerError: 2022-11-15T22:36:14.338209Z [error ] Performing full refresh, ignoring state left behind by any previous runs.
] INFO Using supplied catalog /Users//dev/meltano-projects/nba-monte-carlo/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Processing 4 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing stream:nba_schedule_2023 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Walking ./data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Found 3 files. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Checking 3 resolved objects for any that match regular expression "nba_schedule_2023.csv" and were modified since 2001-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing file "nba_schedule_2023.csv". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Sending version information to singer.io. To disable sending anonymous usage data, set the config parameter "disable_collection" to true cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] INFO writing streams in separate folders cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] Traceback (most recent call last): cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//dev/meltano-projects/nba-monte-carlo/.meltano/loaders/target-parquet/venv/bin/target-parquet", line 8, in cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] sys.exit(main()) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//dev/meltano-projects/nba-monte-carlo/.meltano/loaders/target-parquet/venv/lib/python3.8/site-packages/target_parquet/init.py", line 274, in main cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] state = persist_messages( cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//dev/meltano-projects/nba-monte-carlo/.meltano/loaders/target-parquet/venv/lib/python3.8/site-packages/target_parquet/init.py", line 225, in persist_messages cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] t2.start() cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/process.py", line 121, in start cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] self._popen = self._Popen(self) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/context.py", line 224, in _Popen cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] return _default_context.get_context().Process._Popen(process_obj) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/context.py", line 284, in _Popen cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] return Popen(process_obj) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] super().init(process_obj) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] self._launch(process_obj) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] reduction.dump(process_obj, fp) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] File "/Users//.pyenv/versions/3.8.15/lib/python3.8/multiprocessing/reduction.py", line 60, in dump cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] ForkingPickler(file, protocol).dump(obj) cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] AttributeError: Can't pickle local object 'persist_messages..consumer' cmd_type=elb consumer=True name=target-parquet producer=False stdio=stderr string_id=target-parquet
] INFO Wrote 1341 records for stream "nba_schedule_2023". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing stream:team_ratings cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Walking ./data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Found 3 files. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Checking 3 resolved objects for any that match regular expression "team_ratings.csv" and were modified since 2001-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing file "team_ratings.csv". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Wrote 30 records for stream "team_ratings". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing stream:xf_series_to_seed cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Walking ./data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Found 3 files. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Checking 3 resolved objects for any that match regular expression "xf_series_to_seed.csv" and were modified since 2001-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing file "xf_series_to_seed.csv". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Wrote 14 records for stream "xf_series_to_seed". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Syncing stream:nba_elo_latest cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] INFO Assembled https://projects.fivethirtyeight.com/nba-model/nba_elo_latest.csv as the URL to a source file. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
] Loader failed
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/logging/output_logger.py", line 201, in redirect_logging
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/block/extract_load.py", line 457, in run
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/block/extract_load.py", line 483, in run_with_job
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/block/extract_load.py", line 449, in execute
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/block/extract_load.py", line 647, in run
.15/envs/meltano/lib/python3.8/site-packages/meltano/core/block/extract_load.py", line 799, in _check_exit_codes
failed", {PluginType.LOADERS: consumer_code})
Loader failed
] Block run completed. block_type=ExtractLoadBlocks err=RunnerError('Loader failed') exit_codes={<PluginType.LOADERS: 'loaders'>: 1} set_number=0 success=False

split evidence.dev sql queries into separate files

now that evidence can support decoupled sql queries, we should decouple the sql queries from the markdown files.

Split docker actions based on visualization target

Now that Superset, Evidence, and Rill can be visualization options, I want to split the makefile to run the docker container based on which tool is handling the viz.

Superset works now with:

make docker-build
make docker-run-pipeline
make docker-run-superset

I want to continue the same pattern with Rill & Evidence, but I am stuck on how to get it to work. Debugging docker is hell!

make docker-build
make docker-run-pipeline
make docker-run-<viz_tool>

Getting error when running docker conatienr -> validate() got an unexpected keyword argument 'extra_validators'

Hi Team,

we are trying to install the application on Ubuntu machine . we have followed the same procedure mentioned in the document .when we are running the docker image make using command "docker-run-superset" than getting below error.

Can you please guide what we have to do to resolve this error .

Regards,
Akash

Persist parquet files from target-parquet

Since the meltano pipeline state is persisted as part of the container, running meltano run tap-spreadsheets-anywhere target-parquet in an existing container for the second (or really, any subsequent time) will fail because the parquet files are written to /tmp. It looks like the logic inside the .devcontainer drops /tmp when the container is stopped.

The best thing to do here is re-route the data created by the meltano run to a directory that is a part of the repo, but added to .gitignore. This will keep both the meltano run state and the data "in sync" and allow better behavior when re-running the pipeline.

add team pages to nfl model

Superset Dashboard Files

I ran the entire workflow within a VSCode devcontainer. Cool stuff! Do you have the Superset dashboard definitions uploaded to the repo (which produce the images in the readme?)

Update CI/CD so it builds successfully

Remove superset install from pipeline check.

Remove the netlify action to attempt deploy, since its handled instead in a github action.

add actuals rfwd for both ncaaf & nfl models

Parametrized pages don't build in evidence 14.0.0

For example, this page. https://mdsinabox.com/teams/phi/

It works in npm run dev, but not when you run npm run build.

could be related to evidence-dev/evidence#681, but not sure.

Playoff calcs seem off

Have quite a few issues to sort out based on this.

Division winners cannot be play-in eligible (Hawks/Heat), i.e. they get 6 seed at worst.
need to apply tiebreakers.
would expect approx 50/50 East vs West odds, but instead it is more like 75/25. Need to investigate why this is, and if is a code issue.

Missing LICENSE

Would love to use this as a base for some of my own projects, can you add a license file? I'd recommend MIT or Apache

Malloy?

add team pages to ncaaf model

Add snapshotting to allow comparison between points and time.

Since ELO rating is re-forecasted after every game, it would be interesting to see how predicted outcomes change over time.

The chart below from 538 is an example of the kind of visual that is possible if we have snapshotting enabled.

It should be noted that this is where persistence comes into play since current models are NOT persisted (by design).

bring NFL model up to parity with NCAAF

Also add supercontest lines for assessing which bets are the best per the model

Consider adding sqlfluff

considering adding sqlfluff so we have nicer, more consistent sql

replace manual evidence.dev install with meltano plugin

update dbt utils to 1.0.0

Rework external materializations to allow persistence

Currently, external materializations are basically a toy - they write parquet files into folders, but they only exist inside the environment during execution and then are flushed into the void. Furthermore, not all models that should be persisted this way are persisted!

Options:

use Motherduck to handle persistence between runs. CC @Alex-Monahan, this is a very interesting use case for duckDB + serverless analytics. Would love some thoughts about this. I am thinking "every job run outputs a unique file to motherduck" or something. My biggest concern is coupling to duckDB version since they are not backwards compatible.
use serverless postgres (i.e. Neon) and write out resulting datasets to a persisted cache.
write to S3/GCS/Azure Blob after each job run.
I tend to favor this approach because it is the most "generic" but the affordances of a real database (i.e. a catalog) are very nice.

Bug: 'make pipeline' fails - loader fails

When running 'make pipeline', the loader fails. There is an open issue for this here.

The workaround, annoyingly, is to skip the duckdb loader entirely and instead run 'make parquet'.

update to handle motherduck

update to support duckdb 0.7.0

dbt-duckdb failing to run

probably something with duckdb 0.7.0

Project fails to build - TypeError: expected str, bytes or os.PathLike object, not tuple

Related issue: meltano/edk#50

Meltano dbt-ext relies on the main branch of the edk, so changes to either of those potentially break this project.

Eventually these should be versioned and published on Pypl but for now it is a risk of the project

test unique_xf_series_to_seed_series_id fails

Environment: Ubuntu 20.04.6 via WSL

Command: make docker-run-evidence and make docker-run-superset

Description: When meltano invoke dbt-duckdb build is called when running the docker container, the test unique_xf_series_to_seed_series_id fails. The same test passes without using docker. Here are the messages which show up in the terminal after the 75 tests are completed:

23:19:50  Running 1 on-run-end hook
23:19:51  Statements to run:
23:19:51  1 of 1 START hook: nba_monte_carlo.on-run-end.0 ................................ [RUN]
23:19:51  1 of 1 OK hook: nba_monte_carlo.on-run-end.0 ................................... [OK in 0.00s]
23:19:51
23:19:51
23:19:51  Finished running 7 table models, 10 external models, 41 tests, 17 view models, 1 hook in 0 hours 0 minutes and 32.05 seconds (32.05s).
23:19:51
23:19:51  Completed with 1 error and 0 warnings:
23:19:51
23:19:51  Failure in test unique_xf_series_to_seed_series_id (models/_docs.yml)
23:19:51    Got 14 results, configured to fail if != 0
23:19:51
23:19:51    compiled Code at ../docs/compiled/nba_monte_carlo/models/_docs.yml/unique_xf_series_to_seed_series_id.sql
23:19:51
23:19:51  Done. PASS=68 WARN=0 ERROR=1 SKIP=6 TOTAL=75

Superset fails to run with ```ModuleNotFoundError: No module named 'cryptography.hazmat.backends.openssl.x509'```

After cloning the repository and trying to run in docker on master branch, I get this error:

meltano invoke superset fab create-admin --username admin --firstname lebron --lastname james --email [email protected] --password password
2023-01-03T10:30:34.244420Z [error    ] Traceback (most recent call last):
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/bin/superset", line 5, in <module>
    from superset.cli.main import superset
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/__init__.py", line 21, in <module>
    from superset.app import create_app
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/app.py", line 23, in <module>
    from superset.initialization import SupersetAppInitializer
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/initialization/__init__.py", line 33, in <module>
    from superset.extensions import (
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/extensions/__init__.py", line 32, in <module>
    from superset.utils.cache_manager import CacheManager
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/utils/cache_manager.py", line 24, in <module>
    from superset.utils.core import DatasourceType
  File "/workspaces/nba-monte-carlo/.meltano/utilities/superset/venv/lib/python3.9/site-packages/superset/utils/core.py", line 76, in <module>
    from cryptography.hazmat.backends.openssl.x509 import _Certificate
ModuleNotFoundError: No module named 'cryptography.hazmat.backends.openssl.x509'

Need help fixing this problem? Visit http://melta.no/ for troubleshooting steps, or to
join our friendly Slack community.

Superset metadata database could not be initialized: `superset db upgrade` failed
make: *** [Makefile:18: superset-visuals] Error 1
make: *** [docker-run-superset] Error 2

It seems to be similar to error mentioned in #54 but the superset-test branch seems to be gone now. I'm happy to provide a PR if I can be pointed in the right direction on where this might be happening. I am running on macOS Catalina 10.15.7

add accuracy history

`make docker-run` command fails with "plugin 'superset' is not known to meltano" error

Steps to replicate:

clone repository
make docker-build
make docker-run

Error message:

16:45:43  Done. PASS=74 WARN=0 ERROR=0 SKIP=0 TOTAL=74


meltano invoke superset fab create-admin --username admin --firstname lebron --lastname james --email [email protected] --password password
Need help fixing this problem? Visit http://melta.no/ for troubleshooting steps, or to
join our friendly Slack community.

Plugin 'superset' is not known to Meltano
make: *** [Makefile:14: superset-visuals] Error 1
make: *** [Makefile:23: docker-run] Error 2

Update pipeline to allow multiple pipeline runs

prep models need to handle data de-duplication.

load dbt exposures to evidence sources

initial discussion here -> https://evidencedev.slack.com/archives/C0239KTKHH7/p1692330288239449

I'm currently working on #98 first, but once I get those cleaned up (next few days), I will see if I can pick this up.

Broken Pipeline - see failing github action

The pipeline is failing for duplicate records in the schedule table.

playoff games are showing up in game log incorrectly

Need to fix this

Evaluate which code can be consolidated into macros across NBA, NFL, NCAAF models.

The following models should in theory be generic:

inputs
- teams
- schedule
- actual results
outputs
- elo over time
- predictions
- end of season standings (blending predictions + actuals) AKA game log

This will allow additional sports to be added very quickly and easily - NHL, MLB, Premier League (?), College bball

non-generic models + reasons

in season tournament / "champions cups": the games are not fixed, so we have compute projected winners to slot teams into subsequent games.
end of season seeding: tiebreaking methodology depends on specific league rules
playoffs: for leagues with playoffs, subsequent games take a dependency on projected outcomes. additionally, some leagues have different criteria for wins (NFL is 1 game, NBA is best of 7, MLB is hybrid, and so on)

However, the most interesting analysis depends on "end of season seeding" (argh), so we will need to figure out how to build models for end of season seeding for each sport. Playoff models I am less certain on, because I am not confident "regular season win totals" which is currently driving ELO ratings is necessarily predictive of playoff success. For all intents and purposes, regular season and post season models should probably behave independently.

fix bug on nba main page

Set up runner to refresh mdsinabox.com every night at midnight

db file not created

Running from main branch in codespaces, follwed directions in readme.md, and was not able to connect to the db from superset.

Upon investigation, realized the file was not present at /tmp/mdsbox.db after running make pipeline.

It looks like either the deletion of the path keys in profiles.yml in #33 , or that in combination with the switch to use external materialization in #36 is the cause. I have a PR incoming that I will link to this issue to add the path key back in that seems to fix it.

After doing that, the mdsbox.db file shows up for me:

Upgrade to evidence universal SQL

Update to evidence universal SQL - some cutting edge stuff here!

Clean up cosmestic issues

Have a dirth of cosmetic issues I need to fix, been delaying that as I focus more on modeling. Going back to this now since modeling is mostly stable.

NBA

NCAAF

Lock this as of end of regular season (incl army/navy game)

NFL

End of season seeding out of order on the bar chart (works in NBA section, so replicate that component)
most recent games - sort of date DESC (currently date ASC), set it to 5 games)
scores should render as int, not dec

What to do about Rill & Superset?

Rill & Superset are in the project for legacy purposes but certainly make it a little heavier to maintain. It does seem that many users do find value in the instructions on how to config superset, and it is a nice tool! Perhaps we just leave them in here as options installed with config flags and focus on the core workflow on Evidence.

Add NCAAF Forecast

Once the NFL team pages are built out and sorted, add pages for NCAAF teams. The model should be roughly appropriate, although the post-season stuff is much more complex.

```npm run dev``` does not work in codespaces in evidence version 14.0.0

This used to work in evidence 10.0.0, but in the latest release, it does not. Will have to investigate.

Current parameters are:

evidence-run:
	cd analyze && npm run dev -- --host 0.0.0.0

Execute a python model

This way of structing a data-stack in a box is just brilliant. I can get everything up and running except making a python model instead of a sql model.

I have this very basic py model to test:

import pandas as pd
def model(dbt, session):
    dbt.config(packages=["pandas"])
    data= dbt.ref("train_test_split")
    data['test_column'] = 1

    return data

However, upon a dbt:run I get:

Python model failed:
No module named 'pandas'

I consulted the author of duckdb-dbt here, but I am still not able to run it. Is it possible in your end?

Update readme to document multiple visualization options

misc model updates

nfl

add tiebreakers
add playoffs

nba

add in-season tournament analytics page
move evidence queries to sources
add tiebreakers for in-season tournament
fix h2h tiebreakers for in-season tournament
add in-season tournament predictive model
handle plug for the 22 teams that didn't make the in-season tournament
add end of season tiebreakers
fix playoffs (currently showing 0% for all teams to win finals)