Giter Site home page Giter Site logo

drivendata / cookiecutter-data-science Goto Github PK

View Code? Open in Web Editor NEW
7.6K 118.0 2.4K 1.16 MB

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Home Page: http://drivendata.github.io/cookiecutter-data-science/

License: MIT License

Makefile 33.61% Python 49.73% Batchfile 16.67%
cookiecutter-data-science cookiecutter cookiecutter-template data-science machine-learning ai

cookiecutter-data-science's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cookiecutter-data-science's Issues

Remove --recursive from s3 sync commands in Makefile

At least as of AWS CLI v 1.10.32 the aws s3 sync command does not have a --recursive flag. As such, running the sync_data_to_s3 or sync_data_from_s3 make rules throws the error

Unknown options: --recursive
make: *** [sync_data_to_s3] Error 255

The sync operation is recursive by default see the aws cli docs.

The --recursive flag should be removed from the default Makefile.

Imports not right for dotenv snippet in docs

# src/data/dotenv_example.py
from os.path import join, dirname
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")

from os.path import join, dirname should just be import os (for os.environ.get).

Add the installation folder to .env

Hi,

I am just starting to use the project. I noticed that most of the commands have relative paths like in the Makefile.

It is sometimes useful to have access to the full path. For example, when running cron jobs on some of the scripts inside the projects, getting the proper relative paths may be a bit tricky. One can use something like os.path.abspath(__file__) in the script to find the path but I thought that it would be easier if the project folder was dumped in the .env, the environment variable being then used for paths to the data or visualization folders.

Thanks.

Compatibility of pack to create api driven projects

Hi there! Loved the project, this really reflects the maturity of data science projects and where we are standing. So good!

I rise this issue as I was wondering if the current structure can be adapted to an api-driven project. This is, a project in which the analysis and data flow may be related to an api definition.

If yes, what would it be? So we can document it (or point me out where it is)
If not, why? Some books have recommended having an api flow for analysis and process so our results and analysis are available for our mates in engineering. Even allowing for an easy scale up.

Thank you so much!

Slightly adjust commands in Makefile

Just started using your cookiecutter for the first time. Thank you for the effort, it seems very valuable to me!

I had some comments and I'm happy to create pull requests if those are desired changes. I'm talking about targets in the Makefile here.

requirements:

pip now also allows a constraints file which seems more relevant to me in order to pin or require certain versions of dependencies.

clean:

Find can delete directly: find . -iname "*.pyc" -delete seems pretty clear to me. For Python 3 it could be useful to add find . -iname "__pycache__" -exec rm -rf {} +. The plus at the end, rather than the \; will pipe all found instances to the rm tool in one go rather than executing rm for each one individually.

lint:

Typically I only want to run flake8 on source code. So rather than excluding a bunch of directories why not specifically call flake8 on the src directory only?

Swap out Sphinx for mkdocs

Sphinx is really good for projects where documentation lives in docstrings in the code. Mkdocs is easier to write from scratch, style, and deploy.

Also, I've got a preference for writing markdown over RST

Make separate docs pages instead of one monolithic page

Especially if we're adding more content (e.g., #18), we may want to have a few separate pages. Possible segmentation could be:

  • Project introduction and documentation
  • Directory layout
  • Opinions and philosophy
  • Workflow components and the technologies that are chosen (or are options)
  • Extension strategies (e.g., #16)
  • Links to examples of projects that use the template

Unclear how to use AWS and `make data`

Analysis is a DAG. The sequence in this DAG is critical, so more prescription would be beneficial.

It's unclear how to incorporate AWS and the make sync_data_(to|from)_s3 commands into make data. In addition, the documentation doesn't describe how AWS should be used with the .env file.

  • Should make data call sync_data_from_s3?
  • How should variables from .env be exported so they are available to make sync_data_(to|from)_s3? A Python script, or something else?

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework

Hello,
I'd like to use the cookiecutter-data-science framework for the project that I'm working on, but unfortunately I'm having trouble getting started. Specifically, I'm having trouble figuring out how to configure the make_dataset.py file to execute any python data-making scripts. I'm sure the fix is pretty basic, but I've been spinning my wheels for awhile trying to figure this issue out.

It would be great if you could provide a basic tutorial demonstrating a simple implementation of your framework that people like me could use to get started.
Thanks!

Test suite

We should setup a simple test suite. Mostly focused on config testing, I would guess:

  • CI Server that runs tests on multiple OSes: osx/linux/windows
  • Tox for multiple Python versions

Workflow for working with figures and reports

I just started using this cookiecutter and I'm wondering how people are using this directory structure in order to generate figures and reports.

Here's what I'm doing currently:

  • do analysis and generate interesting figure, save them to /reports/figures/
  • write up final jupyter notebook report from within /notebooks/reports/, any reference to figures are going to be ../../reports/figures/fig.png
  • export the report as report.html and place in /reports/

The issue now, is that when I view the report.html, the figures don't have the proper path. How are people getting around this?

how will cookiecutter handle Database driven projects

I see there is s3 syncing but for people using SQL Databases or HDFS? a few useful thoughts:

  1. There should be a place for database connection strings, and connections to be established
  2. inside of src/data we should store python scripts, but we can have a subdirectory, database_scripts for .sql, .hql, etc. This would cover all database insertion, ETL, in database data munging etc.

Does this seem sensible?

ContextDecodingException

When running cookiecutter https://github.com/drivendata/cookiecutter-data-science in Anaconda 2.3.0 (Python 2.7.11) I get the following exception:

Traceback (most recent call last):
  File "/Users/bencook/anaconda/bin/cookiecutter", line 11, in <module>
    sys.exit(main())
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/cli.py", line 106, in main
    config_file=user_config
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/main.py", line 130, in cookiecutter
    extra_context=extra_context,
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/generate.py", line 102, in generate_context
    raise ContextDecodingException(our_exc_message)
cookiecutter.exceptions.ContextDecodingException: JSON decoding error while loading "/Users/bencook/.cookiecutters/cookiecutter-data-science/cookiecutter.json".  Decoding error details: "Expecting property name: line 9 column 1 (char 401)"

I get the same exception in a virtual environment with Python 2.7.9.

Here's what my cookiecutter.json looks like:

{
    "project_name": "project_name",
    "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}",
    "author_name": "Your name (or your organization/company/team)",
    "description": "A short description of the project.",
    "year": "2016",
    "open_source_license": ["MIT", "BSD", "Not open source"],
    "s3_bucket": "[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')",
}

docker support

Hi,
Any chance someone can add docker support + sql docker support like in this cookiecutter django project?

The benefits are:

  1. reproducible environment for running the code -> easier to deply
  2. reproducible database (if needed)

I am new to Docker&Cookie-cutter, otherwise I would do this myself.

Minor issue with self documenting make

In a fresh project, running make (Ubuntu) gives me:

$ make
/bin/sh: 1: test: Linux: unexpected operator
Available rules:

clean               Delete all compiled Python files 
create_environment  Set up python interpreter environment 
data                Make Dataset 
lint                Lint using flake8 
requirements        Install Python Dependencies 
sync_data_from_s3   Download Data from S3 
sync_data_to_s3     Upload Data to S3 
test_environment    Test python environment is setup correctly

It looks like a problem with the invocation of test (my uname is Linux).

Seems like this comes from the very last line of the self documenting rule:

	| more $(shell test $(shell uname) == Darwin && echo '--no-init --raw-control-chars')

Changing the == to = seems to get rid of the /bin/sh: 1: test: Linux: unexpected operator.

@pjbull Can you see if this works on OS X?

Cookiecutter is now in Conda Forge

This works for installation, so you might want to change your creation process to something like :

$ conda config --add channels conda-forge
$ conda install cookiecutter

make src home to not just Python code

I routinely have to use R code in my pipeline rules/targets. I propose changing the src organization as follows:

Change the current:

src
├── data
│   └── make_dataset.py
├── features
│   └── build_features.py
├── __init__.py
├── models
│   ├── predict_model.py
│   └── train_model.py
└── visualization
    └── visualize.py

to something akin to this:

src
├── python
│   ├── data
│   │   ├── __init__.py
│   │   └── make_dataset.py
│   ├── features
│   │   ├── build_features.py
│   │   └── __init__.py
│   ├── __init__.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── predict_model.py
│   │   └── train_model.py
│   ├── rules
│   │   ├── __init__.py
│   │   └── template_python_script.py
│   └── visualization
│       ├── __init__.py
│       └── visualize.py
└── R
    └── rules
        └── template_R_script.R

Thoughts?

How would this structure change for R?

I'm working on creating a similar standard for R at my company and was hoping to get some thoughts on if anything warrants changing to be R specific.

Rename IPython NB to Jupyter NB

Alright, this is SUPER nitpicky. Jupyter Notebooks is the new name for IPython Notebooks. The comment above .ipynb_checkpoints/ in the .gitignore should be changed from # IPython NB Checkpoints to Jupyter NB Checkpoints.

I'm going to submit a PR to make the change (like I said, really nitpicky).

Include nosetests out of the box with top level testing dir

One of the main components that is different from my usual data science set-up is a top-level directory for unit and integration testing. Once a model moves to production, it is vital that it ship with unit and integration tests and insurance that the work does not break any other models. I recommend adding this section at the top level of the module so that forked projects can run the testing suite with access to all the proper sub-modules.

Great work; I appreciate the organization!

Set .gitignore for the data directory

Goal: keep the data/ folder in the project template for illustrative reasons, but by default ignore its contents once the cookiecutter has been instantiated and turned into a git repo.

Add default config file to src/

Hi

Should we add a src/config.py or src/settings.py file? I believe this would make it easier to get paths to folders etc. in make_data.py for example.

# src/config.py
""" Storing config variables and other settings"""
from os.path import join, dirname, os, abspath
from dotenv import load_dotenv
import inspect

dotenv_path = join(dirname(__file__), '../.env')
load_dotenv(dotenv_path)

class ParamConfig:
    """Config variables for the project"""
    def __init__(self):
        self.kaggle_username = os.environ.get("KAGGLE_USERNAME")
        self.kaggle_password = os.environ.get("KAGGLE_PASSWORD")
        self.config_dir = dirname(abspath(inspect.getfile(inspect.currentframe())))
        self.root_dir = dirname(self.config_dir)

        # Data directories
        self.data_dir = os.path.join(self.root_dir, 'data')
        self.raw_data_dir = os.path.join(self.data_dir, 'raw')
        self.processed_data_dir = os.path.join(self.data_dir, 'processed')

config = ParamConfig()

I can then import the config variable like so:

# Selective excerpt from src/data/make_data.py as an example 
from src.settings import config 

def main(output_zip=False):
    """Create data!"""
    logger = logging.getLogger(__name__)
    logger.info('making final data set from raw data')

    # compression = 'gzip' if output_zip is True else

    # Read raw data (auto unzipping files!)
    train_sales = pd.read_csv(path.join(config.raw_data_dir, 'train.csv.zip'))
    test_sales = pd.read_csv(path.join(config.raw_data_dir, 'test.csv.zip'))
    stores = pd.read_csv(path.join(config.raw_data_dir, 'store.csv.zip'),
                         dtype={'CompetitionOpenSinceYear': str,
                                'CompetitionOpenSinceMonth': str,
                                'Promo2SinceWeek': str,
                                'Promo2SinceYear': str,})

However note that importing settings in this way also requires me to change the make file from this:

data:
    python -m src/data/make_dataset.py

to this:

data:
    python -m src.data.make_dataset

I'm not sure if this has any downsides to it. An alternative is also to add the src and/or settings file to the python path.

I'm still learning both Python and Data Science so please bear with me if what I'm suggesting or my code is Silly :)

Add an opinion about making scripts chatty

I'm generally in favor of keeping the opinions section pithy, but I think this may be a fit.

  • Use real logging, not print statements (we have some boilerplate)
    • easy redirect to multiple places
    • timestamps and module for free
    • easy to see what happens on someone else's instance
  • Include tqdm by default

Make default repo_name lowercase

Right now the default repo_name is simply the provided project_name, replacing spaces with underscores: "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}".

It would be nice if the default also converted the project_name to lowercase as well: {{ cookiecutter.project_name.lower().replace(' ', '_') }}.

Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.