Giter Site home page Giter Site logo

business-science / pytimetk Goto Github PK

View Code? Open in Web Editor NEW
634.0 634.0 55.0 181.46 MB

Time series easier, faster, more fun. Pytimetk.

Home Page: https://business-science.github.io/pytimetk/

License: MIT License

Python 96.04% JavaScript 3.96%
pandas polars time time-series timeseries timeseries-analysis

pytimetk's People

Contributors

gtimothee avatar iskode avatar justinkurland avatar lucaso21 avatar mdancho84 avatar samuelmacedo83 avatar tackes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytimetk's Issues

More Augment Functions: logarithmic, polynomial, hilbert, wavelet, short fourier

Per @JustinKurland:

There are a lot of opportunities for more augmentation functions:

  • tk.augment_logrithmic()
  • tk.augment_polynomial()
  • tk.augment_hilbert()
  • tk.augment_wavelet()
  • tk.augment_short_fourier() <- This is different than the normal fourier transform in that it breaks a signal into smaller segments to provide a time-varying analysis with adjustable time and frequency resolutions.

These are just a few, but all represent further oppotunities to try and add valuable information that has historically been leveraged in the extant time series and signal processing literature.

Tests: Data Wrangling Functions

Create pytest tests for Data Wrangling Functions. Use chatgpt to help.

  • tk.summarize_by_time()
  • tk.future_frame()
  • tk.pad_by_time()

#2

`tk.augment_fourier()`

PyTimeTK Roadmap

Phase 1: MVP Package

Develop a minimal package with the most important functions.

Use this guide: https://py-pkgs.org/03-how-to-package-a-python

Priority 1 - Core Data and Data Frame Operations

  • summarise_by_time() / summarize_by_time()
  • Data Sets

Priority 2 - Plot Time Series

  • plot_time_series() - Not sure if we should go with plotly or altair for interactive mode. I feel we should go with plotnine for non-interactive. Will need smooth_vec().

Priority 3 - Data Wrangling

  • future_frame() - We will also need tk_make_future_timeseries() and tk_make_timeseries()
  • pad_by_time()

Priority 4 - Augment Operations

Note - These functions should overwrite columns that are named the same in the input data frame.

  • tk.augment_timeseries_signature() - tk.get_timeseries_signature()
  • tk.augment_holiday_signature() - Uses holidays package
  • tk.augment_lags() / tk.agument_leads()
  • tk.augment_rolling()
  • tk.augment_fourier()

Priority 5 - TS Features

  • tk.ts_features()

Phase 2: Expand Functionality

Anomalize in Python

  • Convert Anomalize R package to tk.anomalize()

Time Series Plotting Utilities

  • Plot ACF
  • Plot Anomalies
  • Plot Seasonality
  • Plot STL Decomposition
  • Plot Time Series Regression

Time Series Inspection, Frequency, and Trend

  • TS Summary: tk.ts_summary()
  • Time Scale Template
  • Automatic Frequency Detection
  • Automatic Trend Detection

Applied Tutorials

  • Sales CRM Database Analysis
  • Finance Investment Analysis
  • Demand Forecasting
  • Anomaly Detection
  • Clustering

Phase 3: Extend Sklearn

  • Time Series Splitting / Cross Validation Functionality
  • Preprocessors & Feature Engineering
  • Vectorized Functions - Box Cox,
  • Plot Time Series CV

Phase 4: Fill in Function Gaps Where Needed

Add additional functionality that was not identified in Phases 1-3.

summarize_by_time - corr

corr does not appear to be a valid agg_function

For example:

df \
    .groupby("category_2") \
    .summarize_by_time(
        date_column='order_date', 
        value_column= ['total_price'],
        freq = "MS",
        agg_func = ['corr']
    )

Will generate the error:
AttributeError: 'corr' is not a valid function for 'DatetimeIndexResamplerGroupby' object

I think simply modifying the docstring here:

        - "sum": Sum of values
        - "mean": Mean of values
        - "median": Median of values
        - "min": Minimum of values
        - "max": Maximum of values
        - "std": Standard deviation of values
        - "var": Variance of values
        - "first": First value in group
        - "last": Last value in group
        - "count": Count of values
        - "nunique": Number of unique values
        - "corr": Correlation between values <- Just remove

is the simplest solution. I am not entirely sure what the intended use for corr here was anyway, was it for comparing to features/covariates or was it meant to compare from t1 to t2 to t3 ...

Regardless should just tweak the docstring for now.

In addition, the function as currently written includes a 'kind' parameter, this defaults to 'timestamp', but that it will work for 'period' is also not specified. This should be included in the docstring.

Tests: Augment Functions

Need pytest tests for augment functions. (Use chatgpt to help)

  • augment_timeseries_signature
  • augment_holiday_signature
  • augment_lags
  • augment_leads
  • augment_rolling

#2

Error in plot_timeseries with engine = Matplotlib

When plotting data with grouped data, matplotlib returns an image size error.

ValueError: Image size of 140000x100000 pixels is too large. It must be less than 2^16 in each direction.

However, if we explicitly define the width and height, matplotlib works as expected.
Need default plot size to be defined.

Reproducible example:

import timetk as tk

NOT WORKING

df = tk.load_dataset('m4_monthly', parse_dates = ['date'])

fig = (
    df
        .groupby('id')
        .plot_timeseries(
            'date', 'value', 
            color_column = 'id',
            facet_ncol = 2,
            x_axis_date_labels = "%Y",
            engine = 'matplotlib'
        )
)
fig

WORKING

fig = (
    df
        .groupby('id')
        .plot_timeseries(
            'date', 'value', 
            color_column = 'id',
            facet_ncol = 2,
            x_axis_date_labels = "%Y",
            width = 1200,
            height = 800,
            engine = 'matplotlib'
        )
)
fig

`ts_summary()`: Work on auto frequency

Need to work on how ts_summary() calculates a frequency when pandas inferred frequency failed.

Goal: Make auto frequency detection possible (and less brittle).

Pad By Time Grouped - End Behavior

Update pad_by_time behavior for grouped data to extend to the end of the max time of all groups.

Example: groups A and B, where A have values (with gaps) between 1/1/22 and 1/6/22, and B has values between 1/2/22 and 1/5/22.
We expect group B to have values filled in to the end of the latest date for all group

In terms of data prep for a global model.. if 1/6 is the end of my training data, we would need group B to be extended to 1/6 as well

Quick roadmap corrections

Priority 3 - Augment Operations -> change to Priority 4 - Augment Operations
Note - These functions should overwrite columns that are named the same in the input data frame.

tk_augment_timeseries_signature() - tk_get_timeseries_signature()
tk_augment_lags() / tk_agument_leads() - Will need lag_vec() , lead_vec()
tk_augment_slidify() - May need slidify_vec()
add tk_augment_holiday_signature() and check it once merge request is completed

New function: `apply_by_time`

  • summarize_by_time = agg + resample: Simple aggregations to only single columns as a series, highly optimized
  • apply_by_time = apply + resample: More complex aggregations allowing users to access all columns in the data, less optimized

Bug: `plot_timeseries` plotly engine - Bollinger Band Example

Getting a weird bug. It's only when the color palette has duplicated colors.

When colors are duplicated

import timetk as tk
import pandas as pd

stocks_df = tk.load_dataset("stocks_daily", parse_dates = True)

# Bollinger Bands
bollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \
    .groupby('symbol') \
    .augment_rolling(
        date_column = 'date',
        value_column = 'adjusted',
        window = 20,
        window_func = ['mean', 'std'],
        center = False
    ) \
    .assign(
        upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],
        lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']
    )


# Visualize
fig = (bollinger_df

    # zoom in on dates
    .query('date >= "2023-01-01"') 

    # Convert to long format
    .melt(
        id_vars = ['symbol', 'date'],
        value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
    ) 

    # Group on symbol and visualize
    .groupby("symbol") 
    .plot_timeseries(
        date_column = 'date',
        value_column = 'value',
        color_column = 'variable',
        # Adjust colors for Bollinger Bands
        color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#18BC9C'],
        smooth = False, 
        facet_ncol = 2,
        width = 900,
        height = 700,
        engine = "plotly" 
    )
)
fig

image

When colors are not duplicated.

(bollinger_df

    # zoom in on dates
    .query('date >= "2023-01-01"') 

    # Convert to long format
    .melt(
        id_vars = ['symbol', 'date'],
        value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
    ) 

    # Group on symbol and visualize
    .groupby("symbol") 
    .plot_timeseries(
        date_column = 'date',
        value_column = 'value',
        color_column = 'variable',
        # Adjust colors for Bollinger Bands
        color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#000000'],
        smooth = False, 
        facet_ncol = 2,
        width = 900,
        height = 700,
        engine = "plotly" 
    )
)

image

Documentation Instructions - Quarto and Quartodoc

Documentation Instructions

  • Create package documentation (docstrings)
  • Use Quarto and Quartodoc to build the Python package documentation

1. Create Package Documentation

The easiest way to create documentation fast is to use Mintlify Doc Writer for Python

image

IMPORTANT: Quartodoc uses Numpy Docstring Formatting

image

You can then highlight a function and select "Generate Docstring".

image

2. Use Quartodoc & Quarto to generate Package Documentation

Make sure Quarto and Quartodoc are installed.

The main commands are:

# Change directory to /docs folder
cd docs 

# Build the documentation
quartodoc build

# Preview the website
quarto preview

You should now see a website on your localhost:

image

Making Tutorials

We will eventually need to make some tutorials and documentation. Will cover this later after we create the core timetk functions.

Publishing Changes

  • You can just make a pull request with any changes. Once I merge I'll publish.
  • The command is quarto publish gh-pages, which publishes to the gh-pages branch.

Plot_timeseries - bug

Creating ticket for a known bugs in plot_timeseries.

Removed tests on these until bugs are fixed.

  • BUG in plotly with "v" direction

`tk.plot_timeseries()`

Implement tk.plot_timeseries() similar to R timetk plot_time_series().

  • Plotnine Implementation
  • Plotly Implementation

#2

Lead: Matt Dancho & Samuel Macedo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.