business-science / pytimetk Goto Github PK
View Code? Open in Web Editor NEWTime series easier, faster, more fun. Pytimetk.
Home Page: https://business-science.github.io/pytimetk/
License: MIT License
Time series easier, faster, more fun. Pytimetk.
Home Page: https://business-science.github.io/pytimetk/
License: MIT License
Dataset: Please use walmart_sales dataset since its demand forecasting
Future Frame: Make use of the future frame function to create future dates by ID. This will allow us to show the future forecast.
Plotting: where possible use plot timeseries.
Per @JustinKurland:
There are a lot of opportunities for more augmentation functions:
tk.augment_logrithmic()
tk.augment_polynomial()
tk.augment_hilbert()
tk.augment_wavelet()
tk.augment_short_fourier()
<- This is different than the normal fourier transform in that it breaks a signal into smaller segments to provide a time-varying analysis with adjustable time and frequency resolutions.These are just a few, but all represent further oppotunities to try and add valuable information that has historically been leveraged in the extant time series and signal processing literature.
Looks like timetk
is taken on PyPi. pytimetk
looks open.
Create a tk.ts_summary()
function similar to timetk in R: https://business-science.github.io/timetk/reference/tk_summary_diagnostics.html
Create pytest
tests for Data Wrangling Functions. Use chatgpt to help.
tk.summarize_by_time()
tk.future_frame()
tk.pad_by_time()
Add guide on Data Wrangling. Cover functions with examples:
Will be housed here: https://business-science.github.io/pytimetk/guides/04_wrangling.html
Lead: Lucas O
Implement tk.augment_fourier()
similar to how R timetk tk_augment_fourier()
and vec_fourier
work:
Lead: Justin Kurland
The timetk
for python basics guide is the best place to start learning about the package philosophy: https://business-science.github.io/pytimetk/guides/02_timetk_concepts.html
Refer to this thread: #25 (comment)
Develop a minimal package with the most important functions.
Use this guide: https://py-pkgs.org/03-how-to-package-a-python
summarise_by_time()
/ summarize_by_time()
plot_time_series()
- Not sure if we should go with plotly
or altair
for interactive mode. I feel we should go with plotnine
for non-interactive. Will need smooth_vec().
future_frame()
- We will also need tk_make_future_timeseries()
and tk_make_timeseries()
pad_by_time()
Note - These functions should overwrite columns that are named the same in the input data frame.
tk.augment_timeseries_signature()
- tk.get_timeseries_signature()
tk.augment_holiday_signature()
- Uses holidays
packagetk.augment_lags()
/ tk.agument_leads()
tk.augment_rolling()
tk.augment_fourier()
tk.ts_features()
tk.anomalize()
tk.ts_summary()
Add additional functionality that was not identified in Phases 1-3.
corr
does not appear to be a valid agg_function
For example:
df \
.groupby("category_2") \
.summarize_by_time(
date_column='order_date',
value_column= ['total_price'],
freq = "MS",
agg_func = ['corr']
)
Will generate the error:
AttributeError: 'corr' is not a valid function for 'DatetimeIndexResamplerGroupby' object
I think simply modifying the docstring here:
- "sum": Sum of values
- "mean": Mean of values
- "median": Median of values
- "min": Minimum of values
- "max": Maximum of values
- "std": Standard deviation of values
- "var": Variance of values
- "first": First value in group
- "last": Last value in group
- "count": Count of values
- "nunique": Number of unique values
- "corr": Correlation between values <- Just remove
is the simplest solution. I am not entirely sure what the intended use for corr
here was anyway, was it for comparing to features/covariates or was it meant to compare from t1 to t2 to t3 ...
Regardless should just tweak the docstring for now.
In addition, the function as currently written includes a 'kind' parameter, this defaults to 'timestamp', but that it will work for 'period' is also not specified. This should be included in the docstring.
Write tests for tk.ts_summary()
Investigate parallel processing:
Any other long running functions?
Pandarallel: https://nalepae.github.io/pandarallel/
Need pytest
tests for augment functions. (Use chatgpt to help)
After reviewing polars and pandas more closely, I'm questioning the separation of value column and Agg functions.
Here's how polars accomplished Aggs:
df.group_by("a").agg(
b_sum=pl.sum("b"),
c_mean_squared=(pl.col("c") ** 2).mean(),
)
#65 Tracks Augmentors discussion
The color_column
was not set up properly.
Add a guide Augment Functions including:
tk.augment_timeseries_signature()
tk.augment_holiday_signature()
tk.augment_lags()
and tk.augment_leads()
tk.augment_rolling()
Will go here Guide: Adding Features (Augmenting): https://business-science.github.io/pytimetk/guides/05_augmenting.html
Example Guide (Data Wrangling): https://business-science.github.io/pytimetk/guides/04_wrangling.html
When plotting data with grouped data, matplotlib returns an image size error.
ValueError: Image size of 140000x100000 pixels is too large. It must be less than 2^16 in each direction.
However, if we explicitly define the width and height, matplotlib works as expected.
Need default plot size to be defined.
import timetk as tk
df = tk.load_dataset('m4_monthly', parse_dates = ['date'])
fig = (
df
.groupby('id')
.plot_timeseries(
'date', 'value',
color_column = 'id',
facet_ncol = 2,
x_axis_date_labels = "%Y",
engine = 'matplotlib'
)
)
fig
fig = (
df
.groupby('id')
.plot_timeseries(
'date', 'value',
color_column = 'id',
facet_ncol = 2,
x_axis_date_labels = "%Y",
width = 1200,
height = 800,
engine = 'matplotlib'
)
)
fig
Need to work on how ts_summary()
calculates a frequency when pandas
inferred frequency failed.
Goal: Make auto frequency detection possible (and less brittle).
Need to discuss what all we want to add for version 0.3.0
Update pad_by_time behavior for grouped data to extend to the end of the max time of all groups.
Example: groups A and B, where A have values (with gaps) between 1/1/22 and 1/6/22, and B has values between 1/2/22 and 1/5/22.
We expect group B to have values filled in to the end of the latest date for all group
In terms of data prep for a global model.. if 1/6 is the end of my training data, we would need group B to be extended to 1/6 as well
Priority 3 - Augment Operations -> change to Priority 4 - Augment Operations
Note - These functions should overwrite columns that are named the same in the input data frame.
tk_augment_timeseries_signature() - tk_get_timeseries_signature()
tk_augment_lags() / tk_agument_leads() - Will need lag_vec() , lead_vec()
tk_augment_slidify() - May need slidify_vec()
add tk_augment_holiday_signature()
and check it once merge request is completed
summarize_by_time
= agg
+ resample
: Simple aggregations to only single columns as a series, highly optimizedapply_by_time
= apply
+ resample
: More complex aggregations allowing users to access all columns in the data, less optimizedRefactor code to use typing:Union
.
tk.augment_rolling()
: Upgrade to handle rolling regressions.Getting a weird bug. It's only when the color palette has duplicated colors.
import timetk as tk
import pandas as pd
stocks_df = tk.load_dataset("stocks_daily", parse_dates = True)
# Bollinger Bands
bollinger_df = stocks_df[['symbol', 'date', 'adjusted']] \
.groupby('symbol') \
.augment_rolling(
date_column = 'date',
value_column = 'adjusted',
window = 20,
window_func = ['mean', 'std'],
center = False
) \
.assign(
upper_band = lambda x: x['adjusted_rolling_mean_win_20'] + 2*x['adjusted_rolling_std_win_20'],
lower_band = lambda x: x['adjusted_rolling_mean_win_20'] - 2*x['adjusted_rolling_std_win_20']
)
# Visualize
fig = (bollinger_df
# zoom in on dates
.query('date >= "2023-01-01"')
# Convert to long format
.melt(
id_vars = ['symbol', 'date'],
value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
)
# Group on symbol and visualize
.groupby("symbol")
.plot_timeseries(
date_column = 'date',
value_column = 'value',
color_column = 'variable',
# Adjust colors for Bollinger Bands
color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#18BC9C'],
smooth = False,
facet_ncol = 2,
width = 900,
height = 700,
engine = "plotly"
)
)
fig
(bollinger_df
# zoom in on dates
.query('date >= "2023-01-01"')
# Convert to long format
.melt(
id_vars = ['symbol', 'date'],
value_vars = ["adjusted", "adjusted_rolling_mean_win_20", "upper_band", "lower_band"]
)
# Group on symbol and visualize
.groupby("symbol")
.plot_timeseries(
date_column = 'date',
value_column = 'value',
color_column = 'variable',
# Adjust colors for Bollinger Bands
color_palette =["#2C3E50", "#E31A1C", '#18BC9C', '#000000'],
smooth = False,
facet_ncol = 2,
width = 900,
height = 700,
engine = "plotly"
)
)
Write tests for tk.ts_features()
The easiest way to create documentation fast is to use Mintlify Doc Writer for Python
IMPORTANT: Quartodoc uses Numpy Docstring Formatting
You can then highlight a function and select "Generate Docstring".
Make sure Quarto and Quartodoc are installed.
The main commands are:
# Change directory to /docs folder
cd docs
# Build the documentation
quartodoc build
# Preview the website
quarto preview
You should now see a website on your localhost:
We will eventually need to make some tutorials and documentation. Will cover this later after we create the core timetk
functions.
quarto publish gh-pages
, which publishes to the gh-pages branch.New dataset for use with the forthcoming anomaly functionality.
Integrate skimpy
: https://aeturrell.github.io/skimpy/
skim
clean_column_names
Create a plotly theme template for timetk.
Creating ticket for a known bugs in plot_timeseries.
Removed tests on these until bugs are fixed.
Add tests to make sure that plot time series functions properly.
Implement tk.plot_timeseries()
similar to R timetk plot_time_series()
.
Lead: Matt Dancho & Samuel Macedo
Running checklist of backends: #77 (comment)
Convert anomalize
R package to timetk for time series anomaly detection.
anomalize
plot_anomalies
plot_anomaly_decomp
Documentation:
Showcase:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.