kedro-org / kedro-plugins Goto Github PK

View Code? Open in Web Editor NEW

91.0 4.0 84.0 2.21 MB

First-party plugins maintained by the Kedro team.

License: Apache License 2.0

Makefile 0.27% Gherkin 0.68% Python 98.46% Jinja 0.25% HTML 0.33%

kedro kedro-plugin hacktoberfest

kedro-plugins's Introduction

Kedro-Plugins

First-party plugins maintained by the Kedro team.

The Kedro team maintains the following official Kedro plugins:

Kedro-Airflow: Allows users to easily deploy Kedro pipelines as Apache Airflow DAGs.
Kedro-Datasets: A collection of Kedro's data connectors.
Kedro-Docker: Simplifies the process of running a Kedro project within a Docker container.
Kedro-Telemetry: Gathers anonymised usage analytics to help drive future development of Kedro. Data will only be collected if consent is given.

kedro-plugins's People

Contributors

Stargazers

Watchers

Forkers

fdroessler dcardnell cshaley antheas daniel-falk rxm7706 heber-urdaneta szczeles et2010 jw-cpnet pn11 yassinealouini vladimir-filimonov djpetti dannyrfar wmoreiraa nanaoka bpmeek calychas briancechmanek tingtingqb mcdonnelljoseph sajidalamqb mihaigurau anshumank399 b4rlw sbrugman afaqueahmad7117 cilopezs ptrbld ethanisaacson elanqo floriangd laizaparizotto noklam petitlepton ankatiyar daibe-dev responsibleaiml tomershor fmfreeze ryansaxe matthiasroels tgoelles riley-brady migq2 hfwittmann edouard59 ianwhale dodobird-ai williamcaicedo kuruonur1 samuel-lee-sj buke2016 michaelsexton fabiogra andrewcao1 mjspier eromerobilbomatica ingteceduardo jerome-asselin-buspatrol grofte mark-druffel lordsoffallen charlesbmi npfp m-gris puneeter anh0ang derluke filipeo2-mck galenseilis felipemonroy yury-fedotov harm-matthias-harms lvijnck rwpurvis minurapunchihewa janickspirig hugodscarvalho pascalwhoop gitgud5000

kedro-plugins's Issues

pip installing kedro-datasets[option] causes different dependencies to installing kedro[option]

Description

Installing kedro-datasets[option] installs a different set of dependencies than kedro[option]. It appears that kedro-datasets[option] is installing the superset of requirements for all datasets.

Context

This is currently blocking kedro-org/kedro#1495

Steps to Reproduce

pip install "kedro[pandas.CSVDataSet]"; pip freeze > requirements-kedro.txt
pip install "kedro-datasets[pandas.CSVDataSet]"; pip freeze > requirements-kedro-datasets.txt
Compare the requirements

Add warning when SparkDataSet is used on Databricks without a valid file path

Description

The way that SparkDataSet is currently implemented, it will fail to resolve versioned datasets when a user passes a filepath not prefixed with /dbfs. We should log a warning when they do so, telling them to use this prefix to avoid this (and potentially other, unreported) problems.

Understanding the average size of a Kedro project

Description

One of the value propositions that Kedro supports is that it's suitable for large-scale analytics projects. A large-scale analytics project has many nodes, datasets and pipelines and might involve many team members.

This task focuses on making it possible to collect data to support this value proposition via kedro-telemetry. Furthermore, I want an easy way to understand this data via Heap Analytics.

Context

Suppose we find this is not true in practice and the average Kedro project is small. In that case, it will affect some design choices and perhaps open up usability functionality that supports smaller projects, e.g. a YAML API for the pipelines and nodes.

We know that YAML is easier to work with; however, we have discouraged the YAML API pattern because it creates maintainability problems for large-scale projects. However, a YAML interface appears to be a pipeline creation pattern used by Ploomber and MLFlow Pipelines, and it may suggest that projects are smaller, which is why users can get away with using it. This needs to be verified.

Possible Implementation

This task will require us being able to find out a few things:

To work out the size of a Kedro project, probably tracked per username (following work on HEAP):

Number of datasets
Number of nodes
Number of pipelines
To work out how many team members are on a project, to show that Kedro helps with collaboration:
Number of projects, identified by project_name
A count of username per project

Possible Alternatives

Another wild idea is that we may be able to find this information by counting elements on Kedro-Viz. Kedro-Viz already has a way to find large-scale pipelines when the pipeline warning graph appears. However, this information is incomplete because it assumes that users are also using Kedro-Viz and I'm also not sure what size of the pipeline will trigger the warning.

Update the README.md of Kedro-Telemetry to point to LF AI & Data

This ticket is to update the final wording on the README.md for Kedro-Telemetry to include the new text which points to the Linux Foundation.

Cannot specify dtypes with `polars.CSVDataSet`

Description

As per title.

Context

I was trying to load a problematic CSV file, which loads just fine when calling pl.read_csv manually:

df = pl.read_csv(
    "../data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv",
    dtypes={
        "product_age": pl.Float64,
        "group_identifier": pl.Utf8,
    },
    parse_dates=True,
)

However, when turning that to a catalog entry:

openrepair-0_3-events-raw:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv
  load_args:
    dtypes:
      product_age: pl.Float64
      group_identifier: pl.Utf8
    parse_dates: true

I get an error, because it's treating the dypes as strings and instead Polars dtypes are needed:

...
DataSetError: Failed while loading data from data set CSVDataSet(filepath=/Users/juan_cano/Projects/QuantumBlack 
Labs/talk-kedro-polars/data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv, load_args={'dtypes': 
{'group_identifier': pl.Utf8, 'product_age': pl.Float64}, 'parse_dates': True, 'rechunk': True}, protocol=file, 
save_args={}).
Cannot infer dtype from 'pl.Float64' (type: str)

Unfortunately I don't think Polars supports any other way of passing the dtypes, so I'm not sure if there's a workaround.

More broadly speaking, I guess my problem is that I don't know how to specify non-primitive types in a catalog entry.

Steps to Reproduce

See above

Expected Result

I should be able to specify dtypes for polars datasets.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.6
Kedro plugin and kedro plugin version used (pip show kedro-airflow): I'm stalling kedro-starters with pip install kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/wmoreiraa/kedro-plugins@4448b9f#subdirectory=kedro-datasets, hence from https://github.com/wmoreiraa/kedro-plugins@4448b9f
Python version used (python -V): 3.10.9
Operating system and version: macOS Ventura

kedro airflow plugins: ValueError Pipeline input(s) not found in the DataCatalog

when I run the Airflow Job
Have this problem

ValueError: Pipeline input(s) {'X_test', 'y_train', 'X_train'} not found in the DataCatalog

import sys
from collections import defaultdict
from datetime import datetime, timedelta
from pathlib import Path

from airflow import DAG
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession


sys.path.append("/Users/mahao/airflow/dags/pandas_iris_01/src")




class KedroOperator(BaseOperator):
    @apply_defaults
    def __init__(self, package_name: str, pipeline_name: str, node_name: str,
                 project_path: str, env: str, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.package_name = package_name
        self.pipeline_name = pipeline_name
        self.node_name = node_name
        self.project_path = project_path
        self.env = env

    def execute(self, context):
        configure_project(self.package_name)
        with KedroSession.create(self.package_name,
                                 self.project_path,
                                 env=self.env) as session:
            session.run(self.pipeline_name, node_names=[self.node_name])


# Kedro settings required to run your pipeline
env = "local"
pipeline_name = "__default__"
#project_path = Path.cwd()
project_path = "/Users/mahao/airflow/dags/pandas_iris_01"
print(project_path)

package_name = "pandas_iris_01"

# Default settings applied to all tasks
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

# Using a DAG context manager, you don't have to specify the dag property of each task
with DAG(
        "pandas-iris-01",
        start_date=datetime(2019, 1, 1),
        max_active_runs=3,
        schedule_interval=timedelta(
            minutes=30
        ),  # https://airflow.apache.org/docs/stable/scheduler.html#dag-runs
        default_args=default_args,
        catchup=False  # enable if you don't want historical dag runs to run
) as dag:

    tasks = {}

    tasks["split"] = KedroOperator(
        task_id="split",
        package_name=package_name,
        pipeline_name=pipeline_name,
        node_name="split",
        project_path=project_path,
        env=env,
    )

    tasks["make-predictions"] = KedroOperator(
        task_id="make-predictions",
        package_name=package_name,
        pipeline_name=pipeline_name,
        node_name="make_predictions",
        project_path=project_path,
        env=env,
    )

    tasks["report-accuracy"] = KedroOperator(
        task_id="report-accuracy",
        package_name=package_name,
        pipeline_name=pipeline_name,
        node_name="report_accuracy",
        project_path=project_path,
        env=env,
    )

    tasks["split"] >> tasks["make-predictions"]

    tasks["split"] >> tasks["report-accuracy"]

    tasks["make-predictions"] >> tasks["report-accuracy"]

Improve kedro-telemetry logging

There's two problems with how kedro-telemetry does logging at the moment:

Sometimes it uses click.secho and sometimes it uses logging. This is inconsistent.
Due to changes in kedro-org/kedro#1461, as it stands, no kedro-telemetry logging message that is < WARNING will display now.

For point 1, technically speaking I think both these message should use logging.debug (not info) rather than click.echo. Since by default these messages should not show to a user (confirmed by @yetudada in sprint planning) then setting them to use logging.debug will make them disappear.

So all that should be needed here is modifying L64 and L57 to both use logging.debug, check that the messages disappear when you do kedro run, and update any tests affected.

Note, as per discussion in kedro-org/kedro#1546, at the moment there's no way for a user to change whether or not the kedro-telemetry messages are shown by altering project-side logging.yml because before_command_run hooks are run before that logging configuration is set up. This is fine for now.

Airflow UI not showing up using kedro

Description

I have configured my kedro project with airflow following the documentation at https://kedro.readthedocs.io/en/stable/deployment/airflow_astronomer.html

Context

Everything went well but at the end the airflow UI is not available at localhost:8080

Env file ".env" found. Loading...
Sending build context to Docker daemon  128.5kB
Step 1/1 : FROM quay.io/astronomer/ap-airflow:2.0.0-buster-onbuild
# Executing 7 build triggers
 ---> Using cache
 ---> Using cache
 ---> Using cache
 ---> Using cache
 ---> Using cache
 ---> Using cache
 ---> 84c5336cc31b
Successfully built 84c5336cc31b
Successfully tagged kedro-airflow-iris_3fe060/airflow:latest


Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
INFO[0001] [0/3] [postgres]: Starting                   
INFO[0001] [1/3] [postgres]: Started                    
INFO[0001] [1/3] [scheduler]: Starting                  
INFO[0002] [2/3] [scheduler]: Started                   
INFO[0002] [2/3] [webserver]: Starting                  
INFO[0002] [3/3] [webserver]: Started                   
Airflow Webserver: http://localhost:8080
Postgres Database: localhost:5432/postgres
The default credentials are admin:admin
(base) sl@MacBook-Pro kedro-airflow-iris % python -V
Python 3.7.12 | packaged by conda-forge | (44db2626, Oct 29 2021, 16:19:11)
[PyPy 7.3.7 with GCC Clang 11.1.0]

Kedro version used 0.18.1
Kedro plugin and kedro plugin version used (pip show kedro-airflow):
Python version used (python -V): 3.7.12
Operating system and version: Mac OS monterey ( M1)

Any help is highly appreciated.

Consistent arguments between `kedro run` CLI and the `--config` yaml and provide examples in documentation

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

https://discord.com/channels/778216384475693066/846330075535769601/968467444131848204
Currently, there are 2 ways to provide argument to kedro run CLI.

Using the CLI argument kedro run --from-nodes=some_node
Using the --config argument and define the arguments in a YAML file. kedro run --config=config.yml

config.yml

run:
  from_nodes: some_node # Notice this is underscore instead of a dash

The inconsistent API and the lack of examples in documentation could confuse user.

Few proposed changes:

The YAML config should use the same arguments as the CLI, i.e. using the dash instead of the underscore
Add examples of YAML file in documentation
(Optional, open to discuss) - Support native YAML syntax with a list and dict (Currently for params it can be defined as a dict, but it will not accept a list for arguments but it has to be a string with comma from_nodes: xxxx,yyyy,zzzz
https://discord.com/channels/778216384475693066/846330075535769601/968467444131848204
[New] - validate the arguments are the expected arguments from the run function, optionally print out what's the argument it has parsed. Currently, if we provide an invalid argument it will still run but doesn't do anything, it is hard to debug, especially the CLI argument can override the YAML config.

Context

Why is this change important to you? How would you use it? How can it benefit other users?
A consistent API will make users' life easier and there is no strong reason why we want 2 different API.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

A fix in Kedro's run function should be doable, I expect it will be a small change that parse

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Xarray NetCDFDataSet

Description

Read and write netcdf file into Xarray.
https://xarray.pydata.org/en/stable/user-guide/io.html#netcdf

Context

Should attract the weather data science community to use kedro :)

Possible Implementation

Should be quite similar to pandas.CSVDataSet. I'll give this implementation a shot in my free time.

Has anyone ever implemented such a custom dataset already?

Make Spark datasets more consistent with SQL datasets in terms of connection pattern

(transfer from Jira, created by @lorenabalan)

Follow-up to kedro-org/kedro#1163 and [KED-2865]

The Spark datasets already use a version of singletons, namely the Java object through the pyspark API. However we can take an extra step and create a singleton connection that works as a class attribute, similar to what we do now for SQL datasets (except it'll probably be an actual singleton, as I don't believe there's a case where users would need to connect to multiple Spark clusters?

Multi Level partioned Dataset with filters on rows and columns using Pyarrow

Description

Multi Level partioned Dataset with filters on rows and columns, to easily extract only the required data from a datastore

Context

Easily add Multi Level partioned Dataset

Possible Implementation

from copy import deepcopy
from typing import Any, Dict
from pathlib import PurePosixPath
from kedro.io import AbstractDataSet
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


class PyarrowParquetDataSet(AbstractDataSet):
    DEFAULT_LOAD_ARGS = {}  # type: Dict[str, Any]
    DEFAULT_SAVE_ARGS = {}  # type: Dict[str, Any]

    def __init__(self, folderpath: str, load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None):
        self._filepath = PurePosixPath(folderpath)

        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

    def _load(self):
        dataset = pq.ParquetDataset(self._filepath, use_legacy_dataset=False, **self._load_args)
        return dataset

    def _save(self, df: pd.DataFrame) -> None:
        table = pa.Table.from_pandas(df)
        pq.write_to_dataset(table, root_path=self._filepath, use_legacy_dataset=False, **self._save_args)

    def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset"""
        return dict(filepath=self._filepath, load_args=self._load_args, save_args=self._save_args)

Catalog.yml:

raw_store_2m:
  type: auto_trade.extras.datasets.pyarrow_parquet_dataset.PyarrowParquetDataSet
  folderpath: data/01_raw/ohlcv/
  load_args:
    filters:
  save_args:
    partition_cols: ['symbol','year','month']
    existing_data_behavior: overwrite_or_ignore

MSSQLQueryDataSet that extends pandas.SQLQueryDataSet

At Fieldbox.ai, we needed a Microsoft SQL Server DataSet for a particular project. A read DataSet is enough. For that, we used the pandas.SQLQueryDataSet class and extended it.

Is it possible to add this feature?

We have a working implementation (using pyodbc) and we can work with the Kedro team to make it better, add documentation, tests, and so on.

Here is a code snippet:

"""``SQLDataSet`` to load to an MSSQL backend using `pyodbc`"""

import copy
import datetime as dt
from typing import Any, Dict, Optional

import pandas as pd
import pyodbc
from kedro.extras.datasets.pandas import SQLQueryDataSet
from kedro.io.core import DataSetError


class MSSQLQueryDataSet(SQLQueryDataSet):
    """An extension of SQLQueryDataSet that works with MSSQL (read mode only)."""

    def __init__(
        self,
        sql: str,
        credentials: Optional[Dict[str, Any]] = None,
        load_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        self._load_args = copy.deepcopy(load_args) if load_args is not None else {}
        self._load_args["sql"] = sql
        self._filepath = None
        expected_cred_keys = {"server", "database", "username", "password", "port"}
        if credentials is None or not expected_cred_keys.issubset(
            set(credentials.keys())
        ):
            raise DataSetError(
                f"Credentials must be provided, with the keys {expected_cred_keys!r}."
            )
        # We delay the pyodbc connection creation to the load, we just set various
        # connection values.
        self.server = credentials["server"]
        self.database = credentials["database"]
        self.user = credentials["username"]
        self.password = credentials["password"]
        self.port = credentials["port"]
        self.adapt_date_params()

    def _describe(self) -> Dict[str, Any]:
        load_args = copy.deepcopy(self._load_args)
        return dict(
            sql=load_args.pop("sql", None),
            load_args=load_args,
        )

    def _make_connection_str(self) -> str:
        # https://learn.microsoft.com/en-us/sql/connect/python/pyodbc/step-3-proof-of-concept-connecting-to-sql-using-pyodbc?view=sql-server-ver16
        return f"DRIVER={{SQL Server}};SERVER={self.server},{self.port};DATABASE={self.database};ENCRYPT=yes;UID={self.username};PWD={self.password}"  # noqa

    def _load(self) -> pd.DataFrame:
        connection = pyodbc.connect(self._make_connection_str())
        df = pd.read_sql_query(self._load_args["sql"], connection)
        return df

    def adapt_date_params(self) -> None:
        """We need to change the format of datetime parameters.

        MSSQL expects datetime in the exact format %y-%m-%dT%H:%M:%S.
        Here, we also accept plain dates"""
        new_load_args = {}
        for arg, value in self._load_args.get("params", {}).items():
            try:
                as_date = dt.date.fromisoformat(value)
                new_val = dt.datetime.combine(as_date, dt.time.min)
                new_load_args[arg] = new_val.strftime("%Y-%m-%dT%H:%M:%S")
            except (TypeError, ValueError):
                new_load_args[arg] = value
        if new_load_args:
            self._load_args["params"] = new_load_args

Let us know if this is something that makes sense. 👌

Choose tool to allow testing a plugin across build environments using a single command

Description

Decide on a tool to use to make it easier to test the plugins in kedro-plugins individually.

Context

The plugins are all in a mono-repo, but are not related and so they all require slightly different run environments.

Possible Tools

Use tox per-plugin to allow testing a plugin across build environments using a single command (tox) -risks: conda-installed packages may need to be handled differently, accessing pre-commit config above plugin root is less clean
Recommend not using tox4 for now until more confident about broader plugin support (e.g. tox-dev/tox-conda#156)
nox
hatch

Issue outcome

Decision on which tool to use. This ticket should be a spike. The implementation of the tool is a separate task.

`pandas.ParquetDataSet` does not use pandas API for folders

Description

If the ParquetDataSet points to a folder instead of a file (e.g. if the data is partitioned with hive style, such as Year=2023/Month=1/...), then pyarrow.parquet.ParquetDataset.read is used instead of pandas.read_parquet. The API is not the same, and if I want to use --say-- filters, it fails.

Context

I want to read a heavily partitioned parquet dataset, and I want to use filters and columns to limit the IO. Those are exposed by pandas.read_parquet but not pyarrow.parquet.ParquetDataset.read which is the method called by kedro when the parquet dataset points to a folder.

Steps to Reproduce

Start a kedro project, install pandas and pyarrow.
Then in a python shell:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
df = pd.DataFrame({"x": ["a", "a", "b", "b"], "y": range(4)})
schema = pa.Schema.from_pandas(df)
df.to_parquet("data/01_raw/partition", partition_cols=["x"], schema=schema)

Set conf/base/catalog.yml to

part:
  type: pandas.ParquetDataSet
  filepath: data/01_raw/partition
  load_args:
    filters: [[x, ==, a]]

Then in kedro ipython, we get

In [2]: catalog.load("part")
[02/15/23 10:41:55] INFO     Loading data from 'part' (ParquetDataSet)...                         data_catalog.py:343
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <redacted>parquet/lib/python3.10/site-packages/kedro/io/core.py: │
│ 186 in load                                                                                      │
│                                                                                                  │
│   183 │   │   self._logger.debug("Loading %s", str(self))                                        │
│   184 │   │                                                                                      │
│   185 │   │   try:                                                                               │
│ ❱ 186 │   │   │   return self._load()                                                            │
│   187 │   │   except DataSetError:                                                               │
│   188 │   │   │   raise                                                                          │
│   189 │   │   except Exception as exc:                                                           │
│                                                                                                  │
│ <redacted>parquet/lib/python3.10/site-packages/kedro/extras/data │
│ sets/pandas/parquet_dataset.py:170 in _load                                                      │
│                                                                                                  │
│   167 │   │   │   # It doesn't work at least on S3 if root folder was created manually           │
│   168 │   │   │   # https://issues.apache.org/jira/browse/ARROW-7867                             │
│   169 │   │   │   data = (                                                                       │
│ ❱ 170 │   │   │   │   pq.ParquetDataset(load_path, filesystem=self._fs)                          │
│   171 │   │   │   │   .read(**self._load_args)                                                   │
│   172 │   │   │   │   .to_pandas()                                                               │
│   173 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: _ParquetDatasetV2.read() got an unexpected keyword argument 'filters'

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-2-3cc17d7dd5a1>:1 in <module>                                                     │
│                                                                                                  │
│ <redacted>parquet/lib/python3.10/site-packages/kedro/io/data_cat │
│ alog.py:347 in load                                                                              │
│                                                                                                  │
│   344 │   │   │   "Loading data from '%s' (%s)...", name, type(dataset).__name__                 │
│   345 │   │   )                                                                                  │
│   346 │   │                                                                                      │
│ ❱ 347 │   │   result = dataset.load()                                                            │
│   348 │   │                                                                                      │
│   349 │   │   return result                                                                      │
│   350                                                                                            │
│                                                                                                  │
│ <redacted>parquet/lib/python3.10/site-packages/kedro/io/core.py: │
│ 604 in load                                                                                      │
│                                                                                                  │
│   601 │   │   return self._filepath / version / self._filepath.name                              │
│   602 │                                                                                          │
│   603 │   def load(self) -> _DO:  # pylint: disable=useless-parent-delegation                    │
│ ❱ 604 │   │   return super().load()                                                              │
│   605 │                                                                                          │
│   606 │   def save(self, data: _DI) -> None:                                                     │
│   607 │   │   self._version_cache.clear()                                                        │
│                                                                                                  │
│ <redacted>parquet/lib/python3.10/site-packages/kedro/io/core.py: │
│ 195 in load                                                                                      │
│                                                                                                  │
│   192 │   │   │   message = (                                                                    │
│   193 │   │   │   │   f"Failed while loading data from data set {str(self)}.\n{str(exc)}"        │
│   194 │   │   │   )                                                                              │
│ ❱ 195 │   │   │   raise DataSetError(message) from exc                                           │
│   196 │                                                                                          │
│   197 │   def save(self, data: _DI) -> None:                                                     │
│   198 │   │   """Saves data by delegation to the provided save method.                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DataSetError: Failed while loading data from data set 
ParquetDataSet(filepath=<redacted>data/01_raw/partition, load_args={'filters': 
[['x', '==', 1]]}, protocol=file, save_args={}).
_ParquetDatasetV2.read() got an unexpected keyword argument 'filters'

Expected Result

We should get the equivalent of pandas.read_parquet:

In [6]: pd.read_parquet("data/01_raw/partition/", filters=[("x", "==", "a")])
Out[6]: 
   y  x
0  0  a
1  1  a

Actual Result

See above

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.4
Python version used (python -V): Python 3.10.9
Operating system and version: Ubuntu 20.04.5

Refactor kedro-datasets testing

Description

After discussion of #104 @jmholzer mentioned there are other areas of the test suite that may not be as robust as hoped. This issue proposes going in and exploring the test setup, identifying weaker tests and refactoring/rewriting them.

Should MatplotilbWriter multiple plot functionality be removed in favour of PartitionedDataSet?

MatplotlibWriter currently supports 3 different save modes:

save a single plt.figure to a png file
save List[plt.figure] to multiple png files (labelled 0.png, 1.png, etc.)
save Dict[str, plt.figure] to multiple png files (labelled by dictionary keys)

There's a recently-added overwrite option associated with the latter two modes (kedro-org/kedro#868). This also exists for PartitionedDataSet.

The current behaviour has some problems:

it's very weird because it's the only dataset that has multiple save modes possible
(less important because this will still need to be solved on kedro-viz even if we change how it works...) it complicates some things in kedro-viz (#1626 kedro-org/kedro-viz#783)

On the other hand, the ability to save multiple plots rather than define one dataset per plot is essential. I have used it myself many times and seen it used a lot.

So, my question is: should we replace the matplotlib save modes that do multiple plots with instead wrapping MatplotlibWriter in PartionedDataSet? Leaving aside how we do this technically for the moment, would this be a good change to make? i.e. will this be a user-friendly solution here? Will it allow everything we need to allow in terms of functionality?

My suspicion is that the only reason we don't already use PartionedDataSet for this is historical (MatplotlibWriter was added to contrib at the same time PartionedDataSet was added to core).

Tagging @Galileo-Galilei who I suspect will just have the answers here 😀

Change `ParquetDataSet` to load using pandas instead of parquet

Description

Pandas now support loading of parquet natively. It should be possible to remove the if-else statement here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/parquet_dataset.py#L161-L169

External tables support for SparkHiveDataSet

Description

SparkHiveDataset does not allow external hive tables at the moment. External tables are often encountered when the org database is outside hive and the table needs to be hosted in hive. More info available on : https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_create_an_external_table.html

Context

This will broaden the scope for hive datasets. Write now ant externally managed hive dataset needs to be referenced via a custom dataset and this happens quite often

Possible Implementation

Implementation is super simple. User needs to specify the keyword "External" in the DDL and specify a path for the table schema. Both can be tactically managed/input via catalog. Basis this input , the dataset should internally be able to decide the next course of actions and load/save data accordingly

Possible Alternatives

Accessing Hive table via HQL (but this again requires a HiveQueryDataSet (custom) ) which can access the metastore and query (bit slow)

[spike] Investigate support for TaskFlowAPI in kedro-airflow

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

Since Airflow 2.0, a simpler TaskFlowAPI for DAGs is released as an alternative with the Operator API. At the moment kedro-airflow supports the Operator, but it's good to keep an eye on it.
https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Adding Polars' dataframe as a dataset

Description

Polars is a dataframe library written in Rust with a python API that is gaining a lot of traction. It has ~13k stars on github, and numerous blog posts are appearing on its performance.

Context

It certainly feels like it's only a matter of time before this library becomes one of the "standard" data analysis tools - like dask, spark, and others have become. Having an out of the box implementation in Kedro would allow organizations that are beginning to implement this library to have a clean switch without having to implement a custom, AbstractDataSet.

Possible Implementation

Reading, and writing CSVs, parquet and others in Polars seems relatively straight forward. One concern would be transcoding between pandas -> Polars, or Polars to Pandas, polar to dask, etc. may require a little more thought as Polars does not use an index.

Add an additional Jinja2-Template

Description

The current kedro-airflow plugin requires the kedro project to be installed on the Airflow worker. This can provide a challenge when changes to the kedro projects were done and need to be rolled out to the airflow worker. An alternative to the current extension of the BaseOperator which generates a KedroOperator would be to use the PythonVirtualenvOperator. In that alternative setup, the worker would be able to install the kedro project updates from a local/official pypi server upon release of an update. The downside is that there is an additional overhead when creating the Virtualenv.

Context

I have run into issues when using the current setup as I am getting OOM errors when running kedro dags that use the KedroOperator (investigation to why is still ongoing). Further to this and probably due to my inexperience I was struggling rolling out updates from the kedro project to the airflow workers without manually updating the worker or without what would seem a lot of hassle in worker updates.

Possible Implementation

An alternative Jinja2 Template can be provided which follows very much the logic of the existing one but replaces the KedroOperator with a PythonVirtualenvOperator setup.

Basic elements of the setup can be found below:
Kedro function to be executed in the virtual environment:

def kedro_func(project, path, env, node_names, pipeline_name):
        """
        Example function that will be performed in a virtual environment.

        Importing at the module level ensures that it will not attempt to import the
        library before it is installed.
        """
        from kedro.framework.session import KedroSession
        from kedro.framework.project import configure_project

        configure_project(project)
        with KedroSession.create(project,
                                    path,
                                    env=env) as session:
            context = session.load_context()
            print(context.catalog.list())
            session.run(pipeline_name, node_names=node_names)
        print('Finished')

Changes to the task generating loop:

    tasks = {}
    {% for node in pipeline.nodes %}
    tasks["{{ node.name | safe | slugify }}"] = PythonVirtualenvOperator(
        python_callable=kedro_func,
        requirements=['kedro', package_name],
        system_site_packages=False,
        task_id="{{ node.name | safe | slugify  }}",
        op_kwargs={
            "package_name": package_name,
            "pipeline_name": pipeline_name,
            "node_name": "{{ node.name | safe }}",
            "project_path": project_path,
            "env": env,
        }
    )
    {% endfor %}

Would be happy to receive feedback on this and if it is deemed useful I am happy to create a PR.

[KED-926] Anaconda Intake Integration

The intake project, led by Anaconda, provides rich functionality for data catalogs. Consider using that instead of a homebrew approach.
https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/

Malformed docstrings make Kedro docs fail

Description

Kedro doc builds on RTD are failing: https://readthedocs.org/projects/kedro/builds/19843581/

/home/docs/checkouts/readthedocs.org/user_builds/kedro/envs/2425/lib/python3.7/site-packages/kedro/datasets/json/json_dataset.py:docstring of kedro.datasets.json.json_dataset.JSONDataSet:12: WARNING: Unexpected indentation.
/home/docs/checkouts/readthedocs.org/user_builds/kedro/envs/2425/lib/python3.7/site-packages/kedro/datasets/json/json_dataset.py:docstring of kedro.datasets.json.json_dataset.JSONDataSet:13: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/docs/checkouts/readthedocs.org/user_builds/kedro/envs/2425/lib/python3.7/site-packages/kedro/datasets/spark/spark_hive_dataset.py:docstring of kedro.datasets.spark.spark_hive_dataset.SparkHiveDataSet:9: WARNING: Bullet list ends without a blank line; unexpected unindent.

Which points to this:

kedro-plugins/kedro-datasets/kedro_datasets/json/json_dataset.py

Lines 20 to 49 in 4910e84

    
               """``JSONDataSet`` loads/saves data from/to a JSON file using an underlying 
        
               filesystem (e.g.: local, S3, GCS). It uses native json to handle the JSON file. 
        
               Example usage for the 
        
               `YAML API <https://kedro.readthedocs.io/en/stable/data/\ 
        
               data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_: 
        
               .. code-block:: yaml 
        
                   cars: 
        
                     type: json.JSONDataSet 
        
                     filepath: gcs://your_bucket/cars.json 
        
                     fs_args: 
        
                       project: my-project 
        
                     credentials: my_gcp_credentials 
        
               Example usage for the 
        
               `Python API <https://kedro.readthedocs.io/en/stable/data/\ 
        
               data_catalog.html#use-the-data-catalog-with-the-code-api>`_: 
        
               :: 
        
                   >>> from kedro_datasets.json import JSONDataSet 
        
                   >>> 
        
                   >>> data = {'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]} 
        
                   >>> 
        
                   >>> data_set = JSONDataSet(filepath="test.json") 
        
                   >>> data_set.save(data) 
        
                   >>> reloaded = data_set.load() 
        
                   >>> assert data == reloaded 
        
               """

kedro-plugins/kedro-datasets/kedro_datasets/spark/spark_hive_dataset.py

Lines 15 to 65 in 4910e84

    
               """``SparkHiveDataSet`` loads and saves Spark dataframes stored on Hive. 
        
               This data set also handles some incompatible file types such as using partitioned parquet on 
        
               hive which will not normally allow upserts to existing data without a complete replacement 
        
               of the existing file/partition. 
        
               This DataSet has some key assumptions: 
        
               - Schemas do not change during the pipeline run (defined PKs must be present for the 
        
               duration of the pipeline) 
        
               - Tables are not being externally modified during upserts. The upsert method is NOT ATOMIC 
        
               to external changes to the target table while executing. 
        
               Upsert methodology works by leveraging Spark DataFrame execution plan checkpointing. 
        
               Example usage for the 
        
               `YAML API <https://kedro.readthedocs.io/en/stable/data/\ 
        
               data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_: 
        
               .. code-block:: yaml 
        
                   hive_dataset: 
        
                     type: spark.SparkHiveDataSet 
        
                     database: hive_database 
        
                     table: table_name 
        
                     write_mode: overwrite 
        
               Example usage for the 
        
               `Python API <https://kedro.readthedocs.io/en/stable/data/\ 
        
               data_catalog.html#use-the-data-catalog-with-the-code-api>`_: 
        
               :: 
        
                   >>> from pyspark.sql import SparkSession 
        
                   >>> from pyspark.sql.types import (StructField, StringType, 
        
                   >>>                                IntegerType, StructType) 
        
                   >>> 
        
                   >>> from kedro_datasets.spark import SparkHiveDataSet 
        
                   >>> 
        
                   >>> schema = StructType([StructField("name", StringType(), True), 
        
                   >>>                      StructField("age", IntegerType(), True)]) 
        
                   >>> 
        
                   >>> data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)] 
        
                   >>> 
        
                   >>> spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema) 
        
                   >>> 
        
                   >>> data_set = SparkHiveDataSet(database="test_database", table="test_table", 
        
                   >>>                             write_mode="overwrite") 
        
                   >>> data_set.save(spark_df) 
        
                   >>> reloaded = data_set.load() 
        
                   >>> 
        
                   >>> reloaded.take(4) 
        
               """

Context

kedro-org/kedro#2425 and others blocked.

Steps to Reproduce

[First Step]
[Second Step]
[And so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.

-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V):
Kedro plugin and kedro plugin version used (pip show kedro-airflow):
Python version used (python -V):
Operating system and version:

Create high-level user/project statistics on Heap

Description

Product usage analytics helps us understand how Kedro is used. This information helps us determine if we have succeeded in developing certain features and gives us a guiding point for identifying if we must improve our approach.

We shipped the first version of Kedro-Telemetry to understand the usage of the CLI and Kedro-Viz. However, we're still missing some high-level information like:

How many Kedro users do we have? The research question is, "How many users identified by username have run at least one CLI command?"
How many users of Kedro-Viz do we have? The research question is, "How many users identified by username have opened Kedro-Viz or run the kedro viz CLI command?"
How many users of Kedro-Viz experiment tracking do we have? The research question is, "How many users identified by username have opened Kedro-Viz and have opened the runsList or experiment-tracking pages on Kedro-Viz?"
How many projects are using Kedro? The research question is, "How many projects identified by project_name have had someone run at least one CLI command from that project?"

All of these values assume that Kedro-Telemetry is installed and activated according to our consent-based workflow.

Context

Some of the complexities of why it is difficult to do this might lie in defining user identities on Heap.

There are two types of properties that Heap recognises, user and event. Properties are bits of metadata that are captured during user interactions with the application. User properties refer to any data related to the user. In contrast, event properties are metadata associated with any actions the user takes.

Here is what I have observed:

On the Heap, a User ID is created, and it is unknown if this field has a 1-to-1 mapping to our username field collected from Kedro-Telemetry. This field generates built-in charts on the Number of Users.
And on our side, we send all of our user properties like username or even project_name as event properties when they might be user properties.

Possible Implementation

There are two parts to this:

How do we have a consistent user identifier on Heap? Can we use the User ID field? Can we send username to replace User ID?
How do we make it possible to create a summative view of projects on Heap? Which may have to look at adding project_name or another project identifier to user properties.

Re: Point 2. This would require some discussion about what is a user. Is a user consistently defined by their username or is a user a username AND project_name.

Replace `identity` in kedro telemetry with unique `username`

Description

This is related to the heap issue. Looking at the kedro-telemetry plugin we currently assign the hashed computer name of users to the identity parameter for Heap. Having a look at the Heap API it makes more sense to assign this to something like username.

Note: The identity parameter needs to be unique, so we may want to find a way to ensure this. (Maybe we hash the concatenation of username and computer name?)

Reason for this:
Heap generates a unique userID that can't be manually changed and isn't related/mapped to the username parameter. However, if heap finds the same identity it will merge that userID into the Heap profile of the identified user.

Context

#36

Snowflake Data Connectors (SnowPark)

Description

I think there's scope to create a series of data connectors that would allow Kedro users to connect to Snowflake in different ways. This usage pattern was identified in the kedro-org/kedro#1653 research, that sometimes our users want to leverage SQL-based workflows for their data engineering pipelines. These connectors essentially simplify use of Python when you need it for the data science part of your workflow.

While I have created this issue, I think it's important to document why we have seen users create Snowflake datasets and not leverage our pandas.SQLTableDataSet and pandas.SQLQueryDataSet to do the same, especially in the case of workflows based on Pandas.

Possible Implementation

This tasks proposes building out the following data connectors:

spark.SnowflakeTableDataSet - Load and saves data from/to a table as a Spark DataFrame.
spark.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame
pandas.SnowflakeTableDataSet - Load and saves data from/to a table as a Pandas DataFrame.
pandas.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame

Kedro-Docker should sanitize the repository/tag name

Description

When I check out a Kedro project with a non-lowercase repository name, and I run kedro docker build, I get:

docker build --build-arg KEDRO_UID=501 --build-arg KEDRO_GID=20 --build-arg BASE_IMAGE=python:3.8-buster -t CamelCase-repo /Users/deepyaman/CamelCase-repo
invalid argument "CamelCase-repo" for "-t, --tag" flag: invalid reference format: repository name must be lowercase
See 'docker build --help'.

Context

Why is this change important to you? How would you use it? How can it benefit other users?

The error message isn't that clear; even if the behavior isn't changed, it should point users to use the --image option for kedro docker build (not the Docker --tag).

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Figure out how Docker checks their image name, and sanitize accordingly.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Improve the error message.

Implement `exists_function` resolution using `fsspec` in `SparkDataSet`

Description

The exists_function is used by AbstractVersionedDataSet to resolve the path to a stored, versioned dataset. In other dataset implementations, the exists_function is set to fsspec.filesystem(...)._fs.exists. In SparkDataSet, we have a complex series of logical tests to determine the exists_function, this is brittle (as every possible file system in use has to be anticipated a priori) and a source of error (as in kedro-org/kedro#1801).

Context

SparkDataSet sees high use, making this change would eliminate a source of bugs in a dataset that is important to our users.

kedro airflow plugins: ValueError Pipeline input(s) not found in the DataCatalog

Description

using the kedro-airflow plugin generate airflow dag

but I got the following error

Traceback (most recent call last):
  File "/Users/mahao/airflow/dags/pandas_iris_01_dag.py", line 38, in execute
    session.run(self.pipeline_name, node_names=[self.node_name])
  File "/Users/mahao/miniconda3/envs/airflow/lib/python3.8/site-packages/kedro/framework/session/session.py", line 391, in run
    run_result = runner.run(
  File "/Users/mahao/miniconda3/envs/airflow/lib/python3.8/site-packages/kedro/runner/runner.py", line 74, in run
    raise ValueError(
ValueError: Pipeline input(s) {'X_test', 'y_train', 'X_train'} not found in the DataCatalog

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

kedro, version 0.18.1

Name: kedro-airflow
Version: 0.5.0
Summary: Kedro-Airflow makes it easy to deploy Kedro projects to Airflow
Home-page: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow
Author: Kedro
Author-email: 
License: Apache Software License (Apache 2.0)
Location: /Users/mahao/miniconda3/envs/airflow/lib/python3.8/site-packages
Requires: kedro, python-slugify, semver
Required-by:

Python 3.8.13

Extend Kedro-Airflow CLI functionality with optional jinja template

Description

The default Jinja2-Template works well as a base to get started on using kedro with Airflow. However, the fact that a few things are hardcoded in the plugin/template makes it hard to modify a variety of things. I would propose a couple of adjustments in order to allow for more customisability:

Provide CLI Arguments for folder and name of jinja2 Templates
Prefix/Postfix the name of the resulting dag files with the pipeline name if a pipeline was given

I have already implemented those in a branch and was curious if this would be a welcomed PR.

Context

In a recent project I found myself constantly having to adjust the resulting dag files as I was experimenting with a variety of things. I ended up replacing the template file in the site-packages folder at some point which struck me as too hacky. The above changes would allow users to provide their own Jinja2 Templates as well as generating Airflow Dags for multiple pipelines without having to manually rename the output files.

Possible Implementation

Add two more click options (names just examples):

@click.option("-jd", "--jinja-dir", "jinja_template_dir", default=str(Path(__file__).parent))
@click.option("-j", "--jinja-file", "jinja_template", default="airflow_dag_template.j2")

Modify the loader and template respectively:

loader = jinja2.FileSystemLoader(jinja_template_dir)
jinja_env = jinja2.Environment(autoescape=True, loader=loader, lstrip_blocks=True)
jinja_env.filters["slugify"] = slugify
template = jinja_env.get_template(jinja_template)

Adjust the dag_filename if necessary:

if pipeline_name != "__default__":
    dag_filename = f"{package_name}_{pipeline_name}_dag.py"

Looking forward to some feedback on this suggestions.

Kedro-airflow issue: cannot find configuration path /conf/base

Description

I am starting to use airflow for a kedro ML project. I have airflow installed on MAC. I have a kedro project which runs successfully on my local MAC. Currently, trying to convert the kedro pipeline to an airflow executable task.

To do this, I have followed the instructions at https://github.com/quantumblacklabs/kedro-airflow

kedro airflow create
Copied the created dag to the airflow dags/ folder
kedro package

This step created the dist/ with .whl and .egg files

checkpoints:

% which airflow
/Users/sl/mambaforge-pypy3/envs/airflow_env/bin/airflow
(airflow_env) % which python 
/Users/sl/mambaforge-pypy3/envs/airflow_env/bin/python

The kedro package from step 3 is installed to airflow executors venv:

/Users/sl/mambaforge-pypy3/envs/airflow_env/bin/python -m pip install /dist/test_uk-0.1-py3-none-any.whl

Now the docs https://kedro.readthedocs.io/en/stable/deployment/single_machine.html#package-based suggest that

pyproject.toml file conf/ and data/, logs/ have to be created and are not packaged in step 3.

Therefore, I have created these directories in the below location:

 % pwd
/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/test_uk
(airflow_env) sl@MacBook-Pro test_uk % ls
__init__.py		__pycache__		data			logs			pipelines		settings.py
__main__.py		conf			info.log		pipeline_registry.py	pyproject.toml

Context

This gives the below error:

[2022-07-15, 20:54:59 EEST] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/Users/sl/airflow/dags/test_uk_dag.py", line 36, in execute
    with KedroSession.create(self.package_name,
  File "/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/kedro/framework/session/session.py", line 172, in create
    session._setup_logging()
  File "/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/kedro/framework/session/session.py", line 188, in _setup_logging
    conf_logging = self._get_logging_config()
  File "/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/kedro/framework/session/session.py", line 176, in _get_logging_config
    conf_logging = self._get_config_loader().get(
  File "/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/kedro/config/templated_config.py", line 161, in get
    config_raw = _get_config_from_patterns(
  File "/Users/sl/mambaforge-pypy3/envs/airflow_env/lib/python3.10/site-packages/kedro/config/common.py", line 69, in _get_config_from_patterns
    raise ValueError(
ValueError: Given configuration path either does not exist or is not a valid directory: /conf/base
[2022-07-15, 20:54:59 EEST] {taskinstance.py:1415} INFO - Marking task as UP_FOR_RETRY. dag_id=test-uk, task_id=preprocess, execution_date=20220714T050000, start_date=20220715T145449, end_date=20220715T145459
[2022-07-15, 20:54:59 EEST] {standard_task_runner.py:92} ERROR - Failed to execute job 8 for task preprocess (Given configuration path either does not exist or is not a valid directory: /conf/base; 85230)
[2022-07-15, 20:54:59 EEST] {local_task_job.py:156} INFO - Task exited with return code 1

Then i tried creating these directories in airflow/ where the dags/ are located

 airflow % ls
airflow-scheduler.err		airflow-webserver-monitor.pid	airflow-webserver.pid		dags				webserver_config.py
airflow-scheduler.log		airflow-webserver.err		airflow.cfg			data
airflow-scheduler.out		airflow-webserver.log		airflow.db			logs
airflow-scheduler.pid		airflow-webserver.out		conf				pyproject.toml

And i still get the same above error. I ran out of ideas and spent a couple of days trying different things. It could be an obvious thing but im blinded at the moment. Any help would be highly appreciated.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V):0.18.1
Kedro plugin and kedro plugin version used (pip show kedro-airflow): Version: 0.5.0
Python version used (python -V): Python 3.10.4
Operating system: MacOs Monterey (M1):

Performance profiling on `kedro new` with starter and without starter

When the telemetry is enabled and the Internet connection is choppy or poor, kedro new takes a lot to run.

It could be because of the telemetry timeout:

kedro-plugins/kedro-telemetry/kedro_telemetry/plugin.py

Line 204 in 7fc511f

url=HEAP_ENDPOINT, headers=HEAP_HEADERS, data=json.dumps(data), timeout=10

Or it could be because of cloning the starter itself. This is related to kedro-org/kedro#1476

Kedro Telemetry breaks packaged projects due to wrongly assuming `pyproject.toml` exists

Description

Kedro Telemetry installed alongside a packaged and installed Kedro project breaks the project by assuming that the pyproject.toml file exists. The pyproject.toml is only a recipe for building the project and should not be assumed to be existing in the current folder in all cases.

The problem was introduced with #62

Context

When deploying Kedro projects and if you have installed Kedro Telemetry, it breaks your project.

Steps to Reproduce

Create a Kedro project
Add a dependency on kedro-telemetry
Package it through kedro package
Install it in a different environment
Run the project through ./<project> in a folder where only the conf/ is

Expected Result

The project should run.

Actual Result

An exception is thrown.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.x
Kedro plugin and kedro plugin version used (pip show kedro-telemetry): 0.2.2
Python version used (python -V): Not relevant
Operating system and version: Not relevant

Fix failing test for loading CSVDataSet from S3 bucket

Description

The test for kedro-datasets in the file tests/pandas/test_csv_dataset.py::TestCSVDataSetS3::test_load_and_confirm currently fails for Python 3.10. It only fails if called when all the tests in tests/pandas/test_csv_dataset.py are run. If it is called in isolation, the test will pass.

If the class TestCSVDataSetS3 is moved to the top of the file, so that it is executed (in order) before the remaining tests in the file, the test case will pass. Since this test does not share resources with any other test in the file and also passes when called in isolation, the time taken between fixture definition and running the test may play a role in causing this error.

Context

This bug currently blocks all development work on kedro-datasets.

Steps to Reproduce

Checkout the main branch.
Navigate to the kedro-datasets directory.
Run pytest --no-cov tests/pandas/test_csv_dataset.py

All tests in this file pass apart from TestCSVDataSetS3::test_load_and_confirm.

The following error is produced:

E           kedro.io.core.DataSetError: Failed while loading data from data set CSVDataSet(filepath=test_bucket/test.csv, load_args={}, protocol=s3, save_args={'index': False}).
E           Forbidden

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.3

Telemetry breaks everything if there's no write permissions

Description

As seen on databricks repos, which currently do not have write access, running any kedro command in a directory with no write access will raise an error with kedro-telemetry which then stops kedro from running everything.

This isn't the first time that an exception raised by kedro-telemetry has stopped kedro commands from working rather than just raising a warning and carrying on. I wonder if we should wrap the whole plugin in a try/except to catch any exceptions we don't anticipate. A broken kedro-telemetry shouldn't prevent users from using kedro.

Steps to reproduce

Go to https://gitpod.io/#https://github.com/kedro-org/kedro
pip install kedro-telemetry
chmod a=rx /workspace/project
kedro info

This will give the following exception and not execute the kedro command at all:

PermissionError: [Errno 13] Permission denied: '/workspace/project/.telemetry'

Kedro-airflow python-slugify version pin is incompatible with latest version of airflow

Description

kedro-airflow pins python-slugify with python-slugify~=4.0
airflow pins python-slugify with python-slugify>=5.0 and supports the latest version of python-slugify.
A pin of python-slugify>=4.0,<7 would be great if kedro-airflow can support it.

Context

I was trying to pip install kedro-airflow with the latest version of airflow and got a deoendency resolution error.

Steps to Reproduce

pip install apache-airflow==2.3.4 kedro-airflow
Receive error

Expected Result

Ideally you can install both in the same environment.

Actual Result

You receive an error:

ERROR: Cannot install apache-airflow==2.3.4 and kedro-airflow==0.5.0 because these package versions have conflicting dependencies.

The conflict is caused by:
    apache-airflow 2.3.4 depends on python-slugify>=5.0
    kedro-airflow 0.5.0 depends on python-slugify~=4.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

Your Environment

Include as many relevant details about the environment in which you experienced the bug:
Not applicable

Kedro version used (pip show kedro or kedro -V):
Kedro plugin and kedro plugin version used (pip show kedro-airflow):
Python version used (python -V):
Operating system and version:

Kedro-dataset release process

Introduction

How are we going to release when certain libraries are not compatible? i.e. if tensorflow has no support for Python3.11, how do we handle this in our CI?

Background

Since the separation of kedro-datasets, it's now possible to upgrade kedro / kedro-datasets separately. Prior to this, kedro was always compatible will all datasets so we didn't have this challenge before.

Problem

How do we make our CI works and allow certain DataSets to skip CI?
Should the user always install the latest version?
- For example, let's say version 1.0.10 support Python3.10 for Tensorflow and 1.0.11 add support for Python3.11. In theory, if users are using Python<3.11, it would not be a problem if they install 1.0.11.

Possible Solution

We could create some kind of tag/decorators to skip tests in "file" or "module" level to skip tests. It may get a little bit messy w

[kedro-docker] Add creation of compose.yml file

Description

I would use more kedro-docker if it could generate a compose.yml file to configure volumes and other components manually. Instead of using kedro docker I could just replicate the configuration and use docker-compose directly, making it easier to integrate. Also it gives more freedom to make some tweaks with kedro-viz and other plugins.

Context

I tend to not use directly kedro docker, and instead I create my own docker-compose, but kedro docker init already helps a lot.

Possible Implementation

Make kedro docker init create a compose.yml file that makes it possible to run via docker-compose up

Kedro-airflow setup.py requires test_requirements.txt, but it's not packaged in pypi

Description

Kedro-airflow setup.py requires test_requirements.txt, but it's not packaged in pypi

Context

I am trying to package kedro-airflow and kedro-viz for conda-forge so that I can conda install them. Conda-forge runs setup.py as part of the build process. Kedro-airflow directly references test_requirements.txt in setup.py, but test_requirements.txt is not packaged in the build for pypi, so running setup.py fails.

Steps to Reproduce

Download kedro-airflow via pypi.
unzip it and run python setup.py.

Expected Result

Setup.py succeeds.

Actual Result

Setup.py fails because test_requirements.txt is missing.

 FileNotFoundError: [Errno 2] No such file or directory: 'test_requirements.txt'

kedro-docker ModuleNotFoundError

Description

I have been trying to dockerize the kedro project with the below commands:

pip install kedro-docker
kedro docker init
kedro docker build

The Docker file looks like below:

$ cat Dockerfile 
ARG BASE_IMAGE=python:3.6-buster
FROM $BASE_IMAGE

# install project requirements
COPY src/requirements.txt /tmp/requirements.txt

RUN pip install -r /tmp/requirements.txt && rm -f /tmp/requirements.txt

##install some custom python packages py_tools and methods
RUN pip install --extra-index-url https://developer:XXXXXXXXXX/simple py_tools
RUN pip install --extra-index-url https://developer:XXXXXXXXXX/pypi-all/simple methods

# add kedro user
ARG KEDRO_UID=999
ARG KEDRO_GID=0
RUN groupadd -f -g ${KEDRO_GID} kedro_group && \
useradd -d /home/kedro -s /bin/bash -g ${KEDRO_GID} -u ${KEDRO_UID} kedro

# copy the whole project except what is in .dockerignore
WORKDIR /home/kedro
COPY . .
RUN chown -R kedro:${KEDRO_GID} /home/kedro
USER kedro
RUN chmod -R a+w /home/kedro

EXPOSE 8888

CMD ["kedro", "run"]

Having the below issue:

$ kedro docker run
docker run -v /Users/sl/project/conf/local:/home/kedro/conf/local -v /Users/sl/project/data:/home/kedro/data -v /Users/sl/project/logs:/home/kedro/logs -v /Users/sl/project/notebooks:/home/kedro/notebooks -v /Users/sl/project/references:/home/kedro/references -v /Users/sl/project/results:/home/kedro/results --rm --name project-run project kedro run
2022-06-21 11:17:15,371 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.2.1
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: 2022-06-21 11:17:15,399 - kedro_telemetry.plugin - WARNING - Failed to confirm consent. No data was sent to Heap. Exception: 
2022-06-21 11:17:15,402 - kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
Model version 20220621-111715
2022-06-21 11:17:15,472 - kedro.framework.session.session - INFO - ** Kedro project kedro
2022-06-21 11:17:15,972 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.
Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 217, in main
    cli_collection()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 145, in main
    super().main(
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/project.py", line 352, in run
    session.run(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/session/session.py", line 344, in run
    pipeline = pipelines[name]
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/project/__init__.py", line 121, in inner
    self._load_data()
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/project/__init__.py", line 153, in _load_data
    register_pipelines = self._get_pipelines_registry_callable(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/project/__init__.py", line 141, in _get_pipelines_registry_callable
    module_obj = importlib.import_module(pipelines_module)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/kedro/src/project/pipeline_registry.py", line 7, in <module>
    from project.pipelines.data_science import pipeline as ds
  File "/home/kedro/src/project/pipelines/ds/pipeline.py", line 2, in <module>
    from project.pipelines.ds.model import train_model
  File "/home/kedro/src/epc_fi/pipelines/ds/model.py", line 2, in <module>
    from py_tools.ds.FKU.model import Model
ModuleNotFoundError: No module named 'py_tools.ds.FKU'

Environment

* kedro, version 0.18.1
*$ pip show kedro-docker
Name: kedro-docker
Version: 0.3.0
Summary: Kedro-Docker makes it easy to package Kedro projects with Docker.
Home-page: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker
Author: Kedro
Author-email: 
License: Apache Software License (Apache 2.0)
Location: /Users/sl/.conda/envs/test-new/lib/python3.10/site-packages
Requires: semver, anyconfig, kedro
Required-by: 
* Python 3.10.4
* Mac OS, Monterey, M1

Could anyone throw some light to fix this. Anything else need to be defined in the Docker file like python path or etc..?

Make Kedro Compatible with Airflow 2.4.2

Description

Airflow is one of the primary deployment solutions for Kedro
Airflow 2.4.2 has a dependency of attrs>=22.1.0
Kedro requirements uses attrs~=21.3
https://github.com/kedro-org/kedro/blob/c40873a5d4adfbc7b5970cb8b0c1133d53463bbf/dependency/requirements.txt#L2

Context

Unable to install Kedro & Airflow using conda in the same environment

Steps to Reproduce

mamba create -n kedro-airflow "python=3.7.12" "airflow=2.4.2" "kedro>=0.18.3"
Encountered problems while solving:

package airflow-2.4.2-py310h51d5547_0 requires attrs >=22.1.0, but none of the providers can be installed

Expected Result

With just the change of attrs - from attrs~=21.3 to "attrs >=22.1.0,<22.2"

Kedro, Airflow and MLFlow are compatible

mamba create -n kedro-airflow "python=3.7.12" "anyconfig >=0.10.0,<0.11.0" "attrs >=21.3,<22.2" "cachetools >=4.1,<5.0" "click <9.0" "cookiecutter >=2.1.1,<3.0" "dynaconf >=3.1.2, <4.0" "fsspec >=2021.4,<=2022.5.0" "gitpython >=3.0,<4.0" "importlib_metadata >=3.6" "importlib_resources >=1.3" "jmespath >=0.9.5,<1.0" "jupyter_client >=5.1,<7.0" "pip-tools >=6.5,<7.0" "pluggy >=1.0,<1.1" "python >=3.7,<3.11" "python-json-logger >=2.0.0,<3.0.0" "pyyaml >=4.2,<7.0" "rich >=12.0,<13.0" "rope >=0.21.0,<0.22.0" "setuptools >=38.0" "toml >=0.10,<0.11" "toposort >=1.5,<2.0" "airflow=2.4.2" "mlflow >=1.0.0,<2.0.0"

Actual Result

Draft Pull for conda-feedstock shows build is possible - Restoring compatibility with Airflow & MLFlow
conda-forge/kedro-feedstock#25

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used : 0.18.3
Python version used : Python 3.7
Operating system and version: Linux 64
Constrained by a Monolith Conda Environment : Requiring Kedro, Airflow, MLFlow, PySpark and Great-Expectations installed in a single conda environment

Add catalog dataset for HDF formats not supported by pandas

Description

HDF files which were not created by pandas.to_hdf sometimes cannot be read using pandas.read_hdf. Since Kedro's HDFDataSet depends on pandas, these files cannot be added to the catalog all.

Context

Currently, the dynamic simulation software employed in my research group outputs exclusively .h5 files which contain information we wish to feed to ML models using Kedro. Currently we use a Python script which converts these HDF files into CSVs so we can track using Kedro, yet this is an inefficient process as we are required to rewrite thousands of files to process them.

Ideally, we would like to add out dataset to our data catalog just like we do with our CSV datasets, but in a way that was able to read any kind of HDF file, unlike kedro.extras.datasets.pandas.HDFDataSet.

Given pandas cannot read HDF files which do not conform to its specs (apparently by design according to this issue), this simple addition would benefit any user who may store information in HDF files, be it because it is their preferred storage method or (like in our case) they use some software which directly outputs HDF.

Possible Implementation

My research colleague @Eric-OG and I believe we can implement this on our own. I think it's worth noting it'd be our first contribution to an open source project, but we've read the guidelines and so forth.

It would basically involve copying one of the existing datasets (possible even pandas.HDFDataSet) and adapting it to use another library. We planned on using h5py.

Possible Alternatives

Performing part of our data processing pipelines without Kedro; this is cumbersome and can get harder to maintain, especially since our code will likely be used by new researchers next year;
Converting the files to another type already implemented; this is what we do today but it's simply inefficient.

Kedro-Datasets PyPI source distribution Tar.gz does not include requirements.txt

Description

PyPi Source Distribution ; does not include requirements.txt, test_requirements.txt and lincense.md

https://files.pythonhosted.org/packages/0f/8e/1ca7d83226970935d2eeb94611dea4be3eb888b2ea9393f343eabd2560b0/kedro-datasets-0.0.6.tar.gz

https://pypi.org/project/kedro-datasets/#files

Context

setup.py refers to requirements.txt and test_requirements.txt -
license is needed for distribution through conda and is ideally included in source.

Expected Result

Source Distribution should contain license, and files needed to run setup.py
Source Distribution should be similar to Kedro-Airflow and eventually could be made available as a conda installer.

Actual Result

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.3
Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-datasets
Python version used (python -V): 3.7.12
Operating system and version: Linux

Introduce KedroOperator for Airflow

Discussed in kedro-org/kedro#1716

^{Originally posted by kangshung July 20, 2022}
Hey, are there any plans to add KedroOperator to an official Airflow provider list? This would really make this operator "official". KedroSession, if needed, could be probably passed in a different way.

Kedro-Airflow not working with Astrocloud

Raised by @jweiss-ocurate:

Description

I am trying to run a simple spaceflights example with Astrocloud. I wasn't sure if anyone has been able to get it to work.

Here is the DockerFile:
FROM quay.io/astronomer/astro-runtime:4.1.0

RUN pip install --user new_kedro_project-0.1-py3-none-any.whl --ignore-requires-python

Context

I am trying to use kedro-airflow with astrocloud.

Steps to Reproduce

Follow directions here https://kedro.readthedocs.io/en/latest/10_deployment/11_airflow_astronomer.html
Replace the DockerFile with the above mentioned image.

Expected Result

Complete Kedro Run on local Airflow image.

Actual Result

Failure in local Airflow image.
[2022-02-26, 16:43:26 UTC] {store.py:32} INFO - read() not implemented for BaseSessionStore. Assuming empty store.
[2022-02-26, 16:43:26 UTC] {session.py:78} WARNING - Unable to git describe /usr/local/airflow
[2022-02-26, 16:43:29 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

Your Environment

Include as many relevant details about the environment you experienced the bug in:

Kedro-Airflow plugin version used (get it by running pip show kedro-airflow): 0.4.1
Airflow version (airflow --version):
Kedro version used (pip show kedro or kedro -V): 0.17.7
Python version used (python -V): > 2.0.0
Operating system and version: Ubuntu Linux 20.04

Release datasets 1.1.0

Kedro-telemetry release 0.2.2

Release 0.2.2 - mainly for getting data about project statisitics

Enhance the kedro-plugins README with plugin descriptions

Description

Currently the README page looks very empty since we just migrate all the plugins to this kedro-plugins repository. We could add some short descriptions for each plugin.

[kedro-airflow] Update `kedro-airflow` to take in custom session id

Description

To be done after kedro-org/kedro#1731

Context

kedro-org/kedro#1551

It was decided to allow users to customise the session_id which would make it easier when using an orchestrator like Airflow to map airflow runs to kedro runs and versioned datasets.

	"""``JSONDataSet`` loads/saves data from/to a JSON file using an underlying
	filesystem (e.g.: local, S3, GCS). It uses native json to handle the JSON file.

	Example usage for the
	`YAML API <https://kedro.readthedocs.io/en/stable/data/\
	data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
	.. code-block:: yaml

	cars:
	type: json.JSONDataSet
	filepath: gcs://your_bucket/cars.json
	fs_args:
	project: my-project
	credentials: my_gcp_credentials

	Example usage for the
	`Python API <https://kedro.readthedocs.io/en/stable/data/\
	data_catalog.html#use-the-data-catalog-with-the-code-api>`_:
	::

	>>> from kedro_datasets.json import JSONDataSet
	>>>
	>>> data = {'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}
	>>>
	>>> data_set = JSONDataSet(filepath="test.json")
	>>> data_set.save(data)
	>>> reloaded = data_set.load()
	>>> assert data == reloaded

	"""

	"""``SparkHiveDataSet`` loads and saves Spark dataframes stored on Hive.
	This data set also handles some incompatible file types such as using partitioned parquet on
	hive which will not normally allow upserts to existing data without a complete replacement
	of the existing file/partition.

	This DataSet has some key assumptions:

	- Schemas do not change during the pipeline run (defined PKs must be present for the
	duration of the pipeline)
	- Tables are not being externally modified during upserts. The upsert method is NOT ATOMIC

	to external changes to the target table while executing.
	Upsert methodology works by leveraging Spark DataFrame execution plan checkpointing.

	Example usage for the
	`YAML API <https://kedro.readthedocs.io/en/stable/data/\
	data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:

	.. code-block:: yaml

	hive_dataset:
	type: spark.SparkHiveDataSet
	database: hive_database
	table: table_name
	write_mode: overwrite

	Example usage for the
	`Python API <https://kedro.readthedocs.io/en/stable/data/\
	data_catalog.html#use-the-data-catalog-with-the-code-api>`_:
	::

	>>> from pyspark.sql import SparkSession
	>>> from pyspark.sql.types import (StructField, StringType,
	>>> IntegerType, StructType)
	>>>
	>>> from kedro_datasets.spark import SparkHiveDataSet
	>>>
	>>> schema = StructType([StructField("name", StringType(), True),
	>>> StructField("age", IntegerType(), True)])
	>>>
	>>> data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)]
	>>>
	>>> spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
	>>>
	>>> data_set = SparkHiveDataSet(database="test_database", table="test_table",
	>>> write_mode="overwrite")
	>>> data_set.save(spark_df)
	>>> reloaded = data_set.load()
	>>>
	>>> reloaded.take(4)
	"""