The fondant from ml6team

Simplify Fondant naming

Many objects within the framework have long names. For example: FondantComponentOp, FondantPipeline, FondantClient
Simplify naming by potentially using short abbreviations (e.g. F)

Create 'Creating a component' documentation

Create 'Creating a pipeline' documentation

Prompt-based LAION retrieval component

Redesign `component.transform` interface

Currently the interface of the component.transform method is:

def transform(
    self, args: argparse.Namespace, dataframe: dd.DataFrame
) -> dd.DataFrame:

Since the dataframe is the main argument, I would put it first. And instead of providing an argparse.Namespace, I would provide the arguments as keyword arguments.

def transform(
    self, dataframe: dd.DataFrame, **kwargs
) -> dd.DataFrame:

This way users can define the keyword arguments they are expecting in the method signature instead of having to unpack an argparse.Namespace. Eg. for a component that filters on image dimensions:

def transform(
    self, dataframe: dd.DataFrame, *, height: int, width: int
) -> dd.DataFrame:

Optimize index writing to only write to a new location when necessary

Use manifest evolution in `Dataset`

#41 provides the functionality to generate the output manifest based on the input manifest and the component spec. We should use this in the Component or Dataset class to validate the output of the component.

Importing mechanism for reusable components

We should provide an importing mechanism for our reusable components so they can easily be included in user pipelines.

Loading a local component is done as follows:

from fondant.pipeline import FondantComponentOp

component_op = FondantComponentOp(
    component_spec_path="...",
    arguments={...},
)

where the component_spec_path is a local path.

We no longer want the user to have to specify the component_spec_path, but still want them to be able to provide custom arguments.

We can provide a single ReusableComponent class which takes the name of the component to load

from fondant.components import ReusableComponentOp

component_op = ReusableComponentOp(
    name="...",
    arguments={...},
)

Which we can optionally still wrap as specific classes:

from fondant.components import CaptioningComponentOp

captioning_op = CaptioningComponent(
    arguments={...},
)

Create 'Component registry' documentation

Split Dataset class into Loader and Writer and add tests

Write documentation

We should update the README and create the following detailed documentation pages:

Tasks

Beta Give feedback

Update README to better reflect the problem Fondant aims to solve and the solution it provides #63

documentation
Create 'Getting Started' documentation #61

documentation
Create 'Manifest' documentation #62

documentation
Create 'Component specification' documentation #64

documentation
Create 'Creating a custom component' documentation #131

documentation
Create 'Creating a pipeline' documentation #132

documentation
Create 'Component registry' documentation #133

documentation
Create 'Infrastructure' documentation #134

documentation
Add ControlNet example README #120

Components documentation
Add Stable Diffusion example README #121

Components documentation
Options

Merge loaded subsets into a single dataframe

The loaded subsets need to be merged into a single dataframe by joining them on the index, which is a combination of the id and the source columns, before passing it to the user.

Create 'Manifest' documentation

Implement basic Fondant pipeline mechanism

Currently, compiling and executing pipelines requires the users to have basic kubeflow knowledge. Additionally, the compiled pipelines do not validate the component specifications of the different chained components (static validation before compiling the pipeline).

The ExpressPipeline class that aims to compile the defined pipelines and validate if they are defined in the correct order (all subsets are present when needed based on the ordering of the components).

Stable Diffusion fine-tuning from a seed image dataset pipeline

Create an example pipeline to create a dataset for fine-tuning Stable Diffusion based on a seed dataset from the Hugging Face hub.

The components needed are:

Components

Beta Give feedback

Load from Hugging Face hub component #100

Components
Image embedding component #101

Components
Embedding-based LAION retrieval component #102

Components
Image downloading component #96

Components
Image resolution filtering component #97

Components
Captioning component #98

Components
Write to Hugging Face hub component #113

Components
Add Stable Diffusion example README #121

Components documentation
Options

The final dataset should ideally be registered to the Hugging Face hub.

Captioning component

Prompt generating component

Add ControlNet example README

Write to Hugging Face hub component

Add Pipeline integration tests

To make sure updates to the Fondant code base don't break existing pipelines, we need a test infrastructure on Kubernetes that spins up a test pipeline, runs some basic components and deletes the test pipeline. Specifically:

build docker images for all components
create a dedicate test namespace in the cluster
deploy ml pipeline using the newly built components
run the test
delete the namespace
delete the images

This can be achieved similar to how KubeFlow pipelines is doing this: https://github.com/kubeflow/pipelines/tree/master/test.

Load from Hugging Face hub component

Create one click deployment for GCP and AWS

Goal is to help users start using Fondant

There are multiple ways we can make this happen with different levels of user involvement.

Options from easiest to hardest (from user perspective)

1. Fondant as SaaS
tbd

2. Marketplace solution
Fondant is available on the all the marketplaces of the cloud provider. User clicks a button and K8s and kubeflow (and other dependencies like storage and access rights) are all provisioned. (note that this is a complex process that includes a review with each cloud provider)

GCP documentation: https://cloud.google.com/marketplace/docs/partners/kubernetes/submitting

3. Makefile
we create makefiles that will setup a k8s cluster and deploy kubeflow pipelines on them. This is a multistep approach and allows for some user customization.

4. Terraform
We create a terrafrom module or example code that the user can apply (an maintain). Requires terraform knowledge

5. detailed guide
We create documentation on how to setup everything (probably combining already existing guides

Add coverage check

We should add a coverage check of our test suite to Github actions. This provides a view of how well our test suite covers our code and can give indications on when / where to add tests. High coverage can help users trust the quality of Frondant.

We can look at Connexion for how to add this:
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/pyproject.toml#L76
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/tox.ini#L36
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/.github/workflows/pipeline.yml#L25

ControlNet fine-tuning for interior design pipeline

Create an example pipeline to create a dataset for fine-tuning ControlNet for interior design.

The components needed are:

Components

Beta Give feedback

Prompt generating component #94

Components
Prompt-based LAION retrieval component #95

Components
Image downloading component #96

Components
Image resolution filtering component #97

Components
Captioning component #98

Components
Segmentation component #99

Components
Write to Hugging Face hub component #113

Components
Add ControlNet example README #120

Components documentation
Options

The final dataset should ideally be registered to the Hugging Face hub.

Implement example pipelines & components

For the alpha release, we want to have example pipelines for the following use cases:

Example pipelines

Beta Give feedback

ControlNet fine-tuning for interior design pipeline #92

8 of 8

Components
Stable Diffusion fine-tuning from a seed image dataset pipeline #93

8 of 8

Components
Options

For this we need the following example components:

Example components

Beta Give feedback

Prompt generating component #94

Components
Prompt-based LAION retrieval component #95

Components
Image downloading component #96

Components
Image resolution filtering component #97

Components
Captioning component #98

Components
Segmentation component #99

Components
Load from Hugging Face hub component #100

Components
Image embedding component #101

Components
Embedding-based LAION retrieval component #102

Components
Write to Hugging Face hub component #113

Components
Options

Add typing check in Github actions

The Fondant code is typed, but the typing is not checked. This is an issue if users want to leverage the Fondant typing, as there is no guarantee that it is correct. We should add a typing check to our Github actions to validate this. This will require fixing the types, and might have to be done in multiple steps.

Generate output manifest based on input manifest and component spec

The output manifest of a component should be defined completely by the input manifest and the specification of the component.

This functionality can be used for validation of the data produced by the component, and for static validation of the pipeline before execution.

Image resolution filtering component

Infer when a subset or the index needs to be written to a new location

#42 provides functionality to generate the output manifest from the input manifest and component spec. But it does not yet correctly update the location of the subsets or location. It should deduce when this is necessary and update the location with the new one.

Based on our offline discussion:

A component should define any data it changes in its output_subsets. So a subset needs to be written to a new location if and only if it is defined in the component output_subsets. Since we only need the component specification, this can be done statically.
To know if the index changes, we need to check if the input index and output index of a component at runtime. We need to check how we can do this efficiently using Dask. As a first step we will always rewrite the index and implement this check as optimization later.

Based on this, we get the following subtasks:

Tasks

Beta Give feedback

Update manifest evolution to include subset location updates (#56)
Implement always writing the index to a new location (#56)
Optimize index writing to only write to a new location when necessary #70

Core
Options

Add Stable Diffusion example README

Allow for indexing on both `id` and `source` (multiindex)

Split dataframe to write different subsets

Currently, the user returns a single Fondant dataframe when loading or transforming data. However, there is a need to split this dataframe into multiple subsets based on component specifications and save each subset into a separate location.

This task involves creating a process that allows for efficient and accurate splitting of the dataframe while ensuring each subset is written to its designated location.

Make input_subsets and output_subsets no longer required in component spec schema

Not every component needs input and output subsets. A data loading component can start without inputs, while some transform components like filtering might not write new output subsets. We should make these sections optional in the component spec.

Create 'Infrastructure' documentation

Image downloading component

Create 'Component specification' documentation

Update README to better reflect the problem Fondant aims to solve and the solution it provides

Embedding-based LAION retrieval component

Create 'Creating a custom component' documentation

Add release pipeline for building docker images

Clean up args.metadata

Currently, each component takes a --metadata argument, which includes the run_id, component_id and base_path.

However, the run_id and base_path stay the same for all components in a pipeline, hence it might make sense to only have a loading component accept a --metadata argument, which is then added to the manifest and kept for the rest of the pipeline. The component_id can be inferred from the component spec so there's no need to pass that anymore (see #56).

Alternatively, perhaps the pipeline wrapper class could create the initial Manifest, add the metadata to it, and pass that to the loading component. This way, no component would require a --metadata argument.

Update format subset locations

Currently the format of the location of subsets or the index is:

f"{run_id}/{component_id}/{subset_name}"

But since a subset is not rewritten for every run, and certainly not every component, this leads to a directory structure with only sparse data.

This structure makes it hard to find data versions of a single subset without making it easy to find the data versions of a specific run, since runs might use datasets created by previous runs ones we have pipeline caching.

I would therefore propose to change it to

f"{subset_name}/{run_id}/{component_id}"

Implement a WriteComponent class

Problem Statement

Currently we have two types of components:

LoadComponent : takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.
The user is expected to change the column names to subset_field if the columns of the loaded dataset do not have this. (this might change.
TransformComponent takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user as subset_field

Both components share some common functionalities based on the abstract run method:

loading the manifest
processing the dataset based on the load or transform function implemented by the user
evolving the manifest based on the component spec
writing the new datasets
uploading the manifest

What is still missing is a WriteComponent (this is currently implemented with a transform component). This component should take as an argument a write_path and write the dataset with an appropriate schema based on the component spec.

Proposed Approach

Let's take the write_to_hub component as an example. What we want is the following:

Allow users to write the dataset with custom columns names
This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link
The user returns the dataset with the modified names and we take care of writing it to the appropriate location
Two options here:
- (Option A): User has to implement the writing functions themselves similar to how it's currently implemented here
  Not ideal since they have to call the schema stuff and the to_parquet method which we normally handle in the backend.
- (Option B): User returns a url of where to write the data as well as the dataframe. We take care of writing the dataset to the appropriate location. However, this would require us to implement a DaskDataSink component similar to the DaskDataWriter (we have to change some names) where the biggest difference is that:
  - The write location is the one returned by the user and not based on the manifest (run_id + subset)
  - We do not write any subsets in this component
    Only caveat here is that we're assuming that all the remote locations can be sinked to using the to_parquet method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing to hf://) requires hf_hub as a dependency
  - We might want to introduce specific classes for the Data Sinks since it enables for a nicer isolation of the behavior of the components. E.g. the dataset needs to have a specific schema when writing to hub in order to render properly (see below). Where the specific classes would subclass a general DaskDataSink class. Wondering whether we should follow a similar approach for the LoadComponent if we want to introduce many sources.

Tasks

Beta Give feedback

Implement general data writer class
Add tests
Options

Implementation Steps/Tasks

Depends on the choice of solution.

another thing that we will have to implement is change the schema passed to the to_parquet method from a dict to a pyarrow.Schema data type. This is mainly to support adding additional metadata to the schema needed for the write_to_hub component.

Potential Impact

if Option B is introduced:

Component class will have to be adjusted by adding a new class and potentially modifying the run method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if the WriteComponent is introduced ) as well as DataIo

Testing

Documentation

We will have to document all 3 types of component in the custom_component.md

Feedback and Suggestions

Dependent features

Additional Notes

Create 'Getting Started' documentation

Split Component into Load and Transform subclasses

The Component class currently has a type attribute which indicates if the component is used to load data or transform data, which is used in if / else checks in different places. It would be cleaner to split this into LoadComponent and TransformComponent subclasses

Add reasoning
Add explanation with references
Update license file

ml6team / fondant Goto Github PK

fondant's People

Contributors

Stargazers

Watchers

Forkers

fondant's Issues

Tasks

Components

Options from easiest to hardest (from user perspective)

Components

Example pipelines

Example components

Tasks

Problem Statement

Proposed Approach

Tasks

Implementation Steps/Tasks

Potential Impact

Testing

Documentation

Feedback and Suggestions

Dependent features

Additional Notes

Recommend Projects

Recommend Topics

Recommend Org