ml6team / fondant Goto Github PK
View Code? Open in Web Editor NEWProduction-ready data processing made easy and shareable
Home Page: https://fondant.ai/en/stable/
License: Apache License 2.0
Production-ready data processing made easy and shareable
Home Page: https://fondant.ai/en/stable/
License: Apache License 2.0
Many objects within the framework have long names. For example: FondantComponentOp, FondantPipeline, FondantClient
Simplify naming by potentially using short abbreviations (e.g. F
)
Currently the interface of the component.transform
method is:
def transform(
self, args: argparse.Namespace, dataframe: dd.DataFrame
) -> dd.DataFrame:
Since the dataframe
is the main argument, I would put it first. And instead of providing an argparse.Namespace
, I would provide the arguments as keyword arguments.
def transform(
self, dataframe: dd.DataFrame, **kwargs
) -> dd.DataFrame:
This way users can define the keyword arguments they are expecting in the method signature instead of having to unpack an argparse.Namespace
. Eg. for a component that filters on image dimensions:
def transform(
self, dataframe: dd.DataFrame, *, height: int, width: int
) -> dd.DataFrame:
#41 provides the functionality to generate the output manifest based on the input manifest and the component spec. We should use this in the Component
or Dataset
class to validate the output of the component.
We should provide an importing mechanism for our reusable components so they can easily be included in user pipelines.
Loading a local component is done as follows:
from fondant.pipeline import FondantComponentOp
component_op = FondantComponentOp(
component_spec_path="...",
arguments={...},
)
where the component_spec_path
is a local path.
We no longer want the user to have to specify the component_spec_path
, but still want them to be able to provide custom arguments.
We can provide a single ReusableComponent
class which takes the name of the component to load
from fondant.components import ReusableComponentOp
component_op = ReusableComponentOp(
name="...",
arguments={...},
)
Which we can optionally still wrap as specific classes:
from fondant.components import CaptioningComponentOp
captioning_op = CaptioningComponent(
arguments={...},
)
We should update the README and create the following detailed documentation pages:
The loaded subsets need to be merged into a single dataframe by joining them on the index, which is a combination of the id and the source columns, before passing it to the user.
Currently, compiling and executing pipelines requires the users to have basic kubeflow knowledge. Additionally, the compiled pipelines do not validate the component specifications of the different chained components (static validation before compiling the pipeline).
The ExpressPipeline class that aims to compile the defined pipelines and validate if they are defined in the correct order (all subsets are present when needed based on the ordering of the components).
Create an example pipeline to create a dataset for fine-tuning Stable Diffusion based on a seed dataset from the Hugging Face hub.
The components needed are:
The final dataset should ideally be registered to the Hugging Face hub.
To make sure updates to the Fondant code base don't break existing pipelines, we need a test infrastructure on Kubernetes that spins up a test pipeline, runs some basic components and deletes the test pipeline. Specifically:
This can be achieved similar to how KubeFlow pipelines is doing this: https://github.com/kubeflow/pipelines/tree/master/test.
Goal is to help users start using Fondant
There are multiple ways we can make this happen with different levels of user involvement.
1. Fondant as SaaS
tbd
2. Marketplace solution
Fondant is available on the all the marketplaces of the cloud provider. User clicks a button and K8s and kubeflow (and other dependencies like storage and access rights) are all provisioned. (note that this is a complex process that includes a review with each cloud provider)
GCP documentation: https://cloud.google.com/marketplace/docs/partners/kubernetes/submitting
3. Makefile
we create makefiles that will setup a k8s cluster and deploy kubeflow pipelines on them. This is a multistep approach and allows for some user customization.
4. Terraform
We create a terrafrom module or example code that the user can apply (an maintain). Requires terraform knowledge
5. detailed guide
We create documentation on how to setup everything (probably combining already existing guides
We should add a coverage check of our test suite to Github actions. This provides a view of how well our test suite covers our code and can give indications on when / where to add tests. High coverage can help users trust the quality of Frondant.
We can look at Connexion for how to add this:
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/pyproject.toml#L76
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/tox.ini#L36
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/.github/workflows/pipeline.yml#L25
Create an example pipeline to create a dataset for fine-tuning ControlNet for interior design.
The components needed are:
The final dataset should ideally be registered to the Hugging Face hub.
For the alpha release, we want to have example pipelines for the following use cases:
For this we need the following example components:
The Fondant code is typed, but the typing is not checked. This is an issue if users want to leverage the Fondant typing, as there is no guarantee that it is correct. We should add a typing check to our Github actions to validate this. This will require fixing the types, and might have to be done in multiple steps.
The output manifest of a component should be defined completely by the input manifest and the specification of the component.
This functionality can be used for validation of the data produced by the component, and for static validation of the pipeline before execution.
#42 provides functionality to generate the output manifest from the input manifest and component spec. But it does not yet correctly update the location of the subsets or location. It should deduce when this is necessary and update the location with the new one.
Based on our offline discussion:
output_subsets
. So a subset needs to be written to a new location if and only if it is defined in the component output_subsets
. Since we only need the component specification, this can be done statically.Based on this, we get the following subtasks:
Currently, the user returns a single Fondant dataframe when loading or transforming data. However, there is a need to split this dataframe into multiple subsets based on component specifications and save each subset into a separate location.
This task involves creating a process that allows for efficient and accurate splitting of the dataframe while ensuring each subset is written to its designated location.
Not every component needs input and output subsets. A data loading component can start without inputs, while some transform components like filtering might not write new output subsets. We should make these sections optional in the component spec.
Currently, each component takes a --metadata
argument, which includes the run_id
, component_id
and base_path
.
However, the run_id
and base_path
stay the same for all components in a pipeline, hence it might make sense to only have a loading component accept a --metadata
argument, which is then added to the manifest and kept for the rest of the pipeline. The component_id
can be inferred from the component spec so there's no need to pass that anymore (see #56).
Alternatively, perhaps the pipeline wrapper class could create the initial Manifest
, add the metadata to it, and pass that to the loading component. This way, no component would require a --metadata
argument.
Currently the format of the location of subsets or the index is:
f"{run_id}/{component_id}/{subset_name}"
But since a subset is not rewritten for every run, and certainly not every component, this leads to a directory structure with only sparse data.
This structure makes it hard to find data versions of a single subset without making it easy to find the data versions of a specific run, since runs might use datasets created by previous runs ones we have pipeline caching.
I would therefore propose to change it to
f"{subset_name}/{run_id}/{component_id}"
Currently we have two types of components:
LoadComponent
: takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.
The user is expected to change the column names to subset_field
if the columns of the loaded dataset do not have this. (this might change.
TransformComponent
takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user as subset_field
Both components share some common functionalities based on the abstract run method:
What is still missing is a WriteComponent
(this is currently implemented with a transform component). This component should take as an argument a write_path
and write the dataset with an appropriate schema based on the component spec.
Let's take the write_to_hub
component as an example. What we want is the following:
Allow users to write the dataset with custom columns names
This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link
The user returns the dataset with the modified names and we take care of writing it to the appropriate location
Two options here:
to_parquet
method which we normally handle in the backend.DaskDataSink
component similar to the DaskDataWriter
(we have to change some names) where the biggest difference is that:
to_parquet
method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing to hf://
) requires hf_hub as a dependencyDaskDataSink
class. Wondering whether we should follow a similar approach for the LoadComponent
if we want to introduce many sources.Depends on the choice of solution.
another thing that we will have to implement is change the schema passed to the to_parquet
method from a dict
to a pyarrow.Schema
data type. This is mainly to support adding additional metadata to the schema needed for the write_to_hub
component.
if Option B is introduced:
Component
class will have to be adjusted by adding a new class and potentially modifying the run
method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if the WriteComponent
is introduced ) as well as DataIo
We will have to document all 3 types of component in the custom_component.md
The Component class currently has a type
attribute which indicates if the component is used to load data or transform data, which is used in if / else checks in different places. It would be cleaner to split this into LoadComponent
and TransformComponent
subclasses
We need a release pipeline to release the fondant
package to PyPI.
Data types such as Tuples
,bool
, Lists
are not always represented correctly after passing them to Kubeflow. We need to handle them internally after they are passed by the user and parse them as intended
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.