quintoandar / butterfree Goto Github PK

View Code? Open in Web Editor NEW

280.0 280.0 35.0 1.31 MB

A tool for building feature stores.

License: Apache License 2.0

Makefile 1.10% Python 98.90%

data-engineering data-science etl etl-framework feature-store package pyspark python

butterfree's People

Contributors

Stargazers

Watchers

butterfree's Issues

SonarCloud bugs/vulnerabilities (minor issues) on Cassandra Client

SonarCloud bugs/vulnerabilities on Cassandra Driver

Summary

TL;DR: Just take a look at the appointed issues on sonar cloud and solve them on cassandra_client.py

Age: legacy

Present since: ~ 2020-05-01

Estimated cost: { estimatedcost:simple | estimatedcost:complex | estimatedcost:investigation_needed }

Type: coding and testing

Description 📋

There are just 5 issues accordingly to SonarClound, it should be easy to solve:

Impact 💣

Since the client is not being used for now, no much impact.

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

When defining a feature set, it's expected that pivot will have all categories and, as a consequence, the resulting Source dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet and AggregatedFeatureSet breaks.

Feature related:

Age: legacy

Estimated cost: investigation_needed

Type: documentation, coding and testing.

Description 📋

If we have a pivot transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet or AggregatedFeatureSet instantiation. If for some reason, not all categories are found in the Source resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.

In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source:

    +---+---+-------+------+----+-----+
    | id| ts|balcony|fridge|oven| pool|
    +---+---+-------+------+----+-----+
    |  1|  1|   null| false|true|false|
    |  2|  2|  false|  null|null| null|
    |  1|  3|   null|  null|null| null|
    |  1|  4|   null|  null|null| true|
    |  1|  5|   true|  null|null| null|
    +---+---+-------+------+----+-----+

As a result, a possible AggregatedFeatureSet could be:

aggregated_feature_set=AggregatedFeatureSet(
                name="example_agg_feature_set",
                entity="entity",
                description="Just a single example. "
                keys=[
                    KeyFeature(
                        name="id",
                        description="House id.",
                        dtype=DataType.BIGINT,
                    )
                ],
                timestamp=TimestampFeature(from_column="ts"),
                features=[
                    Feature(
                        name="balcony_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="balcony",
                    ),
                    Feature(
                        name="fridge_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="fridge",
                    ),
                    Feature(
                        name="oven_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="oven",
                    ),
                    Feature(
                        name="pool_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="pool",
                    ),
                ],
            )

Now, if we take a different time window and, for some reason, there is no information regarding the pool amenity, we'd have a resulting Source dataframe like this:

    +---+---+-------+------+----+
    | id| ts|balcony|fridge|oven|
    +---+---+-------+------+----+
    |  1|  6|   null| false|true|
    |  2|  7|  false|  null|null|
    |  1|  8|   null|  null|null|
    |  1|  9|   null|  null|null|
    +---+---+-------+------+----+

Therefore, the pool_amenity feature would break, since there's no pool column anymore.

Impact 💣

We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).

Solution Hints

We could have a parameter for making a given feature optional. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null (we could raise a warning in these cases).

Observations 🤔

We should take care, when implementing this solution, to avoid hiding errors.

Historical and Online Feature Store Writer - Write diretcly in hdfs

Is there any way to write the Historical and Online feature stores directly in the hdfs?

Summary

I would like to write the Historical and Online feature stores in two different hdfs paths, using the parquet format.
I've tried use the S3Config class to do it, but not works..

Bug in streaming feature sets

Summary

Streaming feature sets are not written to entity table.

Feature related:

Age: { age:legacy }

Estimated cost: { estimatedcost:simple }

Type: { type:coding }

Description 📋

Streaming feature sets are not written to entity tables, only to feature set tables.

Impact 💣

Critical when we use entity tables in our online feature store.

Solution Hints

In file butterfree/core/load/writers/online_feature_store_writer.py the loop of line 127 only executes the first iteration (creating the handler to write on feature set table), as there is a return clause in line 137.

Files related or evidences (like: prints)

This is evident for feature set phone_contact_streaming_hsm -- its columns on table wonka.phone_contact_hsm are all empty.

Link on github wiki is wrong

https://github.com/quintoandar/butterfree/wiki/Documentation

It is not redirect to read the docs.

It should be : https://butterfree.readthedocs.io/en/latest/

False positive ambiguous columns error when creating features

Tech Debt Title

Summary

Weird error related to ambiguous columns that are not really ambiguous?

Feature related:

Age: new-tech-debt-introduced

Present since: 2020-03-06

Estimated cost: investigation_needed

Type: coding

Description 📋

It seems that when using a SQLExpressionTransform when creating features can lead to false positive errors about ambiguous columnds.

An example:

source=Source(
                readers=[
                    TableReader(
                        id="availability",
                        database="datalake_ebdb_raw",
                        table="horariosemanalimovel_aud",
                    )
                    .with_(self.column_sum)
                    .with_(
                        pivot,
                        group_by_columns=["imovel_id", "rev"],
                        pivot_column="diaDaSemana",
                        agg_column="column_sum",
                        aggregation=functions.sum,
                        mock_value=0,
                        mock_type="int",
                        with_forward_fill=True,
                    ),
                    TableReader(
                        id="ure",
                        database="datalake_ebdb_clean",
                        table="user_revision_entity",
                    ),
                ],
                query=(
                    """
                    with coalesced_availability as (
                      select 
                        av.imovel_id as id,
                        av.rev,
                        coalesce(`1`, 0) as monday,
                        coalesce(`2`, 0) as tuesday,
                        coalesce(`3`, 0) as wednesday,
                        coalesce(`4`, 0) as thursday,
                        coalesce(`5`, 0) as friday,
                        coalesce(`6`, 0) as saturday,
                        coalesce(`7`, 0) as sunday
                      from availability av
                    ), houses as (
                      select
                        ha.id_house as ha_id,
                        ha.rev as ha_rev,
                        av.rev as av_rev,
                        av.monday,
                        av.tuesday,
                        av.wednesday,
                        av.thursday,
                        av.friday,
                        av.saturday,
                        av.sunday
                      from datalake_ebdb_clean.house_aud ha
                      full outer join coalesced_availability av
                        on av.id = ha.id_house
                          and av.rev <= ha.rev
                    )
                    select distinct
                      ha_id as id,
                      coalesce(av_rev, ha_rev) as ts_revision,
                      monday as available_slots_monday,
                      tuesday as available_slots_tuesday,
                      wednesday as available_slots_wednesday,
                      thursday as available_slots_thursday,
                      friday as available_slots_friday,
                      saturday as available_slots_saturday,
                      sunday as available_slots_sunday,
                      (monday + tuesday + wednesday + thursday + friday + saturday + sunday) as total_available_slots_weekly
                    from houses
                    """
                ),
            ),
            feature_set=FeatureSet(
                name="house_availability",
                entity="house",
                description=(
                    """
                    Holds availability information related to house
                    feature such as "available_slots_monday" or
                    "total_available_slots_weekly"
                    """
                ),
                keys=[
                    KeyFeature(
                        name="id",
                        description="The House's Main ID",
                    )
                ],
                timestamp=TimestampFeature(from_column="ts_revision", from_ms=True),
                features=[
                    Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),
                  ...

It seems that the part:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),

Causes the error:

org.apache.spark.sql.AnalysisException: Reference 'available_slots_monday' is ambiguous, could be: available_slots_monday, available_slots_monday.;

However if I change the query from monday as available_slots_monday, to make the slect simply as monday and then do:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(monday, 9)"
                        ),

it works!

Impact 💣

Some false positive errors that can be hard to debug.

Critical in: UNKOWN

Solution Hints :squirrel:

Not sure

Observations 🤔

Files related or evidences (like: prints)

Complete error:

AnalysisException: Undefined function: 'fat'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0

I try to get feature from csv and kafka but got that error:

from butterfree.extract import Source
from butterfree.extract.readers import FileReader
from butterfree.extract.readers import KafkaReader

kafka_reader = KafkaReader(
    id="events",
    topic="queue.transactions",
    value_schema=schema_kafka,
    connection_string="kafka:29092",
    stream=False
)

readers = [
    kafka_reader,
    FileReader(id="nutrients", path="starbucks-menu-nutrition-drinks.csv", format="csv", schema=schema_file)
]

query = """
select
    *
from
    events
    join nutrients
        on events.product_name = nutrients.name
"""

source = Source(readers=readers, query=query)
source_df = source.construct(spark_client)

Error

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-18-9b5f19e7b117> in <module>
----> 1 source_df = source.construct(spark_client)

/opt/conda/lib/python3.8/site-packages/butterfree/extract/source.py in construct(self, client, start_date, end_date)
     82         """
     83         for reader in self.readers:
---> 84             reader.build(
     85                 client=client, start_date=start_date, end_date=end_date
     86             )  # create temporary views for each reader

/opt/conda/lib/python3.8/site-packages/butterfree/extract/readers/reader.py in build(self, client, columns, start_date, end_date)
    103 
    104         """
--> 105         column_selection_df = self._select_columns(columns, client)
    106         transformed_df = self._apply_transformations(column_selection_df)
    107 

/opt/conda/lib/python3.8/site-packages/butterfree/extract/readers/reader.py in _select_columns(self, columns, client)
    119     ) -> DataFrame:
    120         df = self.consume(client)
--> 121         return df.selectExpr(
    122             *(
    123                 [

/usr/local/spark/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
   1433         if len(expr) == 1 and isinstance(expr[0], list):
   1434             expr = expr[0]
-> 1435         jdf = self._jdf.selectExpr(self._jseq(expr))
   1436         return DataFrame(jdf, self.sql_ctx)
   1437 

/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    135                 # Hide where the exception came from that shows a non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

/usr/local/spark/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: Undefined function: 'fat'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0

S3Config class is missing

By following: https://butterfree.readthedocs.io/en/latest/configuration.html I should be able to create an S3Config class to send data to S3 but theres no implementation for this class.

Also: to which S3 Actions I'll need permissions to save the table?

thanks =D

Missing href in ipython notebook about simple feature set.

Hey great library,

Just a minor thing, in the examples/simple_feature_set/simple_feature_set.ipynb there is a link missing about For more information about H3 geohash click here

I will submit a PR adding this link https://h3geo.org/docs/comparisons/geohash

Consider adding `mypy`

Summary

TL;DR: A simple summary about the tech debt

Age: new-tech-debt-introduced

Estimated cost: investigation_needed

Type: coding and testing

Description 📋

butterfree adopted type hints, we have to guarantee those types are right.
The users can have some problems when they'll use it in a typed context and using mypy.

Impact 💣

Critical when people will use this lib in a typed context and lib has wrong type hints!

Solution Hints

Add mypy check

Modules with wrong types to fix:

I can fix this issue, if possible!

Set a minimum Python version

Summary

To distribute correctly butterfree we have to set what Python's versions are allowed to install. Currently we're allowed to install this lib in any version.

python-requires

Present since: ever

Estimated cost: simple

Type: documentation | coding | testing

Impact 💣

Critical when people try to install on unsupported Python versions

Solution Hints

Setup correctly the python_requires argument in setup.py

request: Is it possible to use this lib with databricks-connect?

First of all: Congratulation on building this lib.
So far this seems an awesome lib and I'm thinking about using it in one of our projects.

I'm creating a repo that uses a tool called databricks-connect that allows us to develop locally while running all the heavy processing on top of Databricks clusters. The idea behind this work is to be able to create an image that allow us to easily run the same code locally, on deploy pipelines and also on Airflows DAGs.

Do you have any use case like this at 5A?

btw, the link for pre processing in this page is broken: https://butterfree.readthedocs.io/en/latest/extract.html

thanks =D

Transform a feature before aggregating it

Hello everyone

Is it possible to transform (using the sql or spark transform) a data before aggregating it?

example, I want to aggregate using the max operator, but my numerical field is actually a string that I must cast to integer first.
What I'm doing right now is casting the value on the query and aggregating it on the featureSet but I'd like, if possible, to keep the queries as simple as possible.

Or like, I have a column that is one of 10 possible words and I want to use a case statement to transform it into a number and then aggregate using max.

Thanks =D

Wrong feature names when use the parameter *name

Wrong feature names when use the name parameter

Summary

The name parameter it's ignored when we call a Feature object with a transformation and use the from_column parameter together.

Feature related:
butterfree.transform.features.Feature

Age: { age:legacy | age:new-tech-debt-introduced }

Present since: 201X-XX-XX

Estimated cost: { estimatedcost:simple | estimatedcost:complex | estimatedcost:investigation_needed }

Type: { type:documentation | type:coding | type:testing }

Description 📋

A clear and concise description of what the tech debt is and the reason of being created

Impact 💣

Description of the current or possible impact of this tech debt.

Critical in: { N MONTHS | N YEARS | UNKOWN }

Critical when

Solution Hints

Description of solution hints that you have in mind.

Observations 🤔

Files related or evidences (like: prints)

When we check the Feature set dataframe, we can see that the function was applied for both features, but the name it's wrong to the region feature.

Depends on issue X

quintoandar / butterfree Goto Github PK

butterfree's People

Contributors

Stargazers

Watchers

Forkers

butterfree's Issues

SonarCloud bugs/vulnerabilities on Cassandra Driver

Summary

Description 📋

Impact 💣

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

Description 📋

Impact 💣

Solution Hints

Observations 🤔

Is there any way to write the Historical and Online feature stores directly in the hdfs?

Summary

Bug in streaming feature sets

Summary

Description 📋

Impact 💣

Solution Hints

Files related or evidences (like: prints)

Tech Debt Title

Summary

Description 📋

Impact 💣

Solution Hints :squirrel:

Observations 🤔

Files related or evidences (like: prints)

Consider adding mypy

Summary

Description 📋

Impact 💣

Solution Hints

Set a minimum Python version

Summary

Impact 💣

Solution Hints

Wrong feature names when use the name parameter

Summary

Description 📋

Impact 💣

Solution Hints

Observations 🤔

Files related or evidences (like: prints)

Recommend Projects

Recommend Topics

Recommend Org

Consider adding `mypy`