Giter Site home page Giter Site logo

quintoandar / butterfree Goto Github PK

View Code? Open in Web Editor NEW
280.0 280.0 35.0 1.31 MB

A tool for building feature stores.

License: Apache License 2.0

Makefile 1.10% Python 98.90%
data-engineering data-science etl etl-framework feature-store package pyspark python

butterfree's People

Contributors

alvaromarquesandrade avatar fernandrone avatar gabrandao avatar github-felipe-caputo avatar guilhermesalerno avatar hmeretti avatar jdvala avatar jeanineharb avatar lecardozo avatar marcelogdeandrade avatar mmoscardini avatar moromimay avatar rafaelleinio avatar ralphrass avatar roelschr avatar thepabloaguilar avatar thspinto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

butterfree's Issues

SonarCloud bugs/vulnerabilities (minor issues) on Cassandra Client

SonarCloud bugs/vulnerabilities on Cassandra Driver

Summary

TL;DR: Just take a look at the appointed issues on sonar cloud and solve them on cassandra_client.py

Age: legacy

Present since: ~ 2020-05-01

Estimated cost: { estimatedcost:simple | estimatedcost:complex | estimatedcost:investigation_needed }

Type: coding and testing

Description ๐Ÿ“‹

There are just 5 issues accordingly to SonarClound, it should be easy to solve:

Impact ๐Ÿ’ฃ

Since the client is not being used for now, no much impact.

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

When defining a feature set, it's expected that pivot will have all categories and, as a consequence, the resulting Source dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet and AggregatedFeatureSet breaks.

Feature related:

Age: legacy

Estimated cost: investigation_needed

Type: documentation, coding and testing.

Description ๐Ÿ“‹

If we have a pivot transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet or AggregatedFeatureSet instantiation. If for some reason, not all categories are found in the Source resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.

In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source:

    +---+---+-------+------+----+-----+
    | id| ts|balcony|fridge|oven| pool|
    +---+---+-------+------+----+-----+
    |  1|  1|   null| false|true|false|
    |  2|  2|  false|  null|null| null|
    |  1|  3|   null|  null|null| null|
    |  1|  4|   null|  null|null| true|
    |  1|  5|   true|  null|null| null|
    +---+---+-------+------+----+-----+

As a result, a possible AggregatedFeatureSet could be:

aggregated_feature_set=AggregatedFeatureSet(
                name="example_agg_feature_set",
                entity="entity",
                description="Just a single example. "
                keys=[
                    KeyFeature(
                        name="id",
                        description="House id.",
                        dtype=DataType.BIGINT,
                    )
                ],
                timestamp=TimestampFeature(from_column="ts"),
                features=[
                    Feature(
                        name="balcony_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="balcony",
                    ),
                    Feature(
                        name="fridge_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="fridge",
                    ),
                    Feature(
                        name="oven_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="oven",
                    ),
                    Feature(
                        name="pool_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="pool",
                    ),
                ],
            )

Now, if we take a different time window and, for some reason, there is no information regarding the pool amenity, we'd have a resulting Source dataframe like this:

    +---+---+-------+------+----+
    | id| ts|balcony|fridge|oven|
    +---+---+-------+------+----+
    |  1|  6|   null| false|true|
    |  2|  7|  false|  null|null|
    |  1|  8|   null|  null|null|
    |  1|  9|   null|  null|null|
    +---+---+-------+------+----+

Therefore, the pool_amenity feature would break, since there's no pool column anymore.

Impact ๐Ÿ’ฃ

We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).

Solution Hints :shipit:

We could have a parameter for making a given feature optional. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null (we could raise a warning in these cases).

Observations ๐Ÿค”

We should take care, when implementing this solution, to avoid hiding errors.

Historical and Online Feature Store Writer - Write diretcly in hdfs

Is there any way to write the Historical and Online feature stores directly in the hdfs?

Summary

I would like to write the Historical and Online feature stores in two different hdfs paths, using the parquet format.
I've tried use the S3Config class to do it, but not works..

Bug in streaming feature sets

Bug in streaming feature sets

Summary

Streaming feature sets are not written to entity table.

Feature related:

Age: { age:legacy }

Estimated cost: { estimatedcost:simple }

Type: { type:coding }

Description ๐Ÿ“‹

Streaming feature sets are not written to entity tables, only to feature set tables.

Impact ๐Ÿ’ฃ

Critical when we use entity tables in our online feature store.

Solution Hints :shipit:

In file butterfree/core/load/writers/online_feature_store_writer.py the loop of line 127 only executes the first iteration (creating the handler to write on feature set table), as there is a return clause in line 137.

Files related or evidences (like: prints)

This is evident for feature set phone_contact_streaming_hsm -- its columns on table wonka.phone_contact_hsm are all empty.

False positive ambiguous columns error when creating features

Tech Debt Title

Summary

Weird error related to ambiguous columns that are not really ambiguous?

Feature related:

Age: new-tech-debt-introduced

Present since: 2020-03-06

Estimated cost: investigation_needed

Type: coding

Description ๐Ÿ“‹

It seems that when using a SQLExpressionTransform when creating features can lead to false positive errors about ambiguous columnds.

An example:

source=Source(
                readers=[
                    TableReader(
                        id="availability",
                        database="datalake_ebdb_raw",
                        table="horariosemanalimovel_aud",
                    )
                    .with_(self.column_sum)
                    .with_(
                        pivot,
                        group_by_columns=["imovel_id", "rev"],
                        pivot_column="diaDaSemana",
                        agg_column="column_sum",
                        aggregation=functions.sum,
                        mock_value=0,
                        mock_type="int",
                        with_forward_fill=True,
                    ),
                    TableReader(
                        id="ure",
                        database="datalake_ebdb_clean",
                        table="user_revision_entity",
                    ),
                ],
                query=(
                    """
                    with coalesced_availability as (
                      select 
                        av.imovel_id as id,
                        av.rev,
                        coalesce(`1`, 0) as monday,
                        coalesce(`2`, 0) as tuesday,
                        coalesce(`3`, 0) as wednesday,
                        coalesce(`4`, 0) as thursday,
                        coalesce(`5`, 0) as friday,
                        coalesce(`6`, 0) as saturday,
                        coalesce(`7`, 0) as sunday
                      from availability av
                    ), houses as (
                      select
                        ha.id_house as ha_id,
                        ha.rev as ha_rev,
                        av.rev as av_rev,
                        av.monday,
                        av.tuesday,
                        av.wednesday,
                        av.thursday,
                        av.friday,
                        av.saturday,
                        av.sunday
                      from datalake_ebdb_clean.house_aud ha
                      full outer join coalesced_availability av
                        on av.id = ha.id_house
                          and av.rev <= ha.rev
                    )
                    select distinct
                      ha_id as id,
                      coalesce(av_rev, ha_rev) as ts_revision,
                      monday as available_slots_monday,
                      tuesday as available_slots_tuesday,
                      wednesday as available_slots_wednesday,
                      thursday as available_slots_thursday,
                      friday as available_slots_friday,
                      saturday as available_slots_saturday,
                      sunday as available_slots_sunday,
                      (monday + tuesday + wednesday + thursday + friday + saturday + sunday) as total_available_slots_weekly
                    from houses
                    """
                ),
            ),
            feature_set=FeatureSet(
                name="house_availability",
                entity="house",
                description=(
                    """
                    Holds availability information related to house
                    feature such as "available_slots_monday" or
                    "total_available_slots_weekly"
                    """
                ),
                keys=[
                    KeyFeature(
                        name="id",
                        description="The House's Main ID",
                    )
                ],
                timestamp=TimestampFeature(from_column="ts_revision", from_ms=True),
                features=[
                    Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),
                  ...

It seems that the part:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(available_slots_monday, 9)"
                        ),
                    ),

Causes the error:

org.apache.spark.sql.AnalysisException: Reference 'available_slots_monday' is ambiguous, could be: available_slots_monday, available_slots_monday.;

However if I change the query from monday as available_slots_monday, to make the slect simply as monday and then do:

Feature(
                        name="available_slots_monday",
                        description="Number indicating available hours for visit on monday",
                        transformation=SQLExpressionTransform(
                            expression="coalesce(monday, 9)"
                        ),

it works!

Impact ๐Ÿ’ฃ

Some false positive errors that can be hard to debug.

Critical in: UNKOWN

Solution Hints :squirrel:

Not sure

Observations ๐Ÿค”

Files related or evidences (like: prints)

Complete error:
image

AnalysisException: Undefined function: 'fat'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0

I try to get feature from csv and kafka but got that error:

from butterfree.extract import Source
from butterfree.extract.readers import FileReader
from butterfree.extract.readers import KafkaReader

kafka_reader = KafkaReader(
    id="events",
    topic="queue.transactions",
    value_schema=schema_kafka,
    connection_string="kafka:29092",
    stream=False
)

readers = [
    kafka_reader,
    FileReader(id="nutrients", path="starbucks-menu-nutrition-drinks.csv", format="csv", schema=schema_file)
]

query = """
select
    *
from
    events
    join nutrients
        on events.product_name = nutrients.name
"""

source = Source(readers=readers, query=query)
source_df = source.construct(spark_client)

Error

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-18-9b5f19e7b117> in <module>
----> 1 source_df = source.construct(spark_client)

/opt/conda/lib/python3.8/site-packages/butterfree/extract/source.py in construct(self, client, start_date, end_date)
     82         """
     83         for reader in self.readers:
---> 84             reader.build(
     85                 client=client, start_date=start_date, end_date=end_date
     86             )  # create temporary views for each reader

/opt/conda/lib/python3.8/site-packages/butterfree/extract/readers/reader.py in build(self, client, columns, start_date, end_date)
    103 
    104         """
--> 105         column_selection_df = self._select_columns(columns, client)
    106         transformed_df = self._apply_transformations(column_selection_df)
    107 

/opt/conda/lib/python3.8/site-packages/butterfree/extract/readers/reader.py in _select_columns(self, columns, client)
    119     ) -> DataFrame:
    120         df = self.consume(client)
--> 121         return df.selectExpr(
    122             *(
    123                 [

/usr/local/spark/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
   1433         if len(expr) == 1 and isinstance(expr[0], list):
   1434             expr = expr[0]
-> 1435         jdf = self._jdf.selectExpr(self._jseq(expr))
   1436         return DataFrame(jdf, self.sql_ctx)
   1437 

/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    135                 # Hide where the exception came from that shows a non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

/usr/local/spark/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: Undefined function: 'fat'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0


Consider adding `mypy`

Consider adding mypy

Summary

TL;DR: A simple summary about the tech debt

Age: new-tech-debt-introduced

Estimated cost: investigation_needed

Type: coding and testing

Description ๐Ÿ“‹

butterfree adopted type hints, we have to guarantee those types are right.
The users can have some problems when they'll use it in a typed context and using mypy.

Impact ๐Ÿ’ฃ

Critical when people will use this lib in a typed context and lib has wrong type hints!

Solution Hints :shipit:

Add mypy check


Modules with wrong types to fix:

  • configs
  • validations
  • constants
  • dataframe_service
  • reports
  • extract
  • clients
  • load
  • transform
  • pipelines
  • testing

I can fix this issue, if possible!

Set a minimum Python version

Set a minimum Python version

Summary

To distribute correctly butterfree we have to set what Python's versions are allowed to install. Currently we're allowed to install this lib in any version.

python-requires

Present since: ever

Estimated cost: simple

Type: documentation | coding | testing

Impact ๐Ÿ’ฃ

Critical when people try to install on unsupported Python versions

Solution Hints :shipit:

Setup correctly the python_requires argument in setup.py

request: Is it possible to use this lib with databricks-connect?

First of all: Congratulation on building this lib.
So far this seems an awesome lib and I'm thinking about using it in one of our projects.

I'm creating a repo that uses a tool called databricks-connect that allows us to develop locally while running all the heavy processing on top of Databricks clusters. The idea behind this work is to be able to create an image that allow us to easily run the same code locally, on deploy pipelines and also on Airflows DAGs.

Do you have any use case like this at 5A?

btw, the link for pre processing in this page is broken: https://butterfree.readthedocs.io/en/latest/extract.html

thanks =D

Transform a feature before aggregating it

Hello everyone

Is it possible to transform (using the sql or spark transform) a data before aggregating it?

example, I want to aggregate using the max operator, but my numerical field is actually a string that I must cast to integer first.
What I'm doing right now is casting the value on the query and aggregating it on the featureSet but I'd like, if possible, to keep the queries as simple as possible.

Or like, I have a column that is one of 10 possible words and I want to use a case statement to transform it into a number and then aggregate using max.

Thanks =D

Wrong feature names when use the parameter *name

Wrong feature names when use the name parameter

Summary

The name parameter it's ignored when we call a Feature object with a transformation and use the from_column parameter together.

Feature related:
butterfree.transform.features.Feature

Age: { age:legacy | age:new-tech-debt-introduced }

Present since: 201X-XX-XX

Estimated cost: { estimatedcost:simple | estimatedcost:complex | estimatedcost:investigation_needed }

Type: { type:documentation | type:coding | type:testing }

Description ๐Ÿ“‹

A clear and concise description of what the tech debt is and the reason of being created

Impact ๐Ÿ’ฃ

Description of the current or possible impact of this tech debt.

Critical in: { N MONTHS | N YEARS | UNKOWN }

or

Critical when

Solution Hints :shipit:

Description of solution hints that you have in mind.

Observations ๐Ÿค”

Files related or evidences (like: prints)

image

When we check the Feature set dataframe, we can see that the function was applied for both features, but the name it's wrong to the region feature.

image

Depends on issue X

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.