logicalclocks / hopsworks-tutorials Goto Github PK

Tutorials for the Hopsworks Platform

License: GNU Affero General Public License v3.0

Jupyter Notebook 97.54% Shell 0.07% Python 2.28% Java 0.12%

hopsworks-tutorials's Introduction

👨🏻‍🏫 Hopsworks Tutorials

We are happy to welcome you to our collection of tutorials dedicated to exploring the fundamentals of Hopsworks and Machine Learning development. In addition to offering different types of use cases and common subjects in the field, it facilitates navigation and use of models in a production environment using Hopsworks Feature Store.

⚙️ How to run the tutorials:

For the tutorials to work, you will need a Hopsworks account. To do so, go to app.hopsworks.ai and create one. With a managed account, just run the Jupyter notebook from within Hopsworks.

Generally the notebooks contain the information you will need on how to interact with the Hopsworks Platform.

If you have an app.hopsworks.ai account; you may connect to Hopsworks with the following line; this will prompt you with a link to your Token which will link to the feature store.

import hopsworks
 
project = hopsworks.login()
fs = project.get_feature_store()

In some cases, you may also need to install Hopsworks; to be able to work with the package. Simply start your notebook with:

!pip install -U hopsworks --quiet

The walkthrough and tutorials are provided in the form of Python notebooks, you will therefore need to run a jupyter environment or work within a colaboratory notebook in google; the later option might lead to some minor errors being displayed or libraries might require different library versions to work.

✍🏻 Concepts:

In order to understand the tutorials you need to be familiar with general concepts of Machine Learning and Python development. You may find some useful information in the Hopsworks documentation.

🗄️ Table of Content:

Basic Tutorials:
- QuickStart: Introductory tutorial to get started quickly.
- Churn: Predict customers that are at risk of churning.
- Fraud Batch: Detect Fraud Transactions (Batch use case).
- Fraud Online: Detect Fraud Transactions (Online use case).
- Iris: Classify iris flower species.
- Loan Approval: Predict loan approvals.
Advanced Tutorials:
- Air Quality: Creating an air quality AI assistant that displays and explains air quality indicators for specific dates or periods, using Function Calling for LLMs and a RAG approach without a vector database.
- Bitcoin: Predict Bitcoin price using timeseries features and tweets sentiment analysis.
- Citibike: Predict the number of citibike users on each citibike station in the New York City.
- Credit Scores: Predict clients' repayment abilities.
- Electricity: Predict the electricity prices in several Swedish cities based on weather conditions, previous prices, and Swedish holidays.
- NYC Taxi Fares: Predict the fare amount for a taxi ride in New York City given the pickup and dropoff locations.
- Recommender System: Build a recommender system for fashion items.
- TimeSeries: Timeseries price prediction.
- LLM PDF: An AI assistant that utilizes a Retrieval-Augmented Generation (RAG) system to provide accurate answers to user questions by retrieving relevant context from PDF documents.
- Fraud Cheque Detection: Building an AI assistant that detects fraudulent scanned cheque images and generates explanations for the fraud classification, using a fine-tuned open-source LLM.
- Keras model and Sklearn Transformation Functions with Hopsworks Model Registry: How to register Sklearn Transformation Functions and Keras model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
- PyTorch model and Sklearn Transformation Functions with Hopsworks Model Registry: How to register Sklearn Transformation Functions and PyTorch model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
- Sklearn Transformation Functions With Hopsworks Model Registy: How to register sklearn.pipeline with transformation functions and classifier in Hopsworks Model Registry and use it in training and inference pipelines.
- Custom Transformation Functions: How to register custom transformation functions in hopsworks feature store use then in training and inference pipelines.
Integrations:
- BigQuery Storage Connector: Create an External Feature Group using BigQuery Storage Connector.
- Google Cloud Storage: Create an External Feature Group using GCS Storage Connector.
- Redshift: Create an External Feature Group using Redshift Storage Connector.
- Snowflake: Create an External Feature Group using Snowflake Storage Connector.
- DBT Tutorial with BigQuery: Perform feature engineering in DBT on BigQuery.
- WandB: Build a machine learning model with Weights & Biases.
- Great Expectations: Introduction to Great Expectations concepts and classes which are relevant for integration with the Hopsworks MLOps platform.
- Neo4j: Perform Anti-money laundering (AML) predictions using Neo4j Graph representation of transactions.
- Polars : Introductory tutorial on using Polars.
- PySpark Streaming : Real time feature computation from streaming data using PySpark and HopsWorks Feature Store.
- Monitoring: How to implement feature monitoring in your production pipeline.
- Bytewax: Real time feature computation using Bytewax.
- Apache Beam: Real time feature computation using Apache Beam, Google Cloud Dataflow and Hopsworks Feature Store.
- Apache Flink: Real time feature computation using Apache Flink and Hopsworks Feature Store.
- MageAI: Build and operate a ML system with Mage and Hopsworks.

📝 Feedbacks & Comments:

We welcome feedbacks and suggestions, you can contact us on any of the following channels:

Our Support Forum,
Directly on this github repository,
Send us an email at [email protected]

hopsworks-tutorials's People

Contributors

Stargazers

Watchers

Forkers

robzor92 maxxx-zh kennethmhc moritzmeister collerek rktraz bochuxt alexhopsworks davitbzh vatj senaygoito terrencstaciralphlindgr marcellinamichie291 dhananjay-mk gibchikafa scottlittle javierdlrm dileep-gadiraju berthoug abidzar16 arvindroshaan rahul-c1 mildmillard amartin6251 coding4vinayak ldebb boldcodes carlosrnes m-lebon bayonlelukmansalami omuratgultekin marlonyms kjolibois ronakjhanwar yemidatadev korhanmd elim-bh bhep kayjayi adubun92 mjndai7 doh-mcd2303 febikambu alkafaweey napermial timadeg alialemimatinpour taltaf913 jducil feihu618 carlosfab steffengr maxjonas2 manu-sj rafaelnduarte folu22 abeenoch matteopilotto marcopellegrinoit jimdowling raghu-007 ee1922 solab5 benj3037 camillahannesbo rvanbruggen aw009 dhenok1 utility001 annikaijak lingchen0329 w-winnie lastworden apratim-mishra tacoman99 retaiba-rachid utopic-dev

hopsworks-tutorials's Issues

Java Beam Example

To run the java beam example:

1.Add options into Pipeline creation:
Pipeline p = Pipeline.create(options)

Specify gcp region and gcptemplocation in maven cmd
--region=us-central1 --gcpTempLocation=gs:///<your_tmp_directory>
GCP service account also needed serviceusage permissions.
HOPSWORKS serverless hostname should be c.app.hopsworks.ai

I was unable to get it to run on Windows due to path error in finding keystore.jks. Worked ok on google cloud shell.

Air quality Advanced Tutorial [feature view creation]

In the third notebook, For some reasons it prints some operational error after calling selected_features.show(5):
it brings the same error even trying to create the training data.

this is the error

OperationalError Traceback (most recent call last)
File ~/mambaforge/lib/python3.10/site-packages/pandas/io/sql.py:2266, in SQLiteDatabase.execute(self, sql, params)
2265 try:
-> 2266 cur.execute(sql, *args)
2267 return cur

File ~/mambaforge/lib/python3.10/site-packages/pyhive/hive.py:408, in Cursor.execute(self, operation, parameters, **kwargs)
407 response = self._connection.client.ExecuteStatement(req)
--> 408 _check_status(response)
409 self._operationHandle = response.operationHandle

File ~/mambaforge/lib/python3.10/site-packages/pyhive/hive.py:538, in _check_status(response)
537 if response.status.statusCode != ttypes.TStatusCode.SUCCESS_STATUS:
--> 538 raise OperationalError(response)

OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:343', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:232', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:269', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:255', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:541', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:516', 'sun.reflect.GeneratedMethodAccessor268:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1821', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy53:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:281', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:712', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1557', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1542', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:750'], sqlState='08S01', errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask'), operationHandle=None)

During handling of the above exception, another exception occurred:

NotSupportedError Traceback (most recent call last)
File ~/mambaforge/lib/python3.10/site-packages/pandas/io/sql.py:2270, in SQLiteDatabase.execute(self, sql, params)
2269 try:
-> 2270 self.con.rollback()
2271 except Exception as inner_exc: # pragma: no cover

File ~/mambaforge/lib/python3.10/site-packages/pyhive/hive.py:285, in Connection.rollback(self)
284 def rollback(self):
--> 285 raise NotSupportedError("Hive does not have transactions")

NotSupportedError: Hive does not have transactions

The above exception was the direct cause of the following exception:

DatabaseError Traceback (most recent call last)
Cell In[11], line 2
1 # # Uncomment this if you would like to view your selected features
----> 2 selected_features.show(5)

File ~/mambaforge/lib/python3.10/site-packages/hsfs/constructor/query.py:182, in Query.show(self, n, online)
179 read_options = {}
180 sql_query, online_conn = self._prep_read(online, read_options)
--> 182 return engine.get_instance().show(
183 sql_query, self._feature_store_name, n, online_conn, read_options
184 )

File ~/mambaforge/lib/python3.10/site-packages/hsfs/engine/python.py:317, in Engine.show(self, sql_query, feature_store, n, online_conn, read_options)
316 def show(self, sql_query, feature_store, n, online_conn, read_options={}):
--> 317 return self.sql(
318 sql_query, feature_store, online_conn, "default", read_options
319 ).head(n)

File ~/mambaforge/lib/python3.10/site-packages/hsfs/engine/python.py:106, in Engine.sql(self, sql_query, feature_store, online_conn, dataframe_type, read_options, schema)
96 def sql(
97 self,
98 sql_query,
(...)
103 schema=None,
104 ):
105 if not online_conn:
--> 106 return self._sql_offline(
107 sql_query,
108 feature_store,
109 dataframe_type,
110 schema,
111 hive_config=read_options.get("hive_config") if read_options else None,
112 )
113 else:
114 return self._jdbc(
115 sql_query, online_conn, dataframe_type, read_options, schema
116 )

File ~/mambaforge/lib/python3.10/site-packages/hsfs/engine/python.py:144, in Engine._sql_offline(self, sql_query, feature_store, dataframe_type, schema, hive_config)
142 with warnings.catch_warnings():
143 warnings.simplefilter("ignore", UserWarning)
--> 144 result_df = util.run_with_loading_animation(
145 "Reading data from Hopsworks, using Hive",
146 pd.read_sql,
147 sql_query,
148 hive_conn,
149 )
151 if schema:
152 result_df = Engine.cast_columns(result_df, schema)

File ~/mambaforge/lib/python3.10/site-packages/hsfs/util.py:345, in run_with_loading_animation(message, func, *args, **kwargs)
342 end = None
344 try:
--> 345 result = func(*args, **kwargs)
346 end = time.time()
347 return result

File ~/mambaforge/lib/python3.10/site-packages/pandas/io/sql.py:654, in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize, dtype_backend, dtype)
652 with pandasSQL_builder(con) as pandas_sql:
653 if isinstance(pandas_sql, SQLiteDatabase):
--> 654 return pandas_sql.read_query(
655 sql,
656 index_col=index_col,
657 params=params,
658 coerce_float=coerce_float,
659 parse_dates=parse_dates,
660 chunksize=chunksize,
661 dtype_backend=dtype_backend,
662 dtype=dtype,
663 )
665 try:
666 _is_table_name = pandas_sql.has_table(sql)

File ~/mambaforge/lib/python3.10/site-packages/pandas/io/sql.py:2330, in SQLiteDatabase.read_query(self, sql, index_col, coerce_float, parse_dates, params, chunksize, dtype, dtype_backend)
2319 def read_query(
2320 self,
2321 sql,
(...)
2328 dtype_backend: DtypeBackend | Literal["numpy"] = "numpy",
2329 ) -> DataFrame | Iterator[DataFrame]:
-> 2330 cursor = self.execute(sql, params)
2331 columns = [col_desc[0] for col_desc in cursor.description]
2333 if chunksize is not None:

File ~/mambaforge/lib/python3.10/site-packages/pandas/io/sql.py:2275, in SQLiteDatabase.execute(self, sql, params)
2271 except Exception as inner_exc: # pragma: no cover
2272 ex = DatabaseError(
2273 f"Execution failed on sql: {sql}\n{exc}\nunable to rollback"
2274 )
-> 2275 raise ex from inner_exc
2277 ex = DatabaseError(f"Execution failed on sql '{sql}': {exc}")
2278 raise ex from exc

DatabaseError: Execution failed on sql: WITH right_fg0 AS (SELECT *
FROM (SELECT fg1.city_name city_name, fg1.date date, fg1.pm2_5 pm2_5, fg1.pm_2_5_previous_1_day pm_2_5_previous_1_day, fg1.pm_2_5_previous_2_day pm_2_5_previous_2_day, fg1.pm_2_5_previous_3_day pm_2_5_previous_3_day, fg1.pm_2_5_previous_4_day pm_2_5_previous_4_day, fg1.pm_2_5_previous_5_day pm_2_5_previous_5_day, fg1.pm_2_5_previous_6_day pm_2_5_previous_6_day, fg1.pm_2_5_previous_7_day pm_2_5_previous_7_day, fg1.mean_7_days mean_7_days, fg1.mean_14_days mean_14_days, fg1.mean_28_days mean_28_days, fg1.std_7_days std_7_days, fg1.exp_mean_7_days exp_mean_7_days, fg1.exp_std_7_days exp_std_7_days, fg1.std_14_days std_14_days, fg1.exp_mean_14_days exp_mean_14_days, fg1.exp_std_14_days exp_std_14_days, fg1.std_28_days std_28_days, fg1.exp_mean_28_days exp_mean_28_days, fg1.exp_std_28_days exp_std_28_days, fg1.year year, fg1.day_of_month day_of_month, fg1.month month, fg1.day_of_week day_of_week, fg1.is_weekend is_weekend, fg1.sin_day_of_year sin_day_of_year, fg1.cos_day_of_year cos_day_of_year, fg1.sin_day_of_week sin_day_of_week, fg1.cos_day_of_week cos_day_of_week, fg1.unix_time unix_time, fg1.city_name join_pk_city_name, fg1.unix_time join_pk_unix_time, fg1.unix_time join_evt_unix_time, fg0.temperature_max temperature_max, fg0.temperature_min temperature_min, fg0.precipitation_sum precipitation_sum, fg0.rain_sum rain_sum, fg0.snowfall_sum snowfall_sum, fg0.precipitation_hours precipitation_hours, fg0.wind_speed_max wind_speed_max, fg0.wind_gusts_max wind_gusts_max, fg0.wind_direction_dominant wind_direction_dominant, RANK() OVER (PARTITION BY fg1.city_name, fg1.date, fg1.unix_time ORDER BY fg0.unix_time DESC) pit_rank_hopsworks
FROM soll_featurestore.air_quality_1 fg1
INNER JOIN soll_featurestore.weather_1 fg0 ON fg1.city_name = fg0.city_name AND fg1.date = fg0.date AND fg1.unix_time >= fg0.unix_time) NA
WHERE pit_rank_hopsworks = 1) (SELECT right_fg0.city_name city_name, right_fg0.date date, right_fg0.pm2_5 pm2_5, right_fg0.pm_2_5_previous_1_day pm_2_5_previous_1_day, right_fg0.pm_2_5_previous_2_day pm_2_5_previous_2_day, right_fg0.pm_2_5_previous_3_day pm_2_5_previous_3_day, right_fg0.pm_2_5_previous_4_day pm_2_5_previous_4_day, right_fg0.pm_2_5_previous_5_day pm_2_5_previous_5_day, right_fg0.pm_2_5_previous_6_day pm_2_5_previous_6_day, right_fg0.pm_2_5_previous_7_day pm_2_5_previous_7_day, right_fg0.mean_7_days mean_7_days, right_fg0.mean_14_days mean_14_days, right_fg0.mean_28_days mean_28_days, right_fg0.std_7_days std_7_days, right_fg0.exp_mean_7_days exp_mean_7_days, right_fg0.exp_std_7_days exp_std_7_days, right_fg0.std_14_days std_14_days, right_fg0.exp_mean_14_days exp_mean_14_days, right_fg0.exp_std_14_days exp_std_14_days, right_fg0.std_28_days std_28_days, right_fg0.exp_mean_28_days exp_mean_28_days, right_fg0.exp_std_28_days exp_std_28_days, right_fg0.year year, right_fg0.day_of_month day_of_month, right_fg0.month month, right_fg0.day_of_week day_of_week, right_fg0.is_weekend is_weekend, right_fg0.sin_day_of_year sin_day_of_year, right_fg0.cos_day_of_year cos_day_of_year, right_fg0.sin_day_of_week sin_day_of_week, right_fg0.cos_day_of_week cos_day_of_week, right_fg0.unix_time unix_time, right_fg0.temperature_max temperature_max, right_fg0.temperature_min temperature_min, right_fg0.precipitation_sum precipitation_sum, right_fg0.rain_sum rain_sum, right_fg0.snowfall_sum snowfall_sum, right_fg0.precipitation_hours precipitation_hours, right_fg0.wind_speed_max wind_speed_max, right_fg0.wind_gusts_max wind_gusts_max, right_fg0.wind_direction_dominant wind_direction_dominant
FROM right_fg0)
TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:343', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:232', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:269', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:255', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:541', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:516', 'sun.reflect.GeneratedMethodAccessor268:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1821', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy53:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:281', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:712', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1557', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1542', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:750'], sqlState='08S01', errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask'), operationHandle=None)
unable to rollback

H&M tutorials notebooks

Hi,
I am going through this video https://www.youtube.com/watch?v=9vBRjGgdyTY

but i couldn't find the resources mentioned in the video .
can u point out to those resources?

thanks

ML Tutorialz

What is the "ELASTIC_ENDPOINT"?

When I tried that tuitorial(https://github.com/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/recommender-system/3_build_index.ipynb), I met that error.
The logs:
Traceback (most recent call last):
File “/root/hopsworks/recommender-system/build_index.py”, line 55, in
client = OpenSearch(**opensearch_api.get_default_py_config())
File “/kvm/miniconda3/envs/hopsworks/lib/python3.9/site-packages/hopsworks/core/opensearch_api.py”, line 66, in get_default_py_config
url = furl(self._get_opensearch_url())
File “/kvm/miniconda3/envs/hopsworks/lib/python3.9/site-packages/hopsworks/core/opensearch_api.py”, line 33, in _get_opensearch_url
return os.environ[constants.ENV_VARS.ELASTIC_ENDPOINT_ENV_VAR]
File “/kvm/miniconda3/envs/hopsworks/lib/python3.9/os.py”, line 679, in getitem
raise KeyError(key) from None
KeyError: ‘ELASTIC_ENDPOINT’

I am confused about the ELASTIC_ENDPOINT, what is it , and how to fix it? help!

pandas mean function call instead of standard deviation and frequency

Dear Team,

I was going through fraud_batch 1_feature_groups.ipynb notebook. I notice that the comment says its calculating standard deviation and frequency; however, the logic below comment is for the mean instead of standard deviation and frequency.

https://github.com/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch/1_feature_groups.

Moving standard deviation of transaction volume.

df_4h_std = pd.DataFrame(cc_group.mean())

Moving average of transaction frequency.

df_4h_count = pd.DataFrame(cc_group.mean())

Tutorial to push to Online Store from Databricks

Hello ,

Sorry for a naive question. I want to use open source Hopsworks Feature Store, as a stand alone package, in Databricks clusters. Though I was able to find tutorials on how to connect to and save data in offline HSFS from here: Hopsworks Examples ; I'm struggling to find a similar one on how to push data in off-line feature store, to online feature store ( RonDB ).

Please provide appropriate directions.

Thanks for your time ! !

Quickstarts tutorial predict fails with xgboost error

deployment.predict(inputs=fraud_model.input_example)

fails with HTTP Error 500 - XGBClassifier object has no attribute 'use_label_encoder'

setting use_label_encoder=False does not resolve the problem, nor downgrading XGBoost from 2.0.0.

Fraud batch training pipeline part2

Hi,

I'm looking into your tutorial of Fraud batch training pipeline. I get an error at part 2 when trying to create the training dataset. I would be thankful for help. /Daniel

Copy of code and error message:

TEST_SIZE = 0.2

td_version, td_job = feature_view.create_train_test_split(
description = 'transactions fraud batch training dataset',
data_format = 'csv',
test_size = TEST_SIZE,
write_options = {'wait_for_job': True}
)

RestAPIError Traceback (most recent call last)
in
1 TEST_SIZE = 0.2
2
----> 3 td_version, td_job = feature_view.create_train_test_split(
4 description = 'transactions fraud batch training dataset',
5 data_format = 'csv',

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/feature_view.py in create_train_test_split(self, test_size, train_start, train_end, test_start, test_end, storage_connector, location, description, extra_filter, data_format, coalesce, seed, statistics_config, write_options)
1038 )
1039 # td_job is used only if the python engine is used
-> 1040 td, td_job = self._feature_view_engine.create_training_dataset(
1041 self, td, write_options
1042 )

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/core/feature_view_engine.py in create_training_dataset(self, feature_view_obj, training_dataset_obj, user_write_options)
239 ):
240 self._set_event_time(feature_view_obj, training_dataset_obj)
--> 241 updated_instance = self._create_training_data_metadata(
242 feature_view_obj, training_dataset_obj
243 )

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/core/feature_view_engine.py in _create_training_data_metadata(self, feature_view_obj, training_dataset_obj)
501
502 def _create_training_data_metadata(self, feature_view_obj, training_dataset_obj):
--> 503 td = self._feature_view_api.create_training_dataset(
504 feature_view_obj.name, feature_view_obj.version, training_dataset_obj
505 )

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/core/feature_view_api.py in create_training_dataset(self, name, version, training_dataset_obj)
175 headers = {"content-type": "application/json"}
176 return training_dataset_obj.update_from_response_json(
--> 177 self._client._send_request(
178 "POST", path, headers=headers, data=training_dataset_obj.json()
179 )

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/decorators.py in if_connected(inst, *args, **kwargs)
33 if not inst._connected:
34 raise NoHopsworksConnectionError
---> 35 return fn(inst, *args, **kwargs)
36
37 return if_connected

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/hsfs/client/base.py in _send_request(self, method, path_params, query_params, headers, data, stream, files)
169
170 if response.status_code // 100 != 2:
--> 171 raise exceptions.RestAPIError(url, response)
172
173 if stream:

RestAPIError: Metadata operation error: (url: https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/1155/featurestores/1103/featureview/transactions_view_fraud_batch_fv/version/1/trainingdatasets). Server response:
HTTP code: 500, HTTP reason: Internal Server Error, body: b'{"type":"restApiJsonResponse","errorCode":120000,"errorMsg":"A generic error occurred."}', error code: 120000, error msg: A generic error occurred., user msg:

Project name is not a valid Avro name

Hi! I'm learning the https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/fraud_batch/1_feature_groups.ipynb tutorial. I appreciate that it's helpful and detailed. Thank you! However, I encountered an error FeatureStoreException: Failed to construct Avro Schema: 270_project_fs_featurestore is not a valid Avro name because it does not match the pattern (?:^|\.)[A-Za-z_][A-Za-z0-9_]*$ at the cell with code trans_fg.insert(trans_df, write_options={"wait_for_job": False}) . I was able to resolve it after creating a new project with all alphabet characters only.

May I know if this is a bug? During create a project in the web UI, it was able to accept the project name 270_project_fs with no errors but when I call the fg.insert() function it fails.

Below is the complete error message:

InvalidName                               Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/hsfs/feature_group.py in _get_encoded_avro_schema(self)
   1407         try:
-> 1408             avro.schema.parse(schema_s)
   1409         except avro.schema.SchemaParseException as e:

13 frames
/usr/local/lib/python3.9/dist-packages/avro/schema.py in parse(json_string, validate_enum_symbols)
   1147     # construct the Avro Schema object
-> 1148     return make_avsc_object(json_data, names, validate_enum_symbols)

/usr/local/lib/python3.9/dist-packages/avro/schema.py in make_avsc_object(json_data, names, validate_enum_symbols)
   1092                 doc = json_data.get('doc')
-> 1093                 return RecordSchema(name, namespace, fields, names, type, doc, other_props)
   1094             else:

/usr/local/lib/python3.9/dist-packages/avro/schema.py in __init__(self, name, namespace, fields, names, schema_type, doc, other_props)
    877         else:
--> 878             NamedSchema.__init__(self, schema_type, name, namespace, names,
    879                                  other_props)

/usr/local/lib/python3.9/dist-packages/avro/schema.py in __init__(self, type, name, namespace, names, other_props)
    362         # Add class members
--> 363         new_name = names.add_name(name, namespace, self)
    364 

/usr/local/lib/python3.9/dist-packages/avro/schema.py in add_name(self, name_attr, space_attr, new_schema)
    330         """
--> 331         to_add = Name(name_attr, space_attr, self.default_namespace)
    332 

/usr/local/lib/python3.9/dist-packages/avro/schema.py in __init__(self, name_attr, space_attr, default_space)
    261 
--> 262         self._validate_fullname(self._full)
    263 

/usr/local/lib/python3.9/dist-packages/avro/schema.py in _validate_fullname(self, fullname)
    265         for name in fullname.split('.'):
--> 266             validate_basename(name)
    267 

/usr/local/lib/python3.9/dist-packages/avro/schema.py in validate_basename(basename)
    149     if not _BASE_NAME_PATTERN.search(basename):
--> 150         raise InvalidName("{!s} is not a valid Avro name because it "
    151                           "does not match the pattern {!s}".format(

InvalidName: 270_project_fs_featurestore is not a valid Avro name because it does not match the pattern (?:^|\.)[A-Za-z_][A-Za-z0-9_]*$

During handling of the above exception, another exception occurred:

FeatureStoreException                     Traceback (most recent call last)
<ipython-input-16-21d160d2454d> in <cell line: 1>()
----> 1 trans_fg.insert(trans_df, write_options={"wait_for_job": False})

/usr/local/lib/python3.9/dist-packages/hsfs/feature_group.py in insert(self, features, overwrite, operation, storage, write_options, validation_options)
   1073         feature_dataframe = engine.get_instance().convert_to_default_dataframe(features)
   1074 
-> 1075         job, ge_report = self._feature_group_engine.insert(
   1076             self,
   1077             feature_dataframe,

/usr/local/lib/python3.9/dist-packages/hsfs/core/feature_group_engine.py in insert(self, feature_group, feature_dataframe, overwrite, operation, storage, write_options, validation_options)
    111 
    112         return (
--> 113             engine.get_instance().save_dataframe(
    114                 feature_group,
    115                 feature_dataframe,

/usr/local/lib/python3.9/dist-packages/hsfs/engine/python.py in save_dataframe(self, feature_group, dataframe, operation, online_enabled, storage, offline_write_options, online_write_options, validation_id)
    453     ):
    454         if feature_group.stream:
--> 455             return self._write_dataframe_kafka(
    456                 feature_group, dataframe, offline_write_options
    457             )

/usr/local/lib/python3.9/dist-packages/hsfs/engine/python.py in _write_dataframe_kafka(self, feature_group, dataframe, offline_write_options)
    811 
    812         # setup row writer function
--> 813         writer = self._get_encoder_func(feature_group._get_encoded_avro_schema())
    814 
    815         def acked(err, msg):

/usr/local/lib/python3.9/dist-packages/hsfs/feature_group.py in _get_encoded_avro_schema(self)
   1408             avro.schema.parse(schema_s)
   1409         except avro.schema.SchemaParseException as e:
-> 1410             raise FeatureStoreException("Failed to construct Avro Schema: {}".format(e))
   1411         return schema_s
   1412 

FeatureStoreException: Failed to construct Avro Schema: 270_project_fs_featurestore is not a valid Avro name because it does not match the pattern (?:^|\.)[A-Za-z_][A-Za-z0-9_]*$```

Project Name

Hello everyone,

I mostly have a curiosity.

When using the Hopsworks feature store, the project name has to be globally unique or unique only for a person/organization?

I am hosting a course using Hopsworks, and other people doing the course cannot use the same project name as me. They have entirely different users and organizations.