logicalclocks / hopsworks Goto Github PK

Hopsworks - Data-Intensive AI platform with a Feature Store

License: GNU Affero General Public License v3.0

Java 67.12% HTML 6.62% CSS 2.00% JavaScript 7.90% Python 0.19% Shell 0.12% Ruby 15.32% Jupyter Notebook 0.69% Less 0.02%

feature-store aws azure data-science feature-engineering feature-management gcp governance kserve machine-learning

hopsworks's Introduction

What is Hopsworks?

Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature pipelines and training pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.

🚀 Quickstart

APP - Serverless (beta)

→ Go to app.hopsworks.ai

Hopsworks is available as a serverless app, simply head to app.hopsworks.ai and register with your Gmail or Github accounts. You will then be able to run a tutorial or access Hopsworks directly and try yourself. This is the prefered way to first experience the platform before diving into more advanced uses and installation requirements.

Azure, AWS & GCP

Managed Hopsworks is our platform for running Hopsworks and the Feature Store in the cloud and integrates directly with the customer AWS/Azure/GCP environment. It also integrates seamlessly with third party platforms such as Databricks, SageMaker and KubeFlow.

If you wish to run Hopsworks on your Azure, AWS or GCP environement, follow one of the following guides in our documentation:

Installer - On-premise

→ Follow the installation instructions.

The hopsworks-installer.sh script downloads, configures, and installs Hopsworks. It is typically run interactively, prompting the user about details of what is installed and where. It can also be run non-interactively (no user prompts) using the '-ni' switch.

Requirements

You need at least one server or virtual machine on which Hopsworks will be installed with at least the following specification:

Centos/RHEL 7.x or Ubuntu 18.04;
at least 32GB RAM,
at least 8 CPUs,
100 GB of free hard-disk space,
outside Internet access (if this server is air-gapped, contact us for support),
a UNIX user account with sudo privileges.

🎓 Documentation and API

Documentation

Hopsworks documentation includes user guides, feature store documentation and an administration guide. We also include concepts to help user navigates the abstractions and logics of the feature stores and MLOps in general:

Feature Store: https://docs.hopsworks.ai/3.0/concepts/fs/
Projects: https://docs.hopsworks.ai/3.0/concepts/projects/governance/
MLOps: https://docs.hopsworks.ai/3.0/concepts/mlops/prediction_services/

APIs

Hopsworks API documentation is divided in 3 categories; Hopsworks API covers project level APIs, Feature Store API covers covers feature groups, feature views and connectors, and finally MLOps API covers Model Registry, serving and deployment.

Hopsworks API - https://docs.hopsworks.ai/hopsworks-api/3.0.1/generated/api/connection/
Feature Store API - https://docs.hopsworks.ai/feature-store-api/3.0.0/generated/api/connection_api/
MLOps API - https://docs.hopsworks.ai/machine-learning-api/3.0.0/generated/connection_api/

Tutorials

Most of the tutorials require you to have at least an account on app.hopsworks.ai. You can explore the dedicated https://github.com/logicalclocks/hopsworks-tutorials repository containing our tutorials or jump directly in one of the existing use cases:

Fraud (batch): https://github.com/logicalclocks/hopsworks-tutorials/tree/master/fraud_batch
Fraud (online): https://github.com/logicalclocks/hopsworks-tutorials/tree/master/fraud_online
Churn prediction https://github.com/logicalclocks/hopsworks-tutorials/tree/master/churn

📦 Main Features

Project-based Multi-Tenancy and Team Collaboration

Hopsworks provides projects as a secure sandbox in which teams can collaborate and share ML assets. Hopsworks' unique multi-tenant project model even enables sensitive data to be stored in a shared cluster, while still providing fine-grained sharing capabilities for ML assets across project boundaries. Projects can be used to structure teams so that they have end-to-end responsibility from raw data to managed features and models. Projects can also be used to create development, staging, and production environments for data teams. All ML assets support versioning, lineage, and provenance provide all Hopsworks users with a complete view of the MLOps life cycle, from feature engineering through model serving.

Development and Operations

Hopsworks provides development tools for Data Science, including conda environments for Python, Jupyter notebooks, jobs, or even notebooks as jobs. You can build production pipelines with the bundled Airflow, and even run ML training pipelines with GPUs in notebooks on Airflow. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks, with support for elastic workers in the cloud (add/remove workers dynamically).

Available on any Platform

Hopsworks is available as a both managed platform in the cloud on AWS, Azure, and GCP, and can be installed on any Linux-based virtual machines (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models.

🧑‍🤝‍🧑 Community

Contribute

We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to give us suggestions, report bugs and add features to our library anytime.

Join the community

Ask questions and give us feedback in the Hopsworks Community
Join our Public Slack Channel
Follow us on Twitter
Check out all our latest product releases

Open-Source

Hopsworks is available under the AGPL-V3 license. In plain English this means that you are free to use Hopsworks and even build paid services on it, but if you modify the source code, you should also release back your changes and any systems built around it as AGPL-V3.

hopsworks's People

Contributors

Stargazers

Watchers

Forkers

tkakantousis maismail robzor92 ermiasg alexhopsworks kouzant limmen siroibaf ideaplexus hanklu-2019 kai-chi steffengr semanticbeeng cloudtwter moritzmeister dvgopal hamzamo smkniazi gmartinsribeiro jmscraig farah-nisar kiranvajja berthoug tidesq kiminh zhiliangpersonal suyambuganesh carlee0 cyliu0204 davitbzh baihuijie bcui6611 forestlzj chenwenjia1991 aniruddhachoudhury stjordanis gibchikafa zeta1999 pradyumnakashyap cognoscentai atindriyosanyal luke202001 nameartem giardiv nitinpandey-154 zhuohuwu0603 kpchen micseb lanastazia javierdlrm zhangjunqiang kuafou yevgenchop swipswaps seantbooker lovew-lc guru1966 world2005 rpatil524 isabella232 danielschulz ptzagk animeshinvinci erssebaggala marianavig sukumar45 ilibx gavinljj ciusji faramir jai2033shankar chunmk ivanliu1989 heap-exposome metavai tdoehmen doytsujin leo23 galuszkak jozy0123 dhananjay-mk deividcingolani asener1 bubriks stanpalatnik delphen mailmahee dubeno kennethmhc jack1981 vatj aws-big-data-projects sayandigital marvel-works devopstoday11 mullerhai wuchunfu mithlaer ackuq jimdowling

hopsworks's Issues

Python Engine import error

When I run the following line of code
fs = project.get_feature_store()

I am receiving the following error

ImportError Traceback (most recent call last)
File c:\Users\user\anaconda3\envs\ML\lib\site-packages\hsfs\engine_init_.py:33, in init(engine_type)
32 try:
---> 33 from hsfs.engine import python
34 except ImportError:

File c:\Users\user\anaconda3\envs\ML\lib\site-packages\hsfs\engine\python.py:36
35 from typing import TypeVar, Optional, Dict, Any
---> 36 from confluent_kafka import Producer
37 from tqdm.auto import tqdm

File c:\Users\user\anaconda3\envs\ML\lib\site-packages\confluent_kafka_init_.py:40
20 # end delvewheel patch
21
22 #!/usr/bin/env python
(...)
37 # limitations under the License.
38 #
---> 40 from .deserializing_consumer import DeserializingConsumer
41 from .serializing_producer import SerializingProducer

File c:\Users\user\anaconda3\envs\ML\lib\site-packages\confluent_kafka\deserializing_consumer.py:19
1 #!/usr/bin/env python
2 # -- coding: utf-8 --
...
39 )
40 _engine_type = "python"
41 _engine = python.Engine()

FeatureStoreException: Trying to instantiate Python as engine, but 'python' extras are missing in HSFS installation. Install with pip install hsfs[python].`]

I've tried pretty much everything I can to solve this but obviously I didn't manage to,

Windows 11 Anaconda env with python 3.8.2

Schema inference doesn't support pandas-native types

I am trying out Hopsworks on a toy problem, and I noticed that the type inference/conversion breaks down when I use pandas-native types. Instead, I have to use numpy types. For example:

import pandas as pd
import numpy as np
import hopsworks

ctx = hopsworks.login()
fs = ctx.get_feature_store()

feature_group = fs.get_or_create_feature_group(
    name="foo",
    version="1",
    description="an example",
    primary_key=['index'],
)

# insertion using numpy types works fine
np_data = pd.DataFrame({
    "index": np.arange(10),
    "feature": np.arange(10, 0, -1)
}).astype(np.int8)

feature_group.insert(np_data)

# fails because it can't handle pandas-native Int8
pd_data = pd.DataFrame({
    "index": np.arange(10),
    "feature": np.arange(10, 0, -1)
}).astype(pd.Int8Dtype())

feature_group.insert(pd_data)  # FeatureStoreException

stacktrace

ValueError                                Traceback (most recent call last)
File c:\Users\Sebastian\Documents\Coding-Projects\!deploy_test\.venv\lib\site-packages\hsfs\engine\python.py:368, in Engine.parse_schema_feature_group(self, dataframe, time_travel_format)
    367 try:
--> 368     converted_type = self._convert_pandas_type(
    369         feat_type, arrow_schema.field(feat_name).type
    370     )
    371 except ValueError as e:

File c:\Users\Sebastian\Documents\Coding-Projects\!deploy_test\.venv\lib\site-packages\hsfs\engine\python.py:389, in Engine._convert_pandas_type(self, dtype, arrow_type)
    387     return self._infer_type_pyarrow(arrow_type)
--> 389 return self._convert_simple_pandas_type(dtype)

File c:\Users\Sebastian\Documents\Coding-Projects\!deploy_test\.venv\lib\site-packages\hsfs\engine\python.py:419, in Engine._convert_simple_pandas_type(self, dtype)
    417     return "string"
--> 419 raise ValueError(f"dtype '{dtype}' not supported")

ValueError: dtype 'Int8' not supported

During handling of the above exception, another exception occurred:

FeatureStoreException                     Traceback (most recent call last)
Cell In[66], line 5
      1 pd_data = pd.DataFrame({
      2     "index": np.arange(10),
...
--> 372         raise FeatureStoreException(f"Feature '{name}': {str(e)}")
    373     features.append(feature.Feature(name, converted_type))
    374 return features

FeatureStoreException: Feature 'index': dtype 'Int8' not supported

Are there any plans on supporting this? It would make writing extractors much easier, e.g., we can start relying on automatic type inference via convert_dtypes to convert object type data to more sensible dtypes before writing them to the feature store.

ADLS not supported?

On web document, there is an introduction about ADLS, but when I want to tried it, there is no ADLS option to choose

login user is incorrect

https://hopsworks.readthedocs.io/en/0.10/getting_started/setups/single_machine.html

I start with the virtualbox and Importing an ova image.

in the process of start vm, i comes to type hopsworks0 login:

but when i type in [email protected]
and then password: admin.

it come out incorrect

On-prem install, stuck!

I tried to instll this project, refer to https://docs.hopsworks.ai/3.0/setup_installation/on_prem/hopsworks_installer/

But this install process is stuck for a long time.
Someone could tell the reasons, and how to fix it and success to install

The installation log is:

@tkakantousis @ErmiasG @evsav @stigviaene @SirOibaf

SSL Handshake Certificate Error

Hi,

So I'm trying to insert my Feature Groupe into the Hopswokrs Feature Store and I'm recieving these message:

%3|1686429911.906|FAIL|Air-de-Yassine.lan#producer-1| [thrd:ssl://3.142.251.253:9092/bootstrap]: ssl://3.142.251.253:9092/bootstrap: SSL handshake failed: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (brew install openssl) (after 112ms in state SSL_HANDSHAKE, 9 identical error(s) suppressed)

I have already installed the OpenSLL lib using brew.
The Feature Groupe is created but the data is never ingested.

Am I missing something here?

Unable to create a project

Hi there,

I am trying to deploy Hopsworks with the main objective of trying the Feature Store component. To do so, I have followed this guide where it is explained how to deploy the whole stack using a GCE image. I followed all the steps and managed to get an instance up and running.

However, when I try to create a project with the service Feature Store (or any other service), nothing happens:

It stays like that and the project is never created. I have tried to create a user, thinking that maybe the problem is that I am not assigning users to the project, but the confirmation email is never sent to my email. I can see the pending request in the admin section, and can click on resend email, but nothing happens.

The main problem is that I don't know how to debug this, could you point me in the correct direction?

Also, this is probably another issue to open, but have you considered providing a docker-based deployment? I think it would be easier to deploy than full images or native services installs.

Thanks for your help!

localhost is not accessible

Hi,
I have completed the installation for a single machine i.e my laptop running ubuntu 18.04 LTS. steps I have followed: https://hopsworks.readthedocs.io/en/0.9/getting_started/setups/single_machine.html
installation takes around 1hr and i am able to see it on VirtualBox.

On accessing the "tail -f karamel-chef/nohup" it say's

cat karamel-chef/nohup.out
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Checking if box 'ubuntu/bionic64' version '20191218.0.0' is up to date...
==> default: Clearing any previously set forwarded ports...
==> default: Using hostname "hopsworks0.logicalclocks.com" as node name for Chef...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
default: Adapter 1: nat
==> default: Forwarding ports...
default: 22 (guest) => 29206 (host) (adapter 1)
default: 3306 (guest) => 20048 (host) (adapter 1)
default: 9090 (guest) => 27889 (host) (adapter 1)
default: 8080 (guest) => 20766 (host) (adapter 1)
default: 8181 (guest) => 39758 (host) (adapter 1)
default: 9009 (guest) => 25500 (host) (adapter 1)
default: 4848 (guest) => 20001 (host) (adapter 1)
default: 5601 (guest) => 33923 (host) (adapter 1)
default: 3000 (guest) => 61755 (host) (adapter 1)
default: 8083 (guest) => 51157 (host) (adapter 1)
default: 8084 (guest) => 35895 (host) (adapter 1)
default: 8086 (guest) => 55799 (host) (adapter 1)
default: 2003 (guest) => 64851 (host) (adapter 1)
default: 8888 (guest) => 28717 (host) (adapter 1)
default: 11112 (guest) => 64246 (host) (adapter 1)
default: 12358 (guest) => 29026 (host) (adapter 1)
default: 8787 (guest) => 53652 (host) (adapter 1)
default: 42011 (guest) => 42011 (host) (adapter 1)
default: 42012 (guest) => 42012 (host) (adapter 1)
default: 42013 (guest) => 42013 (host) (adapter 1)
default: 2181 (guest) => 47596 (host) (adapter 1)
default: 9092 (guest) => 54353 (host) (adapter 1)
default: 16000 (guest) => 52593 (host) (adapter 1)
default: 8000 (guest) => 25931 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
default: SSH address: 127.0.0.1:29206
default: SSH username: vagrant
default: SSH auth method: private key
default: Warning: Connection reset. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Setting hostname...
==> default: Mounting shared folders...
default: /vagrant => /home/machine/Downloads/karamel-chef
default: /tmp/vagrant-chef/03ca3913f1400189ae6be0273a9b8488/cookbooks => /home/machine/Downloads/karamel-chef/cookbooks
==> default: Machine already provisioned. Run vagrant provision or use the --provision
==> default: flag to force provisioning. Provisioners marked to run always will still run.

I am unable to connect the GUI on the browser, in other words, What URL I need to open. Please help me.

Installation Error in Hopsworks 3.5 Installer

Description:
I am encountering an issue while trying to install Hopsworks 3.5 using the provided installer. The installation process is failing with the following error:
while installing the hops works 3.5 using https://github.com/logicalclocks/karamel-chef
STDERR: remote failure: Error occurred during deployment: Exception while loading the app: CDI deployment failure:WELD-001408: Unsatisfied dependencies for type FeatureStoreTagControllerIface with qualifiers @default

Hopsworks Version: 3.5

Errors in the FeatureView>Training Data documentation

The documentation here reported contains some errors.

Indeed, to delete a training data version, you need to pass during the function call a value for the variable named "training_dataset_version", while in the documentation is written:
% delete a training data version
feature_view.delete_training_dataset(version=1)

Instead, when you want to delete all the possible training datasets, you need to use the function "delete_all_training_datasets()", while on the documentation is written:
% delete all training datasets
feature_view.delete_training_dataset()

Hopsworks never installs with Poetry

I am attempting to add the hopsworks package into my project via poetry.

I have attempted to install multiple versions. Starting with the default version poetry downloads which is 3.4.4 then 3.5.0 & 3.7.0.

The issue is that Resolving dependencies loop just keeps running indefinitely and the package never installs.

I normally use python 3.11.8, but have also tried each install with all versions 3.8 - 3.11.8.

I do not have this issue with any other package using poetry.

Any help would be greatly appreciated. Thank you!

Here is my latest pyproject.toml file if needed.

[tool.poetry.dependencies]
python = ">=3.9,<3.9.7 || >3.9.7,<3.10"
jupyter = "^1.0.0"
pandas = "^2.1.4"
pyarrow = "^14.0.2"
tdqm = "^0.0.1"
plotly = "^5.18.0"
numpy = "^1.26.3"
requests = "^2.31.0"
scikit-learn = "^1.4.0"
xgboost = "^2.0.3"
lightgbm = "^4.3.0"
optuna = "^3.5.0"
python-dotenv = "^1.0.1"

Feature request - Install on top of kubernetes

Hi, I would like to install Hopsworks on kubernetes in opposite on dedicated VM.

Hopsworks web doesn't work

I tried following the tutorial at https://app.hopsworks.ai/ and got stuck waiting on the "creating new project" step. Seems totally frozen, been waiting for minutes...

Issue with date time arguments

Hello,

I have an issue with arguments that take datetime values.

My dataset has a feature of datetime for every hour with a datatype of period[H].

When I use the function FeatureView.get_batch_data and provide an argument to have the dataset start from 00:00 hours, it provides me the dataset from 12:00 hours. Similarly, with the end_time rather than providing the last datetime at 23:00 hours, it provides me with the last datetime of 11:00.

For eg.

    # get feature store instance
    fs = ...

    # get feature view instance
    feature_view = fs.get_feature_view(...)

    # set up dates
    import datetime
    start_date = datetime.datetime(2023, 1, 1)
    end_date = datetime.datetime(2023, 1, 5)

    # get a batch of data
    df = feature_view.get_batch_data(
        start_time=start_date,
        end_time=end_date
    )

If I provide a datetime value of datetime.datetime(2023, 1, 1) and datetime.datetime(2023, 1, 5) then the dataset starts from “2023-01-01 12:00:00” and ends at "2023-01-05 11:00:00".

The resulting dataframe is wrong it needs to be between “2023-01-01 00:00:00” and "2023-01-04 23:00:00".

Installed version:

Python 3.11.7
hopsworks 3.4.4
pandas 2.1.4

Any help will be appreciated, thanks.

exporting certificates does not work with Firefox

Problem:

Using Firefox go to a project in hopsworks UI, then "Settings"
Click "Export Certificates"
Only trustStore is exported. The download of keyStore is missing

It does work with Chrome.

My machine:
Ubuntu 18.10 64-bit
Firefox 66.0.2
Chrome 73.0.3683.86

Backfil job fails when sharding on datetime[ns]

Chances are that this is a user error on my part, though I couldn't work it out from the docs. Figured I'll ask here so that we can see if there is a way to improve the docs and/or if there is an issue.

I'm trying to create a sharded/partitioned feature group, which uses both a primary key and a partition key. While the feature group is created successfully, I can't seem to be able to insert data into it:

import pandas as pd
import numpy as np
import hopsworks
from datetime import datetime, timedelta

rng = np.random.default_rng(1234)

# insertion using numpy types works fine
np_data = (
    pd.DataFrame({
        "index": np.arange(10),
        "feature": np.arange(10, 0, -1),
    })
    .astype(np.int8)
    .assign(event_time=[datetime.now()+timedelta(seconds=int(x)) for x in rng.integers(0, 100, 10)])
)

ctx = hopsworks.login()
fs = ctx.get_feature_store()

feature_group = fs.get_or_create_feature_group(
    name="foo",
    version="1",
    description="an example",
    primary_key=["index"],
    partition_key=["event_time"]
)

feature_group.insert(np_data)  # FeatureStoreException

Here is a link to the (failed) backfill job: https://c.app.hopsworks.ai/p/16549/jobs/named/foo_1_offline_fg_backfill/executions
(I can also share the logs if necessary.)