Giter Site home page Giter Site logo

sap-samples / datasphere-fedml Goto Github PK

View Code? Open in Web Editor NEW
13.0 13.0 11.0 21.91 MB

The publication is a collection of sample code to show how data from SAP and non-SAP systems can be made available for training in ANY hyperscaler machine learning service via several layers of abstraction from data connection to training using our FedML Python libraries.

License: Apache License 2.0

Jupyter Notebook 82.89% Python 17.11%
sample-code sap-data-warehouse-cloud machine-learning data-federation data-to-value hyperscalers btp-use-case-factory sample sap-datasphere 4200

datasphere-fedml's Introduction

REUSE status

FedML

Description


The SAP Federated ML Python libraries (FedML) applies the Data Federation architecture of SAP Datasphere for intelligently sourcing SAP as well as non-SAP data for Machine Learning experiments done at the ML platforms removing the need for replicating or moving data. By abstracting the Data Connection, Data load (for all ML platforms) and Model training (with flexibility and provision for user provided training scripts), Model Deployment, and Inferencing (for Hyperscaler Machine learning platforms) , the FedML library provides end to end integration with few lines of code .

What's New

1. The new version of FedML (available as fedml-dsp in PyPi, V1.0.0) :

  • Is machine learning platform-independent. It can be used in all machine learning platforms
  • Supports NVIDIA RAPIDS™, CUDA cuDF and cuPy and hence can be used for training models in GPU environments.
  • Supports sourcing data from SAP Datasphere models directly into PySpark and cuPy (for GPU) dataframes.
  • Supports SAP AI Core Deployment - Models that are trained in any ML Platform (and containerized independently) can now be deployed in SAP GenAI Hub's AI Core with couple lines of code.
  • Supports writing inferenced results back to SAP Datasphere.

Solution Architecture

ARD

2.FedML (Original, V2.0) for hyperscaler platforms [AWS, GCP, Azure and Databricks] :

  • Is pip installable from PyPi for its respective hyperscaler platforms.
  • Supports model training and deployment to hyperscaler environment.
  • Supports deployment to SAP Business Technology Platform Kyma environment.
  • Supports inferencing with hyperscaler deployed as well as Kyma deployed models.
  • Supports writing inferenced results back to SAP Datasphere.

Solution Architecture - FedML Hyperscaler libraries

ARD

Requirements

  • SAP Datasphere tenant instance, with connectivity established to the remote data sources, and views exposed, that can be consumed by FedML.

  • Access to corresponding Machine learning Platforms with appropriate configurations. See Configuration section.

Download and Installation

Try out examples from the samples-notebooks directory of corresponding library folders

Configuration

  • For FedML (platform-independent) library specific pre-requisites, configuration and documentation, please refer here
  • For AWS FedML library specific pre-requisites, configuration and documentation, please refer here
  • For GCP FedML library specific pre-requisites, configuration and documentation, please refer here
  • For Azure FedML library specific pre-requisites, configuration and documentation, please refer here
  • For Databricks FedML library specific pre-requisites, configuration and documentation, please refer here

Limitations

None

How to obtain support

This project is provided "as-is" with no expectation for major changes or support.
Create an issue in this repository if you find a bug or have questions about the content.
For additional support, ask a question in SAP Community.

Licensing

Copyright (c) 2021 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.

datasphere-fedml's People

Contributors

ajinkyapatil8190 avatar akula86 avatar btbernard avatar chaturvedakash avatar karishma-kapur avatar panktijk avatar s-krishnamoorthy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasphere-fedml's Issues

Incompatible packages--Unable to train sample notebook on Vertex AI prebuilt container

I am trying to using Vertex AI to train a linear regression model on data stored in SAP DWC by using the fedml_gcp package. I am following the sample notebook and script provided here. I have attached the notebook (test_1.ipynb) and python script (LinearRegressionScript.py) that I am running, along with the log files. They are in this zip file: [fed ml files.zip](
fedml files.zip
)

This is the code I ran the first time:

from fedml_gcp import dwcgcp
import os

PROJECT_ID = 'project-name'
REGION = 'asia-southeast1'

BUCKET_NAME = 'sap_dwc_fed'
BUCKET_URI = "gs://"+BUCKET_NAME
BUCKET_FOLDER = 'linear_test'
MODEL_OUTPUT_DIR = BUCKET_URI+'/'+BUCKET_FOLDER

SCRIPT_PATH = 'LinearRegressionScript.py'
JOB_NAME = "linear-regression-training"

MODEL_DISPLAY_NAME = "linear-regression-model"
DEPLOYED_MODEL_DISPLAY_NAME = 'linear-regression-deployed-model'

params = {'project':PROJECT_ID,
         'location':REGION, 
         'staging_bucket':BUCKET_URI}

dwc = dwcgcp.DwcGCP(params)

TRAIN_VERSION = "scikit-learn-cpu.0-23"
DEPLOY_VERSION = "sklearn-cpu.1-0"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

table_name = 'sample_dataset'
table_size = 1
job_dir = 'gs://'+BUCKET_NAME

cmd_args = [
    "--table_name=" + str(table_name),
    "--table_size="+ str(table_size),
    "--job-dir=" + str(job_dir),
    "--bucket_name=" + str(BUCKET_NAME),
    "--bucket_folder=" + str(BUCKET_FOLDER)
]

required_packages = [
    'fedml_gcp',
    'matplotlib>=2.2.3',
    'seaborn>=0.9.0',
    'scikit-learn>=0.20.2',
    'pandas>=1.1.4',
    'numpy',
    'hdbcli',
    'pandas-gbq'
]

inputs2 = {
    'display_name':JOB_NAME,
    'script_path':SCRIPT_PATH,
    'container_uri':TRAIN_IMAGE,
    'model_serving_container_image_uri':DEPLOY_IMAGE,
    'requirements':required_packages
}

run_job_params2 = {'model_display_name':MODEL_DISPLAY_NAME,
                  'args':cmd_args,
                  'replica_count':1,
                  'base_output_dir':MODEL_OUTPUT_DIR,
                  'sync':True}

model = dwc.train_model(training_inputs=inputs2, 
                      training_type='custom',
                       params=run_job_params2)

This resulted in package dependency issues (please refer to Log_1.json in the attached zip file). Based on the error messages, I changed the required packages to:

required_packages = [
    'fedml_gcp',
    'matplotlib>=2.2.3',
    'seaborn>=0.9.0',
    'scikit-learn>=0.20.2',
    'pandas>=1.1.4',
    'numpy',
    'hdbcli',
    'pandas-gbq',
    'google-auth<2,>=1.25.0',
    'google-auth-oauthlib<0.5,>=0.4.1',
    'google-api-core[grpc]<2.0.0dev,>=1.34.0',
    'google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<2dev,>=1.31.5',
    'google-cloud-core<2.0dev,>=1.1.0',
    'googleapis-common-protos[grpc]<2.0.0dev,>=1.56.0',
    'grpcio<2.0dev,>=1.47.0',
    'packaging<22.0.0dev,>=14.3',
    'protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5'
]

This resulted in more package dependency issues (please refer to Log_2.json in the attached zip file). The errors shown this time require package versions that contradict the versions specified in the errors in the first log file. Here are the contradicting package requirements:

  1. First log: "google-cloud-logging 1.15.0 has requirement google-cloud-core<2.0dev,>=1.1.0", and the second log: "google-cloud-storage 2.7.0 has requirement google-cloud-core<3.0dev,>=2.3.0"
  2. First log: "google-api-python-client 1.9.3 has requirement google-api-core<2dev,>=1.18.0", and the second log: "pandas-gbq 0.19.1 has requirement google-api-core<3.0.0dev,>=2.10.2"
  3. First log: "tensorboard 2.2.2 has requirement google-auth<2,>=1.6.3" and the second log: "pandas-gbq 0.19.1 has requirement google-auth>=2.13.0"
  4. First log: "tensorboard 2.2.2 has requirement google-auth-oauthlib<0.5,>=0.4.1" and the second log: "pandas-gbq 0.19.1 has requirement google-auth-oauthlib>=0.7.0"

I am using the Vertex AI prebuilt container that is used in the sample notebook provided by SAP as a part of the documentation for fedml-gcp (this container). But the packages pandas-gbq, hdbcli, and fedml_gcp are not getting installed on this container.

Please help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.