Giter Site home page Giter Site logo

webankfintech / prophecis Goto Github PK

View Code? Open in Web Editor NEW
453.0 453.0 148.0 35.26 MB

Prophecis is a one-stop cloud native machine learning platform.

Home Page: https://github.com/WeBankFinTech/Prophecis

License: Apache License 2.0

Makefile 0.20% Dockerfile 0.31% Go 72.32% Python 2.90% Shell 0.68% JavaScript 3.66% HTML 0.32% CSS 0.39% Vue 14.80% SCSS 0.45% Mustache 0.45% Java 3.45% Smarty 0.06% Jsonnet 0.01%
gpu linkis machine-learning ml multi-tenant-management notebook

prophecis's People

Contributors

alexzywu avatar bleachzk avatar daylin-gao avatar finaltestin avatar hexudong111 avatar james23wang avatar renzhe-li avatar tangjiafeng avatar uuarttt avatar wallyell avatar zhengfan199525 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prophecis's Issues

Prophecis0.3.2对应DSS/Linkis平台Appconn插件部署包及初始化SQL缺失

根据安装部署文档中的要求准备安装DSS/Linkis平台的Appconn插件部署包及初始化SQL发现存在异常。
release0.3.2未提供对应的插件包、并且0.3.2源码内appconn对应pom依赖为DSS1.0.1 Linkis1.0.3,尝试替换为7月DSS发布版本的DSS1.1.0 Linkis1.1.1进行编译发现代码存在报错:
MLSSOpenRequestRef.java[XX,XX] error: camnot find symbol.
同时发现appconn初始化sql中需要操作的dss_appcation元数据表在DSS1.1.0 Linkis1.1.1的元数据表中已不存在

源码包编译版本依赖DSS1 0 1Linkis1 0 3替换DSS1 1 1LINKIS1 1 0编译报错

mlss初始化sql中包含dss不存在的表dss_application

dss源数据库中不包含dss_application表

镜像无法拉取!!!

您好!
wedatasphere/prophecis:metrics-0.2.0
wedatasphere/prophecis:jobmonitor-0.2.0
wedataspere/prophecis:minio-2020-06-14
wedatasphere/prophecis:lcm-0.2.0
wedatasphere/prophecis:trainer-0.2.0
这些镜像都无法拉取,请问仓库里有这些镜像吗?

什么时候发布0.3.x

一直关注Prophecis,请问v0.2.x和v0.3.x这两个版本有大概的发布时间点么?

Prophecis-0.3.0部署步骤

安装helm-3.2.1:

wget https://get.helm.sh/helm-v3.2.1-linux-amd64.tar.gz

tar -xzvf helm-v3.2.1-linux-amd64.tar.gz

cd linux-amd64/

mv helm /usr/bin/

helm version

helm repo list

helm repo add aliyuncs https://apphub.aliyuncs.com

安装Istio-1.8.2

wget https://github.com/istio/istio/releases/download/1.8.2/istio-1.8.2-linux-amd64.tar.gz

#设置istioctl环境变量
export PATH=$PATH:/opt/istio-1.8.2/bin

#部署
istioctl install
#验证,查看相关Pod是否正常Running
kubectl -n istio-system get pods 

安装seldon core-1.13.0

wget https://github.com/SeldonIO/seldon-core/archive/refs/tags/v1.13.0.tar.gz

cd seldon-core-1.13.0/helm-charts

helm install seldon-core seldon-core-operator --set usageMetrics.enabled=true --namespace seldon-system --set istio.enabled=true


#如果镜像拉取报错
docker pull registry.cn-shenzhen.aliyuncs.com/shikanon/google_containers.spartakus-amd64:v1.1.0
docker tag registry.cn-shenzhen.aliyuncs.com/shikanon/google_containers.spartakus-amd64:v1.1.0 gcr.io/google_containers/spartakus-amd64:v1.1.0



helm list -n seldon-system


helm del seldon-core -n seldon-system 

安装nfs:

yum install -y nfs-utils rpcbind



systemctl start rpcbind
systemctl enable rpcbind

systemctl start nfs-server
systemctl enable nfs-server

环境准备

1.增加配置文件
vim /root/.docker/config.json
#增加如下配置
{
        "auths": {
                "": {
                        "auth": ""
                }
        },
        "HttpHeaders": {
                "User-Agent": "Docker-Client/20.10.8-ce (linux)"
        }
}
2.NFS服务端挂载共享文件
mkdir -p /data/bdap-ss/mlss-data/tmp
mkdir -p /mlss/di/jobs/prophecis
mkdir -p /cosdata/mlss-test


vim /etc/exports
/data/bdap-ss/mlss-data/tmp xx.xx.xx.0/24(rw,sync,no_root_squash)
/mlss/di/jobs/prophecis xx.xx.xx.0/24(rw,sync,no_root_squash)
/cosdata/mlss-test xx.xx.xx.0/24(rw,sync,no_root_squash)

exportfs -arv
3.NFS客户端挂载共享文件
showmount -e xx.xx.xx.xx

mkdir -p /data/bdap-ss/mlss-data/tmp
mkdir -p /mlss/di/jobs/prophecis
mkdir -p /cosdata/mlss-test


mount xx.xx.xx.xx:/data/bdap-ss/mlss-data/tmp /data/bdap-ss/mlss-data/tmp
mount xx.xx.xx.xx:/mlss/di/jobs/prophecis /mlss/di/jobs/prophecis
mount xx.xx.xx.xx:/cosdata/mlss-test /cosdata/mlss-test
4.调整部分问题

(1) 文件重复问题:

/install/Prophecis/templates/di 文件下:learner-configmap.yml 与 learner-rsa-keys.yml 移动至 /install/Prophecis/templates/services 下,然后删除 /install/Prophecis/templates/di 文件夹。

(2) 镜像地址问题

安装配置文件中,所有 uat.sf.dockerhub.stgwebank/webank/prophecis 的镜像地址 更换成 wedatasphere/prophecis

(3) 修改sql数据库配置信息

install/sql下,数据库创建文件:

prophecis.sql 与 prophecis-data.sql 前两行的数据库地址

CREATE DATABASE IF NOT EXISTS `mlss_gzpc_bdap_uat_01` /*!40100 DEFAULT CHARACTER SET utf8 */;
USE `mlss_gzpc_bdap_uat_01`;

改成自己的数据库地址,地址对应下一条中mysql配置的 db —> name

然后,先后复制 prophecis.sql 与 prophecis-data.sql 内容至数据库sql脚本编辑器中执行,生成对应表与文件。

5.修改配置信息

/install/Prophecis/values.yaml中需要修改如下部分

# 改成自己的,mysql的用户名密码
db:
  server: 127.0.0.1
  port: 3306
  name: prophecis_db
  user: prophecis
  pwd: prophecis@wedatasphere

# 用户访问的网页地址,改成宿主机节点ip
gateway:
  address: 127.0.0.1
  port: 30778

#超级管理员的用户名密码,可以改成自己需要的,需对应数据库表t_superadmin
admin:
    user: hadoop
    password: hadoop

安装各个组件

kubectl create namespace prophecis

kubectl label nodes xx.xx.xx.xx mlss-node-role=platform
#如果有GPU计算节点,则标注NVIDIAGPU
kubectl label nodes xx.xx.xx.xx hardware-type=NVIDIAGPU



## 安装Notebook Controller组件
helm install notebook-controller ./notebook-controller
## 安装MinIO组件
helm install minio-prophecis --namespace prophecis ./MinioDeployment
## 安装prophecis组件
helm install prophecis ./Prophecis


#查看与删除
helm list --all
helm del prophecis --namespace default
helm del notebook-controller --namespace default
helm del minio-prophecis --namespace prophecis

mlss-controlcenter-go 源码是否和线上的cc-apiserver-v0.3.0中编译的源码不一致

我将GitHub源码编译成的 mlss-controlcenter-go 替换 了 cc-apiserver-v0.3.0中的执行文件,但是线上运行时页面报如下两个错:

path /cc/v1/groups/group/storage was not found
Error when checking namespace from cc, {"code":404,"message":"path /cc/v1/groups/users/roles/1/namespaces was not found"}

线上cc-apiserver-v0.3.0中的mlss-controlcenter-go是否和GitHub上一源码致呢?

docker hub missing images of some v0.3.2 components

The missing images in Prophecis/install/value.yaml:

wedatasphere/prophecis:mllabis-v0.3.2 --> wedatasphere/prophecis:mllabis-v0.3.0
wedatasphere/prophecis:metrics-v0.3.2 --> wedatasphere/prophecis:metrics-v0.3.0
wedatasphere/prophecis:mf-server-v0.3.2 --> wedatasphere/prophecis:mf-server-v0.3.0

编译 di/lcm 模块报错

go build -v -o bin/main
webank/DI/lcm/service/lcm
webank/DI/lcm/service/lcm
service/lcm/splitTraining.go:39:42: not enough arguments in call to learner.CreateServiceSpec
have (string, string)
want (string, string, kubernetes.Interface)
service/lcm/splitTraining.go:78:17: t.helper undefined (type splitTraining has no field or method helper)
service/lcm/splitTraining.go:117:50: too many arguments in call to newConstructLearnerContainer
service/lcm/split_training.go:36:6: method redeclared: splitTraining.jobSpecForLearner
method(splitTraining) func(string) ("k8s.io/api/batch/v1".Job, error)
method(splitTraining) func(
"k8s.io/api/core/v1".Service) (*"k8s.io/api/batch/v1".Job, error)
service/lcm/split_training.go:36:24: splitTraining.jobSpecForLearner redeclared in this block
previous declaration at service/lcm/splitTraining.go:70:6
service/lcm/split_training.go:85:24: splitTraining.Start redeclared in this block
previous declaration at service/lcm/splitTraining.go:37:6
service/lcm/split_training.go:122:33: cannot use serviceSpec (type *"k8s.io/api/core/v1".Service) as type string in argument to t.jobSpecForLearner
service/lcm/split_training.go:149:25: (*splitTraining).NewCreateFromBOM redeclared in this block
previous declaration at service/lcm/splitTraining.go:128:6
service/lcm/split_training.go:179:29: (*splitTraining).NewCreateFromBOM.func1 redeclared in this block
previous declaration at service/lcm/splitTraining.go:137:33
service/lcm/split_training.go:200:25: (*splitTraining).CreateFromBOMForTFJob redeclared in this block
previous declaration at service/lcm/splitTraining.go:179:6
service/lcm/split_training.go:179:29: too many errors

image

helm 安装成功之后 ,登陆不了LDAP错误

prophecis 页面可以打开,但是admin无法登录。
页面错误502.
Oct 15 18:08:30 ai-master kubelet: W1015 18:08:30.651802 10459 kubelet_pods.go:863] Unable to retrieve pull secret prophecis/hubsecret
for prophecis/ffdl-trainingdata-8946c74fd-5prbp due to secret "hubsecret" not found. The image pull may not succeed.
Oct 15 18:08:51 ai-master kubelet: W1015 18:08:51.652099 10459 kubelet_pods.go:863] Unable to retrieve pull secret prophecis/hubsecret
for prophecis/di-storage-796c9596c-l6ws9 due to secret "hubsecret" not found. The image pull may not succeed.

搭建过程中,镜像无法拉取

有2个镜像无法拉取,一个是wedataspere/prophecis:minio-2020-06-14的镜像无法拉取;一个是wedatasphere/prophecis:fluent-bit-1.2.1的镜像 ,请问是否仓库当前是否存在该镜像?

自身创建notebook镜像

可自身创建notebook运行镜像:
1.Dockerfile:

其他的镜像类似

docker pull jupyter/scipy-notebook:notebook-6.4.10

FROM jupyter/scipy-notebook:notebook-6.4.10
USER root
RUN echo "Asia/shanghai" > /etc/timezone
ENTRYPOINT ["sh","-c", "jupyter lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

2.编译:

编译成自身tag

docker build -t "xxxx/jupyter/scipy-notebook:notebook-6.4.10" .

配置文件mf-service.yml 文件指定的端口为41012

vim /etc/kubernetes/manifests/kube-apiserver.yaml
#spec:

containers:

- command:

中,增加如下配置项

  • --service-node-port-range=30000-40000

你们配置的文件为41012 ,而又要求k8s部署为40000以内,岂不是冲突了

which version of k8s supported?

I had test kubernetes v1.20.0 and v1.18.6 and it shows errors, and below is the detail information:

run helm install notebook-controller . in folder Prophecis/helm-charts/k8s 1.18.6/notebook-controller
it shows:
Error: template: MLSS/templates/notebook-controller-0.5.1.yaml:115:24: executing "MLSS/templates/notebook-controller-0.5.1.yaml" at <.Values.aide.controller.notebook.repository>: nil pointer evaluating interface {}.controller

run helm install notebook-controller . in folder Prophecis/helm-charts/notebook-controller
it shows:
Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Deployment" in version "apps/v1beta1", unable to recognize "": no matches for kind "StatefulSet" in version "apps/v1beta2"

I encountered the following problems:nginx: [emerg] host not found in resolver "kube-dns.kube-system.svc.cluster.local" in /etc/nginx/conf.d/ui.conf:46

[root@node Prophecis]# kubectl logs -f bdap-ui-deployment-595f6c44bf-jmkb5 -n prophecis
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: /etc/nginx/conf.d/default.conf is not a file or does not exist
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
nginx: [emerg] host not found in resolver "kube-dns.kube-system.svc.cluster.local" in /etc/nginx/conf.d/ui.conf:46
[root@node Prophecis]#

version 0.3.2, di-storage server start, but has rpc error

image

storage-deployment.yaml

kind: Deployment
apiVersion: apps/v1
metadata:
  name: di-storage
  namespace: prophecis
  labels:
    app.kubernetes.io/managed-by: Helm
    environment: prophecis
    service: di-storage
  annotations:
    deployment.kubernetes.io/revision: '1'
    meta.helm.sh/release-name: prophecis
    meta.helm.sh/release-namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      environment: prophecis
      service: di-storage
  template:
    metadata:
      creationTimestamp: null
      labels:
        environment: prophecis
        service: di-storage
        version: storage-v0.3.2
    spec:
      volumes:
        - name: di-config
          configMap:
            name: di-config
            defaultMode: 420
        - name: timezone-volume
          hostPath:
            path: /usr/share/zoneinfo/Asia/Shanghai
            type: File
        - name: oss-storage
          hostPath:
            path: tmp
            type: Directory
      containers:
        - name: di-storage-rpc-server
          image: 'wedatasphere/prophecis:storage-v0.3.2'
          command:
            - /bin/sh
            - '-c'
          args:
            - DLAAS_PORT=8443 /main
          ports:
            - containerPort: 8443
              protocol: TCP
          env:
            - name: DLAAS_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: DLAAS_ENV
              value: prophecis
            - name: DLAAS_LOGLEVEL
              value: DEBUG
            - name: DLAAS_PUSH_METRICS_ENABLED
              value: 'true'
            - name: LINKIS_ADDRESS
              value: '127.0.0.1:8088'
            - name: LINKIS_TOKEN_CODE
              value: BML-AUTH
            - name: MONGO_ADDRESS
              value: mongo.prophecis.svc.cluster.local
            - name: MONGO_USERNAME
              value: mlssopr
            - name: MONGO_PASSWORD
              value: mlssopr
            - name: MONGO_DATABASE
              value: mlsstest
            - name: MONGO_Authentication_Database
              value: admin
            - name: DLAAS_OBJECTSTORE_TYPE
              valueFrom:
                secretKeyRef:
                  name: storage-secrets
                  key: DLAAS_OBJECTSTORE_TYPE
            - name: DLAAS_OBJECTSTORE_AUTH_URL
              valueFrom:
                secretKeyRef:
                  name: storage-secrets
                  key: DLAAS_OBJECTSTORE_AUTH_URL
            - name: DLAAS_OBJECTSTORE_USER_NAME
              valueFrom:
                secretKeyRef:
                  name: storage-secrets
                  key: DLAAS_OBJECTSTORE_USER_NAME
            - name: DLAAS_OBJECTSTORE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: storage-secrets
                  key: DLAAS_OBJECTSTORE_PASSWORD
            - name: DLAAS_ELASTICSEARCH_SCHEME
              value: http
            - name: DLAAS_ELASTICSEARCH_ADDRESS
              value: 'http://elasticsearch.prophecis.svc.cluster.local:9200'
            - name: DLAAS_ELASTICSEARCH_ADDRESS
              valueFrom:
                secretKeyRef:
                  name: trainingdata-secrets
                  key: DLAAS_ELASTICSEARCH_ADDRESS
            - name: DLAAS_ELASTICSEARCH_USERNAME
              valueFrom:
                secretKeyRef:
                  name: trainingdata-secrets
                  key: DLAAS_ELASTICSEARCH_USERNAME
            - name: DLAAS_ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: trainingdata-secrets
                  key: DLAAS_ELASTICSEARCH_PASSWORD
          resources:
            limits:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: di-config
              mountPath: /etc/mlss/
            - name: timezone-volume
              mountPath: /etc/localtime
            - name: oss-storage
              mountPath: /data/oss-storage
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        mlss-node-role: platform
      securityContext: {}
      imagePullSecrets:
        - name: hubsecret
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

UI模块没有dockerfile

自己构建的镜像部署之后总是403forbidden,能不能分享一下ui image构建的dockerfile啊,谢谢

经常出现“网络服务异常”报错,是给LDAP认证有关的吗?

部署Prophecis版本: v0.3.0 Kubernetes版本: 1.18.6, 所有pod运行状态都Running
1.部署文档中说是: Prophecis使用LDAP来负责统一认证,但部署文档没有要求必须安装LDAP目录服务,有要求LDAP必须创建什么用户吗?
2.部署文档要求创建的超级管理员和用户密码,给t_superadmin表对应,t_superadmin表中有name字段,不需要密码字段存储吗?LDAP创建的用户要和t_superadmin表的超级管理员用户对应吗?
3.登录时,出现错误:原因是:LDAP目录服务器 用户认证没通过吗?
37dd7aeb1b376cfd4f22bcc705d56b2

4.清除浏览器缓存后,重新打开登录页面,可以登录进去,但好多页面点击过程都会出现“网络服务异常”错误。
微信图片_20221027194135
请问这个给Auth_type:LDAP有关系吗?

  1. LDAP统一认证支持可配置吗?关闭或者开启。

notebook-controller源码缺失

mllabis模块下仅有notebook-server源码部分,且与线上使用的源码不一致(wedatasphere/prophecis:mllabis-v0.1.1)。并且,未曾找到notebook-controller部分的源码。

文档问题较为严重

开源产品的文档应该清晰易读,重要过程应有详细的说明。不应像CSDN上很多程序员写的自言自语的“天书”。文档总的来说很糟糕,对于Prophecis的推广会产生较大负面作用

创建notebook一直是waiting状态

  • helm version:version.BuildInfo{Version:"v3.2.1", GitCommit:"fe51cd1e31e6a202cba7dead9552a6d418ded79a", GitTreeState:"clean", GoVersion:"go1.13.10"}
  • docker version:
Client: Docker Engine - Community
 Version:           19.03.9
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        9d988398e7
 Built:             Fri May 15 00:25:27 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.9
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       9d988398e7
  Built:            Fri May 15 00:24:05 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.13
  GitCommit:        9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
 runc:
  Version:          1.0.3
  GitCommit:        v1.0.3-0-gf46b6ba
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

-k8s version:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.20", GitCommit:"1f3e19b7beb1cc0110255668c4238ed63dadb7ad", GitTreeState:"clean", BuildDate:"2021-06-16T12:51:17Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

创建NoteBook之后,状态一直都是waiting,该如何处理?哪里能看到日志?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.