Giter Site home page Giter Site logo

qihoo360 / dgl-operator Goto Github PK

View Code? Open in Web Editor NEW
45.0 45.0 6.0 522 KB

The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes

License: Apache License 2.0

Dockerfile 1.22% Makefile 3.45% Go 68.58% Shell 12.57% Python 14.17%

dgl-operator's People

Contributors

ryantd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dgl-operator's Issues

No such file or directory: '/etc/dgl/hostfile'

deploy examples/v1alpha1/GraphSAGE_dist.yaml:

error of dgl-graphsage-launcher pod

Phase 3/5: dispatch partitions
----------
Traceback (most recent call last):
File "tools/dispatch.py", line 102, in 
main()
File "tools/dispatch.py", line 44, in main
with open(args.ip_config) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/etc/dgl/hostfile'
----------
Phase 3/5 error raised

error of dgl-operator pod

021-08-14T09:45:48.716Z	INFO	controllers.DGLJob	Finished reconciling job	{"dgljob": "dgl-operator/dgl-graphsage", "dgl-operator/dgl-graphsage": "80.81µs"}
2021-08-14T09:45:48.722Z	ERROR	controllers.DGLJob	unable to fetch DGLJob	{"dgljob": "dgl-operator/dgl-graphsage", "error": "DGLJob.qihoo.net \"dgl-graphsage\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
github.com/Qihoo360/dgl-operator/controllers.(*DGLJobReconciler).Reconcile
/workspace/controllers/dgljob_controller.go:115
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:297
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99

运行kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml `报错

Phase 3/5: dispatch partitions

Traceback (most recent call last):
  File "tools/dispatch.py", line 102, in 
    main()
  File "tools/dispatch.py", line 52, in main
    with open(args.part_config) as conf_f:
FileNotFoundError: [Errno 2] No such file or directory: '/dgl_workspace/dataset/graphsage.json'

Phase 3/5 error raised

This will download 1.38GB. Will you proceed? (y/N)

Phase 1/5: load and partition graph

Using backend: pytorch
Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://192.168.12.218:8000/ogbn_products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace')
Download http://192.168.12.218:8000/ogbn_products.zip
Extract /dgl_workspace/dataset/ogbn_products.zip
load ogbn-products
This will download 1.38GB. Will you proceed? (y/N)
Traceback (most recent call last):
  File "code/load_and_partition_graph.py", line 107, in 
    g, _ = load_dataset('ogbn-products', args.output, args.dataset_url)
  File "code/load_and_partition_graph.py", line 33, in load_dataset
    data = DglNodePropPredDataset(name=name, root=work_dir)
  File "/usr/local/lib/python3.6/site-packages/ogb/nodeproppred/dataset_dgl.py", line 69, in init
    self.pre_process()
  File "/usr/local/lib/python3.6/site-packages/ogb/nodeproppred/dataset_dgl.py", line 98, in pre_process
    if decide_download(url):
  File "/usr/local/lib/python3.6/site-packages/ogb/utils/url.py", line 17, in decide_download
    return input("This will download %.2fGB. Will you proceed? (y/N)\n" % (size)).lower() == "y"
EOFError: EOF when reading a line
WARNING:root:The OGB package is out of date. Your version is 1.3.0, while the latest version is 1.3.5.

Phase 1/5 error raised

Large dataset(products) can't be delivered

Phase 2/5: deliver partitions

  • POD_NAME=dgl-sample-cuda-test-214-launcher -c watcher-loop-partitioner
  • shift
  • /opt/kube/kubectl exec dgl-sample-cuda-test-214-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace
  • /opt/kube/kubectl cp /dgl_workspace/dataset dgl-sample-cuda-test-214-launcher:/dgl_workspace -c watcher-loop-partitioner
    E0214 08:27:30.469178 152 v2.go:105] write tcp 12.100.12.173:49690->12.96.0.1:443: use of closed network connection
    error: Internal error occurred: error executing command in container: read unix @->/var/run/docker.sock: read: connection reset by peer
    sleep
    wake
    Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
    12.100.0.117 30050 dgl-sample-cuda-test-214-launcher

['12.100.0.117', '30050', 'dgl-sample-cuda-test-214-launcher']
copy /dgl_workspace/dataset to dgl-sample-cuda-test-214-launcher:/dgl_workspace
Traceback (most recent call last):
File "/dgl_workspace/tools/launch.py", line 284, in
main()
File "/dgl_workspace/tools/launch.py", line 253, in main
run_cp_container(args)
File "/dgl_workspace/tools/launch.py", line 105, in run_cp_container
kubecp_container(source_file_path, pod_name, args.target_dir, args.container)
File "/dgl_workspace/tools/launch.py", line 46, in kubecp_container
subprocess.check_call(cmd, shell = True)
File "/home/gnn/conda/envs/gnn/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -x; /opt/kube/kubectl cp /dgl_workspace/dataset dgl-sample-cuda-test-214-launcher:/dgl_workspace -c watcher-loop-partitioner' returned non-zero exit status 1.

Phase 2/5 error raised

deliver partitions error

Phase 2/5: deliver partitions

  • POD_NAME=dgl-graphsage-launcher -c watcher-loop-partitioner
  • shift
  • /opt/kube/kubectl exec dgl-graphsage-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace
    error: unable to upgrade connection: container not found ("watcher-loop-partitioner")
    Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
    12.100.10.0 30050 dgl-graphsage-launcher

['12.100.10.0', '30050', 'dgl-graphsage-launcher']
Traceback (most recent call last):
File "tools/launch.py", line 280, in
main()
File "tools/launch.py", line 252, in main
run_cp_container(args)
File "tools/launch.py", line 103, in run_cp_container
kubexec_container(f'mkdir -p {args.target_dir}', pod_name, args.container)
File "tools/launch.py", line 31, in kubexec_container
subprocess.check_call(cmd, shell = True)
File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'sh /etc/dgl/kubexec.sh 'dgl-graphsage-launcher -c watcher-loop-partitioner' 'mkdir -p /dgl_workspace'' returned non-zero exit status 1.

Phase 2/5 error raised

Some time another error may occur.

Phase 2/5: deliver partitions

Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
30050 dgl-graphsage-launcher

['30050', 'dgl-graphsage-launcher']
Traceback (most recent call last):
File "tools/launch.py", line 280, in
main()
File "tools/launch.py", line 252, in main
run_cp_container(args)
File "tools/launch.py", line 100, in run_cp_container
for pod_info in get_ip_host_pairs(args.ip_config):
File "tools/launch.py", line 64, in get_ip_host_pairs
raise RuntimeError("Format error of ip_config.")
RuntimeError: Format error of ip_config.
/etc/dgl/leadfile may loss ip.

At another cluster
Phase 2/5: deliver partitions

  • POD_NAME=dgl-graphsage-launcher -c watcher-loop-partitioner
  • shift
  • /opt/kube/kubectl exec dgl-graphsage-launcher -c watcher-loop-partitioner -- /bin/sh -c mkdir -p /dgl_workspace
    error: unable to upgrade connection: error dialing backend: dial tcp 127.0.0.1:34248: connect: connection timed out
    connection error

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.