Comments (5)
@leepengcheng Sorry for the late reply. Would you like to provide more info, like
- Did
dgl-graphsage-partitioner
Pod work as expected? And the final status wasCompleted
? - Please provide
dgl-graphsage-partitioner
Pod log. - Did
dgl-graphsage-worker
Pods create at first as expected, but failed or evicted before the main container ofdgl-graphsage-launcher
running?
The regular lifecycle is like
- Partitioning phase
dgl-graphsage-partitioner
Pod has arunning
status, which means the main container of the Pod is running.- A initContainter (partitioning finish tracker) of
dgl-graphsage-launcher
is running.
- Partitioning finished
dgl-graphsage-partitioner
Pod has acompleted
status.- The initContainter (partitioning finish tracker) is completed.
dgl-graphsage-worker
Pods are intended to be created.- A initContainter (worker readiness tracker) of
dgl-graphsage-launcher
is running.
- Training phase
dgl-graphsage-launcher
Pod has arunning
status, which means the main container of the Pod is running.- All
dgl-graphsage-worker
Pods have arunning
status. - Launching the distributed training...
from dgl-operator.
the ogbn dataset is too slow,so I replace it with cora_v2 dataset. dgl-graphsage-partitioner pod has a Error
status
dgl-graphsage-partitioner pod log:
Phase 1/5: load and partition graph
----------
[14:07:14] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:73: Partition a graph with 2708 nodes and 10556 edges into 2 parts and get 323 edge cuts
Using backend: pytorch
/usr/local/lib/python3.6/site-packages/dgl/data/utils.py:285: UserWarning: Property dataset.num_labels will be deprecated, please use dataset.num_classes instead.
warnings.warn('Property {} will be deprecated, please use {} instead.'.format(old, new))
Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://snap.stanford.edu/ogb/data/nodeproppred/products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace')
NumNodes: 2708
NumEdges: 10556
NumFeats: 1433
NumClasses: 7
NumTrainingSamples: 140
NumValidationSamples: 500
NumTestSamples: 1000
Done loading data from cached files.
load 'ogbn-products' takes 0.141 seconds
|V|=2708, |E|=10556
train: 140, valid: 500, test: 1000
Convert a graph into a bidirected graph: 0.003 seconds
Construct multi-constraint weights: 0.006 seconds
Metis partitioning: 0.003 seconds
Reshuffle nodes and edges: 0.041 seconds
Split the graph: 0.001 seconds
Construct subgraphs: 0.005 seconds
part 0 has 1611 nodes and 1394 are inside the partition
part 0 has 5353 edges and 5353 are inside the partition
part 1 has 1513 nodes and 1314 are inside the partition
part 1 has 5203 edges and 5203 are inside the partition
Save partitions: 0.193 seconds
There are 10556 edges in the graph and 0 edge cuts for 2 partitions.
############# partition_graph ###############
----------
Phase 1/5 finished
Phase : 2 seconds
Total : 2 seconds
----------
Phase 2/5: deliver partitions
----------
Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
Traceback (most recent call last):
File "tools/launch.py", line 278, in
main()
File "tools/launch.py", line 250, in main
run_cp_container(args)
File "tools/launch.py", line 98, in run_cp_container
for pod_info in get_ip_host_pairs(args.ip_config):
File "tools/launch.py", line 62, in get_ip_host_pairs
raise RuntimeError("Format error of ip_config.")
RuntimeError: Format error of ip_config.
----------
Phase 2/5 error raised
from dgl-operator.
i solve this problem by 'time.sleep(10)',because the partitioner pod finished too quickly(i mouted local cora dataset)😱
from dgl-operator.
leepengcheng
请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢
from dgl-operator.
leepengcheng
请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢
apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
name: dgl-graphsage
namespace: dgl-operator
spec:
partitionMode: DGL-API
cleanPodPolicy: Running
dglReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: k3d-sfai-registry:39203/examples:graphsage-dist
name: dgl-graphsage
imagePullPolicy: IfNotPresent
resources:
requests:
ephemeral-storage: 10Gi
limits:
ephemeral-storage: 15Gi
command:
- dglrun
args:
- --graph-name
- graphsage
# partition
- --partition-entry-point
- code/load_and_partition_graph.py
- --num-partitions
- "2"
- --balance-train
- --balance-edges
- --dataset-url
- http://snap.stanford.edu/ogb/data/nodeproppred/products.zip
# training
- --train-entry-point
- code/train_dist.py
- --num-epochs
- "1"
- --batch-size
- "1000"
- --num-trainers
- "1"
- --num-samplers
- "4"
- --num-servers
- "1"
Worker:
replicas: 2
template:
spec:
containers:
- image: k3d-sfai-registry:39203/examples:graphsage-dist
name: dgl-graphsage
imagePullPolicy: IfNotPresent
resources:
requests:
memory: 15Gi
cpu: "2"
ephemeral-storage: 10Gi
limits:
memory: 20Gi
cpu: "4"
ephemeral-storage: 15Gi
volumeMounts:
- mountPath: /root/.dgl/ogbn_products
name: ogbn
- mountPath: /dgl_workspace/code
name: code
volumes:
- name: ogbn
hostPath:
path: /home/dgl/data/ogbn_products
type: Directory
- name: code
hostPath:
path: /home/dgl/dgl/examples/GraphSAGE_dist/code
type: Directory
你可以参考一下,我用的K3S搭的,所以要挂载两次,注意宿主机的文件路径和容器内部的路径的差别
from dgl-operator.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dgl-operator.