Giter Site home page Giter Site logo

Comments (5)

ryantd avatar ryantd commented on May 29, 2024

@leepengcheng Sorry for the late reply. Would you like to provide more info, like

  1. Did dgl-graphsage-partitioner Pod work as expected? And the final status was Completed?
  2. Please provide dgl-graphsage-partitioner Pod log.
  3. Did dgl-graphsage-worker Pods create at first as expected, but failed or evicted before the main container of dgl-graphsage-launcher running?

The regular lifecycle is like

  1. Partitioning phase
    1. dgl-graphsage-partitioner Pod has a running status, which means the main container of the Pod is running.
    2. A initContainter (partitioning finish tracker) of dgl-graphsage-launcher is running.
  2. Partitioning finished
    1. dgl-graphsage-partitioner Pod has a completed status.
    2. The initContainter (partitioning finish tracker) is completed.
    3. dgl-graphsage-worker Pods are intended to be created.
    4. A initContainter (worker readiness tracker) of dgl-graphsage-launcher is running.
  3. Training phase
    1. dgl-graphsage-launcher Pod has a running status, which means the main container of the Pod is running.
    2. All dgl-graphsage-worker Pods have a running status.
    3. Launching the distributed training...

from dgl-operator.

leepengcheng avatar leepengcheng commented on May 29, 2024

@ryantd

the ogbn dataset is too slow,so I replace it with cora_v2 dataset. dgl-graphsage-partitioner pod has a Error status

dgl-graphsage-partitioner pod log:

Phase 1/5: load and partition graph
----------
[14:07:14] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:73: Partition a graph with 2708 nodes and 10556 edges into 2 parts and get 323 edge cuts
Using backend: pytorch
/usr/local/lib/python3.6/site-packages/dgl/data/utils.py:285: UserWarning: Property dataset.num_labels will be deprecated, please use dataset.num_classes instead.
  warnings.warn('Property {} will be deprecated, please use {} instead.'.format(old, new))
Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://snap.stanford.edu/ogb/data/nodeproppred/products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace')
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
load 'ogbn-products' takes 0.141 seconds
|V|=2708, |E|=10556
train: 140, valid: 500, test: 1000
Convert a graph into a bidirected graph: 0.003 seconds
Construct multi-constraint weights: 0.006 seconds
Metis partitioning: 0.003 seconds
Reshuffle nodes and edges: 0.041 seconds
Split the graph: 0.001 seconds
Construct subgraphs: 0.005 seconds
part 0 has 1611 nodes and 1394 are inside the partition
part 0 has 5353 edges and 5353 are inside the partition
part 1 has 1513 nodes and 1314 are inside the partition
part 1 has 5203 edges and 5203 are inside the partition
Save partitions: 0.193 seconds
There are 10556 edges in the graph and 0 edge cuts for 2 partitions.
############# partition_graph ###############
----------
Phase 1/5 finished
Phase : 2 seconds
Total : 2 seconds
----------
Phase 2/5: deliver partitions
----------
Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
Traceback (most recent call last):
  File "tools/launch.py", line 278, in 
    main()
  File "tools/launch.py", line 250, in main
    run_cp_container(args)
  File "tools/launch.py", line 98, in run_cp_container
    for pod_info in get_ip_host_pairs(args.ip_config):
  File "tools/launch.py", line 62, in get_ip_host_pairs
    raise RuntimeError("Format error of ip_config.")
RuntimeError: Format error of ip_config.
----------
Phase 2/5 error raised

from dgl-operator.

leepengcheng avatar leepengcheng commented on May 29, 2024

@ryantd

i solve this problem by 'time.sleep(10)',because the partitioner pod finished too quickly(i mouted local cora dataset)😱

from dgl-operator.

allendred avatar allendred commented on May 29, 2024

leepengcheng

请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢

from dgl-operator.

leepengcheng avatar leepengcheng commented on May 29, 2024

leepengcheng

请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢

apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
  name: dgl-graphsage
  namespace: dgl-operator
spec:
  partitionMode: DGL-API
  cleanPodPolicy: Running
  dglReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: k3d-sfai-registry:39203/examples:graphsage-dist
            name: dgl-graphsage
            imagePullPolicy: IfNotPresent
            resources:
              requests:
                ephemeral-storage: 10Gi
              limits:
                ephemeral-storage: 15Gi
            command:
            - dglrun
            args:
            - --graph-name
            - graphsage
            # partition
            - --partition-entry-point
            - code/load_and_partition_graph.py
            - --num-partitions
            - "2"
            - --balance-train
            - --balance-edges
            - --dataset-url
            - http://snap.stanford.edu/ogb/data/nodeproppred/products.zip
            # training
            - --train-entry-point
            - code/train_dist.py
            - --num-epochs
            - "1"
            - --batch-size
            - "1000"
            - --num-trainers
            - "1"
            - --num-samplers
            - "4"
            - --num-servers
            - "1"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: k3d-sfai-registry:39203/examples:graphsage-dist
            name: dgl-graphsage
            imagePullPolicy: IfNotPresent
            resources:
              requests:
                memory: 15Gi
                cpu: "2"
                ephemeral-storage: 10Gi
              limits:
                memory: 20Gi
                cpu: "4"
                ephemeral-storage: 15Gi
            volumeMounts:
              - mountPath: /root/.dgl/ogbn_products
                name: ogbn
              - mountPath: /dgl_workspace/code
                name: code

          volumes:
          - name: ogbn
            hostPath:
              path: /home/dgl/data/ogbn_products
              type: Directory
          - name: code
            hostPath:
              path: /home/dgl/dgl/examples/GraphSAGE_dist/code
              type: Directory

你可以参考一下,我用的K3S搭的,所以要挂载两次,注意宿主机的文件路径和容器内部的路径的差别

from dgl-operator.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.