uclasystem / dorylus Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 12.0 5.33 MB

Dorylus: Affordable, Scalable, and Accurate GNN Training

CMake 1.79% Python 20.26% Shell 6.63% Makefile 0.09% C++ 65.06% C 0.15% Cuda 6.02%

aws-lambda-threads gnn serverless-computing

dorylus's People

Contributors

Stargazers

Watchers

Forkers

googol-lab pgworld duzhanyuan tianyi-ge z-y00 jlovette sanzo00 ftwh machinelearningsystem kittecchen chaogaoucr sarda-devesh

dorylus's Issues

Can we run dorylus without ec2man?

Hey authors, @kevalvora @ivanium @josehu07 @redhairdragon @johnnt849
Thank you for making dorylus open-sourced and write this detailed wiki!

I am wondering if it is possible to run dorylus without using ec2man? Basically I launch some instances on AWS, and compile weightserver and graphserver into binary files and directly run on that. Is that possible?

Also for the build--upload-lambda-functions, I cannot find the example forward-prop-josehu, can you give me a pointer for that?

Do I need to strictly follow the instance required here for CPU/GPU backend? Can I launch other type of instance?

    For the serverless based backend: ami-07aec0eb32327b38d
    For the CPU based backend: ami-04934e5a63d144d88
    For the GPU based backend: ami-01c390943eecea45c
    For the Weight Server: ami-0901fc9a7bc310a8a

Thanks!

What specific kind of Lambda trigger did you use as its event and why C++?

Hi!
It's my pleasure to appreciate your great work and I'm curious about this project's implementation.

I'm wondering what specific kind of Lambda trigger did you use to start it? Thanks. (Specifically, AWS Lambda is an event-based system, the trigger could be HTTP requests and so on.)

On the other hand, why did you use c++ for Lambda computing? According to the link below, I did not find C++ as a programming language for Lambda. Could you please explain your idea of using C++ and how did you run C++ code on Lambda Threads? Thanks a lot.
https://aws.amazon.com/lambda/faqs/#:~:text=Q%3A%20What%20languages%20does%20AWS%20Lambda%20support%3F

Best.

Execution is stuck at Epoch 1

Dear maintainers,
I'm running dorylus on lambda right now. My graph server master is stuck at Epoch 1, and then it returns：
"[ Node 0 ] [ FUNC ERROR ] Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds".
My lambda prints no log so I have no idea what's going on. Could you help me solve the problem?
Thank you very much!

More details are presented below.

My Cluster

graphserver
2 EC2 c5n.2xlarge us-east-1f
weightserver
1 EC2 t2.medium us-east-1c
lambda
Named "gcn". Memory 192MB, timeout 10min.

It is configured to be in the same VPC of the graphservers and weightserver (but GSs and WS are not in the same subnet, as is shown above).

My Input Dataset and Configs

Actually I don't have much knowledge about machine learning, so I use the example simple graph presented in the tutorial as the input dataset.

small.graph
```
0 1
0 2
1 3
2 4
3 5
```

feature

used by ./prepare is 4

0.3, 0.2, 0.5, 0.7
0.7, 0.5, 0.5, 0.3
0.1, 0.2, 0.8, 0.7
0.3, 0.4, 0.5, 0.1
0.3, 0.4, 0.2, 0.1
0.3, 0.6, 0.5, 0.8

labels

I set the argument of ./prepare to 1
```
0
3
2
1
2
3
```
layerconfig (renamed to simplegraph.config later)
```
4
8
10
```
Other configs
All other configs (e.g., gserverport, nodeport) use the default value.

Commands and Full Logs

Command:
- graphserver
  ./run/run-onnode graph simplegraph --l 5 --e 5
- weightserver
  ./run/run-onnode weight simplegraph

Full Logs

graphserver

[ Node 404 ]  Engine starts initialization...
[ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
[ Node 404 ]  NodeManager starts initialization...
[ Node   1 ]  Private IP: 172.31.73.108
[ Node 404 ]  Engine starts initialization...
[ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
[ Node 404 ]  NodeManager starts initialization...
[ Node   0 ]  Private IP: 172.31.79.216
[ Node   0 ]  NodeManager initialization complete.
[ Node   0 ]  CommManager starts initialization...
[ Node   1 ]  NodeManager initialization complete.
[ Node   0 ]  CommManager starts initialization...
[ Node   0 ]  CommManager initialization complete.
[ Node   1 ]  CommManager initialization complete.
[ Node   0 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.0.bin
[ Node   1 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.1.bin
Cannot open output file:/filepool/simplegraph/parts_2/graph.0.bin, [Reason: Permission denied]
Cannot open input file: /filepool/simplegraph/parts_2/graph.0.bin, [Reason: No such file or directory]
[ Node   0 ]  Finish preprocessing!
[ Node   0 ]  <GM>: 0 global vertices, 0 global edges,
                0 local vertices, 0 local in edges, 0 local out edges
                0 out ghost vertices, 0 in ghost vertices
[ Node   0 ]  No feature cache, loading raw data...
[ Node   0 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.0.bin [Reason: Permission denied]
Cannot open output file:/filepool/simplegraph/parts_2/graph.1.bin, [Reason: Permission denied]
Cannot open input file: /filepool/simplegraph/parts_2/graph.1.bin, [Reason: No such file or directory]
[ Node   1 ]  Finish preprocessing!
[ Node   1 ]  <GM>: 0 global vertices, 0 global edges,
                0 local vertices, 0 local in edges, 0 local out edges
                0 out ghost vertices, 0 in ghost vertices
                [ Node   1 ]  No feature cache, loading raw data...
[ Node   1 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.1.bin [Reason: Permission denied]
[ Node   1 ]  Engine initialization complete.
[ Node   0 ]  Engine initialization complete.
[ Node   0 ]  Number of epochs: 5, validation frequency: 1
[ Node   0 ]  Sync Epoch 1 starts...
[ Node   1 ]  Sync Epoch 1 starts...
[ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.938Z 548bbee5-c454-479f-956c-baeb0e8e239b Task timed out after 600.02 seconds
[ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.993Z 5d6f2b65-2386-4169-9868-e43f0c5a0361 Task timed out after 600.10 seconds
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.001Z 402b364b-4579-40f4-94a5-03f5efde62c0 Task timed out after 600.10 seconds
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.004Z 285502c8-06e2-4585-8a81-983110962a8e Task timed out after 600.10 seconds
[ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.020Z a93af084-45b9-4a86-b0a9-6b3bcc37cb06 Task timed out after 600.10 seconds
[ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.032Z fe90565a-f474-4c29-84c9-1cae963f3d63 Task timed out after 600.09 seconds
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.053Z e40b2790-0c6a-4c28-9bec-dc2075b0a5ce Task timed out after 600.02 seconds
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.018Z f85d186c-56f3-4cc8-87bf-36aa28034f23 Task timed out after 600.10 seconds
[ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.075Z 627871c8-fa47-4812-af2e-11b9cc7da5fd Task timed out after 600.02 seconds

weightserver

Killing existing 'weightserver' processes... 
weightserver: no process found                                                                             [24/1922]Running WEIGHT servers with: [ MARK # 12 ]...                                                                       ./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/simplegraph.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02                                                     Binding weight server to tcp://*:65433...                                                                           [ WS   0 ] Initializing nodes...
[ WS   0 ] All weight servers connected.
[ WS   0 ] Layer 0 - Weights: (4, 8)
[ WS   0 ] Layer 1 - Weights: (8, 10)
[ WS   0 ] All nodes up to date.
[  INFO  ] Number of lambdas set to 10.

Lambda

No log is presented in CloudWatch.

Question about the loading of Friendster

@ivanium @kevalvora @josehu07
Hi authors,
Good work!
I use the c5n.4xlarge instance. However, I find the command to prepare dataset

./prepare <PathToRawGraph> <Undirected? (0/1)> <NumVertices> <NumPartitions> <PathToRawFeatures> <DimFeatures> <PathToRawLabels> <LabelKinds>

There is an Out of Memory error.(42GB is not enough) To run friendster, do I need to use a large memory node to make the partition first? And then distribute those partition to each node, right? How large should the RAM be?

Thanks for the help!

Question: how to run dorylus on other platforms

Dear maintainers,

I'm using another faas platform rather than AWS Lambda and I would like to run dorylus on other serverless platforms like openwhisk. However, I find that the implementation of dorylus is tightly coupled with AWS Lambda.

Could you please give me some advice on how to run dorylus on other platforms?
E.g., what files should I modify? How much time or effort I need to take approximately?

Thank you!

Invalid message sizes

Hello,

I have currently being trying to getting dorylus running and I am currently running into an issue of the lambda function timing out after 900 seconds. In order to resolve the cause of the bug, I modified src/funcs/gcn/ops/network_ops.cpp to add in some logging like this:

...
int recvTensor(zmq::socket_t& socket, Matrix &mat) {
    zmq::message_t tensorHeader(TENSOR_HDR_SIZE);
    zmq::message_t tensorData;

    std::cout << "Calling recv for tensor header of size " << TENSOR_HDR_SIZE << std::endl;
    if (!socket.recv(&tensorHeader)) {
        return 0;
    }
    std::cout << "Received tensor header" << std::endl;

    unsigned resp = parse<unsigned>((char*)tensorHeader.data(), 0);
    if (resp == ERR_HEADER_FIELD) {
        std::cerr << "Got error from server. Consult graph server output" << std::endl;
        return -1;
    }
    std::string name = parseName((char*)tensorHeader.data());

    std::cout << "Calling receive for tensor data" << std::endl;
    if (!socket.recv(&tensorData)) {
        return 0;
    }
    std::cout << "Received tensor data" << std::endl;

    unsigned rows = parse<unsigned>((char*)tensorHeader.data(), 3);
    unsigned cols = parse<unsigned>((char*)tensorHeader.data(), 4);

    FeatType* data = new FeatType[rows * cols];
    std::memcpy(data, tensorData.data(), tensorData.size());

    mat.setName(name.c_str());
    mat.setRows(rows);
    mat.setCols(cols);
    mat.setData(data);

    return 0;
}

void logabbleHeader(void* header, unsigned op, Chunk &chunk) {
    char *ptr = (char *)header;
    memcpy(ptr, &op, sizeof(unsigned));
    std::cout << "Sending data of " << op << " of size " << sizeof(unsigned) << std::endl;
    memcpy(ptr + sizeof(unsigned), &chunk, sizeof(chunk));
    std::cout << "Size of chunk is " << sizeof(chunk) << std::endl;
}

std::vector<Matrix> reqTensors(zmq::socket_t& socket, Chunk &chunk, std::vector<std::string>& tensorRequests) {

#define INIT_PERIOD (5 * 1000u) // 5ms
#define MAX_PERIOD (500 * 1000u)
#define EXP_FACTOR 1.5

    unsigned sleepPeriod = INIT_PERIOD;
    bool empty = true;
    std::vector<Matrix> matrices;
    while (true) {
        zmq::message_t header(HEADER_SIZE);
        logabbleHeader(header.data(), OP::PULL, chunk);
        std::cout << "Sending header of size " << HEADER_SIZE << std::endl;
        socket.send(header, ZMQ_SNDMORE);
        std::cout << "Socket sent header" << std::endl;

        unsigned numTensors = tensorRequests.size();
        std::cout << "Got numTensors of " << numTensors << std::endl;

        for (unsigned u = 0; u < tensorRequests.size(); ++u) {
            std::string& name = tensorRequests[u];
            zmq::message_t tensorHeader(TENSOR_HDR_SIZE);
            populateHeader(tensorHeader.data(), chunk.localId, name.c_str());
            std::cout << "Populated tensor header of size " << TENSOR_HDR_SIZE << std::endl;

            if (u < numTensors - 1) {
                socket.send(tensorHeader, ZMQ_SNDMORE);
            } else {
                socket.send(tensorHeader);
            }
            std::cout << "Sent tensor header " << u << std::endl;
        }

        unsigned more = 1;
        empty = false;
        while (more && !empty) {
            Matrix result;
            std::cout << "Calling recv tensor" << std::endl;
            int ret = recvTensor(socket, result);
            std::cout << "recvTensor returned val of " << ret << std::endl;

            if (ret == -1) {
                for (auto& M : matrices) deleteMatrix(M);
                matrices.clear();
                return matrices;
            }
            if (result.empty()) {
                empty = result.empty();

                for (auto& M : matrices) deleteMatrix(M);
                matrices.clear();
                size_t usize = sizeof(more);
                socket.getsockopt(ZMQ_RCVMORE, &more, &usize);
            } else {
                matrices.push_back(result);

                size_t usize = sizeof(more);
                socket.getsockopt(ZMQ_RCVMORE, &more, &usize);
            }
        }

        if (RESEND && empty) {
            usleep(sleepPeriod);
            sleepPeriod *= EXP_FACTOR;
            sleepPeriod = std::min(sleepPeriod, MAX_PERIOD);
        } else {
            break;
        }
    }

    return matrices;

#undef INIT_PERIOD
#undef MAX_PERIOD
#undef EXP_FACTOR
}
...

with the rest of the gcn lambda function being the same. When I look at the CloudWatch Log I get the following output:

Taking a look at the graph-server src code, I issue that the issue is in src/graph-server/commmanager/lambdaworker.cpp, specifically the following checks in LambdaWorker::work are failing:

// recv will return false if timed out.
            if (!workersocket.recv(&identity)) {
                continue;
            }
            if (identity.size() != IDENTITY_SIZE) {
                printLog(manager->nodeId, "identity size %u", identity.size());
                continue;
            }
            if (!workersocket.recv(&header)) {
                continue;
            }
            if (header.size() != HEADER_SIZE) {
                printLog(manager->nodeId, "header size %u", header.size());
                continue;
            }

I added in the following log statements in src/graph-server/main.cpp of:

const unsigned IDENTITY_SIZE = sizeof(Chunk) + sizeof(unsigned);
    printLog(engine.getNodeId(), "Chunk of size %d, unsigned of size %d, identity of size %d, and header of size %d", sizeof(Chunk), sizeof(unsigned), IDENTITY_SIZE, HEADER_SIZE);

and get the output of:

[ Node   0 ]  Chunk of size 32, unsigned of size 4, identity of size 36, and header of size 36

This means that graph server is expecting two messages of size 36 and 36 while the lambda function is sending one packet of size 36 and another packet of 28 which is causing this issue.

I see that header size (which is what the graph server) is defined as:

#define HEADER_SIZE (sizeof(unsigned) + sizeof(Chunk))

and chunk has the following definition:

struct Chunk {
    unsigned localId;
    unsigned globalId;
    unsigned lowBound;
    unsigned upBound;
    unsigned layer;
    PROP_TYPE dir;
    unsigned epoch;
    bool vertex;
    ...
};

On the other hand, the lambda functions sends a packet of size TENSOR_HDR_SIZE which is defined as:

static const size_t TENSOR_HDR_SIZE = sizeof(unsigned) * 5 + TENSOR_NAME_SIZE;

and it gets populated as such:

populateHeader(tensorHeader.data(), chunk.localId, name.c_str());

which I am assuming calls the following function:

static inline void
populateHeader(void* header, unsigned op, const char* tensorName, unsigned field1 = 0,
  unsigned field2 = 0, unsigned field3 = 0, unsigned field4 = 0) {
    char* data = (char*)header;
    serialize<unsigned>(data, 0, op);
    std::memcpy(data + sizeof(unsigned), tensorName, TENSOR_NAME_SIZE);
    serialize<unsigned>(data, 3, field1);
    serialize<unsigned>(data, 4, field2);
    serialize<unsigned>(data, 5, field3);
    serialize<unsigned>(data, 6, field4);
}

Would appreciate some advice on how to resolve this issue

uclasystem / dorylus Goto Github PK

dorylus's People

Contributors

Stargazers

Watchers

Forkers

dorylus's Issues

Can we run dorylus without ec2man?

What specific kind of Lambda trigger did you use as its event and why C++?

Execution is stuck at Epoch 1

My Cluster

My Input Dataset and Configs

Commands and Full Logs

Question about the loading of Friendster

Question: how to run dorylus on other platforms

Invalid message sizes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent