前置条件
用Ansible FATE-1.7.0单边部署了三个结点,详细配置如下:
210结点,Exchange角色,通过 /bin/bash deploy/deploy.sh keys生成了证书,开启了服务端和客户端认证。
route_table.json的配置信息如下:
{
"route_table":
{
"211":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.211",
"port": 9371
}
]
},
"213":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.213",
"port": 9371
}
]
},
.....
eggroll.properties的中相关的配置信息为:
eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/exchange-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/exchange-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/exchange-client-client.key
eggroll.core.security.ca.crt.path=/data/projects/data/fate/keys/exchange-ca.pem
eggroll.core.security.crt.path=/data/projects/data/fate/keys/exchange-server.pem
eggroll.core.security.key.path=/data/projects/data/fate/keys/exchange-server.key
213结点,Host角色,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/host-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9370
}
]
},
"213":
{
"default":[
{
"ip": "10.32.122.213",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.213",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}
eggroll.properties中相关的配置信息为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371
eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/host-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/host-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/host-client-client.key
211结点,Guest角色,,将210结点的证书拷贝到了对应目录下,示例如下:
scp deploy/keys/exchange/ca.pem [email protected]:/data/projects/data/fate/keys/guest-client-ca.pem
.....
另外,其route_table.json的配置信息如下:
{
"route_table":
{
"default":
{
"default":[
{
"is_secure": true,
"ip": "10.32.122.210",
"port": 9371
}
]
},
"211":
{
"default":[
{
"ip": "10.32.122.211",
"port": 9370
}
],
"fateflow":[
{
"ip": "10.32.122.211",
"port": 9360
}
]
}
},
"permission":
{
"default_allow": true
}
}
eggroll.properties中的相关配置为:
eggroll.rollsite.lan.insecure.channel.enabled=true
eggroll.rollsite.secure.port=9371
eggroll.core.security.client.ca.crt.path=/data/projects/data/fate/keys/guest-client-ca.pem
eggroll.core.security.client.crt.path=/data/projects/data/fate/keys/guest-client-client.pem
eggroll.core.security.client.key.path=/data/projects/data/fate/keys/guest-client-client.key
测试
所有结点的fate-rollsite 服务均重新启动。
在211结点,执行:
source /data/projects/fate/bin/init.sh
flow test toy -gid 211 -hid 213
执行结果报错,错误信息为:
(venv) app@cestc211:/data/projects/fate/eggroll/conf$ flow test toy -gid 211 -hid 213
{
"jobId": "202204211045098230110",
"retcode": 103,
"retmsg": "Traceback (most recent call last):\n File "/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py", line 124, in submit\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {211: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {213: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, Please check rollSite and fateflow network connectivityrpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:748)\n \n\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)"\n\tdebug_error_string = "{"created":"@1650509120.047216058","description":"Error received from peer ipv4:10.32.122.211:9370","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.32.122.213:9371\\nCaused by: java.net.ConnectException: finishConnect(..) failed: Connection refused\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:660)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:637)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:524)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:473)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:748)\\n \\n\\nexception trans path: 10.32.122.210(${id}) --> 10.32.122.211(211)","grpc_status":14}"\n>'}}})\n"
}
其他信息
如果不使用证书(is_secure: false, port:9370),能够正常运行。
期望
使用证书下,正确的配置
谢谢!