Giter Site home page Giter Site logo

Error connecting about angel HOT 14 CLOSED

angel-ml avatar angel-ml commented on May 14, 2024
Error connecting

from angel.

Comments (14)

qdj0511 avatar qdj0511 commented on May 14, 2024 1

@binmu123 我刚才测试了一下,确实在angel_submit上添加JAVA_HOME就能搞定。

from angel.

paynie avatar paynie commented on May 14, 2024

Thanks for reporting this issues. IYou can open http://stghadoop1:8088/cluster/app/application_1498124481939_0016/ to find detail failed message and paste here. 10.12.101.245 is the location of Master of Angel.

from angel.

HyperGroups avatar HyperGroups commented on May 14, 2024

PSAttempt_0_0 failed due to: Exception from container-launch.
Container id: container_1498124481939_0018_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
PSAttempt_0_1 failed due to: Exception from container-launch.
Container id: container_1498124481939_0018_01_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
PSAttempt_0_2 failed due to: Exception from container-launch.
Container id: container_1498124481939_0018_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
PSAttempt_0_3 failed due to: Exception from container-launch.
Container id: container_1498124481939_0018_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1

from angel.

lshmouse avatar lshmouse commented on May 14, 2024

@HyperGroups 请在yarn 找一下container(container_1498124481939_0018_01_000002)的详细日志

from angel.

paynie avatar paynie commented on May 14, 2024

@HyperGroups It seems that yarn start ps failed. This error is commonly caused by parameter format error or some environments (such as JDK) are not valid. You can find the error message in ps container log use the method in FAQ

from angel.

HyperGroups avatar HyperGroups commented on May 14, 2024

Redirecting to log server for container_ 1498124481939_ 0018_ 01_ 000001
ResourceManager
RM Home
NodeManager
Tools
java.lang.Exception:Unknown container.Container either has not started or has already completed or doesn't belong to this node at all.

---Update
Log Type: stderr
Log Upload Time: Thu Jun 29 17:34:10 +0800 2017
Log Length: 966
Exception in thread "main" java.lang.UnsupportedClassVersionError: com/tencent/angel/ps/impl/ParameterServer : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

Log Type: stdout
Log Upload Time: Thu Jun 29 17:34:10 +0800 2017
Log Length: 890
-XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSScavengeBeforeRemark -XX:InitialHeapSize=1639134000 -XX:MaxDirectMemorySize=1073741824 -XX:MaxHeapSize=4085252096 -XX:MaxNewSize=1633681408 -XX:MaxPermSize=209715200 -XX:NewSize=1633681408 -XX:PermSize=104857600 -XX:+PrintAdaptiveSizePolicy -XX:+PrintCommandLineFlags -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:SurvivorRatio=4 -XX:+UseAdaptiveSizePolicy -XX:+UseCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops -XX:+UseLargePages -XX:+UseParallelGC
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12).
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12).
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12).

这个报错之前JDK1.7的时候出现过,现在安装的是java-javac-1.8.0_77

from angel.

HyperGroups avatar HyperGroups commented on May 14, 2024

目前这个问题有所进展
17/06/29 18:32:38 INFO yarn.AngelYarnClient: start pss success
17/06/29 18:32:38 INFO ps.PSPartitioner: blockRow = 1, blockCol=5000000
17/06/29 18:32:38 INFO protobuf.ProtobufUtil: Matrix lr_weight partition num: 2
17/06/29 18:32:38 INFO ps.PSPartitioner: blockRow = 1, blockCol=50
17/06/29 18:32:38 INFO protobuf.ProtobufUtil: Matrix lr_loss partition num: 1

之前看到Unsupported major.minor version 52.0的报错是系统有的机器java版本是1.7导致。
后来改成了1.8,之后就看到不到此报错,焦点是最开始楼层的报错,同时Local模式能跑通。
后来进行了一些操作,又看到了这个Unsupported major.minor version 52.0的报错,
然后进行一些尝试操作:下载最新版的JDK1.8,并在启动脚本里按wiki里说明指定JAVA_HOME的地址为/data/java/jdk,并重新把几台测试机器的配置统一处理了一下,然后就有所进展。
另JAVA_HOME=/usr/java/jdk可能会有找不到路径等问题,但是后来又好了,经过一翻折腾,测试例子yarn模式跑通一次。

@paynie
@lshmouse

from angel.

binmu123 avatar binmu123 commented on May 14, 2024

我的本地local模式跑的没有问题,可以正常执行,但是yarn模式时就有错误
17/06/30 02:37:56 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename : default
17/06/30 02:37:56 INFO impl.YarnClientImpl: Submitted application application_1498790102869_0001
17/06/30 02:37:59 INFO yarn.AngelYarnClient: appMaster getTrackingUrl = http://b0c5e9cf3648:8088/cluster/app/application_1498790102869_0001/
17/06/30 02:37:59 INFO yarn.AngelYarnClient: master host=172.16.16.20, port=28343
17/06/30 02:37:59 INFO yarn.AngelYarnClient: start to create rpc client to am
17/06/30 02:42:55 ERROR yarn.AngelYarnClient: submit application to yarn failed.
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.16.16.20:28343
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:312)
at com.sun.proxy.$Proxy14.getAllPSLocation(Unknown Source)
at com.tencent.angel.client.AngelClient.waitForAllPS(AngelClient.java:710)
at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:167)
at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.16.16.20:28343

从start to create rpc client to am 处开始卡壳,但是我的local模式时没有问题,rpc可以与am连接,正常运行。
我的启动命令是
./angel-submit
--angel.app.submit.class "com.tencent.angel.ml.classification.lr.LRRunner"
--angel.train.data.path "hdfs://localhost:9000/project/test/lr_data"
--angel.log.path "hdfs://localhost:9000/project/test/log"
--angel.save.model.path "hdfs://localhost:9000/project/test/model"
--action.type train
--ml.data.type libsvm
--ml.feature.num 1024
--angel.job.name LR_test
我的hadoop是standalone的,求解答

from angel.

lshmouse avatar lshmouse commented on May 14, 2024

@HyperGroups 看看一下 application_1498790102869_0001 的container是否启动失败,发一下具体日志。只看客户端是看不出具体问题的

from angel.

HyperGroups avatar HyperGroups commented on May 14, 2024

经过一翻折腾,测试例子LRTest
yarn模式跑通一次。
可以加大数据量测试了。

from angel.

binmu123 avatar binmu123 commented on May 14, 2024

@HyperGroups 你好,你碰到过我这种问题吗? rpc连接am失败,local模式可以 但是yarn模式rpc连接超时

from angel.

HyperGroups avatar HyperGroups commented on May 14, 2024

@binmu123 本帖子就是你所遇到的问题,试着设置下java_home在启动脚本里。

from angel.

binmu123 avatar binmu123 commented on May 14, 2024

好的,谢谢 我试试

from angel.

binmu123 avatar binmu123 commented on May 14, 2024

@HyperGroups 你好,我的java版本应该没有问题,是1.8。
我刚才也设置了一下,但是仍然是从start to create rpc client to am 处开始卡壳。连接不上
是不是我hadoop集群的问题,hadoop和yarn我就搭了一个单点做测试一下,都是standalone模式,所有服务都在一个机子上。

from angel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.