The hpc_wiki_docs from ssrheart

问题：

master gcc缺失， node1~6默认的gcc不一样, 2/4为4.8.2, 1/3/5/6 为4.4.7 (node2/4的/usr/bin/gcc为软链接指向了/opt/rh/...gcc)

修复
- node2/4和master均已修复，三者的/usr/bin/gcc 现在均为原系统的gcc-4.4
- (master原始gcc的可执行文件丢失，scp自node1:/usr/bin/gcc)
- g++ 行为一致, 在Master和node2/4已经进行了完整测试：原生gcc 为4.4， module load后为gcc 4.8，符合期望
文档
- 使用gcc-4.8的情景应当推荐统一使用module gcc
- 文档中加入介绍 gcc的一个小节

表现：

qsub提交的任务全部处在Q状态，手动qrun #jobid 可以启动任务。

可能原因：

如果是重启后发生的，可能是没有正确启动loop_qrun.sh脚本。

详细：

出厂安装的torque不能正确地自动选择任务队列中的任务启动，我们通过/home/user/test_pbs/chk_gpu/loop_qrun.sh 来循环检查任务队列以实现这一功能。但似乎之前写入到/etc/rc.d/rc.local中的随开机启动运行上述脚本的命令没有正确执行（或者执行顺序存在问题？）。

18/11/28发生了一次Master误重启事件，loop_qrun.sh 随开机启动，但没有正确完成检查任务队列并启动等待状态中的任务的功能。kill该脚本任务并手动重新运行得到解决。

未来的相似事件参考解决策略：

如果是重启后发生的，全部任务处于Q状态，但是手动qrun #jobid 可以启动任务的情况，考虑检查上述脚本的执行状态。进行启动 / kill并重启等操作。

之后如遇到Master开机重启的情况，检查该脚本状态。

If install tensorflow via anaconda, there will be a crash saying "CUDA driver version is insufficient for CUDA runtime version", so I compiled tensorflow on the server, and the compiled tensorflow works well.
The compiled .whl file is placed at /share/package/comiled-tensorflow, and should be installed in a python3(.6.5) environment.
The dependencies are:

Glibc 2.14, I don't know whether it's strongly required, but following the way in #18 should help
cuda 9.0, I've installed cuda9.0 on all computing nodes, what you should do is adding cuda9.0 to your PATH and LD_LIBRARY_PATH as follows. cuda8.0 can be placed before cuda9.0 if you just use cuda9.0 in tensorflow
PATH=/usr/local/cuda-9.0:$PATH
LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:$LD_LIBRARY_PATH
python 3.6, the whl is compiled with python3

2018.12 文档更新内容

更换硬盘后硬盘使用方案说明。@ Someone
关于硬盘挂载问题的表现和解决，以防日后更换再次遇到。 @sonack

qsub: Server is shutting down...

原因推测：
可能是之前的service pbs_server restart 没有完全执行。正常情况应当在stop和start后分别出现一次OK字样，表示服务先关停，再启动。
qsub遇到的server is shutting down表明pbs_server状态异常。

解决：

kill -9 <pbs_server pid>
service pbs_server start

caffe 新编译机制问题

更新caffe的编译机制文档

原因
caffe每次添加层需要重新编译，不可能继续统一由管理员一直维护下去
策略
不再由维护人员编译，提供文档说明和必要的第三方库，由user在自己目录下建立caffe目录，通过module load等手段自行编译。
完成情况

tmux installed

刚才安装好了tmux，需要一些tmux的文档来支持

Ganglia 首页下方各个节点显示方式的控制

表现

Ganglia 首页下方各个节点性能图表显示方式，有Auto， Scale，None几种，Auto的最合理，但默认是Scale，导致不仔细确认的话无法了解具体使用状况。

解决

这部分PHP代码定义位置在/usr/local/apach2/htdocs/ganglia/templates/default/cluster_view.tpl, 已经对原文件做了备份在clust_view.tpl.backup中。目前写死了使用Auto方式，无法切换其它方式。以防日后用到，在此记录。

py36tensorflow 环境安装问题

py36tensorflow安装

tensorflow-gpu==1.3.0 不可，其要求cudnn6，目前开发工具栈为cudnn5;
tensorflow-gpu==1.2.1, 报错: 要求GLIBC 2.16以上, 参考这里在家目录下编译安装GLIBC2.17, 利用tfpython报Segmentation Fault, 根据评论尝试用glibc2.17运行anaconda python 3.6提示段错误，查了一下说是glibc的bug并且2.20才修复，改用glibc2.21后成功, 安装了GLIBC2.21, GLIBC相关错误消失, 但是import tensorflow时仍然报错找不到libcusolver.so.8.0, 将/usr/local/cuda/lib64/加入环境变量LD_LIBRARY_PATH后, 报错找不到 libcuda.so.1, 未解决。

Note:

配置GLIBC2.21时, 提示某些软件版本太低, 查看INSTALL requirements发现应该是gcc版本太低, 利用命令module load gcc/4.8.2提升gcc版本,再次configure即可。
GLIBC2.21配置时, 提示LD_LIBRARY_PATH shouldn't contain the current directory, 参考这里, 在当前目录下执行unset LD_LIBRARY_PATH即可。
LD_LIBRARY_PATH中应该是包含cuda的lib64目录的, 为什么会报libcusolver.so.8.0缺失问题还值得考虑。

node2 疑似0号GPU长时间运行会掉线

node4 cuda error

illegal address

读取新硬盘时，提交任务看到的目录结构与在master下用shell看到的不一致

表现

读取新硬盘上的模型时，出现FileNotFound错误；程序在新硬盘上新建的文件夹虽然创建成功，但却找不到。

原因

新硬盘只在master节点上被挂载到了/share/data目录，而在计算节点上仍只有原来挂载的master:/share/，因此计算节点的/share/data目录仍然指向原磁盘，最终导致提交任务时的读写空间与在master上看到的不一致。

暂时解决方案

在master节点，用bash shell，运行pdsh -w node[1-6] mount master:/share/data /share/data 手动挂载master的/share/data到计算节点1-6的/share/data目录，但这样重启HPC后配置将失效，因此最终解决方案还需要联系厂家确认硬盘自动挂载的配置。
可以通过df -h来确定挂载是否成功，如下则为正常状况：

`Version 'GLIBC_2.14' not found ` Solution

Contributed by @sonack and @CastellanLiu

1. cp -R /share/package/glibc/glibc-2.14 $HOME/glibc-2.14-src
2. cd $HOME/glibc-2.14-src
3. mkdir build && cd build
4. ../configure --prefix=$HOME/glibc-2.14
5. make -j4 && make install
6. [After installation]
export LD_LIBRARY_PATH=/path_to_glibc/lib:$LD_LIBRARY_PATH

export LD_PRELOAD=/path_to_glibc/lib/libc.so.6:$LD_PRELOAD
src

ganglia 重启

Master

service gmond restart   # data collection
service gmetad restart   # data display

Node1-6

service gmond restart

Todo List

pycaffe
node2 和 node4 down掉的原因排查。发现是网卡问题
tensorflow + horvord 安装测试
报警 (+ 邮件？）尝试ganglia-monitor
线程限制？
torch 出错。排查发现用的torch的modulefile路径有问题，导致walltime和runtime严重不匹配，runtime一直为1s
qhold 无效

Question collection thread

多机/单机版本的区别，还需要说清楚
发现lscpu的信息和Intel对于这款CPU的信息不一致
SSH 连接时间问题需要彻底搞清楚【已解决：推荐使用tmux，已加入tmux文档】

Matlab multithreading problem

MATLAB always tends to create as many computing threads as the number of CPU cores, which causes heavy burden of the whole node, offen making all tasks on that node extremely slow. There are some candidate solutions listed bellow, and further tests should be applied.

maxNumCompThreads(num)
As said in MATLAB official documents, Setting the maximum number of computational threads using maxNumCompThreads does not propagate to your next MATLAB session.
setenv 'OMP_NUM_THREADS' num | setenv OMP_DYNAMIC FALSE
To be tested, refer to OpenMP 环境变量

ssrheart / hpc_wiki_docs Goto Github PK

hpc_wiki_docs's People

Contributors

Stargazers

Watchers

Forkers

hpc_wiki_docs's Issues

问题：

表现：

可能原因：

详细：

未来的相似事件参考解决策略：

更新caffe的编译机制文档

表现

解决

表现

原因

暂时解决方案

Master

Node1-6

Recommend Projects

Recommend Topics

Recommend Org