ssrheart / hpc_wiki_docs Goto Github PK
View Code? Open in Web Editor NEWHPC cluster wiki docs
HPC cluster wiki docs
master gcc缺失, node1~6默认的gcc不一样, 2/4为4.8.2, 1/3/5/6 为4.4.7 (node2/4的/usr/bin/gcc为软链接指向了/opt/rh/...gcc)
qsub
提交的任务全部处在Q状态,手动qrun #jobid
可以启动任务。
如果是重启后发生的,可能是没有正确启动loop_qrun.sh
脚本。
出厂安装的torque不能正确地自动选择任务队列中的任务启动,我们通过/home/user/test_pbs/chk_gpu/loop_qrun.sh
来循环检查任务队列以实现这一功能。但似乎之前写入到/etc/rc.d/rc.local
中的随开机启动运行上述脚本的命令没有正确执行(或者执行顺序存在问题?)。
18/11/28发生了一次Master误重启事件,loop_qrun.sh
随开机启动,但没有正确完成检查任务队列并启动等待状态中的任务的功能。kill该脚本任务并手动重新运行得到解决。
如果是重启后发生的,全部任务处于Q状态,但是手动qrun #jobid
可以启动任务的情况,考虑检查上述脚本的执行状态。进行启动 / kill并重启 等操作。
之后如遇到Master开机重启的情况,检查该脚本状态。
If install tensorflow via anaconda, there will be a crash saying "CUDA driver version is insufficient for CUDA runtime version", so I compiled tensorflow on the server, and the compiled tensorflow works well.
The compiled .whl file is placed at /share/package/comiled-tensorflow, and should be installed in a python3(.6.5) environment.
The dependencies are:
原因推测:
可能是之前的service pbs_server restart
没有完全执行。正常情况应当在stop
和start
后分别出现一次OK
字样,表示服务先关停,再启动。
qsub
遇到的server is shutting down
表明pbs_server
状态异常。
解决:
kill -9 <pbs_server pid>
service pbs_server start
原因
caffe每次添加层需要重新编译,不可能继续统一由管理员一直维护下去
策略
不再由维护人员编译,提供文档说明和必要的第三方库,由user在自己目录下建立caffe目录,通过module load等手段自行编译。
完成情况
刚才安装好了tmux,需要一些tmux的文档来支持
Ganglia 首页下方各个节点性能图表显示方式,有Auto, Scale,None几种,Auto的最合理,但默认是Scale,导致不仔细确认的话无法了解具体使用状况。
这部分PHP代码定义位置在/usr/local/apach2/htdocs/ganglia/templates/default/cluster_view.tpl
, 已经对原文件做了备份在clust_view.tpl.backup
中。目前写死了使用Auto方式,无法切换其它方式。以防日后用到,在此记录。
py36tensorflow安装
尝试用glibc2.17运行anaconda python 3.6提示段错误,查了一下说是glibc的bug并且2.20才修复,改用glibc2.21后成功
, 安装了GLIBC2.21, GLIBC相关错误消失, 但是import tensorflow
时仍然报错找不到libcusolver.so.8.0
, 将/usr/local/cuda/lib64/
加入环境变量LD_LIBRARY_PATH
后, 报错找不到 libcuda.so.1
, 未解决。Note:
module load gcc/4.8.2
提升gcc版本,再次configure即可。LD_LIBRARY_PATH shouldn't contain the current directory
, 参考这里, 在当前目录下执行unset LD_LIBRARY_PATH
即可。illegal address
读取新硬盘上的模型时,出现FileNotFound错误;程序在新硬盘上新建的文件夹虽然创建成功,但却找不到。
新硬盘只在master节点上被挂载到了/share/data
目录,而在计算节点上仍只有原来挂载的master:/share/
,因此计算节点的/share/data
目录仍然指向原磁盘,最终导致提交任务时的读写空间与在master上看到的不一致。
在master节点,用bash shell,运行pdsh -w node[1-6] mount master:/share/data /share/data
手动挂载master的/share/data
到计算节点1-6的/share/data
目录,但这样重启HPC后配置将失效,因此最终解决方案还需要联系厂家确认硬盘自动挂载的配置。
可以通过df -h
来确定挂载是否成功,如下则为正常状况:
Contributed by @sonack and @CastellanLiu
1. cp -R /share/package/glibc/glibc-2.14 $HOME/glibc-2.14-src
2. cd $HOME/glibc-2.14-src
3. mkdir build && cd build
4. ../configure --prefix=$HOME/glibc-2.14
5. make -j4 && make install
6. [After installation]
export LD_LIBRARY_PATH=/path_to_glibc/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=/path_to_glibc/lib/libc.so.6:$LD_PRELOAD
src
service gmond restart # data collection
service gmetad restart # data display
service gmond restart
MATLAB always tends to create as many computing threads as the number of CPU cores, which causes heavy burden of the whole node, offen making all tasks on that node extremely slow. There are some candidate solutions listed bellow, and further tests should be applied.
Setting the maximum number of computational threads using maxNumCompThreads does not propagate to your next MATLAB session
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.