montefiore-institute / alan-cluster Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 1.0 1.21 MB

Documentation and guidelines for the Alan GPU cluster at the University of Liège.

License: BSD 3-Clause "New" or "Revised" License

Shell 20.73% Python 79.27%

alan-cluster's Introduction

Documentation and guidelines for the Alan GPU cluster at the University of Liège.

The documentation assumes you have access to the private network of the university.

General actions

Table of contents:

General actions
User account setup
Cluster usage
Cluster-wide datasets
Centralised Jupyter Access
- Accessing Jupyter Lab
- Launching multiple servers

User account setup

If you do not have an account, then first request an account to the GPU cluster.

Connecting to Alan

Once you have been provided with your account details by e-mail, you can connect to Alan through SSH:

you@local:~ $ ssh [email protected]

After logging in with the password provided by the account confirmation e-mail, you will be forced to change the password.

The e-mail will additionally contain a private authentication key which can be used to connect to the GPU cluster. The key can be used by manually executing:

you@local:~ $ ssh -i path/to/privatekey [email protected]

The authentication procedure can be automated by moving the private key

you@local:~ $ cp path/to/privatekey ~/.ssh/id_alan
you@local:~ $ chmod 400 ~/.ssh/id_alan

and adding

Host alan
  HostName master.alan.priv
  User you
  IdentityFile ~/.ssh/id_alan

to ~/.ssh/config.

Preparing a conda environment

On your initial login, we will guide you to automatically install a conda environment. Carefully read the instructions. If you cancelled the installation procedure, you can still setup conda by executing:

you@master:~ $ wget https://repo.anaconda.com/archive/Anaconda3-2023.07-1-Linux-x86_64.sh
you@master:~ $ sh Anaconda3-2023.07-1-Linux-x86_64.sh

Alternatively, for a lightweight drop-in replacement of conda, you can install micromamba by executing

you@master:~ $ bash <(curl -L micro.mamba.pm/install.sh)

Preparing your (Deep Learning) project

The installation of your Deep Learning environment is quite straightforward after conda has been configured. In general we recommend to work with environments on a per-project basis as it allows for better encapsulation and reproducibility of your experiments.

you@master:~ $ conda create -n myenv python=3.9 -c conda-forge
you@master:~ $ conda activate myenv
(myenv) you@master:~ $ python --version
Python 3.9.13

PyTorch

(myenv) you@master:~ $ conda install pytorch torchvision -c pytorch -c nvidia

Jax

(myenv) you@master:~ $ pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

TensorFlow

(myenv) you@master:~ $ conda install tensorflow-gpu -c conda-forge

Transferring your datasets to Alan

This section shows you how to transfer your datasets to the GPU cluster. It is a good practice to centralize your datasets in a common folder:

you@master:~ $ mkdir -p path/to/datasets
you@master:~ $ cd path/to/datasets

The transfer is initiated using scp from the machine storing the data (e.g. your desktop computer) to the cluster:

you@local:~ $ scp -r my_dataset alan:path/to/datasets

Alternatively, one can rely on rsync:

you@local:~ $ rsync -r -v --progress my_dataset -e ssh alan:path/to/datasets

Cluster usage

The CECI cluster documentation features a thorough Slurm guide. Read it carefully before using Alan.

Elementary tutorials can also be found in /tutorials. Read them to get started quickly.

Slurm commands

sbatch: submit a job to the cluster.
- To reserve N GPU(s) add --gres=gpu:N to sbatch.
scancel: cancel queued or running jobs.
srun: launch a job step.
squeue: display jobs currently in the queue and their associated metadata.
sacct: display accounting data for jobs (including finished/cancelled jobs).
sinfo: get information about the cluster and its nodes.
seff: resource utilization efficiency of the specified job.

Partitions

The cluster provides several queues or job partitions. We made the design decision to partition the job queues based on the GPU type: 1080ti (GTX 1080 Ti), 2080ti (RTX 2080 Ti), quadro (Quadro RTX 6000) and tesla (Tesla V100). This enables the user to specifically request specific GPUs depending on her needs. A specific job partition can be requested by specifying --partition=<partition> to the sbatch command or in your submission script. If no partition is specified, then a job will be scheduled where resources are available.

For debugging purposes, e.g. if you would like to quickly test your script, you can also make use of the debug partition by specifying --partition=debug. This partition has a maximum execution time of 15 minutes.

A full overview of the available partitions is shown below.

you@master:~ $ sinfo -s
PARTITION       AVAIL  TIMELIMIT   NODELIST
all*               up 14-00:00:0   compute-[01-04,06-13]
debug              up      15:00   compute-05
1080ti             up 14-00:00:0   compute-[01-04]
2080ti             up 14-00:00:0   compute-[06-10]
quadro             up 14-00:00:0   compute-[11-12]
tesla              up 14-00:00:0   compute-13
priority-quadro    up 14-00:00:0   compute-[11-12]
priority-tesla     up 14-00:00:0   compute-13
priority           up 14-00:00:0   compute-[01-04,06-13]

The high-priority partitions priority-quadro and priority-tesla can be used to request either Quadro RTX 6000 or Tesla V100 GPUs while flagging your job as high priority in the job queue. This privilege is only available to some users. The quadro and tesla partitions can be requested by all users, but the priority of the corresponding jobs will be kept as normal.

Your priority status can be obtained by executing:

you@master:~ $ sacctmgr show assoc | grep $USER | grep priority > /dev/null && echo "allowed" || echo "not allowed"

Filesystems

We provide the following filesystems to the user.

Mountpoint	Name	Capacity	Purpose	Load data to GPU from filesystem?	Data persistance
`/home/$USER`	Home directory	11TB	Hosts your main files and binaries.	Only if the dataset fits in memory. Do not use this endpoint if your jobs perform a lot of random I/O.	✔️
`/scratch/users/$USER`	Global scratch directory	65TB	Global decentralized filesystem. Store your datasets here if they do not fit in memory, or if it consists of a lot of small files.	Yes	❌

Data persistance is only guaranteed on /home/$USER. Backing-up data hosted on /scratch is your responsibility. The results of a computation should preferably stored in /home/$USER.

Recommended ways to load data into the GPU

It is generally not recommended to load small batches from the main storage disk. This translates into a lot of random IO operations on the main storage hard disks of the cluster, which in turn degrades the performance of all jobs. We recommend the following ways to load data into the GPU:

My dataset does not fit in memory

Use the global /scratch filesystem.

My dataset fits in memory

In this case, we recommend to simply read the dataset into memory and load your batches directly from RAM. This will not cause any issues as the data is sequentially read from the main RAID array on the master. This has the desirable effect that the heads of the hard disks do not have to move around constantly for every (random) small batch your are trying to load, thereby not degrading the performance of the main cluster storage.

Cluster-wide datasets

At the moment we provide the following cluster-wide, read-only datasets which are accessible at /scratch/datasets:

you@master:~ $ ls -l /scratch/datasets

If you would like to propose a new cluster-wide dataset, feel free to submit a proposal.

Centralised Jupyter Access

We provide a centralised Jupyter instance which can be accessed using your Alan account at https://alan.montefiore.uliege.be/jupyter. Launching kernels within existing environments is possible. No additional configuration is required.

Please note that, in order to use existing environments, Anaconda should be installed (compact installations such as miniconda will not work).

Important: Make sure nb_conda_kernels installed in your base environment if you want your Anaconda environments to show up in JupyterHub. In addition, ensure ipykernel is installed in your environments of interest.

Is it possible to access Jupyter Lab?

Yes, simply change the tree keyword in the URL to lab. For instance

https://alan.montefiore.uliege.be/jupyter/user/you/tree

becomes

https://alan.montefiore.uliege.be/jupyter/user/you/lab

Launching multiple servers

We allow you to have more than one running server at the same time. To add a new server, navigate to the control panel.

Via Jupyter Lab

In the top left corner: File -> Hub Control Panel

Via Jupyter Notebook

In the top right corner: Control Panel

alan-cluster's People

Contributors

Stargazers

Watchers

Forkers

waliens

alan-cluster's Issues

[New User] Gilles Orban

Are you a temporary user of the cluster?
Adaptive optics simulations, machine learning for wavefront sensing, high contrast imaging performances.
Until: not defined.

Supervisor
Olivier Absil

Institutional email
[email protected]

Additional context
Use of the cluster would be mostly within the framework of EPIC and METIS.

[New User] Maxime Vandegar

Are you a temporary user of the cluster?
I am doing my Master Thesis focused in deep learning. In that matter, I would like to use the cluster to train and test my algorithms. For now on, I am doing literature review and would like tu run some codes from papers for a better understanding.

Until: End June 2020

Supervisor
Pr. Gilles Louppe

Institutional email
[email protected]

Additional context
/

[Feature Request] (Tiny) ImageNet

I have copied TinyImageNet and imagenet datasets in my personal scratch folder (/scratch/users/jmbegon/). Since they are quite large, it might be of interest to move them to the general dataset scratch folder.

[New User] Maxime Quesnel

Are you a temporary user of the cluster?
I will use the cluster during my entire PhD project.

Supervisor
Gilles Louppe

Institutional email
[email protected]

[New User] Louis Nelissen

Are you a temporary user of the cluster?
Until: October

Supervisor
Gilles Louppe

Institutional email
[email protected]

Additional context
I am a student doing an internship under Prof. Louppe (in regards to the research of Matthia Sabatelli in Deep Reinforcement Learning Algorithms).

[New User] Michael Fonder

Are you a temporary user of the cluster?
Kind of... I need access to the cluster for the remaining time of my PhD thesis.
Until: End of PhD

Supervisor
Marc Van Droogenbroeck

Institutional email
removed for privacy purpose

Additional context
/

[New User] Robert Baudinet

Are you a temporary user of the cluster?
No. I work with Pierre Barnabé for the GeMMe laboratory. We aim to classify scraps using multi-sensors images.

Until: not defined.

Supervisor
Eric Pirard
Additional context
Add any other context or screenshots about the feature request here.

[New User] Maxime Noirhomme

Are you a temporary user of the cluster?
I need acces for GPU for my master thesis
Until: end of June.

Supervisor
Pierre Geurts and Matthia Sabatelli

Additional context
None

[New User] Pierre-Loup Nicolas

Hello,
I would like to get access to the cluster in order to work on my master thesis (my main supervisor is Pierre Geurts), until September 2020.

Supervisor
Pierre Geurts and Raphaël Marée.

Institutional email
[email protected]

Best regards,
Pierre-Loup

[Issue] Issue with gpus on `alan-compute-07`

Describe the issue
I have recently noticed a lot of failing jobs for an experiment I am currently running. I am 100% sure it doesn't come from my script because I have re-scheduled some of the failing jobs and most of them have completed successfully. I have tried several things (reducing batch size in case of OOM, double checking device allocation) before finding that all failing jobs have actually failed on node alan-compute-07 while jobs running on other nodes have all completed successfully.

The jobs fail on this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[...]python3.6/site-packages/clustertools/environment.py", line 50, in __call__
    self.deserialize_and_run(serialized)
  File "[...]python3.6/site-packages/clustertools/environment.py", line 47, in deserialize_and_run
    lazy_computation()
  File "[...]python3.6/site-packages/clustertools/experiment.py", line 135, in __call__
    self.run(self.result, **actual_parameters)
  File "[...]eval_mh_features_svm_computation.py", line 110, in run
    score = eval_features(net, train_loader, eval_loader, scorer, device, n_jobs=self.n_jobs)
  File "[...]eval_mh_features_svm_computation.py", line 49, in eval_features
    x_train, y_train, groups_train = get_features(network, train_loader, device)
  File "[...]eval_mh_features_svm_computation.py", line 40, in get_features
    features.append(network(x).detach().cpu().numpy())
  File "[...]python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "[...]network.py", line 26, in forward
    return self.pool(self.features(x))
  File "[...]python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "[...]models.py", line 33, in forward
    x = self.layer1(x)
  File "[...]python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "[...]python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "[...]python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "[...]python3.6/site-packages/torchvision/models/resnet.py", line 83, in forward
    out = self.conv2(out)
  File "[...]python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "[...]python3.6/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Additional context

$ sacct -u rmormont --format="JobName%40,Nodelist,State,Elapsed,Start" -s F
[...]
Computati+ alan-compute-07     FAILED   00:01:53 2019-05-21T03:27:35
Computati+ alan-compute-07     FAILED   00:01:45 2019-05-21T03:28:49
Computati+ alan-compute-07     FAILED   00:01:58 2019-05-21T03:29:17
Computati+ alan-compute-07     FAILED   00:01:59 2019-05-21T03:29:28
Computati+ alan-compute-07     FAILED   00:02:03 2019-05-21T03:30:34
Computati+ alan-compute-07     FAILED   00:02:04 2019-05-21T03:31:15
Computati+ alan-compute-07     FAILED   00:02:04 2019-05-21T03:31:27
Computati+ alan-compute-07     FAILED   00:01:46 2019-05-21T03:32:37
Computati+ alan-compute-07     FAILED   00:01:47 2019-05-21T03:33:19
Computati+ alan-compute-07     FAILED   00:01:57 2019-05-21T03:33:31
Computati+ alan-compute-07     FAILED   00:01:53 2019-05-21T03:34:23
Computati+ alan-compute-07     FAILED   00:02:04 2019-05-21T03:35:06
Computati+ alan-compute-07     FAILED   00:01:48 2019-05-21T03:35:28
Computati+ alan-compute-07     FAILED   00:01:50 2019-05-21T03:36:16
Computati+ alan-compute-07     FAILED   00:01:49 2019-05-21T03:37:10
Computati+ alan-compute-07     FAILED   00:02:02 2019-05-21T03:37:16
Computati+ alan-compute-07     FAILED   00:01:56 2019-05-21T03:38:06
Computati+ alan-compute-07     FAILED   00:02:04 2019-05-21T03:38:59
Computati+ alan-compute-07     FAILED   00:01:55 2019-05-21T03:39:18
Computati+ alan-compute-07     FAILED   00:01:36 2019-05-21T13:49:00
Computati+ alan-compute-07     FAILED   00:01:34 2019-05-21T13:50:36
Computati+ alan-compute-07     FAILED   00:01:34 2019-05-21T13:52:10
Computati+ alan-compute-07     FAILED   00:01:35 2019-05-21T14:00:37
Computati+ alan-compute-07     FAILED   00:01:35 2019-05-21T14:02:12
Computati+ alan-compute-07     FAILED   00:01:33 2019-05-21T14:03:47
Computati+ alan-compute-07     FAILED   00:01:33 2019-05-21T14:05:20
Computati+ alan-compute-07     FAILED   00:01:36 2019-05-21T14:21:15
Computati+ alan-compute-07     FAILED   00:01:32 2019-05-21T14:24:51
Computati+ alan-compute-07     FAILED   00:01:35 2019-05-21T14:26:23
Computati+ alan-compute-07     FAILED   00:01:38 2019-05-21T14:27:58
Computati+ alan-compute-07     FAILED   00:01:40 2019-05-21T14:28:42
Computati+ alan-compute-07     FAILED   00:01:34 2019-05-21T14:29:36
Computati+ alan-compute-07     FAILED   00:01:45 2019-05-21T14:30:22
Computati+ alan-compute-07     FAILED   00:01:41 2019-05-21T14:31:10
Computati+ alan-compute-07     FAILED   00:01:39 2019-05-21T14:32:07
Computati+ alan-compute-07     FAILED   00:01:46 2019-05-21T14:32:51

$ sacct -u rmormont --format="JobName,Nodelist,State,Elapsed,Start" | grep "alan-compute-0\(1\|2\|3\|4\|5\|6\)"
[...]
Computati+ alan-compute-01    TIMEOUT   00:25:28 2019-05-21T03:13:17
Computati+ alan-compute-05  COMPLETED   00:07:19 2019-05-21T03:13:47
Computati+ alan-compute-05  COMPLETED   00:06:59 2019-05-21T03:13:47
Computati+ alan-compute-03  COMPLETED   00:06:18 2019-05-21T03:14:28
Computati+ alan-compute-01  COMPLETED   00:12:52 2019-05-21T03:14:38
Computati+ alan-compute-05  COMPLETED   00:13:08 2019-05-21T03:20:46
Computati+ alan-compute-03  COMPLETED   00:12:07 2019-05-21T03:20:46
Computati+ alan-compute-05  COMPLETED   00:13:18 2019-05-21T03:21:07
Computati+ alan-compute-02  COMPLETED   00:09:56 2019-05-21T03:23:24
Computati+ alan-compute-02  COMPLETED   00:16:37 2019-05-21T03:26:32
Computati+ alan-compute-01  COMPLETED   00:07:45 2019-05-21T03:27:30
Computati+ alan-compute-05  COMPLETED   00:07:30 2019-05-21T03:28:19
Computati+ alan-compute-03  COMPLETED   00:06:44 2019-05-21T03:32:53
Computati+ alan-compute-02  COMPLETED   00:11:11 2019-05-21T03:33:20
Computati+ alan-compute-05  COMPLETED   00:13:36 2019-05-21T03:33:54
Computati+ alan-compute-05  COMPLETED   00:13:42 2019-05-21T03:34:25
Computati+ alan-compute-01  COMPLETED   00:13:37 2019-05-21T03:35:15
Computati+ alan-compute-05  COMPLETED   00:07:27 2019-05-21T03:35:49
Computati+ alan-compute-03  COMPLETED   00:06:39 2019-05-21T03:35:58
Computati+ alan-compute-01  COMPLETED   00:13:16 2019-05-21T03:38:49
Computati+ alan-compute-03  CANCELLED   00:02:03 2019-05-21T13:41:15
Computati+ alan-compute-03  CANCELLED   00:02:03 2019-05-21T13:41:15
Computati+ alan-compute-03  COMPLETED   00:14:24 2019-05-21T13:49:00
Computati+ alan-compute-03  COMPLETED   00:20:03 2019-05-21T13:49:00
Computati+ alan-compute-03  COMPLETED   00:14:03 2019-05-21T14:03:24
Computati+ alan-compute-04  COMPLETED   00:20:13 2019-05-21T14:21:15
Computati+ alan-compute-03  COMPLETED   00:15:08 2019-05-21T14:21:15
Computati+ alan-compute-03  COMPLETED   00:20:50 2019-05-21T14:21:15
Computati+ alan-compute-03  COMPLETED   00:20:12 2019-05-21T14:21:15
Computati+ alan-compute-01  COMPLETED   00:14:48 2019-05-21T14:26:20

[Issue] Cannot add dataset

I wanted to place some dataset (actually, the one provided by PyTorch) on the cluster but the folder /data/datasets/ is not writable.

So far I have placed my datasets in a local folder. As far as I'm concerned, this can stay like that, but I guess that if each user has its own copy of ImageNet, we will soon have memory issues.

GPU allocation

Describe the issue
I have written the following script to submit a job to two gtx2080ti. I get this error

1020391 gpu2080ti pythonjo rnath PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

Can you check if the script is ok? The files are all in order and the python code is fine.
Screenshots
#!/bin/bash
#SBATCH --job-name=pythonjob
#SBATCH --time=00:10:00 # hh:mm:ss
#SBATCH --output=output_val.txt
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mem-per-cpu=10240 # 10GB
#SBATCH --mail-user=[email protected]
#SBATCH --mail-type=ALL
#SBATCH --partition=gpu2080ti
#SBATCH --comment=DeepLSpectra
python /home/rnath/Code/DeepNet.py

[New User] Pierre Barnabe

Are you a temporary user of the cluster?
Problem: building models for metallic scraps sorting (Project Reverse Metallurgy) on a multi-sensor prototype
Until: not defined.

Supervisor
Eric Pirard

Institutional email
[email protected]

Additional context

[New User] Géraldine Brieven

Purpose
Master Thesis on Face Recognition using one-shot Learning

Until: 06/2019

Supervisor
Gilles Louppe

[New User] Arnaud Vallot

Are you a temporary user of the cluster?
Master thesis: Active learning in industrial control.
Specifically, I need to perform some tests in order to finish the thesis.
Until: 8 June

Supervisor
Pierre Geurts

Institutional email
It is simply arnaud.vallot at student.uliege.be
Sorry for the peculiar format, I already had issues with bot crawlers on github, and I don't want to publicly display my e-mail.

[Issue] `/var/log/slurm/slurmctld.log` fills up root FS.

TODO: Prepare cronjob to clean up /var/log/slurm/slurmctld.log every month.

Unable to login to alan

Hello,
I am unable to login to Alan. It throws me the following error.

Please let me know how I can proceed.

[Feature Request] Centralized TensorBoard storage

Is your feature request related to a problem? Please describe.
It is not really a problem, but I'm trying to make it easier to share with colleagues the progress of model training in tensorboard.
At the moment, I'm regularly starting tensorboard in my SSH connection and I access it locally (but only me) thanks to SSH tunneling.
I just found out about tensorboard dev which makes it really easy to share results online.
However, it seems very limited right now, and once you close the client, you have no way to launch it again later to update results on the same experiment. The only option at the moment seems to delete the previous experiment, create a new one (which takes a while), and share the new URL with colleagues.

Describe the solution you'd like
I was wondering if there was some way to execute "permanently" (i.e. without SSH connection) some program.
More precisely, I was wondering if it was possible to run tensorboard dev upload --logdir 'my_log_dir in the background for days (for instance), and, at some point, to kill the execution when it is not needed anymore.
I'm of course assuming here that it doesn't use a lot of resources and that it wouldn't affect the performances of the cluster in any way.

Additional context

[New User] Rakesh Nath

Are you a temporary user of the cluster?
Deep Learning for Astrophysics PhD
Until: 2023

Supervisor
Olivier Absil

[New User] Nathan Greffe

Are you a temporary user of the cluster?
Yes, I was previously accessing the arya GPUs and would like to get access to this cluster for my master thesis.
Until: the 27th of june 2018

Supervisor
Professor Pierre Geurts

Additional context

[New User] Loic Lejoly

Are you a temporary user of the cluster?
Utilization of the cluster to develop models in the scope of my Master thesis that consists in
Predicting gullies at the continental scale with deep learning
Until: The end of the Master thesis

Supervisor
Gilles Louppe

Additional context

[Feature Request] Remote Live Tracking of Tensorboard

Is your feature request related to a problem? Please describe.
As a user of cluster it would be nice to have a remote live tracking of tensorboard in real time from my local browsers.

Describe the solution you'd like
I implemented a way to do this. It's basically port forwarding. I do start a tensorboard in alan(via ssh). I took that port and
do something like

ssh -N -f -L localhost:local_port:localhost:port_on_ala [email protected]

And this works basically. On scratch or homespace it works in both instances.
I also created a script to automatized this and by creating terminals in your platform of choice and executing commands.
It's is not done since i only did linux part since it seems like opening terminals and executing commands in macOS(darwin) and Windows seems bit different.

Is this the ideal way to do such thing or is there a better alternative , if not i can make a pretty readme - how to and make a PR for it.

[New User] Donovan Derkenne

I need my accound reopened until the 5th of january because I haven't finished my master thesis and I need to rerun some tests

#23

[New User] Corentin Jemine

Are you a temporary user of the cluster?
Yes. I need it for my thesis.
Until: the 8th of June. If I need an extension I will update this thread and mention you ASAP.

Supervisor
Prof. Gilles Louppe

Additional context
I am the student who needed his datasets transferred (VoxCeleb, LibriSpeech...). Note that I will require write access to these datasets.
If there is a possibility for getting access to multiple GPUs (even temporarily), please let me know.

Lionel Mathy

I would like to have access to the Alan cluster to make my master thesis. For the access end date, I would like to be able to access the cluster until 15th september 2021.

Supervisor
The supervisor of the master thesis is Raphaël Marée.

Institutional email
[email protected]

[Question] Time limits

Is there any imposed time limit on all jobs, or one that is agreed by everyone? I've not seen jobs go beyond 24h on the cluster. I have three different models that must be (consecutively) trained for at least one week each. Is it alright if I set the time limit to seven days?

[New User] Anthony Cioppa

Are you a temporary user of the cluster?
Yes
Until: The end of my PhD Thesis

Supervisor
Marc Van Droogenbroeck

Additional context

[New User] Navdeep Kumar

Are you a temporary user of the cluster?
PhD student under the supervision of Prof. Pierre Geurts and Raphael Maree
Until: not defined.

Institutional email
[email protected]

Training won't start when scheduled as a job

I have this tensorflow model that I can't manage to get training with slurm. It's the script sv2tts/synthesizer_train.sh in my home directory. If I ssh into the node that contains the dataset (alan-compute-02), I can run it directly and it works. But when run as a job, it freezes after initialization and outputs nothing more (I've let it run for 30 minutes). I lack the expertise to properly debug this, could I get a hand?

[New User] Ph. Koch

Are you a temporary user of the cluster?
No
Until: not defined.

Supervisor
Eric Pirard
Godefroid Dislaire

Institutional email
[email protected]

Additional context
Classification and segmentation of rock materials based on multi-spectral/multi-sensors data. Possibly reinforcement learning later on. I work with Godefroid and Pierre Barnabe in GeMMe and just joined the group to support our ML and modelling efforts.
Thanks,
ph

[Feature Request] Debug facilities

Some errors cannot be debugged locally on CPU (typical example: mismatch of CPU and GPU tensors in PyTorch). It would be nice to have an environment as close as the one in which the code will eventually run for that purpose.

There are four approches I can think of

Use the usual queue to submit job with low time requirement.
Use a debug queue with high priority and low maximum time bound.
Use a debug queue with a dedicated GPU.
Leave one GPU out of the computing nodes altogether.

The first approach does not require any change but offers no guarantee. Also, I am not sure how the fairness measures will be influenced.

The second is more principled but there is still no guarantee that a code can de debugged quickly if all GPUs are busy. Usually, one would want to debug the code in real time.

Finally leaving one GPU out (either explicitly or through a dedicated queue) allows for instantaneous debugging even when the cluster is under heavy load. On the other hand, it also implies that one GPU is rarely used.

I guess the third solution is best if we are ready to sacrifice one GPU.

@waliens @AWehenkel

[New User] Carles Cantero Mitjans

Are you a temporary user of the cluster?
I will be using the cluster for my PhD at the University of Liège

Supervisor
Marc Van Droogenbroeck

Institutional email
[email protected]

Additional context
Add any other context or screenshots about the feature request here.

invalid user id

Describe the issue
when i try to run

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=res.txt

#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

in my bash file as-

sbatch submit.sh

but when i check in the result file, i see this error-

srun: fatal: Invalid user id: 1046
srun: fatal: Invalid user id: 1046

Please help me understand what's going wrong.

[Issue] Parse error in file /etc/slurm/slurm.conf line 118: " ThreadsPer"

Describe the issue
It seems that there is an issue with slurm, no commands work.

To Reproduce
Steps to reproduce the behavior:
Use any slurm command: squeue, sbatch, ...

[New User] Olivier Absil

Are you a temporary user of the cluster?
No, I'm a permanent user.

Supervisor
myself

Additional context
N/A

[Issue] Unable to use the GPU with Tensorflow 1.15.0

Description
I am not able to use the GPU anymore (after the recent update of the cluster) with Tensorflow (Python). I am using tensorflow-gpu==1.15.0 and this is the error output I get when running from tensorflow.python.client import device_lib; print(device_lib.list_local_devices()).

2020-09-17 10:43:32.814914: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-09-17 10:43:32.835781: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100010000 Hz
2020-09-17 10:43:32.835994: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5616ac2d1d40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-17 10:43:32.836028: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-17 10:43:32.876617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-17 10:43:33.540098: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5616ac3638e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-17 10:43:33.540132: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-09-17 10:43:33.540883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
2020-09-17 10:43:33.542039: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.543019: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.543623: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.544286: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.545065: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.545864: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.546527: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-09-17 10:43:33.546538: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-09-17 10:43:33.546552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-17 10:43:33.546560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2020-09-17 10:43:33.546566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N

The GPU is detected but some libraries are not found.

To Reproduce
Install tensorflow-gpu==1.15.0
and run from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())

Expected behavior
The possibility to use the GPU

[Issue] `sacct` not displaying reserved/used resources

Describe the issue
I usually use the slurm sacct command to monitor jobs (which are possibly not scheduled anymore, therefore not accessible via squeue). On alan, the command does not print anything regarding the resources used by the jobs (scheduled or not).

To Reproduce
An example command with the resulting (trimmed) output. For those jobs, I have requested 4 CPUs, 30Gb of RAM and 1 GPU.

$ sacct --format="JobID,AllocCPUS,AllocGRES,AveCPU,MinCPU,MinCPUNode,MinCPUTask,Ncpus,NTasks,UserCPU,TotalCPU,ReqMem,MaxVMSize,AveVMSize" --user rmormont
       JobID  AllocCPUS    AllocGRES     AveCPU     MinCPU MinCPUNode MinCPUTask      NCPUS   NTasks    UserCPU   TotalCPU     ReqMem  MaxVMSize  AveVMSize
------------ ---------- ------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ----------
1000146               0                                                                   0            00:00:00   00:00:00         0n
1000148               0                                                                   0            00:00:00   00:00:00         0n
1000149               0                                                                   0            00:00:00   00:00:00         0n
1000150               0                                                                   0            00:00:00   00:00:00         0n
1000151               0                                                                   0            00:00:00   00:00:00         0n
1000152               0                                                                   0            00:00:00   00:00:00         0n
1000153               0                                                                   0            00:00:00   00:00:00         0n
1000154               0                                                                   0            00:00:00   00:00:00         0n
1000155               0                                                                   0            00:00:00   00:00:00         0n
1000156               0                                                                   0            00:00:00   00:00:00         0n
1000157               0                                                                   0            00:00:00   00:00:00         0n
1000158               0                                                                   0            00:00:00   00:00:00         0n
1000159               0                                                                   0            00:00:00   00:00:00         0n

Expected behavior
The command should display the reserved/used resources (memory, cpu, gpu,...).

[New User] Lorenzo König

Are you a temporary user of the cluster?
I will perform FDTD simulations (optical propagation) on the Alan CPUs for my PhD project
Until: not defined.

Supervisor
Olivier Absil

Institutional email
[email protected]

[New User] Christian Delacroix

I am Olivier Absil's postdoc. I will be working on High Contrast Imaging simulations.

[New User] Anais Halin

Are you a temporary user of the cluster?
Depending on the meaning of the question, more or less. Well, I need access during my PhD thesis.
Until: The end of my PhD thesis

Supervisor
Marc Van Droogenbroeck

Additional context
None.

[New User] Denis Defrère

Are you a temporary user of the cluster?
Finding exoplanets in direct imaging data

Until: not defined.

Supervisor
O. Absil

Additional context
Part of the EPIC project to search and characterise exoplanet in direct imaging data

[New User] Loïc Sacré

Are you a temporary user of the cluster?
I am temporary user, it will be necessary for my master thesis.
Until: September 2019

Supervisor
Raphael Marée

Additional context
None

Invalid user id

Describe the issue
When running a python script via mpiexec on the Alan CPUs an error "invalid user id: 1049" occurs. When I submit it without mpiexec it is generating the output as expected.

To Reproduce
My submit.sh file:

#!/bin/bash
#SBATCH --job-name=OPGtests
#SBATCH --output=optical_propagation_tests.txt
#SBATCH --ntasks=4
#SBATCH --time=59:00
#SBATCH --mem-per-cpu=100
mpiexec -n 4 python finite_HWP.py

And the output file contains only:

srun: fatal: Invalid user id: 1049

Note that the job continues running but no further output is generated.
I noticed that Malavika experienced the same error message a couple of months ago, although in a different context (Issue #41 )

Additional Context
I have a python environment activated to run Meep simulations (optical propagation) in parallel using mpirun / mpiexec.
Please let me know if you need more detailed information. Thanks!

GPU assignment

Describe the issue
I have asked for two GPUs and the code clearly request so does the shell script. However I get the following error.
019-05-18 17:03:44.558783: I tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
File "/home/rnath/Code/DeepNet.py", line 62, in
conv1d=get_model()
File "/home/rnath/Code/DeepNet.py", line 58, in get_model
model = keras.utils.multi_gpu_model(model, gpus=2)
File "/home/rnath/miniconda3/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 181, in multi_gpu_model
available_devices))
ValueError: To call multi_gpu_model with gpus=2, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/xla_cpu:0']. Try reducing gpus.

Context
This is after the job has been assigned.

#!/bin/bash
#SBATCH --job-name=DeepSpec
#SBATCH --time=24:10:00 # hh:mm:ss
#SBATCH --output=output_val.txt
#SBATCH --ntasks=3
#SBATCH --gres=gpu:2
#SBATCH --mem-per-cpu=5120 # 5GB
#SBATCH --mail-user=[email protected]
#SBATCH --mail-type=ALL
#SBATCH --partition=gpu2080ti
#SBATCH --comment=DeepLSpectra
python /home/rnath/Code/DeepNet.py

[New User] Malavika Vasist

Are you a temporary user of the cluster?
PhD project usage.
Until: not defined.

Supervisor
Gilles Louppe and Olivier Absil

Institutional email
[email protected]

Additional context

[Feature Request] Automatic scratch allocation

Configure Slurm such that scratch directories are automatically allocated, i.e., $SCRATCH points to /scratch/job_id. This directory should be automatically cleaned up when the job completes or crashes.

Let's have the related discussion here @CorentinJ @waliens. I'm currently encountering some issues with slurmd, but only minor things. Should have it up and running by Monday.

[New User] Maxim Henry

Are you a temporary user of the cluster?
Master Thesis
Until: 20 August

Supervisor
Gilles Louppe

Institutional email
[email protected]

Additional context
\

[Issue] TODO

Describe the issue
Whenever I try to install a new package with anaconda or try to create an new environment, I get the following error:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://repo.anaconda.com/pkgs/main/linux-64/current_repodata.json

To Reproduce
Steps to reproduce the behavior:

conda create --name env

Expected behavior
Anaconda should install the new package or create the new environment.

Screenshots

[New User] Sébastien Blondiau

Purpose
My master thesis, about robust training
Until: June 8, 2019

Supervisors
Pierre Geurts, Quentin Louveaux

[New User] Donovan Derkenne

Are you a temporary user of the cluster?
Master thesis on histological image segmentation

Until: 20th august

Supervisor
Raphaël Marée

Institutional email
[email protected]

Additional context
I analyse big histological images(~25kp x 15kp, 19 images) by processing sequencial overlapping patches (1024x1024) with unet-like models . I have 17GB of pre-cropped patches, I developped with tensorflow 1.12, Training process is about 5 to 7 days per test on one 1080ti. I need 8 cpu and at least 20 GB of ram to preprocess and feed data to gpu optimally.

montefiore-institute / alan-cluster Goto Github PK

alan-cluster's Introduction

General actions

User account setup

Connecting to Alan

Preparing a conda environment

Preparing your (Deep Learning) project

PyTorch

Jax

TensorFlow

Transferring your datasets to Alan

Cluster usage

Slurm commands

Partitions

Filesystems

Recommended ways to load data into the GPU

My dataset does not fit in memory

My dataset fits in memory

Cluster-wide datasets

Centralised Jupyter Access

Is it possible to access Jupyter Lab?

Launching multiple servers

Via Jupyter Lab

Via Jupyter Notebook

alan-cluster's People

Contributors

Stargazers

Watchers

Forkers

alan-cluster's Issues

Additional context

Recommend Projects

Recommend Topics

Recommend Org