Duke Computing Cluster (DCC)
About DCC
Duke Computing Cluster (DCC) or just cluster is, for our purposes, one big computer with lots of memory, processor, graphical processors, etc. (resources). We can run scripts and programs on DCC when doing it on our work computers would take too much time or even be impossible. An example of a task that is better run on DCC is voice type classification (VTC) that we run on the VIHI audio recordings.
Most of the time, you'll be working with DCC from the terminal. The way this works is you connect to DCC in the terminal, after which all the commands you issue in that terminal will be run on DCC and not on your computer.
There are two notable exceptions:
scp
utility can copy file between your computer and the cluster, it has to be run in the terminal where you are not connected to DCC.FileZilla is a GUI program that runs on your local computer and allows you to browse DCC folders as you'd do in Finder.
Connecting/logging in
To connect to DCC, open the Terminal and run the ssh
program telling it the address of DCC and your login (don't run it just yet):
After typing this command - if the authentification is successful - you'll see a message like this:
Disconnecting
If @dcc-
is in the prompt, run exit
. Repeat if it is still there.
Running commands
Once you are connected, any commands you type in the terminal are sent to a shell running on DCC, not your computer. In the example above, ssh <net-id>@dcc-login.oit.duke.edu
was run on your computer in your local shell but anything you type after <net-id>@dcc-login-03 ~ $
will be run on DCC - in a remote shell.
Note the "login" in <net-id>@dcc-login-03
. It tells you that the shell that is running on DCC and processing your commands is a login shell. This literally just means "the shell me, DCC, used to authenticate you on myself". It is also an interactive shell: you can interact with it by typing commands on the terminal and reading what the shell prints back. The DCC docs say that it is highly discouraged to run tasks in a login shell. Not only is it discouraged, but the login shell also has access to very little of the resources so if you are running something in it you might as well do it on your local computer. Instead of running things in the login shell, you should request DCC to run a job or an interactive session.
A job is a set of computing resources (RAM, CPUs, GPUs, etc.) that will be scheduled to run your script. Depending on the available resources, the requested job can be started right away or it could have to wait for other jobs to finish.
An interactive session is a non-login interactive shell through which it is totally OK to run any scripts you want interactively. Just like a job, an interactive session will have the resources you requested available to it.
Storing files
There are two folders in which you can store things:
The lab's 1 TB storage space:
Your personal (not sure about the space quota) folder:
All the files you want to process should be stored in the lab's folder. Your personal folder is for your configuration files and programs.
Setup
Have someone (the current admin is Elika) add you to the group using the Research Toolkits website. Check that you have been added by going to that website, clicking on "bergelsonlab", and finding your net-id in the list of members.
Set up SSH keys to avoid using 2-factor authorization. Instructions here.
Make sure connections to DCC don't time out all the time. Use these instructions but instead of
UseKeychain yes
, put the following to the~/.ssh/config
fileServerAliveInterval 240
doesn't have to be the first row underHost *
, just needs to be under it and have the same indent as other lines.Check if you have access by trying to log in to DCC using
ssh
.If some text is printed out and your prompt changes to something like
then you are in! Continue to the next step. If access is denied with some sort of error, check again that step 1 was successful. If you are sure it was, ask the lab tech or anyone on staff to help you.
Once connected, run the following so that the files/folder you and your scripts create are accessible to others in the lab:
Set up conda.
Get back to your local shell (either
exit
from the cluster shell or open a new terminal window). Try copying a file from your computer to your home folder on the cluster usingscp
:Where
<local-file>
is a path to a file on your computer, e.g.,~/Downloads/virus.dmg
. It can also be a relative path.Now that you know how to copy files via the terminal, go and install FileZilla so that you don't have to.
Avoid typing your SSH key passphrase every time
This isn't necessary but will be helpful if you have to type in the passphrase often. The passphrase is used to access your private SSH key when establishing a connection to DCC, e.g., when you run ssh
or scp
. An example where you have to type it many times in a row is running scp
in a loop as inCopying a set of files. If the Forever (on MacOS)instructions aren't clear (sorry about that! Zh.), then do it For a single terminal session and tell the lab tech to improve the forever instructions.
For a single terminal session
You should see something like:
Forever (on MacOS)
(from this SO answer).
Open ~/.ssh/config
file in your text editor (Atom, Sublime, etc.). If this file doesn't exist, create a new file and save it to ~/.ssh/config
.
On macOS, use [cmd]+[shift]+[.] to show the hidden ~/.ssh/
folder.
Also, you can do the following in the terminal in order to create that file and open it for editing in your default text editor.
See if the file contains the line Host *
. If it doesn't, add the following to the top of the file, and save the file:
If Host *
is already there, check if there is a line UseKeychain yes
or UseKeychain no
among the lines directly under it that all have the same number of spaces in the beginning. If it exists and it ends with "yes", you are done. If it exists and ends with "no", change "no" to "yes", and save the file. If the line doesn't exist, add UseKeychain yes
to the lines under Host *
using the same number of spaces in the beginning as the other lines under Host *
Copying files
If you are not used to doing things in a terminal or would just rather avoid using it for copying files, set up FileZilla. It will give you a Finder-like GUI interface that allows you to copy files to/from DCC. The scripts/commands that will run things on DCC will still have to be run in a terminal.
Copying a set of files
Make sure you don't have to type the passphrase every time you connect to the cluster (see #avoid-typing-in-your-phrase-every-time)
Collect paths to all the files you want to copy in a single txt file with one path per line. Let's assume it's called
files.txt
(you can name it anything else).Choose the destination folder on DCC where you want to copy your files. If the files are small, you can use your user folder (
~
). Otherwise, choose a folder somewhere under/hpc/group/bergelsonlab/
Make sure the folder you chose already exists on the cluster. If it doesn't - create it. Use either FileZilla or the terminal connected to DCC to check (
cd <path>
) and/or create (mkdir -p <path>
) the folder.Substitute (1) the destination folder path, (2) your net id, and (3) the name of the file with the file paths in the following code, and then run it in your local (!) terminal.
Copying one file
To copy one file, we'll use a program called scp
. We will need to run it in our local shell.
If you are familiar with cp
- it is a very similar program except that it runs over an ssh connection.
The general syntax is this:
This command will copy the file at the path <src-file>
to the folder at the path <dst-folder>
. One of these paths should be a local path on your local computer, and the other - a remote path on the cluster. To tell scp
that one of those is on the cluster, you prepend <netid>@dcc-login.oit.duke.edu:
to the corresponding path. The local path can be both relative and absolute but the remote path must be absolute.
Here is what copying to/from DCC would look like.
Where:
<local-file>
is a path to a file on your computer, e.g., "~/Downloads/virus.dmg".<local-folder>
is a path to a folder on your computer, e.g., "~/Downloads/".<dcc-file>
is a path to a file on DCC, e.g., "/hpc/home/<net-id>/Downloads/virus.dmg".<dcc-folder>
is a path to a folder on DCC, e.g., "/hpc/home/<net-id>/Downloads/".
The destination folder must exist before copying. To check if the destination folder on the cluster exists, you can try changing into it. On the cluster, run:
The command tends to get quite long, so to improve readability, you can use shell variables:
Running one shortish job
In this case, the easiest thing to do is just run the command interactively - in a terminal that is sending commands directly to one of the cluster nodes.
Things you need to know before you run the command:
Do you need GPU? Ask for a node from the "gpu-common" partition with
-p gpu-common
, omit the option otherwise.Does it make sense to ask for more CPUs? How many? Request 8 CPUs with
-c 8
, omit the option otherwise.Does the job require additional RAM (the default is 2GB)? How much? If you run out of memory, the job will fail, not always with a message that will give you the reason for failing. Request 16 GBs with
--mem=16G
Omit the option if 2 GB is enough.
Ask for a node to work on:
srun [-p gpu-common --gres=gpu:1] -c <n_cores> --mem=<RAM in GBs>G --pty bash -i
\Run your command/script.
Running one job using SLURM
Write a SLURM batch script with:
All the requirements for the node (RAM, cores, type of node, etc.) prepended by
#SBATCH
, these have to come before any actual commands,Additionally, add
-e
and-o
to redirect the error and output streams (that what is being printed in the terminal when you run a command) to text files. Otherwise, all the text outputs from your command will be lost.
Actual commands.
Here is an example:
Save the script to do_the_thing.sh
and ask SLURM to run it with
Check the outputs in log.err
and log.out
to check how the script is doing and whether it didn't error out.
Running multiple independent jobs
NB: this is how I imagine it should work. This part will have to be updated once I actually try to do this.
One script does every job so it needs a way to do one thing for one job and another thing for another. Kind of like an iterator i
in a for loop. One way to arrange this using SLURM is job arrays: the commands in the main script will be run in the same way for each job except for the environment variable SLURM_ARRAY_TASK_ID
. For example, if you need to process 20 files, you would run a job array with ids ranging from 1 to 20. The main script can then take the file number SLURM_ARRAY_TASK_ID
from an input folder and pass it to the worker script (or just use it directly in the same script, it is just my preference to split it). So, your main script (that asks the cluster to run an array of 20 jobs) could have the following lines:
Write a script to run one job.
Make it take a filename or a job number as an argument.
Write a job array script (see https://oit-rc.pages.oit.duke.edu/rcsupportdocs/dcc/slurm/#slurm-job-arrays) that calls the script above.
Conda
Set up
Try conda --version
in the login shell. If that gives you errors, run
Then try conda --version
again.
Use conda in a SLURM batch script
Last updated