Duke Computing Cluster (DCC)

About DCC

Duke Computing Cluster (DCC) or just cluster is, for our purposes, one big computer with lots of memory, processor, graphical processors, etc. (resources). We can run scripts and programs on DCC when doing it on our work computers would take too much time or even be impossible. An example of a task that is better run on DCC is voice type classification (VTC) that we run on the VIHI audio recordings.

Most of the time, you'll be working with DCC from the terminal. The way this works is you connect to DCC in the terminal, after which all the commands you issue in that terminal will be run on DCC and not on your computer.

There are two notable exceptions:

scp utility can copy file between your computer and the cluster, it has to be run in the terminal where you are not connected to DCC.
FileZilla is a GUI program that runs on your local computer and allows you to browse DCC folders as you'd do in Finder.

Connecting/logging in

To connect to DCC, open the Terminal and run the ssh program telling it the address of DCC and your login (don't run it just yet):

ssh <net-id>@dcc-login.oit.duke.edu

After typing this command - if the authentification is successful - you'll see a message like this:

$ ssh <net-id>@dcc-login.oit.duke.edu
Last login: Mon Oct 24 13:33:34 2022 from 10.182.14.225
################################################################################
# NOTICE: Service enhancements were deployed 9/12. See https://rc.duke.edu for #
# more information.                                                            #
#                                                                              #
# Contact [email protected] for assistance                           #
# My patch window is wednesday 03:00                                            #
################################################################################
<net-id>@dcc-login-03  ~ $

Disconnecting

If @dcc- is in the prompt, run exit. Repeat if it is still there.

Running commands

Once you are connected, any commands you type in the terminal are sent to a shell running on DCC, not your computer. In the example above, ssh <net-id>@dcc-login.oit.duke.edu was run on your computer in your local shell but anything you type after <net-id>@dcc-login-03 ~ $ will be run on DCC - in a remote shell.

Note the "login" in <net-id>@dcc-login-03. It tells you that the shell that is running on DCC and processing your commands is a login shell. This literally just means "the shell me, DCC, used to authenticate you on myself". It is also an interactive shell: you can interact with it by typing commands on the terminal and reading what the shell prints back. The DCC docs say that it is highly discouraged to run tasks in a login shell. Not only is it discouraged, but the login shell also has access to very little of the resources so if you are running something in it you might as well do it on your local computer. Instead of running things in the login shell, you should request DCC to run a job or an interactive session.

A job is a set of computing resources (RAM, CPUs, GPUs, etc.) that will be scheduled to run your script. Depending on the available resources, the requested job can be started right away or it could have to wait for other jobs to finish.

An interactive session is a non-login interactive shell through which it is totally OK to run any scripts you want interactively. Just like a job, an interactive session will have the resources you requested available to it.

Storing files

There are two folders in which you can store things:

The lab's 1 TB storage space:
```
/hpc/group/bergelsonlab
```
Your personal (not sure about the space quota) folder:
```
/hpc/home/<net-id>
```

All the files you want to process should be stored in the lab's folder. Your personal folder is for your configuration files and programs.

Setup

Have someone (the current admin is Elika) add you to the group using the Research Toolkits website. Check that you have been added by going to that website, clicking on "bergelsonlab", and finding your net-id in the list of members.
Set up SSH keys to avoid using 2-factor authorization. Instructions here.
Make sure connections to DCC don't time out all the time. Use these instructions but instead of UseKeychain yes, put the following to the ~/.ssh/config file
```
Host *
    ServerAliveInterval 240
```
ServerAliveInterval 240 doesn't have to be the first row under Host *, just needs to be under it and have the same indent as other lines.
Check if you have access by trying to log in to DCC using ssh.
```
ssh <net-id>@dcc-login.oit.duke.edu
```
If some text is printed out and your prompt changes to something like
```
<net-id>@dcc-login-01  ~ $
```
then you are in! Continue to the next step. If access is denied with some sort of error, check again that step 1 was successful. If you are sure it was, ask the lab tech or anyone on staff to help you.
Once connected, run the following so that the files/folder you and your scripts create are accessible to others in the lab:
```
echo "umask 002" >> ~/.bashrc
```
Set up conda.
Get back to your local shell (either exit from the cluster shell or open a new terminal window). Try copying a file from your computer to your home folder on the cluster using scp:
```
scp <local-file> <netid>@dcc-login.oit.duke.edu:/hpc/home/<net-id>/
```
Where <local-file> is a path to a file on your computer, e.g., ~/Downloads/virus.dmg. It can also be a relative path.
Now that you know how to copy files via the terminal, go and install FileZilla so that you don't have to.

Avoid typing your SSH key passphrase every time

This isn't necessary but will be helpful if you have to type in the passphrase often. The passphrase is used to access your private SSH key when establishing a connection to DCC, e.g., when you run ssh or scp. An example where you have to type it many times in a row is running scp in a loop as inCopying a set of files. If the Forever (on MacOS)instructions aren't clear (sorry about that! Zh.), then do it For a single terminal session and tell the lab tech to improve the forever instructions.

For a single terminal session

eval $(ssh-agent)
ssh-add

You should see something like:

$ ssh-add
Enter passphrase for /Users/ek221/.ssh/id_rsa:
Identity added: /Users/ek221/.ssh/id_rsa (ek221@TTS-210192ML)

Forever (on MacOS)

(from this SO answer).

Open ~/.ssh/config file in your text editor (Atom, Sublime, etc.). If this file doesn't exist, create a new file and save it to ~/.ssh/config.

On macOS, use [cmd]+[shift]+[.] to show the hidden ~/.ssh/ folder.

Also, you can do the following in the terminal in order to create that file and open it for editing in your default text editor.

touch ~/.ssh/config
open ~/.ssh/config

See if the file contains the line Host *. If it doesn't, add the following to the top of the file, and save the file:

Host *
    UseKeychain yes

If Host * is already there, check if there is a line UseKeychain yes or UseKeychain no among the lines directly under it that all have the same number of spaces in the beginning. If it exists and it ends with "yes", you are done. If it exists and ends with "no", change "no" to "yes", and save the file. If the line doesn't exist, add UseKeychain yes to the lines under Host * using the same number of spaces in the beginning as the other lines under Host *

Copying files

If you are not used to doing things in a terminal or would just rather avoid using it for copying files, set up FileZilla. It will give you a Finder-like GUI interface that allows you to copy files to/from DCC. The scripts/commands that will run things on DCC will still have to be run in a terminal.

Copying a set of files

Make sure you don't have to type the passphrase every time you connect to the cluster (see Duke Computing Cluster (DCC))
Collect paths to all the files you want to copy in a single txt file with one path per line. Let's assume it's called files.txt (you can name it anything else).
Choose the destination folder on DCC where you want to copy your files. If the files are small, you can use your user folder (~). Otherwise, choose a folder somewhere under /hpc/group/bergelsonlab/
Make sure the folder you chose already exists on the cluster. If it doesn't - create it. Use either FileZilla or the terminal connected to DCC to check (cd <path>) and/or create (mkdir -p <path> ) the folder.

Substitute (1) the destination folder path, (2) your net id, and (3) the name of the file with the file paths in the following code, and then run it in your local (!) terminal.

output_dir=/hpc/group/bergelsonlab/sample_dir
your_net_id=xyz123
files=files.txt

dst=<YOUR_NET_ID>@dcc-login.oit.duke.edu:$output_dir
# Why not just `while read p`? See https://stackoverflow.com/a/1521498/3042770
while IFS="" read -r p || [ -n "$p" ]
do
  scp "$p" "$dst"
done < $files

Copying one file

To copy one file, we'll use a program called scp. We will need to run it in our local shell.

If you are familiar with cp - it is a very similar program except that it runs over an ssh connection.

The general syntax is this:

scp <src-file> <dst-folder>

This command will copy the file at the path <src-file> to the folder at the path <dst-folder>. One of these paths should be a local path on your local computer, and the other - a remote path on the cluster. To tell scp that one of those is on the cluster, you prepend <netid>@dcc-login.oit.duke.edu: to the corresponding path. The local path can be both relative and absolute but the remote path must be absolute.

Here is what copying to/from DCC would look like.

# Copy to DCC:
scp <local-file> <netid>@dcc-login.oit.duke.edu:<dcc-folder>
# Copy from DCC:
scp <netid>@dcc-login.oit.duke.edu:<dcc-file> <local-folder>

Where:

<local-file> is a path to a file on your computer, e.g., "~/Downloads/virus.dmg".
<local-folder> is a path to a folder on your computer, e.g., "~/Downloads/".
<dcc-file> is a path to a file on DCC, e.g., "/hpc/home/<net-id>/Downloads/virus.dmg".
<dcc-folder> is a path to a folder on DCC, e.g., "/hpc/home/<net-id>/Downloads/".

The destination folder must exist before copying. To check if the destination folder on the cluster exists, you can try changing into it. On the cluster, run:

cd <dcc-folder>

The command tends to get quite long, so to improve readability, you can use shell variables:

src=<local-path>
output_dir=<dcc-folder>
dst=<YOUR_NET_ID>@dcc-login.oit.duke.edu:$output_dir
scp "$src" "$dst"

Running one shortish job

In this case, the easiest thing to do is just run the command interactively - in a terminal that is sending commands directly to one of the cluster nodes.

Things you need to know before you run the command:

Do you need GPU? Ask for a node from the "gpu-common" partition with -p gpu-common, omit the option otherwise.
Does it make sense to ask for more CPUs? How many? Request 8 CPUs with -c 8, omit the option otherwise.
Does the job require additional RAM (the default is 2GB)? How much? If you run out of memory, the job will fail, not always with a message that will give you the reason for failing. Request 16 GBs with --mem=16G Omit the option if 2 GB is enough.

Ask for a node to work on: srun [-p gpu-common --gres=gpu:1] -c <n_cores> --mem=<RAM in GBs>G --pty bash -i\
Run your command/script.

Running one job using SLURM

Write a SLURM batch script with:

All the requirements for the node (RAM, cores, type of node, etc.) prepended by #SBATCH, these have to come before any actual commands,
- Additionally, add -e and -o to redirect the error and output streams (that what is being printed in the terminal when you run a command) to text files. Otherwise, all the text outputs from your command will be lost.
Actual commands.

Here is an example:

#!/bin/bash

# Resource requirements
#SBATCH -p gpu-common
#SBATCH -e log.err
#SBATCH -o log.out
#SBATCH --mem=32G

# The commands
source ~/.bashrc
conda activate pyannote

blab=/hpc/group/bergelsonlab
$blab/VTC/voice_type_classifier/apply.sh $blab/vihi_wavs --device=gpu

Save the script to do_the_thing.sh and ask SLURM to run it with

sbatch do_the_thing.sh

Check the outputs in log.err and log.out to check how the script is doing and whether it didn't error out.

Running multiple independent jobs

NB: this is how I imagine it should work. This part will have to be updated once I actually try to do this.

One script does every job so it needs a way to do one thing for one job and another thing for another. Kind of like an iterator i in a for loop. One way to arrange this using SLURM is job arrays: the commands in the main script will be run in the same way for each job except for the environment variable SLURM_ARRAY_TASK_ID. For example, if you need to process 20 files, you would run a job array with ids ranging from 1 to 20. The main script can then take the file number SLURM_ARRAY_TASK_ID from an input folder and pass it to the worker script (or just use it directly in the same script, it is just my preference to split it). So, your main script (that asks the cluster to run an array of 20 jobs) could have the following lines:

all_files=($(ls -1 <input_folder>))  # -1 lists
one_file=${all_files[(($SLURM_ARRAY_TASK_ID - 1))]}
my_script.sh $one_file

Write a script to run one job.
Make it take a filename or a job number as an argument.
Write a job array script (see https://oit-rc.pages.oit.duke.edu/rcsupportdocs/dcc/slurm/#slurm-job-arrays) that calls the script above.

Conda

Set up

Try conda --version in the login shell. If that gives you errors, run

module load Anaconda3
conda init bash
source ~/.bashrc

Then try conda --version again.

Use conda in a SLURM batch script

source ~/.bashrc
conda activate <your_env_name>

Page Status: needs updating

Status details: Duke-related keywords found on the page.

Last updated by ?? on ??/??/??.

PreviousUploading the Data NextFileZilla

Last updated 9 months ago