Working with modules on Bianca¶
Objectives
- Being able to search/load/unload modules
- Create an executable Bash script that uses a module (without SLURM)
Exercises¶
Want more complex/realistic exercises?
The goal of this lesson is to work with the module system in a minimal/fast way. These exercises do not achieve anything useful. See 'Bigger exercises' for more complex/realistic exercises
- 2a. Verify that the tool
cowsay
is not available by default - 2b. Search for the module providing
cowsay
- 2c. Load a specific version of that module
- 2d. Verify that the tool
cowsay
now works - 2e. Unload that module
-
2f. Verify that the tool
cowsay
is not available anymore -
3a. Create an executable script called
cow_says_hello.sh
. It should load a specific version of thecowsay
module, after which it usescowsay
to do something - 3b. Find out: if the
cowsay
module is not loaded, after running the script, is it loaded yes/no?
Working with a computer cluster module system
1. Background¶
Bianca is shared Linux computer with all the standard Linux/GNU tools installed, on which all users should be able to do their work independently and undisturbed.
To ensure this, users cannot modify, upgrade or uninstall software themselves and instead a module system is used. This allow users to independently use their favorite versions of their favorite software.
To have new software installed on Bianca, users must explicitly request a version of a piece of software. As of today, there are nearly 800+ programs and packages, with multiple versions available on all UPPMAX clusters. Using explicit versions of software is easy to do and improves the reproducibility of the scripts written.
To preserve hard disk space, Bianca also has multiple big databases installed.
Warning
- To access bioinformatics tools, load the bioinfo-tools module first.
2. Working with the module system¶
The module
command is the basic interface to the module system.
The ml
shortcut command is also available.
- list all modules immediately available, or search for a specific available module
module avail
orml av
module avail *tool*
orml av *tool*
This command is not so smart, though, especially when searching for a specific tool, or a bioinformatics tool. It only reports modules that are immediately available.
outputs everything that has anr
in the name... not useful.
$ ml av samtools
No module(s) or extension(s) found!
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
It is better to use module spider
or ml spider
.
If there is an exact match, it reports it first.
$ ml spider R
-------------------------------------------
R:
-------------------------------------------
Versions:
R/3.0.2
R/3.2.3
R/3.3.2
R/3.4.0
R/3.4.3
R/3.5.0
R/3.5.2
R/3.6.0
R/3.6.1
R/4.0.0
R/4.0.4
R/4.1.1
R/4.2.1
Other possible modules matches:
454-dataprocessing ADMIXTURE ANTLR ARCS ARC_assembler ARPACK-NG ART AdapterRemoval AlienTrimmer Amber AnchorWave Arlequin Armadillo ArrowGrid Bamsurgeon BclConverter BioBakery BioBakery_data ...
-------------------------------------------
To find other possible module matches execute:
$ module -r spider '.*R.*'
-------------------------------------------
For detailed information about a specific "R" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider R/4.2.1
-------------------------------------------
$ ml spider samtools
-------------------------------------------
samtools:
-------------------------------------------
Versions:
samtools/0.1.12-10
samtools/0.1.19
samtools/1.1
samtools/1.2
samtools/1.3
samtools/1.4
samtools/1.5_debug
samtools/1.5
samtools/1.6
samtools/1.8
samtools/1.9
samtools/1.10
samtools/1.12
samtools/1.14
samtools/1.16
samtools/1.17
Other possible modules matches:
SAMtools
-------------------------------------------
To find other possible module matches execute:
$ module -r spider '.*samtools.*'
-------------------------------------------
For detailed information about a specific "samtools" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider samtools/1.17
-------------------------------------------
The final bit of output tells us more about a specific module version, including the special step required to access all bioinformatics modules.
$ ml spider samtools/1.17
-------------------------------------------
samtools: samtools/1.17
-------------------------------------------
You will need to load all module(s) on any one of the lines below before the "samtools/1.17" module is available to load.
bioinfo-tools
Help:
samtools - use samtools 1.17
Version 1.17
This reminds us that we need to load the bioinfo-tools
module to be able to load samtools/1.17
.
Again, this is required (just once) before loading bioinformatics software.
- Load a module
module load <module name>
orml <module name>
When loading a module, there is a "default" module available, which is almost always the latest version.
However, we rarely want to rely on that.
For reproducibility, we want to load specific version of our bioinformatics tools.
To load the samtools/1.17
module, which is a bioinformatics module.
- List the loaded modules
module list
or simplyml
To load GATK/4.3.0.0
now, bioinfo-tools
is not required because it is already loaded.
Loading this module also shows that sometimes, loading a module results in a message that is helpful for using the module at UPPMAX.
$ ml GATK/4.3.0.0
Note that all versions of GATK starting with 4.0.8.0 use a different wrapper
script (gatk) than previous versions of GATK. You might need to update your
jobs accordingly.
The complete GATK resource bundle is in /sw/data/GATK
See 'module help GATK/4.3.0.0' for information on activating the GATK Conda
environment for using DetermineGermlineContigPloidy and similar other tools.
This message references the command module help GATK/4.3.0.0
for additional help with this module.
All modules have at least a brief help message. Some (such as GATK/4.3.0.0) have more extensive help that guides users using features of the modules at UPPMAX.
It is not general help for using the tool itself.
- Display a brief module-specific help.
module help <module name>
orml help <module name>
$ ml help GATK/4.3.0.0
-------------- Module Specific Help for "GATK/4.3.0.0" ---------------
GATK - use GATK 4.3.0.0
Version 4.3.0.0
**GATK 4.3.0.0**
Usage:
gatk --help for general options, including how to pass java options
gatk --list to list available tools
gatk ToolName -OPTION1 value1 -OPTION2 value2 ...
to run a specific tool, e.g., HaplotypeCaller, GenotypeGVCFs, ...
For more help getting started, see
https://software.broadinstitute.org/gatk/documentation/article.php?id=9881
...
When we list the modules loaded with ml
, we see that GATK/4.3.0.0
is now loaded, as is its prerequisite module java/sun_jdk1.8.0_151
.
$ ml
Currently Loaded Modules:
1) uppmax 2) bioinfo-tools 3) samtools/1.17 4) java/sun_jdk1.8.0_151 5) GATK/4.3.0.0
Modules can also be unloaded, which also unloads their prerequisites.
- Unload a module
module unload <module name>
orml -<module name>
3. Using modules in an executable script¶
Using modules in an executable script is straightforward: just load the module needed before using the software in that module.
For example, this is a valid bash script:
Bigger exercises¶
Warning
- To access bioinformatics tools, load the bioinfo-tools module first.
Hands on: Processing a BAM file to a VCF using GATK, and annotating the variants with snpEff
This workflow uses a pre-made BAM file that contains a subset of reads from a sample from European Nucleotide Archive project PRJEB6463 aligned to human genome build hg38. These reads are from the region chr1:100300000-100800000
.
-
Copy example BAM file to your working directory.
-
Take a quick look at the BAM file. First see if
samtools
is available. -
If
samtools
is not found, loadbioinfo-tools
thensamtools/1.17
-
Now create an index for the BAM file, and examine the first 10 reads aligned within the BAM file.
-
Looks good. Now load the
GATK/4.3.0.0
module. -
Make symbolic links to hg38 genome resources already available on UPPMAX. This provides local symbolic links for the hg38 resources
genome.fa
,genome.fa.fai
andgenome.dict
. -
Create a VCF containing inferred variants. Speed it up by confining the analysis to this region of chr1.
This produces as its output the files$ gatk HaplotypeCaller --reference genome.fa --input ERR1252289.subset.bam --intervals chr1:100300000-100800000 --output ERR1252289.subset.vcf
ERR1252289.subset.vcf
andERR1252289.subset.vcf.idx
. -
Now use
snpEff/5.1
to annotate the variants. LoadingsnpEff/5.1
results in a change of java prerequisite. Also take a quick look at the help for the module for help with running this tool at UPPMAX.$ ml snpEff/5.1 The following have been reloaded with a version change: 1) java/sun_jdk1.8.0_151 => java/OpenJDK_12+32 $ ml help snpEff/5.1 ------------------- Module Specific Help for "snpEff/5.1" -------------------- snpEff - use snpEff 5.1 Version 5.1 Usage: java -jar $SNPEFF_ROOT/snpEff.jar ... Usage: java -jar $SNPEFF_ROOT/SnpSift.jar ... along with the desired command and possible java options for memory, etc Note that databases must be added by an admin -- request via support@uppmax.uu.se See http://snpeff.sourceforge.net/protocol.html for general help Every database that is provided by snpEff/5.1 as of this installation is installed. This complete list can be generated with java -jar $SNPEFF_ROOT/snpEff.jar databases Three additional databases have been installed. Database name Description Notes ------------- ----------- ----- c_elegans.PRJNA13758.WS283 Caenorhabditis elegans genome version WS283 MtDNA uses Invertebrate_Mitochondrial codon table canFam4.0 Canis familiaris genome version 4.0 fAlb15.e73 Ficedula albicollis ENSEMBLE 73 release The complete list of locally installed databases is available at $SNPEFF_ROOT/data/databases_list.installed To add your own snpEff database, see the guide at http://pcingola.github.io/SnpEff/se_buildingdb/#option-1-building-a-database-from-gtf-files
-
Annotate the variants.
-
Take a quick look!
-
Compress the annotated VCF and index it, using
bgzip
andtabix
provided by thesamtools/1.17
module, already loaded.
Hands on: Running R within RStudio, use ggplot2 from R_packages/4.1.1
-
Load the
R_packages/4.1.1
module and the latestRStudio
module, and start RStudio withrstudio &
. -
Load the
ggplot2
R library, provided byR_packages/4.1.1
, and produce an example plot. -
Save the plot using
ggsave
.
Hands on: Loading the conda/latest module
- Load the
conda/latest
module.$ ml conda/latest The variable CONDA_ENVS_PATH contains the location of your environments. Set it to your project's environments folder if you have one. Otherwise, the default is ~/.conda/envs. Remember to export the variable with export CONDA_ENVS_PATH=/proj/... You may run "source conda_init.sh" to initialise your shell to be able to run "conda activate" and "conda deactivate" etc. Just remember that this command adds stuff to your shell outside the scope of the module system. REMEMBER TO USE 'conda clean -a' once in a while
We want to set the CONDA_ENVS_PATH
variable to a directory within our project, rather than use the default which is our home directory.
If you do not set this variable, your home directory will easily exceed its quotas when creating even a single Conda environment.
This will be covered in more detail in the afternoon.
Solutions¶
2a. Verify that the tool cowsay
is not available by default¶
Gives the error message: cowsay: command not found
.
2b. Search for the module providing cowsay
¶
You will find the cowsay/3.03
module.
2c. Load a specific version of that module¶
2d. Verify that the tool cowsay
now works¶
2e. Unload that module¶
2f. Verify that the tool cowsay
is not available anymore¶
Gives the error message: cowsay: command not found
.
3a. Create an executable script called cow_says_hello.sh
¶
Copy-paste this example text:
Run:
3b. Find out¶
Running the script does not load the module beyond running the script.
[richel@sens2023598-bianca ~]$ cowsay hello
-bash: cowsay: command not found
[richel@sens2023598-bianca ~]$ ./cow_says_hello.sh
________
< hello >
--------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
[richel@sens2023598-bianca ~]$ cowsay hello
-bash: cowsay: command not found
Conclusion¶
- Use the module system to use centrally installed software that is available on all nodes
- Include versions when loading modules for reproducibility
- Your own installed software, scripts, Python packages etc. are available from their paths
Installed software¶
Installed databases¶
Video¶
- Solution to the exercises: YouTube, Download (.ogv)