Installation guide¶
There are several minor versions of Roddy, which can be downloaded and installed in the same directory. Minor versions mark changes in the Roddy API. Usually Roddy plugins are only compatible to a specific minor version. Installations for the different versions differ a bit, so we list all versions here.
Premises¶
To install and run Roddy the following programs need to be installed on your computer or the execution hosts:
To run Roddy you need at least
- Java and Groovy
- the default plugin
- the base plugin
- zip/unzip
- bash
- the tool lockfile (usually in the procmail mail-processing-package (v3.22), only needed on job execution hosts)
As Roddy is Linux based, you will be able to find most of these in your OS package manager. For the JDK and Groovy, – both required on the host on which you run Roddy – you may want to use SDKMan. The following will get you going:
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install groovy 2.4.13
sdk install java 8u151-oracle
The minimal version we tested Roddy with are Groovy 2.4.7 and Java 8u121, but lower versions might also work.
The remaining instructions deal with the installation of the Roddy environment – basically is a set of directories into which to install the rest – and Roddy itself.
Roddy 2.2¶
Will not be supported in the future. Releases are only available for legacy plugins.
Roddy 2.3¶
- Clone the repo and select the desired tag.
- Step two depends on your role.
- If you intend to use Roddy and do not want to develop plugins or Roddy itself:
- Download any JRE v1.8.* (OpenJDK and SunJDK were tested). Also download Groovy 2.3.*
- Open up the dist folder in the Roddy directory.
- Create a folder named runtimeDevel
- unzip / untar both archives in runtimeDevel
- If you want to develop Roddy or Roddy plugins:
- Download any JDK v1.8.* (OpenJDK and SunJDK were tested). Also download Groovy 2.3.*
- Open up the dist folder in the Roddy directory.
- Create a folder named runtimeDevel
- unzip / untar both archives in runtimeDevel
- Optionally unpack one or more of the release zips in dist/bin/ directory.
Please see Versioning for information about how to mix different versions of Roddy in the same directory.
You will need an Application properties files file with the configuration for accessing your computing environment.
Note that the Roddy 2.4 releases are actually Roddy 3 pre-releases. Please use Roddy 3 instead.
Roddy 3¶
For Roddy version 3 zips are deployed to Github releases (continuous deployment via Travis). The thus installed Roddy will contain all Java library dependencies except the JDK and Groovy, which are both needed during the start-up, before the actual Roddy is started.
- Release ZIPs for Roddy and the Roddy environment are available via Github Releases. Download the latest release of the RoddyEnv ZIP and unpack it and change into the Roddy environment directory (e.g. “Roddy”). This “environment” is basically a specific directory structure and a start-up script that allow to install multiple Roddy versions in parallel.
- After that you can install arbitrary releases of the Roddy ZIP into dist/bin/$major.$minor.$patch directories.
- Finally the default and base plugin repositories need to be cloned into the dist/plugins/ directory.
pushd dist/plugins
git clone https://github.com/TheRoddyWMS/Roddy-Default-Plugin.git DefaultPlugin
git clone https://github.com/TheRoddyWMS/Roddy-Base-Plugin PluginBase
popd
You will need an Application properties files file with the configuration for accessing your computing environment.
Versioning¶
The Roddy environment with the top-level “roddy.sh” allow you to co-install multiple Roddy versions. Simply install the different versions of Roddy, e.g. from the release zips, into directories in “dist/bin” following the naming scheme “dist/bin/$major.$minor.$patch”. The desired version can than be selected during Roddy invocations using the “–useRoddyVersion” parameter.
Additionally, Roddy is capable of handling multiple versions of the same workflow plugin. Therefore, if you install specific plugins, such as the ACEseq plugin, you will need specific versions of e.g. the default and base plugins. The way to progress here is to first check in the plugin of interest in the “buildinfo.txt”, which plugins and their versions are needed, and then progress in this way from plugin to plugin recursively.
The installation of specific plugin version needs to be done in directories named after the scheme $pluginName_$major.$minor.$patch[-$revision] (the revision is optional). Usually you can get specific versions – official releases of plugins – in the Github Releases of the plugin. Alternatively you clone the repository into an appropriately named directory and then check out the tag with the version of interest.
On the long run, this manual plugin installation mechanism may get automatized.
[Optional] Setup GroovyServ¶
Roddy uses Groovy, however, Groovy is a bit slow to start. So Roddy 3.0+ supports GroovyServ, which can be used by you to speed things up. GroovyServ tremendously decreases the startup time of Groovy applications and Roddy will try to download and set it up automatically. If that fails or if you want to set it up by yourself, do the following in your Roddy directory:
mkdir -p dist/runtime
cd dist/runtime
# Download the GroovyServ binary zip archive from the `GroovyServ`_ download site,
# unzip it and delete the archive afterwards.
unzip groovyserv*.zip
rm groovyserv*.zip
# Last step, put Groovy and the Java binary folders to your PATH environment variable. This
# is e.g. set in your ~/.bashrc file.
Now that’s it. If you want to disable GroovyServ, you also do this.
mkdir -p dist/runtime
cd dist/runtime
touch gservforbidden
If you create the file, Roddy will not use GroovyServ.
Note
This setup was tested using GroovyServ 1.1.0!
Test your installation¶
Head over to the Roddy directory and do
./roddy.sh
If everything is properly done, Roddy will print its help screen.
Quick build instructions¶
If you want to build Roddy yourself, clone the repository. The repository already contains the Roddy environment. Change into this directory and use Gradle to build the Roddy JAR. In summary:
git clone https://github.com/eilslabs/Roddy.git
cd Roddy
git checkout develop
pushd dist/plugins
git clone https://github.com/TheRoddyWMS/Roddy-Default-Plugin.git DefaultPlugin
git clone https://github.com/TheRoddyWMS/Roddy-Base-Plugin PluginBase
popd
./gradlew build
The example will build the Roddy from the develop branch. If you use this branch, the dependencies BatchEuphoria and RoddyToolLib will automatically be pulled from Github with their development snapshots. On the master branch we fix the version numbers of these two dependencies. Note that the two basic plugins are required for some of the integration tests.
Full developer build instructions¶
If you want to work with a full Roddy installation and its dependencies, we suggest you create a dedicated directory to install everything. Roddy and its dependencies [BatchEuphoria](https://github.com/TheRoddyWMS/BatchEuphoria) and [RoddyToolLib](https://github.com/TheRoddyWMS/RoddyToolLib) use the Gradle build system. Specifically, it uses the [composite build feature](https://docs.gradle.org/current/userguide/composite_builds.html) of Gradle. Let’s get your own clones of the BatchEuphoria and RoddyToolLib Git repos and reference them with the –includeBuild parameter:
mkdir RoddyProject
cd RoddyProject
git clone https://github.com/TheRoddyWMS/RoddyToolLib.git
git clone https://github.com/TheRoddyWMS/BatchEuphoria.git
git clone https://github.com/TheRoddyWMS/Roddy.git
mkdir -p Roddy/dist/plugins
pushd Roddy/dist/plugins
git clone https://github.com/TheRoddyWMS/Roddy-Default-Plugin.git DefaultPlugin
git clone https://github.com/TheRoddyWMS/Roddy-Base-Plugin PluginBase
popd
cd Roddy
./gradlew build --include-build ../RoddyToolLib/ --include-build ../BatchEuphoria/
Via the –include-build options you make sure to use the local “development” installations of the libraries.
Gradle and proxies¶
If you are behind a proxy you should first configure the proxy for Gradle. Create $HOME/.gradle/gradle.properties with the appropriate settings. You can use the following template:
systemProp.http.proxyHost=
systemProp.http.proxyPort=
systemProp.https.proxyHost=
systemProp.https.proxyPort=
IntelliJ¶
- Download and activate the Gradle-plugin of IntelliJ, if you have not done so already.
- Open a new project. The project should be an “Empty Project”.
- Clone the RoddyToolLib, BatchEuphoria and Roddy into your new empty project. Also the DefaultPlugin and PluginBase plugins are required for some of the integration tests and should be present for most useful things you can do with Roddy.
cd $yourProjectDirectory
git clone https://github.com/TheRoddyWMS/RoddyToolLib
git clone https://github.com/TheRoddyWMS/BatchEuphoria
git clone https://github.com/TheRoddyWMS/Roddy
mkdir -p Roddy/dist/plugins
pushd Roddy/dist/plugins
git clone https://github.com/TheRoddyWMS/Roddy-Default-Plugin.git DefaultPlugin
git clone https://github.com/TheRoddyWMS/Roddy-Base-Plugin PluginBase
popd
- Import the five source repositories via “File” -> “Project Structure” -> “+” (Module pane). For import select the build.gradle from the specific repository.
- Open the Gradle tasks window by clicking on the Gradle symbol on the task bar. If there is no Gradle symbol in the tool bars of IntelliJ, select “View” -> “Tool Windows” -> “Gradle”.
- Configure the composite Gradle builds by right-clicking on the gradle project.
- Now if you go to the Gradle toolbar and select the build target of Roddy, RoddyToolLib, BatchEuphoria and Roddy itself will be build with Gradle.
Setting up plugins in the project¶
After these initial steps you can add your Roddy plugins to you project. We usually clone the plugin repositories into a dedicated plugins_R3.0/ directory just beneath the root project directory (the now not so empty project that you initially created). This directory is then used for the usePluginVersion command-line option or in the applicationProperties.ini. The only exception are the DefaultPlugin and PluginBase that need to be in the Roddy/dist/plugins directory.
In IntelliJ then add the repository to your project as a module, ideally by directly importing the .iml file from the repository. Make sure that the plugin modules depends on the PluginBase, Roddy_main and maybe RoddyToolLib_main modules.
Running Roddy from within IntelliJ¶
For running Roddy with parameters from IntelliJ you an “Application” configuration with -enableassertions -Xms4m -Xmx50m as VM options, the path to your Roddy/ repository as working dir and de.dkfz.roddy.Roddy as Main class. When debugging plugin code you should use the plugin’s repository root for “Use class path of module”.
Example workflow¶
If you want to try out Roddy, you can download our example workflow. The workflow is wrapped inside a Docker container and you can use it to test some Roddys functionality in a controlled environment. The workflow itself is used for somatic small indel calling. It is based on Platypus and accepts paired control and tumor BAM files. Output files are in VCF format.
Installation¶
Make sure, you have a running Docker environment! Open the de.NBI / HD-HuB ownCloud repository
Download the Docker images:
- The base image for our example: roddybaseimage.tar.gz
- The workflow image itself: roddyplatypus.tar.gz
and import them into your Docker environment.
Also download:
- The workflow dependencies: PlatypusIndelCallingWorkflowDependencies.tar.gz
- The scripts to run the workflow: PlatypusIndelCallingBundle.tar.gz
Create unpack the scripts file. The bundle directory will be created. Unpack the dependencies file and move the folder dependenciesPlatypusIndel/ to the bundle directory. Create a working directory and give it access rights like chmod 777
Now you are nearly prepared and only need files which you can analyse. For this example, you will need a control and a tumor bam file plus their index files. The bam files need to be aligned with BWA (we used versions >= 0.7.8) against hs37d5 and duplication marking should be turned on.
Example usage¶
The docker container uses a slighty simplified Roddy syntax. Head into the extracted bundle directory. There you will finde the roddy.sh script.
You can call the script in the following way:
bash roddy.sh (mode) (dataset id) (control bam) (tumor bam) (work directory)
So to just run the example:
bash roddy.sh run TEST [PATH_TO_YOUR_CONTROL] [PATH_TO_YOUR_TUMOR] [PATH_TO_YOUR_WORKING_DIR]
If everything is setup properly, the Roddy docker will now start and create run the workflow. The workflow will take several hours to finish, so make sure to run it in e.g. a screen session-
Users guide¶
Walkthrough¶
This guide will show you how to setup Roddy, so that it is starting and ready to run an analysis for a project. There is a sample NGS workflow available, which will be used in the examples.
For a short overview about Roddy usage navigate to Cheat sheet.
If you do not already have a running installation, please see Installation guide for instructions to install Roddy.
After installing Roddy, please head to the Roddy folder and run the Roddy start script:
bash roddy.sh
If everything is good, Roddy will start and print the help.
Roddy is supposed to be a rapid development and management platform for cluster based workflows.
The currently supported ways of execution are:
- job submission using qsub with PBS or SGE
- monolithic, direct execution of jobs via Roddy
- submission or execution on the local machine or via SSH
To support you with your workflows, Roddy offers you several options:
help
Shows a list of available configuration files in all configured paths.
[...]
--usePluginVersion=(...,...) - Supply a list of used plugins and versions.
Roddy version 2.2.78 build at Fri Sep 18 13:55:26 CEST 2015
Now you can go on and prepare the configuration for your project.
Setup Roddy Configuration¶
You need two types of configurations:
- Application configuration file (by default applicationProperties.ini): A ini formatted file that configures properties of the Roddy application, the batch processing system (PBS, SGE, etc.), default paths for configutations and plugins, etc.
- An XML configuration file for your project with all parameters of the workflow that you want to use.
Application ini file¶
Roddy uses an ini file to control the application behaviour. The ini file define several things:
- Which job system you use
- How you connect to the processing system
- Where Roddy shall search for plugins and configuration files
By default, Roddy will use the ini file located at $HOME/.roddy/applicationProperties.ini, but you can select any other file with the _–useconfig__ command-line option.
The ini files are explained in detail in Application properties files. Here you’ll see a brief overview:
[DIRECTORIES]
configurationDirectories=[FOLDER_WITH_CONFIGURATION_FILES]
pluginDirectories=[FOLDER_WITH_PLUGINS]
scratchBaseDirectory=[FOLDER_ON_EXECUTION_HOSTS]
[COMMANDS]
jobManagerClass=de.dkfz.roddy.execution.jobs.direct.synchronousexecution.DirectSynchronousExecutionJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.sge.SGEJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.slurm.SlurmJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.lsf.rest.LSFRestJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.lsf.LSFJobManager
commandFactoryUpdateInterval=300
commandLogTruncate=80 # Truncate logged commands to this length. If <= 0, then no truncation.
[COMMANDLINE]
executionServiceUser=USERNAME
executionServiceClass=de.dkfz.roddy.execution.io.LocalExecutionService
#executionServiceClass=de.dkfz.roddy.execution.io.SSHExecutionService
executionServiceHost=[YOURHOST]
executionServiceAuth=keyfile
#executionServiceAuth=password
executionServicePasswd=
executionServiceStorePassword=false
executionServiceUseCompression=false
fileSystemInfoProviderClass=de.dkfz.roddy.execution.io.fs.FileSystemInfoProvider
The file is divided into several sections, but this is mainly to keep a better order:
- COMMON is for setting up general things
- DIRECTORIES
- COMMANDS
- COMMANDLINE is to set up the command line interface
We try to keep every possible option in the ini file, so you should basically be able to just select what you need and to fill in the missing parts.
Usually, you just need to change the following settings:
- jobManagerClass - Selects the cluster system backend
- CLI.executionServiceClass - Selects, if you want to access your system via SSH or directly
- CLI.executionServiceAuth - keyfile or password?
- CLI.executionServiceHost - The host, if you select SSH
- CLI.executionServicePasswd - The password for your system, if using SSH and no keyfiles
- CLI.executionServiceStorePassword - If you want to store the password, put in true, however, the password is stored in plain-text!
- scratchBaseDirectory - A path to a preferably fast local storage on the execution hosts. E.g. /local/$USER
You might remember or store away the above options for future usage as its likely, that they won’t change too often. For you the more important settings might be:
- configurationDirectories - Put in a comma separated list of directories, where you keep your project XML files
- pluginDirectories - Put in a comma separated list of the directories, where your plugins are stored. Note, that the folder dist/plugins in the Roddy base directory, which contains the PluginBase and DefaultPlugin, will always be imported. You do not need to set this one.
You can either copy the content from above or you can also use Roddy to help you with the setup. This will be explained later on.
Project configuration files¶
All workflow-specific settings are stored in XML files.
The configuration files are multi-level, which means, you can - Import configuration files into other configuration files - Define several level of configurations and subconfigurations in one file
<configuration configurationType='project'
name='TestProject'
description='A very small project configuration for some workflow tests.'
imports="baseProject"
usedresourcessize="m">
<availableAnalyses>
<analysis id='testWorkflow' configuration='TestAnalysis' useplugin="DefaultPlugin:develop"/>
<analysis id='qualityControl' configuration='QualityControlAnalysis' useplugin="QualityControlPlugin:1.0.10"/>
</availableAnalyses>
<configurationvalues>
<cvalue name='inputBaseDirectory' value='$USERHOME/roddyTests/${projectName}/data' type='path'/>
<cvalue name='outputBaseDirectory' value='$USERHOME/roddyTests/${projectName}/results' type='path'/>
</configurationvalues>
<subconfigurations>
<configuration name="verysmall" usedresourcessize="xs" inheritAnalyses="true" />
</subconfigurations>
</configuration>
You as a user normally should only need to create a project specific file like the one above. Roddy also offers a command for you to help you to set this one up.
Configuration files contain several sections where Roddy lets you define things like configuration values, tools and even filenames. But, you probably won’t need that now and we’ll concentrate on a very basic project configuration like the one above. You can find an in-detail guide here XML configuration files. You might concentrate on the configuration values part as this will be the part which you probably need most.
//Uhhh, ok, so what is in the above example?//
Good that you ask! First you’ll find a standard XML format containing the configuration header. If it is a project configuration file (you could e.g. create a file which contains basic settings for your working environment like e.g. commonly used binaries and reference files) then your file must be named with the prefix “projects”. Otherwise it will not be recognized as a project configuration by Roddy.
<configuration configurationType='project'
name='TestProject'
description='A very small project configuration for some workflow tests.'
imports="baseProject"
usedresourcessize="m">
The header of the configuration must contain the following:
- The configurationType (in this case “project”)
- A configuration name which must not contain “.” and ” “
It may contain:
- A description
- Imports for other configuration files. import can hold a comma separated list of other configuration id’s / names
- A switch for the size of the data you are dealing with. In the analysis configuration every tool can have different level of resources im memory, CPU, and walltime. This option in the project XML allows you to select a project-wide resource requirement level for the size of the input data expected in the project. The values t, xs, s, m, l, xl are allowed the and default is “l”.
Directly after the header, you will find a list of the imported workflows for your project.
<availableAnalyses>
<analysis id='testWorkflow' configuration='TestAnalysis' useplugin="DefaultPlugin:develop"/>
<analysis id='qualityControl' configuration='QualityControlAnalysis' useplugin="QualityControlPlugin:1.0.10"/>
</availableAnalyses>
Each line can enable a workflow / analysis for your project. To make such a line work, you need to set:
- id an arbitrary name that identifies the workflow in your project. This name will be used to call the workflow from the command line.
- configuration to identify the original analysis configuration id that is defined in the analysis XML in the plugin. You can also import an analysis several times with a different id value.
- finally, useplugin is used to select the plugin and the plugins version, in which the analysis is searched. This parameter is optional.
The corresponding configuration files are automatically searched in your plugins. The active plugins are retrieved from the plugin directories set in you application ini file.
Next comes the part where you set the projects input and output folder.
<configurationvalues>
<cvalue name='inputBaseDirectory' value='$USERHOME/roddyTests/${projectName}/data' type='path'/>
<cvalue name='outputBaseDirectory' value='$USERHOME/roddyTests/${projectName}/results' type='path'/>
</configurationvalues>
In most cases, you should be done right now.
Analysis-specific configuration¶
Occasionally, you may want to set specific parameters only for selected analyses. In this case you can add subconfigurations:
<subconfigurations>
<configuration name="verysmall" usedresourcessize="xs" inheritAnalyses="true" />
</subconfigurations>
Subconfigurations are exactly defined like the main configuration. They can contain the same sections. Each value, which is defined by you, overrides a value of the parent configuration. Subconfigurations can be nested and affect all ** tags that are nested within.
Built-in configuration creation / updates¶
Use Roddy to create an initial project configuration¶
Roddy can help you to create an initial project configuration with one command.
bash roddy.sh prepareprojectconfig create [targetprojectfolder] --useRoddyVersion=develop
The command will:
- Create a target folder structure like [targetprojectfolder]/roddyProject/versions/version_[current date]_[current time]
- Copy a default ini file to the target folder [targetprojectfolder]/applicationProperties.ini
- Copy a default project XML to the target folder [targetprojectfolder]/project.xml
You can now update both the ini file and the XML file to your needs. Do not forget to place the freshly create folder as a configuration folder to the ini file! Please see the explanation above to decide which settings are appropriate for your system.
To use the ini file, you can call Roddy in the following way:
bash roddy.sh --useconfig=[targetprojectfolder]/applicationProperties.ini
Use Roddy to update an existing project configuration to a new version¶
Sometimes it is helpful to keep several version for project configuration files. This ensures, that you can always try to go back to an old version of your config. To support this, you can call Roddy in the following way:
bash roddy.sh prepareprojectconfig update [targetprojectfolder]
Roddy will then search the latest existing project configuration version and create a new folder with a copy in it.
So after you call Roddy, you’ll find e.g.:
- [targetprojectfolder]/roddyProject/versions/version_20150719_111328 and
- [targetprojectfolder]/roddyProject/versions/version_20150925_134527
The new folder will contain a copy of the contents of the old folder. You can call Roddy afterwards with the new ini file.
IMPORTANT: Roddy does not update the configurationDirectories option in the new applicationProperties.ini. As of now, you need to manually adapt the configuration directories in the ini file!
Check if things are set up properly¶
With configurations of complex workflows, it may become very tedious and error prone to ensure that everything is configured correctly. If you work with multiple projects, the first thing to check is the use of the correct configuration files. To find out, if you did everything right, Roddy offers you several options:
bash roddy.sh showconfigpaths --useconfig=[pathOfIniFile]
This will show you all available configuration files in your configured paths. Note, that this won’t list analysis XML files, as these are loaded in a later stage, where Roddy has knowledge about loaded plugins.
With the following command you can check, whether you set the right paths and if all your files are available:
bash roddy.sh listdatasets [project]@[analysis] --useconfig=[pathOfIniFile]
Note
Roddy supports parsing metadata such as dataset identifiers from paths but additionally has a MetadataTable facility that simplifies metadata input via a table. Some workflows may also be implemented to get the metadata from dedicated configuration values. Therefore, whether this command works may depend on the specific workflow and may require additional command-line parameters or configuration values. Still it can be extremely useful to get a list of all findable datasets.
If everything is properly set and you use the right configuration and analysis, Roddy will be able to search the input and output folders in your project configuration file. It will then display a list of all found datasets. Roddy will search both folders and the result will be combined, so you will not get doublettes. If you see the list of your datasets, you can now run your analysis, but before you do this, you can also try some more things before.
bash roddy.sh printruntimeconfig [project]@[analysis] [pid] --useconfig=[pathOfIniFile]
If you run a workflow for the first time, it might make sense to check the generated runtime configuration file before you start a process. The above command will do that for the pid set by you. Is everything right? Good, then you can go on and start a process. If not, you need to check your configuration files.
Run a project¶
There is one more thing you can do before starting a process: You can call Roddy with testrun:
bash roddy.sh testrun [project]@[analysis] [pattern]/[ALL] --useconfig=[pathOfIniFile]
testrun will nearly do the same thing as run, except, that it does not start cluster jobs. It will list all the jobs which will be executed. Please take a close look at the output for all the jobs. testrun and all the other run commands are all triggered with a dataset id pattern. We’ll explain that soon.
Some explanation for the dataset patterns. Roddy selects and lists datasets like e.g. ls. This means, you can use all sorts of wildcards and patterns. Valid patterns are e.g. H063, *-A*, ???3- and so on. But! Keep in mind, that wildcards will may already be resolved by the shell (e.g. Bash is always good for surprises). testrun will help you find out, if the patterns you use are working. Also note, that a plain * won’t work at least for Bash. If you want to run all datasets, use the dataset selector [ALL].
Now let’s look at an example for a job output:
0x789C44FF73F: fastqc [ -l walltime=1000:00:00]
pid : H006-1
PID : H006-1
CONFIG_FILE : [ exDir]/runtimeConfig.sh
ANALYSIS_DIR : /home/heinold/temp/roddyLocalTest/testproject
TOOLSDIR : [ exDir]/analysisTools/qcPipeline
TOOL_ID : fastqc
RAW_SEQ : [ inDir]/control/paired/run120918_SN7001149_0101_AC16PKACXX/sequence/1_B_GCCAAT_L002_R1_complete_filtered.fastq.gz
FILENAME_FASTQC : [outDir]/fastx_qc/control_run120918_SN7001149_0101_AC16PKACXX_1_B_GCCAAT_L002_R1_sequence_fastqc.zip
RODDY_PARENT_JOBS : parameterArray=()
This is the output for a job calling fastqc on a fastq file, to go easy, we just named it fastqc. First, there is a fake job id, which is used in test cases. If you call run instead of testrun, this will be replaced by a job identifier produced by your processing backend (PBS, SGE, etc.). The job id is followed by the resource settings specific to your configured processing backend. Here it is the walltime setting for a PBS system. The next lines are the parameters which will be passed to the job. Some of the parameters are set for every job including pid/PID (“patient id”, this is the “dataset”), CONFIG_FILE or ANALYSIS_DIR. The abbreviations like [exDir] or [inDir] are explained in the header of the testrun output. They are there to make things more readable. Other parameters like e.g. FILENAME_FASTQC are job specific. In this case, there is a fastq file for the job input and a zip file containing the job output. Filenames are based on rules which are normally included in analysis configuration files.
Let’s see, showconfigpaths worked, listdatasets worked, printanalysisxml worked and also testrun. What’s left? Right: run!
Let’s start and run something.
bash roddy.sh run [project]@[analysis] [pattern]/[ALL] --useconfig=[pathOfIniFile]
Instead of the output of testrun, Roddy will now try and run the jobs on your processing backend. If all jobs fail, you might have the wrong settings. If some fail, there might be problems with the backend. Roddy will also try to tell you what sort of problems there are. But this won’t work in every case. We won’t bother you with the full output now, but something like the following will show up in case of success:
Finally, you started something. Now all you have to to is to wait until your process finishes. Roddy will again offer you several commands to help you keep track of your progress.
Process tracking, Debugging and Rerunning a process¶
Sometimes, it can be nice to know if a process is still running or if there were faulty jobs and sometimes you just want to restart a process. Roddy has what you need: checkworkflowstatus, testrerun and rerun.
bash roddy.sh checkworkflowstatus [project]@[analysis] [pattern]/[ALL] --useconfig=[pathOfIniFile]
checkworkflowstatus will create a table listing your selection of datasets and their states:
[outDir]: /home/heinold/temp/roddyLocalTest/testproject/rpp
Dataset State # OK ERR User Folder / Message
A100 UNSTARTED 0 0 0 Not executed (or the Roddy log files were deleted).
A200 UNSTARTED 0 0 0 Not executed (or the Roddy log files were deleted).
stds OK 3 3 0 testuser /home/testuser/temp/roddyLocalTest/testproject...
The table has several columns:
- Dataset is self explaining and shows you for which dataset the line is
- State is the state for the last execution of a dataset
- Is the number of started jobs for a process ===========================================
- OK is the number of good jobs
- ERR is the number of faulty jobs
- User is the user which started the last process
- Folder / Message is the execution store folder for the process
You can e.g. use the output to grep for states, folders and other things. If there are errornous jobs, you now have the info to look for those jobs. The next section will show you, how to do this. For know, we’ll consider the jobs as failed for technical reasons and show you how to restart them.
Roddys restart / rerun option tries to start only jobs which need to be run. For this, it creates a list of all the output files which it knows and compares these files with the existing files on disk. There are no consistency checks done, so files with the size of zero are also taken into account. If a job has failed, all of its descendants are automatically marked as failed. This is also true, when a new job will get startet. What the workflow then does is within the responsibility of the workflows author. Similar to testrun / run, testrerun and rerun will start to process data. However, only necessary jobs will be started.
Import list for different workflows:¶
Please consider using only one analysis import per project XML file, if you set configuration variables. Configuration values for different workflows might have the same name, which could lead to misconfigured workflows. If you do not want to create a new file for every analysis, you can still use subconfigurations for the different workflows using the configuration attribute of the analysis tag:
<!-- Roddy 2.2.x -->
<analysis id='snvCalling' configuration='snvCallingAnalysis' useplugin="COWorkflows:1.0.132-4" />
<analysis id='indelCalling' configuration='indelCallingAnalysis' useplugin="COWorkflows:1.0.132-4" />
<analysis id='copyNumberEstimation' configuration='copyNumberEstimationAnalysis' useplugin="CopyNumberEstimationWorkflow:1.0.189" />
<analysis id='delly' configuration='dellyAnalysis' useplugin="DellyWorkflow:0.1.12"/>
<!-- Roddy 2.3.x -->
<analysis id='WES' configuration='exomeAnalysis' useplugin="AlignmentAndQCWorkflows:1.1.39" />
<analysis id='WGS' configuration='qcAnalysis' useplugin="AlignmentAndQCWorkflows:1.1.39" />
<analysis id='postMergeQC' configuration='postMergeQCAnalysis' useplugin="AlignmentAndQCWorkflows:1.1.39"/>
<analysis id='postMergeExomeQC' configuration='postMergeExomeQCAnalysis' useplugin="AlignmentAndQCWorkflows:1.1.39"/>
<!-- Roddy 3 -->
<analysis id='rdw' configuration='snvRecurrenceDetectionAnalysis' useplugin="SNVRecurrenceDetectionWorkflow"/>
<analysis id='WGBS' configuration='bisulfiteCoreAnalysis' useplugin="AlignmentAndQCWorkflows:1.1.39"/>
Cheat sheet¶
This page is for those amongst you, that need to rush in or just need a fresreshment, when it comes to Roddy usage. We will mostly list useful commands and that’s it. No big explanations or other things. If you need this, open up the Walkthrough.
Create a new project¶
lang=bash
bash roddy.sh prepareprojectconfig create [targetprojectfolder]
# Open up the applicationProperties.ini. Change:
- The cluster settings
- Add the COProjectConfigurations path which you need.
# Open the XML file. Change:
- The project id in the header
- Add analyses you need (see user guide, last part)
- Add / change values you need (e.g. I/O dir)
Test¶
lang=bash
bash roddy.sh listdatasets [project]@[analysis] --useconfig=[yourinifile]
Testrun / Run¶
lang=bash
bash roddy.sh testrerun [project]@[analysis] [id] --useconfig=[yourinifile]
Command line options¶
Roddy has a wide range of run modes and options which will be explained here. The run modes are basically divided into user options and extended developer options. You can view all options by running Roddy without any parameters.
User options¶
If you do not intend to develop Roddy or Roddy plugins, you can stop reading after this part.
Option | Additional | Description |
---|---|---|
help | Shows a list of available configuration files in all configured paths. | |
printappconfig | [–useconfig={file}] | Prints the currently loaded application properties ini file. |
showconfigpaths | [–useconfig={file}] | Shows a list of available configuration files in all configured paths. |
showfeaturetoggles | Shows a list of available feature toggles. | |
prepareprojectconfig | Create or update a project xml file and an application properties ini file. | |
plugininfo | [–useconfig={file}] | Shows details about the available plugins. |
printpluginreadme | (configuration@analysis) n[–useconfig={file}] | Prints the readme file of the currently selected workflow. |
printanalysisxml | (configuration@analysis) n[–useconfig={file}] | Prints the analysis xml file of the currently selected workflow. |
validateconfig | (configuration@analysis) n[–useconfig={file}] | Tries to find errors in the specified configuration and shows them. |
listworkflows | [filter word] [–shortlist] n[–useconfig={file}] | Shows a list of available configurations and analyses. If a filter word is specified, then the whole configuration tree is only printed, if at least one configuration id in the tree contains the word. |
listdatasets | (configuration@analysis) n[–useconfig={file}] | Lists the available datasets for a configuration. |
printruntimeconfig | (configuration@analysis) n[–useconfig={file}] [–extendedlist] [–showentrysources] | Basically calls testrun but prints out the converted / prepared runtime configuration script content. –extendedlist shows all stored values (also e.g. tool entries. Works only in combination with –showentrysources –showentrysources shows the source file of the entry in addition to the value. |
testrun | (configuration@analysis) n[pid_0,..,pid_n] [–useconfig={file}] | Displays the current workflow status for the given datasets. |
testrerun | (configuration@analysis) n[pid_0,..,pid_n] [–useconfig={file}] | Displays the current workflow status for the given datasets. |
run | (configuration@analysis) n[pid_0,..,pid_n] [–waitforjobs] [–useconfig={file}] | Runs a workflow with the configured Jobfactory. Does not check if the workflow is already running on the cluster. |
rerun | (configuration@analysis) n[pid_0,..,pid_n] [–waitforjobs] [–useconfig={file}] | Reruns a workflow starting only the parts which did not produce valid files. Does not check if the workflow is already running on the cluster. |
cleanup | (configuration@analysis) n[pid_0,..,pid_n] [–useconfig={file}] | Calls a workflows cleanup method or a setup cleanup script to clean (i.e. remove or set to file size zero) output files. Aborts the running jobs of a workflow for a pid. |
checkworkflowstatus | (configuration@analysis) n[pid_0,..,pid_n] [–detailed] [–useconfig={file}] | Shows a generic overview about all datasets for a configuration. If some datasets are selected, a more detailed output is generated. If detailed is set, information about all started jobs and their status is shown. |
Common additional options¶
These modes can be parametrized in various ways. Here is a summary of all options, but note that not all of them make sense in all modes.
Option | Argument | Description |
–useconfig | {file} | Use {file} as the application configuration. |
–c | {file} | The order is: full path, .roddy folder, Roddy directory. |
–verbositylevel | {1,3,5} | Set how much Roddy will print to the console, 1 is default, 3 is more, 5 is a lot. |
–v | Set verbosity to 3. | |
–vv | Set verbosity to 5. | |
–useiodir | [fileIn],{fileOut} | Use fileIn/fileOut as the base input and output directories for your project. If fileOut is not specified, fileIn is used for that as well. The format specifier can be one of: tsv, csv or excel |
–usemetadatatable | {file},[format] | Tell Roddy to use an input table to load metadata and input data and available datasets. |
–waitforjobs | Let Roddy wait for all submitted jobs to finish. | |
–disabletrackonlyuserjobs | By default, Roddy will only track jobs of the current user. The switch tells Roddy to track all jobs. | |
–disablestrictfilechecks | Tell Roddy to ignore missing files. By default, Roddy checks if all necessary files exist. | |
–ignoreconfigurationerrors | Tell Roddy to ignore configuration errors. By default, Roddy will exit if configuration errors are detected. | |
–ignorecvalueduplicates | Tell Roddy to ignore duplicate configuration values within the same configuration value block. errors. By default, Roddy will exit if duplicates are found. | |
–forcenativepluginconversion | Tell Roddy to override any existing converted Native plugin. By default Roddy will prevent this. | |
–forcekeepexecutiondirectory | Tell Roddy to keep execution directories. By default Roddy will delete them, if no jobs were executed in a run. | |
–useRoddyVersion | (version no) | Use a specific roddy version. |
–rv | (version no) | Like –useRoddyVersion |
–usePluginVersion | (…,…) | Supply a list of used plugins and versions. |
–configurationDirectories | {path},… | Supply a list of configurationdirectories. |
–pluginDirectories | {path},… | Supply a list of plugin directories. |
Developer options¶
A good way to compile Roddy is to use just
./gradlew build
Roddy also provides means to compile itself (basically using gradlew again) but additionally increasing the build version number. To some extend Roddy can compile and package plugins for you. For these actions the following modes are available:
Option | Additional | Description |
---|---|---|
compile | Compiles the roddy library / application. | |
pack | Creates a copy of the ‘develop’ version and puts the version number to the file name. | |
compileplugin | (plugin ID) [–useconfig={file}] | Compiles a plugin. |
packplugin | (plugin ID) [–useconfig={file}] | Packages the compiled plugin in dist/plugins and creates a version number for it. Please note that you can indeed override contents of a zip file if you do not update / compile the plugin jar! |
Reproduce Roddy Results¶
Reproducibility in bioinformatics is not an easy task. Even keeping everything identical except the CPU may produce slightly different results, so exact reproducibility is unlikely to be achievable at all. Also exact reproducibility is kind of an exaggerated aim when experimental data is concerned, which is always associated with a measurement error.
Generally, to reproduce bioinformatic results you need the following components
- Configuration values (maybe including random seeds)
- Software versions (but you use Conda, or not ;-) ), including workflow versions
- Reference data
- Input data to be analysed
A workflow management system like the Roddy core itself usually plays only a minor role, except if implementation details (in particular bugs) affect any of the aspects above. You should make sure that you have a good description of all these parameters and ideally, you should have a a backup copy of this.
Exact Reproduction¶
The simplest way to (almost) exactly reproduce a Roddy analysis is of course to use the same Roddy call in the same environment. Roddy stores the call used for each analysis in the roddyCall.sh
file in the roddyExecutionStore/exec_*
directory. Note that the file may not contain correctly quoted/escaped commandline parameters, so the call may not exactly executed as given in the file. The reason is that the --cvalues
parameter may contain shell special characters like ‘;’ or ‘!’ that need to be escaped or quoted in the shell when to be interpreted as normal characters.
Here an example:
/path/to/roddy.sh rerun \
config.WGS@alignment \
pid1,pid2 \
--useconfig=/path/to/configs/applicationProperties-analysis-local-lsf.ini \
--usefeaturetoggleconfig=/path/to/configs/featureToggles.ini \
--usePluginVersion=AlignmentAndQCWorkflows:1.2.73-0 \
--configurationDirectories=/path/to/configs \
--useiodir=/path/to/pidDir/,/path/to/outputDir \
--cvalues=fastq_files:/path/to/fastqs/r1.fq.gz;/path/to/fastqs/r2.fq.gz \
--useRoddyVersion=3.3.3
The content of the file is a one-liner, but here it is broken down into arguments for readability. You’ll notice the --cvalues
parameter and that it contains a ‘;’ used in the fastq_files configuration value to delimit read 1 and 2 FASTQs. The content of the configuration values can be arbitrary and is completely up to the workflow developer. In this case, the semicolons is als statement terminator in Bash and will be interpreted as such, unless quoted or escaped. Thus, the fix here is of course to quote the value of the parameter:
--cvalues='fastq_files:/path/to/fastqs/r1.fq.gz;/path/to/fastqs/r2.fq.gz' \
Same Analysis on Different Input Data¶
If you want to extend your analyses with new input data you can start with an old Roddy call and adapt it. Often used adaptations are
- change the output directory
- change the input directory
- change the applicationProperties.ini: if you intend to change the cluster configuration or configuration file directories
- change configuration values that are metadata of the workflow specifically derived from the analysed sample (e.g. insert size distribution parameters for the sample)
Note that these changes can be quite tedious and error prone to be done manually. Usually it is best to write a small script – in particular if you have to run your analyses on many new input datasets.
Beware: Multiple Execution Stores¶
Beware that you may have multiple execution store directories if Roddy was run with in “rerun” mode, e.g. to complete a failed job. Here things can get really complicated if the different parts of the output data were produced with different versions or configurations.
- Ideally, you should call Roddy explicitly mentioning the plugin version, like in the example above. Due to the plugin dependency and automatic loading you may want to check that all used plugins were identical. You can check the
versionInfo.txt
file in the execution stores to learn which versions were used during each execution.- For configuration files the situation is more complex. Currently, the logfiles are only permanently logged in the
$HOME/.roddy/logs
directory. But the most important resource for configuration values are the.parameter
files for each job in theroddyExecutionStore
directories.
If you are paranoid, you should always restart your analysis completely, if you intend to change any of the above mentioned factors like configurations, etc. If you know what you do, you may decide differently and change parameters even for an individual dataset, but then you should document what you do. It is really easy to mess up the configuration by changing a configuration value in a file. Better use versioned configuration files.
Of course similar problems arise if you have multiple samples processed by different workflow runs and you need a homogenously processed data set.
Configuration topics¶
Application properties files¶
To successfully manage a workflow, Roddy needs to know about several things:
- The Batch system you’re running on.
- The user credentials for e.g. SSH and connection settings.
- The directories for configuration files and plugins.
- And, if you want, some debug settings.
When setting paths and referring to e.g. environment variables like “${USER}” braces “${…}” to avoid warning about variables without braces that Roddy generates (since version 3.5) to warn you about possibly unresolved variables.
Let’s have a brief look at it:
[COMMON]
useRoddyVersion=develop # Use the most development version for tests
passEnvironment=false
baseEnvironmentScript=[ENVIRONMENT_FILE]
[DIRECTORIES]
configurationDirectories=[FOLDER_WITH_CONFIGURATION_FILES]
pluginDirectories=[FOLDER_WITH_PLUGINS]
scratchBaseDirectory=[FOLDER_ON_EXECUTION_HOSTS]
[JOB_PROCESSING]
jobManagerClass=de.dkfz.roddy.execution.jobs.direct.synchronousexecution.DirectSynchronousExecutionJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.sge.SGEJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.slurm.SlurmJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.lsf.rest.LSFRestJobManager
commandFactoryUpdateInterval=300
commandLogTruncate=80 # Truncate logged commands to this length. If <= 0, then no truncation.
[COMMANDLINE]
executionServiceUser=USERNAME
executionServiceClass=de.dkfz.roddy.execution.io.LocalExecutionService
#executionServiceClass=de.dkfz.roddy.execution.io.SSHExecutionService
executionServiceHost=[YOURHOST]
executionServiceAuth=keyfile
#executionServiceKeyfileLocation=[keyfile path] # use $HOME/.ssh/id_rsa by default
#executionServiceAuth=password
executionServicePasswd=
executionServiceStorePassword=false
executionServiceUseCompression=false
fileSystemInfoProviderClass=de.dkfz.roddy.execution.io.fs.FileSystemInfoProvider
The file is divided into several sections, but this is mainly to keep a better order, you can have the file setup like you want it. Briefly explained, the
- COMMON is for setting up general things
- DIRECTORIES
- COMMANDS
- COMMANDLINE is to set up the command line interface
We try to keep every possible option in the ini file, so you should basically be able to just select what you need and to fill in the missing parts.
Usually, you just need to change the following settings:
- jobManagerClass - Selects the cluster system backend
- CLI.executionServiceClass - Selects, if you want to access your system via SSH or directly
- CLI.executionServiceAuth - keyfile or password?
- CLI.executionServiceHost - The host, if you select SSH
- CLI.executionServicePasswd - The password for your system, if using SSH and no keyfiles
- CLI.executionServiceStorePassword - If you want to store the password, put in true, however, the password is stored in plain-text!
By default the environment local to the submission host, on which the job submission commands like qsub or bsub are executed – i.e. not necessarily the system on which Roddy is executed (!), is not passed to the execution hosts, to ensure a defined environment for maximum reproducibility. If you want to pass the local environment, you can set passEnvironment to true. The baseEnvironmentScript variable can be used to ensure that e.g. /etc/profile is sourced, because this may not be per se the case, depending on whether your scheduling system uses interactive or login shells to execute job. Alternatively, you may source such base environments in your ~/.profile or ~/.bashrc.
You might remember or store away the above options for future usage as its likely, that they won’t change too often. For you the more important settings might be:
- configurationDirectories - Put in a comma separated list of directories, where you keep your project XML files
- pluginDirectories - Put in a comma separated list of the directories, where your plugins are stored. Note, that the folder dist/plugins in the Roddy base directory, which contains the PluginBase and DefaultPlugin, will always be imported. You do not need to set this one.
You can either copy the content from above or you can also use Roddy to help you with the setup. This will be explained later on.
Configuration files¶
Roddy currently supports two different types of configuration files: - XML based which allows to use all configuration features - Bash based which only allows a reduced set of configuration features
Normally, Roddy workflows and projects are configured with XML files. This document will give you all the details you need to know about those special files. Don’t be afraid of messing up things in configuration files. Roddy checks at least a part (not everything) of the files, when they get loaded and will inform you about structural errors as good as possible.
Types of files¶
Roddy configuration files exist in three flavours:
- Project configuration files
- Workflow or analysis configuration files
- Generic configuration files.
All file types may contain the same content type though analysis configuration files will normally look different than e.g. project configuration files. The main difference between the different types is their position in the configuration inheritance tree, their filename and their header.
Filenames¶
Roddy imposes some filename conventions to identify XML files when they are loaded from disk:
- Project configuration files look like projects*[yourfilename]*.xml
- Workflow configuration files use the pattern analysis*[yourfilename]*.xml
Common configuration files do not use any pattern. You can name them like you want, except for the above patterns.
Inheritance structure¶
Configurations and configuration files can be linked in several ways:
- Subconfigurations extend their parent configuration(s)
- Configuration files can import other configuration, this is only possible on the top-level of a configuration file, a subconfiguration cannot do this
- Analysis configuration files can be imported as an analysis import by a project configuration or subconfiguration
- An analysis can be imported by a project but not vice-versa
Bash configuration files¶
Bash configuration files are, compared to XML files, very lightweight. They offer only a subset of configuration options (namely configuration values and analysis imports) and are ideally used for small project or generic configurations.
#name aConfig
#imports anotherConfig
#description aConfig
#usedresourcessize m
#analysis A,aAnalysis,TestPlugin:develop
#analysis B,bAnalysis,TestPlugin:develop
#analysis C,aAnalysis,TestPlugin:develop
outputBaseDirectory=/data/michael/temp/roddyLocalTest/testproject/rpp
UNZIPTOOL=gunzip
ZIPTOOL_OPTIONS="-c"
sampleDirectory=/data/michael/temp/roddyLocalTest/testproject/vbp/A100/${sample}/${SEQUENCER_PROTOCOL}*
As you can see in the example, a Bash configuration needs a header and a body.
#name aConfig
#imports anotherConfig
#description aConfig
#usedresourcessize m
#analysis A,aAnalysis,TestPlugin:develop
#analysis B,bAnalysis,TestPlugin:develop
#analysis C,aAnalysis,TestPlugin:develop
The header must contain the name of the configuration and may contain imports, a description, the usedresourcessize attribute and several analysis tags. The analysis tags need to be set like [id],[analysis config id],[plugin name]:[plugin version]. Please see XML configuration files for a detailed description of the tags and attributes.
After the header comes the configuration values section.
outputBaseDirectory=/data/michael/temp/roddyLocalTest/testproject/rpp
UNZIPTOOL=gunzip
ZIPTOOL_OPTIONS="-c"
sampleDirectory=/data/michael/temp/roddyLocalTest/testproject/vbp/A100/${sample}/${SEQUENCER_PROTOCOL}*
The syntax for configuration values is the regular Bash syntax for variables. Of course, you can also use comments.
XML configuration files¶
Structure / Sections¶
Each configuration file is built up after the following pattern
<configuration name='test' description='Example.' >
<availableAnalyses />
<configurationvalues />
<processingTools />
<filenames />
<enumerations>
<subconfigurations />
</configuration>
However, keep in mind, that not every section makes sense for every type of XML file. E.g. availableAnalyses only makes sense in project XML files, whereas filenames and processing tools will moste likely only be used within analysis XML files.
Header¶
Different file types use different XML headers. This is necessary to set the different behaviours of those file types. Furthermore, for analysis configuration files, headers differ for the various workflow types. Let’s start with generic configuration files.
Brawl based workflows¶
Script based workflows¶
<configuration configurationType='analysis'
name='dellyAnalysisBrawl' description='An example Brawl analysis.'
class='de.dkfz.roddy.core.Analysis' brawlWorkflow='BrawlTest'
brawlBaseWorkflow='WorkflowUsingMergedBams'
imports='commonCOWorkflowsSettings' listOfUsedTools='script1,script2'
usedToolFolders='scripts,tools'>
<configuration configurationType='analysis'
name='testAnalysisNative'
description='A test analsis invoking a native pbs workflow.'
class='de.dkfz.roddy.core.Analysis'
workflowClass='de.dkfz.roddy.knowledge.nativeworkflows.NativeWorkflow'
listOfUsedTools="nativeWorkflow"
usedToolFolders="roddyTests"
nativeWorkflowTool="nativeWorkflow"
targetCommandFactory="de.dkfz.roddy.execution.jobs.cluster.pbs.PBSCommandFactory">
Java based workflows¶
<configuration configurationType='analysis'
name='testAnalysis' description='A test analysis for local and remote roddy workflow tests.'
class='de.dkfz.roddy.core.Analysis'
workflowClass='de.dkfz.roddy.knowledge.examples.SimpleWorkflow'
listOfUsedTools="testScript,testScriptExitBad,testFileWithChildren"
usedToolFolders="devel"
cleanupScript="cleanupScript">
Project configurations¶
<configuration configurationType='project' name='coWorkflowsTestProject'
description='A test project for the purity estimation analysis.' imports="coBaseProject"
usedresourcessize="s">
Generic / common configurations¶
Generic configuration files keep a minimal header, which might even just contain the name. That’s it.
<configuration name='cofilenames' description='This file contains patterns for filename generation and default configured paths for our computational oncology file structure.' >
Configuration values¶
Usually you will change configuration values. When Roddy executes a workflow, a shell script will be created where all the configuration values are stored. This script can then be imported by workflow scripts.
Configuration values are embedded in a configuration values section like:
<configurationvalues>
<cvalue name='analysisMethodNameOnInput' value='testAnalysis' type='string'/>
<cvalue name='analysisMethodNameOnOutput' value='testAnalysis' type='string'/>
<cvalue name="testAOutputDirectory" value="testfiles" type="path"/>
<!--<cvalue name="valuec" value="${valuea}"/>-->
<!--<cvalue name="valuea" value="${valueb}"/>-->
<!--<cvalue name="valueb" value="${valuea}"/>-->
<cvalue name="testOutputDirectory" value="${outputAnalysisBaseDirectory}/testfiles" type="path"/>
<cvalue name="testInnerOutputDirectory" value="${testOutputDirectory}/testfilesw2"/>
</configurationvalues>
The configuration value itself is defined as a cvalue element. Each element can have several tags:
- name - The tag is used to identify the value both in Roddy and in the job scripts.
- description - If you want to describe a value, do it with this tag.
- value - The actual value is store here. You can set dependencies to other values by enclosing the referenced value like ${targetValue}. Roddy will evaluate the dependency, as soon as it is necessary.
- type - There exist several types for configuration values. The default value is string. Note, that the selection of
the type will influence, how variables are interpreted and evaluated / converted.
- string accepts any value.
- int will accept integer values only. E.g. 1, 2, 3 or 4.
- float will accept valid Java / Groovy float values. E.g. 1.2f 1.2
- double will accept valid Java / Groovy double values. E.g. 1.2 or 1.2E-3.
- boolean will evaluate true (preferred), y, j, t and 1 to true and false (preferred), n, f and 0 to false.
- path indicates, that the variable is a path or a part of a path.
Nice to know - Order of evaluation in Roddy¶
Configuration values can be cast to their type by calling any of the methods toBoolean(), toInt(), toFloat(), toDouble(), toString() and toFile().
In contrast to the other methods, toString() and toFile() are quite complex methods which will resolve referenced variables. Moreover, toFile() also has a particular order in which configuration objects are used to achieve this:
- The first used configuration is normally the project configuration. So every referenced value stored in there will become evaluated.
- The second used configuration is the analysis configuration and replaces USERNAME, USERGROUP, USERHOME, projectName and any other value attached to the analysis.
- The last used configuration is the one from the dataset, and replaces pid, PID, dataSet and DATASET.
Special values¶
For future releases of Roddy and also for better readability of XML files, Roddy offers “special” variables like:
Run flags which look like runPostProcessing, runFlagstats, runScript
and
Binaries which look like BWA_BINARY, MBUFFER_BINARY, PYTHON_BINARY and so on.
Run flags are always considered to be boolean and are e.g. used in Brawl based workflows. Binary variables are or are supposed to be checked on workflow validation and startup in future versions. If you want to exchange a binary in a fast way or set a fixed binary for your scripts, it is also wise to store everything in configuration values.
Tool entries and filename patterns¶
Note
Because of the importance and complexity of both entry types, they are covered in their own section Tools and filenames.
These sections are started like this:
<processingTools>
<tool name='compressionDetection' value='determineFileCompressor.sh' basepath='roddyTools'/>
<tool name='createLockFiles' value='createLockFiles.sh' basepath='roddyTools'/>
<tool name='streamBuffer' value='streamBuffer.sh' basepath='roddyTools'/>
<tool name='wrapinScript' value='wrapInScript.sh' basepath='roddyTools'/>
<tool name='nativeWorkflowScriptWrapper' value='nativeWorkflowScriptWrapper.sh' basepath='roddyTools'/>
</processingTools>
<filenames package='de.dkfz.roddy.knowledge.examples' filestagesbase='de.dkfz.roddy.knowledge.examples.SimpleFileStage'>
<filename class='SimpleTestTextFile' onMethod='test1' pattern='${testOutputDirectory}/test_method_1.txt'/>
<filename class='SimpleTestTextFile' onMethod='test2' pattern='${outputAnalysisBaseDirectory}/${testAOutputDirectory}/test_method_2.txt'/>
<filename class='SimpleTestTextFile' onMethod='test3' pattern='${testInnerOutputDirectory}/test_method_3.txt'/>
<filename class='FileWithChildren' onMethod='SimpleTestTextFile.testFWChildren' pattern='${testOutputDirectory}/filewithchildren.txt'/>
<filename class='SimpleTestTextFile' onMethod='SimpleTestTextFile.testFWChildren' pattern='${testOutputDirectory}/test_method_child0.txt'/>
<filename class='SimpleTestTextFile' onMethod='SimpleTestTextFile.testFWChildren' selectiontag="file1" pattern='${testOutputDirectory}/test_method_child1.txt'/>
</filenames>
They contain a list and resource definitions for included workflow tools and patterns to create filenames based on different rules.
Tool entry names are automatically converted to configuration variables. For this to work, you need to set the tool id in camel case notation: camelCase. If this is done, Roddy will convert the id e.g. to TOOL_CAMEL_CASE. For the above example, you’d get TOOL_COMPRESSION_DETECTION out of compressionDetection and e.g. TOOL_WRAPIN_SCRIPT, TOOL_CREATE_LOCK_FILES, TOOL_STREAM_BUFFER and finally TOOL_NATIVE_WORKFLOW_SCRIPT_WRAPPER.
Here comes a list of stuff taken from an old config file. It’s just taken over and not reworked. However, a lot of the possibilities for filename patterns is listed here:
<!-- Filenames are always stored in the pid's output folder -->
<!-- Different variables can be used:
- ${sourcefile}, use the name and the path of the file from which the new name is derived
- ${sourcefileAtomic}, use the atomic name of which the file is derived
- ${sourcefileAtomicPrefix,delimiter=".."}, use the atomic name's prefix (without file-ending like .txt/.paired.bam...
of which the file is derived, set the delimiter option to define the delimiter default is "_"
the delimiter has to be placed inside "" as this is used to find the delimiter!
- ${sourcepath}, use the path in which the source file is stored
- ${outputbasepath}, use the output path of the pid
- ${[nameofdir]OutputDirectory}
NOTICE: If you use options for a variable your are NOT allowed to use ","! It is used to recognize options.
- ${pid}
- ${sample}
- ${run}
- ${lane}
- ${laneindex}
- You can put in configuration values to do this use:
${cvalue,name=[name of the value],default=".."} where default is optional.
- ${fileStageID} use the id String of the file's stage to build up the name.
-->
<!-- A filename can be derived from another file, use derivedFrom='shortClassName/longClassName'
A filename can also be specified for a level, use fileStage='PID/SAMPLE/RUN/LANE/INDEXEDLANE', refer to BaseFile.FileStage
A filename can be specified for all levels, the name is then build up with the ${fileStageID} value
A filename can be created using the file's called method's name
A filename can be created using the used tool's name
-->
Special: Autofilenames and Autofiletypes¶
Just to mention it (it is also covered in detail in the full guide), Roddy supports some sort of autofilenames and types. This means, if you just want to get things running, you can specify a tool with input and output files. If no filename patterns and file classes exist, Roddy will take care of this for you. However, the autofilenames are not the nicest things to have, so you should go on and create rules, if needed.
Enumerations¶
Enumerations are there to specify data types and validators for configuration values.
<enumeration name='cvalueType' description='various types of configuration values' extends="">
<value id='path' valueTag="de.dkfz.roddy.config.validation.FileSystemValidator" description="Value type is a file system path (fully or with wildcards like ~, *"/>
<value id='bashArray' valueTag="de.dkfz.roddy.config.validation.BashValidator" description="A bash array."/>
<value id='boolean' valueTag="de.dkfz.roddy.config.validation.DefaultValidator" description="A boolean value containing true or false."/>
<value id='integer' valueTag="de.dkfz.roddy.config.validation.DefaultValidator" description="A positive or negative integer value."/>
<value id='float' valueTag="de.dkfz.roddy.config.validation.DefaultValidator" description="A single precision floating point value."/>
<value id='double' valueTag="de.dkfz.roddy.config.validation.DefaultValidator" description="A double precision floating point value."/>
<value id='string' valueTag="de.dkfz.roddy.config.validation.DefaultValidator" description="The default type of no type is set. The value will be stored unchecked."/>
</enumeration>
Looking at the default configuration value type configuration, you can see e.g. that path objects are validated with the FileSystemValidator class.
Tools and filenames¶
The whole workflow structure in Roddy is built around files and filenames. Files are used to create dependencies between steps in the workflow and files also enable Roddy to rerun a workflow based on created files.
As Roddy strictly separates code and configuration, filenames are configured. Of course you are allowed to make exceptions for e.g. initial files but the standard is to create rules for filenames.
So how do you tie things up?
Filename patterns are used to define a single or a range of names for a file class.
File classes are used as input and output parameters for tool entries. Filename patterns are automatically applied to output files!
Tool entries tell Roddy how a script or a binary is called. Which files and parameters go in and which files come out and which resources will be used by jobs running this tool.
A complex tool entry will be shown at the end of this document.
Note
In our experience, it is a good way to create a workflow and its tools on a step by step base so that:
- You create a tool entry, define an initial resource set and i/o parameters.
- Integrate the call into your workflow.
- Setup filename patterns for the tools output files.
- Test the new tool with testrun and testrerun.
- Repeat the steps for the next tool.
Occasionally it might still be wise to remove the output data and test the whole workflow again.
Important
Remember, that Roddy does not feature job monitoring. The job structure, file names and patterns must be well known before the workflow starts!
Tool entries¶
<tool name='testScript' value='testScriptSleep.sh' basepath='roddyTests'>
<resourcesets>
<rset size="l" memory="1" cores="1" nodes="1" walltime="5"/>
</resourcesets>
<input type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_IN"/>
<output type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_OUT"/>
</tool>
Each tool entry has a header:
<tool name='testScript' value='testScriptSleep.sh' basepath='roddyTests'>
- The value of the name attribute is used to call or manage the tool in a workflow. Before a workflow starts, the names of all tools are converted to configuration values so that you will have easy access to them from your scripts. As explained in the configuration section, a job name will be converted from camel case notation to All caps notation using underscore as the word separator. In addition TOOL_ will be used as a prefix. So the tool name testScript would be named TOOL_TEST_SCRIPT in your job.
- The value of the value attribute holds the script or binary name of the executed file.
- The value of the basepath attribute points to the tools folder in the plugins analyisTools folder.
Important
You can, but you don’t have to add resource sets and input and ouput parameters to a job. If you omit resource sets, the job will run with default resource settings. They are explained below. If you omit input and output parameters, you need to take care of the job call by yourself. Normally, Roddy will take care of this for you. If you create a native workflow, you will lose the rerun feature, if you omit the output parameters! Omitting all these parameters might sometimes make sense, when you just want to get easy access to a tool in your analysisTools folder.
Resource sets¶
Each tool can have several resource sets.
<rset size="l" memory="1" cores="1" nodes="1" walltime="5"/>
The attribute size can be one of t, xs, s, m, l, xl and allows you to define resource sets for different cases. From extra small to extra large. t is a special case and can be used for test resources.
Currently Roddy (or BatchEuphoria) can be used to request the resources memory, cores, nodes and walltime You can set values in different formats:
- The default for memory is 1GB. Valid strings for it are for example:
- 1 (which is 1 GB)
- 1m/g/t
- 0.5(m/g/t) which would be 500MB
- The default cores value is 1. Other values are natural numbers in [1; n]
- The default nodes value is 1. Other values are natural numbers in [1; n]
- The default walltime is 1 hour. Other values are for example:
- 00:10:00 which would be 10 minutes.
- 24:00:00 would be aligned to 01:00:00:00 which is one day. All other values will be aligned as well.
- 1h, 1d, 1h50m … or other values in human readable format.
Note
The default size for resource sets used by Roddy is l
- The default for memory is 1GB. Valid strings for it are for example:
Input types¶
A tool can have different input objects:
Values, like strings or numbers:
<input type="string" setby="callingCode" scriptparameter="SAMPLE"/>
- The type attribute tells Roddy, that a string is expected.
- The setby attribute tells Roddy, that the parameter will be set by the developer in the call of the job. Currently only callingCode is valid.
- The scriptparameter value tells Roddy that a parameter with this name is passed to the job.
Single file objects like:
<input type="file" typeof="de.dkfz.b080.co.files.LaneFile" scriptparameter="RAW_SEQUENCE_FILE" /> <input type="file" typeof="BasicBamFile" scriptparameter="RAW_SEQUENCE_FILE" />
- The type attribute tells Roddy that a file object is expected as input.
- The typeof value tells Roddy the expected type of an input value. This check is done within the job call. If the type of the input object does not match, Roddy will fail. You’re allowed to omit the package structure. Roddy will try to find the class in its core code and in the plugin classes. If more than two classes match, Roddy will fail and tell you, that this happened.
Important
You are allowed to put in a non-existent class! If Roddy cannot find the class, it will create a synthetic class during runtime. This way, you can skip code creation and keep your code lean. You are allowed to use this class like any other class. However, you are not able to use the class directly in your Java code.
- Like above, the scriptparameter value tells Roddy that a parameter with this name is passed to the job.
File groups:
File groups are collections of file objects. By default, file groups are designed to store files of the same type.
<input type="filegroup" typeof="de.dkfz.b080.co.files.BamFileGroup" scriptparameter="INPUT_FILES" passas="array"/> <input type="filegroup" typeof="GenericFileGroup" scriptparameter="INPUT_FILES2" passas="array"/>
- Set the type to filegroup if you want to use it.
- typeof behaves nearly the same as for file input definitions. However, here you need to put in a file group class. If you do not need a specialized or named file group, you can use the GenericFileGroup class.
- TODO: classOfContainedFiles
- The passas attribute defines, how the files in the file group are passed to your job. Allowed values are:
- parameters which will tell Roddy to create a parameter for each file in the group.
- array which will tell Roddy to pass the files as an array in a single string.
- The scriptparameter behaves nearly like the one for files. If you set array, the parameter name will be used like it is. If you set parameters it will be used as a prefix and the .
Important
The order of the input parameters matters, when you pass parameters to a job. Roddy will check this and fail, if:
- the number of input parameters does not match
- the type of input parameters does not match
Output types¶
The output of a Roddy job is always a file or a group of files. Moreover, you are only allowed to have one top-level output object in the XML description, but this object might be one which holds other objects like the mentioned file groups.
If your tool does not create output files you can omit those entries. However, it might still be wise to create some sort of checkpoint for the tool so that Roddys rerun feature will work properly. The syntax for output objects is quite similar to the syntax for input objects, so we’ll skip explanations for known attributes. Valid output objects are:
Single file objects:
The single output file syntax is the same like for input files. Just change the tag name to output.
<output type="file" typeof="de.dkfz.b080.co.files.BamFile" scriptparameter="FILENAME" />
In addition to the basic parameters, you can also add a filename attribute like:
<output type="file" typeof="BasicBamFile" scriptparameter="FILE_IN" filename="/tmp/somefile_${pid}.txt" />
If you do this, you will create an inline filename pattern of the onScriptParameter type. Just see the below section about filename patterns and rules for more info.
Also you can tell Roddy to omit the file existence check for workflow reruns by adding the check attribute with its value set to “false”.
<output type="file" typeof="BasicBamFile" scriptparameter="FILE_IN" check="false" />
Files with children:
Files with children are a bit special. They are necessary, if you want to create a file which has some children. The main difference to single files is, that you need to create a class file! Then, for each file you want as a child, you need to create the field and the set / get accessors. We use this feature only in a handful of cases.
<output type="file" typeof="BasicBamFile" scriptparameter="FILENAME"> <output type="file" variable="indexFile" typeof="BamIndexFile" scriptparameter="FILENAME_INDEX"/> </output>
The example shows an output entry with one child. You can add more children, if you need.
The variable attribute tells Roddy which field in the parent class is used to store the created child.
Tuples of files:
Tuples of files are the easiest way to create collections of file objects. It does not matter which types the files have.
<output type="tuple"> <output type="file" typeof="BasicBamFile" scriptparameter="FILENAME_BAM"/> <output type="file" typeof="BamIndexFile" scriptparameter="FILENAME_INDEX"/> </output>
Call in Java code
// Call with output tuple Tuple2 fileTuple = (Tuple2) call("testScriptWithMultiOut", someFile) // Access output tuple children (BasicBamFile)fileTuple.value0 (BamIndexFile)fileTuple.value1
File groups:
Output file groups offer a lot more options than input file groups. This
<output type="filegroup" typeof="GenericFileGroup"> <output type="file" typeof="" scriptparameter="BAM1"/> <output type="file" typeof="" scriptparameter="BAM2"/> <output type="file" typeof="" scriptparameter="BAM3"/> </output>
File groups with indices:
<output type="filegroup" passas="array" filename="somefile_${fgindex}.out" />
Filename patterns¶
Filenames in Roddy are rule based. They are defined in the filenames section in your XML file.
<filenames package='de.dkfz.roddy.knowledge.examples' filestagesbase='de.dkfz.roddy.knowledge.examples.SimpleFileStage'>
<filename class='SimpleTestTextFile' onTool='testScript' pattern='${testOutputDirectory}/test_onScript_1.txt'/>
<filename class='SimpleMultiOutFile' onTool="testScriptWithMultiOut" selectiontag="mout1" pattern="${testOutputDirectory}/test_mout_a.txt" />
<filename class='SimpleMultiOutFile' onTool="testScriptWithMultiOut" selectiontag="mout2" pattern="${testOutputDirectory}/test_mout_b.txt" />
<filename class='SimpleMultiOutFile' onTool="testScriptWithMultiOut" selectiontag="mout3" pattern="${testOutputDirectory}/test_mout_c.txt" />
<filename class='SimpleMultiOutFile' onTool="testScriptWithMultiOut" selectiontag="mout4" pattern="${testOutputDirectory}/test_mout_d.txt" />
</filenames>
There are several types of triggers for patterns available. Patterns are always linked to a particular class. By applying the selectiontag attribute to some of the trigger types, you gain a more fine grained control over pattern selection, if you define output objects of the same class multiple times in a tool.
onScriptParameter trigger¶
This trigger type links the pattern to the scriptparameter attribute of an output object. Valid trigger values are:
- [parameter name] - where parameter name is linked to the scriptparameter attribute. The trigger is valid for all tools.
- :[parameter name] - behaves like above.
- [ANY]:[parameter name] - behaves like above. This is the long form and [ANY] is meant to make the syntax more readable.
- [tool id]:[parameter name] - behaves like above, except that tool id restricts the trigger to exactly one tool.
This trigger type will NOT accept the selectiontag attribute.
onMethod trigger¶
This trigger links the pattern to a method name or a class and a method name. Roddy will search all called methods using the current threads stack trace. The search will stop, as soon as the execute method is reached. Valid values are:
- [methodName] - by specifying only a method name, the pattern will be used for any called method with this name.
- [simple class name].[methodName] - this will accept all methods in classes with the given class name. The class package will be ignored.
- [full class name].[methodName] - by setting the class and the package, this pattern will only be applied with a full match.
This trigger type will accept the selectiontag attribute.
onToolID trigger¶
This trigger will link the pattern to a tool call. If this tool is called and outputs a file of the given class then this pattern might be used.
This trigger type will accept the selectiontag attribute.
derivedfrom trigger¶
In some cases the name of a new file depends on the name of a parent file, e.g. a Bam Index file depends on a Bam file like DATASET_TIMESTAMP.merged.bam -> DATASET_TIMESTAMP.merged.bam.bai.
This trigger type will accept the selectiontag attribute.
generic¶
To be done… we hardly use it.
Important
Filename patterns are evaluated in a specific order!
- First by the type
- onScriptParameter -> onMethod -> onToolID -> derivedFrom -> generic
- By the order in the configuration. First come first serve!
"<filename class='TestFileWithParent' derivedFrom='TestParentFile' pattern='/tmp/onderivedFile'/>"
"<filename class='TestFileWithParent' derivedFrom='TestParentFile' pattern='/tmp/onderivedFile'/>"
"<filename class='TestFileWithParentArr' derivedFrom='TestParentFile[2]' pattern='/tmp/onderivedFile'/>"
"<filename class='TestFileOnMethod' onMethod='de.dkfz.roddy.knowledge.files.BaseFile.getFilename' pattern='/tmp/onMethod'/>"
"<filename class='TestFileOnMethod' onMethod='BaseFile.getFilename' pattern='/tmp/onMethodwithClassName'/>"
"<filename class='TestFileOnMethod' onMethod='getFilename' pattern='/tmp/onMethod'/>"
"<filename class='TestFileOnTool' onTool='testScript' pattern='/tmp/onTool'/>"
"<filename class='FileWithFileStage' fileStage=\"GENERIC\" pattern='/tmp/filestage'/>"
"<filename class='TestOnScriptParameter' onScriptParameter='testScript:BAM_INDEX_FILE' pattern='/tmp/onScript' />"
"<filename class='TestOnScriptParameter' onScriptParameter='BAM_INDEX_FILE2' pattern='/tmp/onScript' />"
"<filename class='TestOnScriptParameter' onScriptParameter=':BAM_INDEX_FILE3' pattern='/tmp/onScript' />"
"<filename class='TestOnScriptParameter' onScriptParameter='[ANY]:BAM_INDEX_FILE4' pattern='/tmp/onScript' />"
"<filename class='TestOnScriptParameter' onScriptParameter='[AffY]:BAM_INDEX_FILE5' pattern='/tmp/onScript' />" // Error!!
"<filename onScriptParameter='testScript:BAM_INDEX_FILE6' pattern='/tmp/onScript' />"
Automatic filenames¶
Synthetic classes¶
Synthetic classes are a mechanism which allows you to use Roddys built-in type checking system without the need to create class files. Synthetic classes are automatically created during runtime in the following cases:
A filename pattern requires a specific non-existent class.
A tool i/o parameter needs a specific non-existent class.
Programmatically, if you request Roddy to load a non-existent class with the LibrariesFactory:
LibrariesFactory.getInstance().loadRealOrSyntheticClass(String classOfFileObject, String baseClassOfFileObject) LibrariesFactory.getInstance().loadRealOrSyntheticClass(String classOfFileObject, Class<FileObject> constructorClass) LibrariesFactory.getInstance().forceLoadSyntheticClassOrFail(String classOfFileObject, Class<FileObject> constructorClass = BaseFile.class) LibrariesFactory.getInstance().generateSyntheticFileClassWithParentClass(String syntheticClassName, String constructorClassName, GroovyClassLoader classLoader = null)
or via the ClassLoaderHelper
LibrariesFactory.getInstance().getClassLoaderHelper().loadRealOrSyntheticClass(String classOfFileObject, String baseClassOfFileObject) LibrariesFactory.getInstance().getClassLoaderHelper().loadRealOrSyntheticClass(String classOfFileObject, Class<FileObject> constructorClass) LibrariesFactory.getInstance().getClassLoaderHelper().generateSyntheticFileClassWithParentClass(String syntheticClassName, String constructorClassName, GroovyClassLoader classLoader = null)
Example tool entry and filename patterns¶
<a/>
Overriding tool entries¶
Sometimes, the initial specification might not be right for you. In this case, you are always allowed to override the existing tool entry. There are basically two ways: Override the resource sets only or redefine the whole tool.
If you want to override the whole tool, just do it. The only thing to remember is, that you probably have to match the in and output parameter count or even the types and you have to make sure, that you put the new tool definition to the proper level in your configuration file hierarchy.
<tool name='testScript' value='testScriptSleep.sh' basepath='roddyTests'>
<resourcesets>
<rset size="l" memory="1" cores="1" nodes="1" walltime="5"/>
</resourcesets>
<input type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_IN"/>
<output type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_OUT"/>
</tool>
Now, if you just need to adapt the resources, you can use the overrideresourcesets*=*”true”** attribute.
<tool name='testScript' value='testScriptSleep.sh' basepath='roddyTests' overrideresourcesets="true">
<resourcesets>
<rset size="l" memory="1" cores="1" nodes="1" walltime="5"/>
</resourcesets>
</tool>
The in- and output entries will be inherited and you’ll have your tools setup with the new resources. Be aware that all of the old resource entries will void!
Metadata¶
Workflows need metadata about the processed data to be able to process the data correctly. Currently, there are two ways how metadata can be communicated to Roddy workflows: (1) via the filesystem paths of files, or (2) via a metadata table.
Note that this is provisional information as the plan is to streamline the respective code and define a clean interface to these and probably other metadata sources, such as XMLs, filesystem attributes, dedicated metadata files or databases.
Filesystem-based Metadata¶
The filesystem-based approach uses of the filename-patterns with specific variables matched against existing filesystem objects.
- outputBaseDirectory: The top-level directory beneath which filename-patterns for input and output files are matched or generated, respectively.
- inputBaseDirectory: Often the same as the outputBaseDirectory, but Roddy can also have a separate input directory in which filename patterns are matched
- outputAnalysisBaseDirectory: defaults to outputBaseDirectory/datasetId
- inputAnalysisBaseDirectory: defaults to inputBaseDirectory/datasetId
The following metadata variables are matched or filled into the filename patterns:
- dataSet
- pid (= patient id, synonymous for dataSet)
- projectName
- USERNAME: The user’s username on the submission host (on which the qsub, etc. are executed).
- USERGROUP: The user’s primary group on the submission host.
- USERHOME: The user’s home directory on the submission host.
- DIR_BUNDLED_FILES, DIR_RODDY: Only valid at the beginning of the path. The absolute path to Roddy’s application directory.
- PWD: The execution directory.
Note that Roddy matches variables by the pattern ‘${’ + varname + ‘}’. Variables that contain references to other variables are written into the job parameter file, which is later sourced by the Bash-based wrapper that sets up the environment for the actual tool script. At this stage, variables that are not enclosed by braces, and therefore not considered by Roddy during the ordering of variable assignments in the parameter file, may or may not be bound. So better use braces!
Additional variables can be defined in plugins. Furthermore, all configuration values can be referenced.
Table-based Metadata¶
MDTs differentiate between internal column names that are used in the Java/Groovy code to select columns from external column names that are used in the input-files. External column names can be chosen freely using alpha-numeric characters.
The basic version of the metadata table (MDT) needs two columns to match files against datasets. Therefore the minimal requirement for a metadata table is to have a dataset column with internal identifier “datasetCol” and a file column with internal identifier “fileCol”.
The mapping between internal and external columns is defined in the XML as configuration values. Each internal column name, like “datasetCol”, needs to be defined as configuration value with the value itself representing the external column name. With this mapping actual columns in the input files are mapped to the correct semantics internally and thus the order of columns in MDT inputs can be arbitrary. Please make heavy use of “description” attributes for your configuration values to ensure the exact semantics of the MDT columns is communicated well.
Furthermore there probably should be a configuration value “metadataTableColumnIDs” that defines a priority for internal column identifiers – with high priority first and lower priority later. The priority allows simple checks on the content of the MDT. Given a set of rows, all higher priority fields need to have identical values. This check is optional and depends on the which API the workflow developer has used in its Java code.
Roddy Configuration Values¶
The following variables are used by Roddy itself:
Path Configuration¶
The following variables describe paths and may contain metadata-variable references such as ‘${dataset}’ that are filled in during the identification of input data during start up or filled with metadata values during submission.
- inputBaseDirectory: This option is also set by the –useiodir=$outputBaseDirectory,$inputBaseDirectory command-line option.
- outputBaseDirectory: This option is also set by the –useiodir=$outputBaseDirectory[,$inputBaseDirectory] command-line option. Usually, this directory contains
- one subdirectory per dataset (e.g. patient).
- the “execution cache file” (.roddyExecCache.txt) containing the list of executed runs
- the global “execution store” (.roddyExecutionStore) with the decompressed analysis tools from the plugins
- outputDirectory: Usually set to “$outputBaseDirectory/${dataset}”.
- outputAnalysisBaseDirectory: This is usually a subdirectory of the outputDirectory and contains the actual output of the workflow. Usually, the filename patterns in the plugins use the outputAnalysisBaseDirectory as base directory for the full paths to output files. This directory also contains the “roddyExecutionStore” with the logs of the workflow execution.
Debugging Options¶
- debugWrapInScript: Turn on extended debugging of the wrap in script that sets up the stage for the actual top-level job/tool script.
- disableDebugOptionsForToolscript: Mainly for internal usage. Turn off all debugging for the top-level job/tool script.
The following options set specific debugging options for Bash
- debugOptionsUsePipefail: set -o pipefail
- debugOptionsUseVerboseOutput: set -v
- debugOptionsUseExecuteOutput: set -x
- debugOptionsUseUndefinedVariableBreak: set -u
- debugOptionsUseExitOnError: set -e
- debugOptionsParseScripts: set -n
- debugOptionsUseExtendedExecuteOutput: This turns on additional debugging in Bash by setting the PS4 variable. Each line starts with information about the executed script, the line-number in the script, and the function calls. export PS4=’+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]: +$ { FUNCNAME[0] }():}’”
Exposed to Jobs¶
The following variables are exposed to plugins by default and are set to by Roddy to the specific values applicable to the job.
- RODDY_SCRATCH: Affected by baseScratchDirectory applicationProperties.ini option.
- RODDY_JOBID: The job-identifier (PBS_JOBID, LSB_JOBID, etc.)
- RODDY_JOBNAME: The job-name (PBS_JOBNAME, etc.)
- RODDY_QUEUE: The queue to which the job runs (PBS_QUEUE, etc.)
Additionally, all configuration variables are reached through from the command-line (–cvalues) and configuration files into the jobs. Dependent on the plugin code additional variables may be set (or overridden) specifically for each job.
Access Rights¶
- processOptionsSetUserGroup
- processOptionsSetUserMask
- processOptionsQueryEnv
- processOptionsQueryID
- outputAccessRightsForDirectories
- outputAllowAccessRightsModification”;
- outputAccessRights
- outputFileGroup
- outputUMask
Other Variables¶
- usedResourcesSize
Output¶
There is not much to say about the standard output and error of Roddy, except that the information on which jobs are submitted is printed to standard output, while all other information is printed to standard error. This simplifies the parsing of the submission results.
More interesting is the execution metadata. Note that the $outputAnalysisBaseDirectory
is configurable and by default the same as $outputBaseDirectory/$dataSet
(defined in the DefaultPlugin configuration). For many workflows it is in a subdirectory of $outputBaseDirectory
.
$outputAnalysisBaseDirectory/roddyExecutionStore
¶
The Roddy execution store contains the most important metadata required for debugging and reproduction. The directory contains one exec_*
subdirectory for each Roddy run for the data output directory in which the execution store is located.
Specifically the following files are contained:
runtimeConfig.sh
- In Roddy 2 this file was the way how job configuration data was provided to the jobs. The file contains Bash code to set up the environment from which the top-level job script is called (this is done in the wrapInScript.sh contained in the DefaultPlugin). Since version 3 this file is not used anymore, but kept (for now) as reference. Be aware that job-specific parameters may be set to wrong values or not at all in this file. Roddy 3 uses the .parameter files.
*.parameter
- Bash script to set up the environment for the top-level job script (since Roddy 3). This file contains all configuration values as used for the job, unless they have been changed in the environment setup script or in the job itself.
versionInfo.txt
- List of plugins and versions used.
jobStateLogFile.txt
- This file is extended by the
wrapInScript.sh
provided by the DefaultPlugin. Each row consists of four colon-separated columns with (1) the cluster job ID as used by the batch processing system, (2) a status indicator – usuallySTARTED
or the job’s exit code, (3) a timestamp in seconds since epoch, (4) and the name of the cluster job. You can convert the timestamp with the date command:date -date="@$secondsSinceEpoch"
.o%J
files- Combined standard output and error of the job as run on the cluster.
$outputBaseDirectory/.roddyExecCache.txt
¶
A small CSV file containing information about execution stored, workflow analysis ID, and user names that ran the analysis.
$outputBaseDirectory/.roddyExecutionStore/
¶
This hidden directory contains the scripts from the plugins as transferred from the host on which you executed Roddy. For different versions of your plugins different subdirectories are created, each with a version and timestamp tag. If you develop a workflow and change scripts multiple such directories for the same plugin version may coexist, but each with a different timestamp. The roddyExecutionStore/exec_*/analysisTools
directory contains symlinks to these unpacked plugin tools directories that were used for the specific job execution.
$HOME/.roddy/logs¶
More extensive log files of Roddy containing information about the execution of the Roddy core workflow management system, including possible exception stacktraces, standard output and error, etc. After each run the corresponding log file name is reported on the standard error at the end of the execution. Note, however, that because these files can get quite big, only 30 of them are kept in this directory!
Developing Roddy¶
Application structure¶
The overall structure of a Roddy installation is as follows:
/
roddy.sh # Top-level script
dist/
bin/
develop/ # Optional development version
$major.$minor.$build/
plugins/ # Since Roddy 3
DefaultPlugin
PluginBase
plugins_R$major.$minor/ # Plugin directory for specific Roddy versions. Usually in mixed installations of 2.3 and 2.4
runtimeDevel/ # Optional
groovy-$major.$minor.$build
jdk, jre, jdk_$major.$minor._$revision
Developers guide¶
Code guidelines¶
Roddy has no specific development or code style (yet) . Here, we try to collect topics and settings, where we think that they might be important.
Code Format¶
We are mainly using IntelliJ IDEA and use the default settings for code formatting.
Collections as return types¶
By default, we do not return a copy (neither shallow, nor deep) of the Collection object. Be careful, not to modify the collection, if you do not change the contents of the object.
Keep it clean and simple¶
We do not really enforce rules, but we try to keep things simple and readable.
- If a code block is not readable, try to make a method out of it.
- Reduce size and complexity of methods.
- Your code should be self explanatory. If it is not, try to make it that way.
We know, that we have a lot of issues in our codebase, but we listen to every improvement suggestion and constantly try to improve things.
Development model¶
For development we follow the standard Github flow with feature branches getting merged directly back into the master branch. Releasing in simply done by putting a tag on the master branch and let the continuous integration pipeline (Travis) deploy a release archive to Github releases
Settings for Groovy classes¶
We will not accept Groovy classes without the @CompileStatic annotation. If you are in the rare situation that you need dynamic dispatch on more than the object (this) itself, you can mark the affected methods with @CompileDynamic.
API documentation¶
We are working on improving our API documentation. The current the API is not build automatically because of problems of groovydoc with Java lambda expressions..
Roddy versioning scheme¶
We are using semantic versioning.
Roddy version numbers consist of three entries: $major.$minor.$build. The build number is also sometimes called patch number.
The $major entry is used to mark API-breaking changes in the Roddy core functions. Backward compatibility is not granted and Roddy will not execute plugins built with different $major versions.
The $minor entry marks smaller changes which extend the Roddy API. Backward compatibility of Roddy to the plugin should not be affected, such that your old plugins should run with the newer Roddy version.
The combination of $major.$minor can somehow be seen as the API level of Roddy. For a “full API level” the plugin versions of “PluginBase” and “DefaultPlugin” need to be considered as well.
Basically the same versioning convention applies to the plugins, but note that we advise authors to base the plugin versions not on the Roddy core versions, but only on the semantics of the analysis. The details have not yet been fully worked out, but basically this means,
- modified output files warrant a major level increase
- added output files warrant a minor level increase
- bug-fixes warrant a patch-level increase
Bug-fixes must not change the output – otherwise they represent major version bumps. Plugins also support a “revision” that is indicated as a “-number” suffix to the plugin version. The revisions usually contain the bug-fixes. If we have to maintain old plugin version just with bugfixes feature backports for specific projects in production, then we usually release version numbers with an additional “-$revision” suffix. Such revisions will therefore at most correspond to minor-level increases. Furthermore, note that specific plugins may not have followed the semantic versioning convention. In the end versioning is in the responsibility of the plugin maintainer.
Importantly, if Roddy sees multiple plugin directories for the same plugin only differing in the revision number, Roddy may automatically upgrade to the version with the largest revision number! So be sure only to use revisions for semantically equivalent plugin versions (e.g. minor bugfixes). Every change that affects the output of you plugin in a way that, e.g., the results are not comparable with previous versions anymore, should receive at least a build-number increase.
Below, you’ll find, how things are (or are supposed to be) handled in git.
How to get started¶
Have you already checked out the Installation guide? If not, please do so and do not forget to use the developer settings instead of the user settings.
The first thing you’ll need is a working Java 8+ installation and a Groovy installation (e.g. 2.4.9+).
Repository Structure¶
/
roddy.sh Top-level script
./RoddyCore/ The core project
buildversion.txt Current buildversion
Java/Groovy sources
docs/ Documentation
dist/
bin/
develop/
$major.$minor.$build/
plugins/
DefaultPlugin
PluginBase
plugins_R$major.$minor/ Plugin directory for specific Roddy versions
runtimeDevel/
groovy-$major.$minor.$build
jdk, jre, jdk_$major.$minor._$revision
The runtimeDevel/ directory is only required for Roddy up to version 2.3.
Compiling Roddy¶
The preferred way to build Roddy is via Gradle. Please run
./gradlew build
This will download all necessary dependencies into the dist/bin/develop/lib directory and create the Roddy.jar in dist/bin/develop.
If you want to develop Roddy and additionally want to work on the RoddyToolLib or BatchEuphoria you can clone these libraries into neighbouring directories and execute gradle with composite build parameters
./gradlew build --include-build ../RoddyToolLib/ --include-build ../BatchEuphoria/
Note that if you are using a proxy, additional configuration is necessary for gradle. Please add the folling lines with the appropriate values for your environment to the file “~/.gradle/gradle.properties”:
systemProp.http.proxyHost=
systemProp.http.proxyPort=
systemProp.https.proxyHost=
systemProp.https.proxyPort=
Hosts are specified without the “http[s]://” prefix.
Packing Roddy¶
The packaging of Roddy is done using the Gradle distribution plugin. There is two packaging targets
./gradlew roddyDistZip roddyEnvironmentDistZip
The distribution zips are put in the “gradleBuild/distribution” directory.
The “roddyEnvironmentDistZip” target will produce a zip with the top-level directory containing the roddy.sh and the essential “dist/bin” subdirectories.
The content of the “roddyDistZip” produces a release zip that is supposed to be extracted into a directory called “dist/bin/$major.$minor.$build”.
Building the documentation¶
The Sphinx-based documentation is located in the “docs/” directory and build with
./gradlew sphinx
The output is then produced in “gradleBuild/site” for inspection in the browser.
Further important notes¶
The “roddyDistZip” target will produce a zip with the content of the “dist/bin/develop” directory. For deployment you should unzip it in that directory and copy its content into an appropriately named “dist/bin/” subdirectory, e.g. “develop” for testing purposes or the version number, such as 3.1.0.
Plugin development¶
Plugin developers guide¶
This page should give you an idea how to start your own Roddy based workflows. We describe, how you need to setup a fresh workflow and how to populate it initially. Depending on the workflow type, you can then go on and dig deeper into the individual manuals. The JVM plugins page will also feature a list of various techniques about how to work with Roddy and (e.g. how to get a file from storage).
If you just need a quickstart or a short repetition, you can read Workflow development primer
Select the workflow type¶
Before you create a new workflow, you have to decide, which type of workflow you want to create:
- Java / Groovy or other JVM based plugins. We will call them JVM plugins.
- Brawl
- Bash or other native workflows like e.g. Python or Perl based
and if you want to create a new plugin or extend an existing plugin. Of course, you can have a mix of workflows in a plugin at a later stage.
We are discussing, if we will support CWL based workflows.
Common plugin setup¶

Roddy plugins are normally strictly organized. An exception to this structure are full native plugins. But as these special plugins get converted to the default structure, finally all plugins are organized this way.
The plugins folder name is built up in the following way:
PluginName_1.0.111-1
Note
The standard Roddy versioning scheme also applies to the plugin versioning scheme which is [major].[minor].[build] and extends it by the revision to [major].[minor].[build]-[revision].
where:
- PluginName is the name of the plugin
- _1.0.111 is the version of the plugin, this is not necessarily the same as the entry in the buildversion.txt file. If you omit this entry, the plugin version is ‘develop’ by default!
- -1 is the revision of the plugin. if you only have smaller changes, you can increase the revision number of the new plugin
and Roddy is able to select the revised plugin instead of the former revision. You can omit this entry and Roddy will set
the revision to -0 internally. Please be aware:
* The revision is only valid, if you set the version! It is not valid for plugins marked as ‘develop’.
- You are also not allowed to set ‘develop’ as the plugin version!
There are some main components for any plugin and files for the contained workflows.
The buildversion.txt file contains the build number of the plugin. This number will get increased, if you pack or compile the plugin. The file contains exactly two lines:
Major.Minor Build
e.g.
1.0 182
The buildinfo.txt file contains information about:
- The Roddy API level, which is e.g. 2.3 or 2.4
- The Java version API level
- The groovy API level
furthermore, it contains information about dependencies to other plugins and compatibility entries.
One example:
dependson=PluginBase:1.2.0 dependson=COWorkflows:1.2.20 JDKVersion=1.8 GroovyVersion=2.4 RoddyAPIVersion=3.0This plugin depends on three other plugins with specific version. For development, it is possible to set current for the version number. The plugin also depends on JDK version 1.8.*/8.*, Groovy version 2.4.* and the Roddy version 3.0.*. If you do not develop a Java based plugin, you can omit JDKVersion and GroovyVersion.
- The resources directory which contains:
The analysisTools directory, which is populated with several tool folders, e.g.
13:45 $ ll analysisTools/ insgesamt 8 ... 4096 26. Jun 13:47 roddyNativeTools ... 4096 13. Jul 16:20 roddyToolsThe names of the tool folders will be used as the basepath entry for tool entries in your workflow configuration file.
The configurationFiles directory which contains one or more configuration files. Workflow configuration files need the prefix analysis, e.g. analysisTest.xml.
If you use Brawl workflows, you will store your Brawl files inside the folder brawlWorkflows.
- The src folder for e.g. Java classes. Of course, you are free to change this and have the code organized in your own way. We tend to keep it like this.
- The jar file, which is named after the plugin name. The jar file is only needed, if you create Java based workflows.
Important
The build* files and the analysisTools and configurationFiles folders are mandatory! If you do not create them, the plugin will not be loaded by Roddy.
Populating your plugin¶
Now it is time to populate your plugin with files, configuration files and resources. The common settings are explained in this document, plugin specific settings are explained separetely.
As noted before, you need to create at least a plugin folder with a valid name, the buildinfo and the buildversion text files and both subfolders in resources.
Important
JVM workflows offer the highest amount of access to the Roddy API. Roddys API concepts will be explained in the description of JVM workflows. However you are allowed to mix workflow types in a plugin.
Let Roddy help you¶
Call Roddy like this:
bash roddy.sh createnewworkflow PluginID[:dependencyPlugin] [native|brawl:]WorkflowID
- Set PluginID to either an existing or a new Plugin.
- Set dependencyPlugin to a parent plugin
- Select if you want a Java, a native (Bash) or a Brawl workflow
- Finally, set the workflows name with at WorkflowID
So e.g. create a Java workflow called FirstWorkflow in a plugin called NewPlugin:
bash roddy.sh createnewworkflow NewPlugin FirstWorkflow
or e.g. create a Brawl workflow called SecondWorkflow in another plugin and set it to depend on NewPlugin:
bash roddy.sh createnewworkflow AnotherPlugin:NewPlugin SecondWorkflow
*Oh I have something new now… but where is it?*
Good question, that totally depends on your application ini file and the setup plugin directories. So look up the file and take a look into all configured directories.
Workflow development primer¶
Following the instructions on this page, you should be able to setup and run a basic workflow within ten minutes. At the end of this page you’ll find all commands in one code block. This guide assumes that you will be developing for Roddy version 3.0.x and that you will create a JVM based workflow.
1. Setup a plugins folder¶
The plugins folder is the folder, where you will store your (self-created) plugins.
mkdir ~/RoddyPlugins
2. Prepare the plugin folder¶
Now create a folder in which you will store your new plugin.
cd ~/RoddyPlugins
mkdir NewPlugin
cd NewPlugin
3. Create the first files and folders¶
This will create the basic structure which is necessary for your plugin. See the Plugin developers guide for more information about plugin structures. We are
mkdir -p resources/analysisTools/workflowTools
mkdir -p resources/configurationFiles
echo 0.0 > buildversion.txt
echo 0 >> buildversion.txt
echo "dependson=PluginBase:1.0.29" > buildversion.txt
echo "dependson=DefaultPlugin:1.0.34" > buildversion.txt
echo "RoddyAPIVersion=3.0" > buildversion.txt
echo "JDKVersion=1.8" >> buildversion.txt
echo "GroovyVersion=2.4" >> buildversion.txt
4. Create the src folder and the inital java package¶
We’ll use our package structure for this example, change it as you need it. You’ll need the src structure, if you want to compile the plugin using Roddy.
mkdir -p src/de/dkfz/roddy/newplugin
cd src/de/dkfz/roddy/newplugin
In this directory, create the file NewPlugin.java and put in the following code.
package de.dkfz.roddy.newplugin;
import de.dkfz.roddy.plugins.BasePlugin;
public class TestPlugin extends BasePlugin {
public static final String CURRENT_VERSION_STRING = "0.0.0";
public static final String CURRENT_VERSION_BUILD_DATE = "NotBuildYet";
@Override
public String getVersionInfo() {
return "Roddy plugin: " + this.getClass().getName() + ", V " + CURRENT_VERSION_STRING + " built at " + CURRENT_VERSION_BUILD_DATE;
}
}
There you are, next step is…
5. Create a workflow class¶
In this directory, create the file NewWorkflow.java and put in the following code.
package de.dkfz.roddy.newplugin;
import de.dkfz.roddy.core.ExecutionContext;
import de.dkfz.roddy.core.Workflow;
public class NewWorkflow extends Workflow {
@Override
public boolean execute(ExecutionContext context) {
return true;
}
}
6. Create your analysis XML file¶
The next step is the creation of your analysis XML file, which will make the workflow available to Roddy. If the XML file is setup properly, you can import the analysis in your project configuration or call it in configuration-free mode.
cd ~/RoddyPlugins/NewPlugin/resources/configurationFiles
<configuration name='newAnalysis' description=''
configurationType='analysis'
class='de.dkfz.roddy.core.Analysis'
workflowClass='de.dkfz.roddy.newplugin.NewWorkflow'
runtimeServiceClass="de.dkfz.roddy.core.RuntimeService"
listOfUsedTools="testScript" usedToolFolders="workflowTools">
<configurationvalues>
<cvalue name="firstValue" value="FillIt" type="string" />
<cvalue name="testOutputDirectory" value="${outputAnalysisBaseDirectory}/testfiles" type="path"/>
</configurationvalues>
<processingTools>
<tool name='testScript' value='testScriptSleep.sh' basepath='workflowTools'>
<resourcesets>
<rset size="l" memory="1" cores="1" nodes="1" walltime="5"/>
</resourcesets>
<input type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_IN"/>
<output type="file" typeof="SimpleTestTextFile" scriptparameter="FILENAME_OUT"/>
</tool>
</processingTools>
<filenames package='de.dkfz.roddy.knowledge.examples' filestagesbase='de.dkfz.roddy.knowledge.examples.SimpleFileStage'>
<filename class='SimpleTestTextFile' onTool='testScript' pattern='${testOutputDirectory}/test_onScript_1.txt'/>
</filenames>
</configuration>
There you are. You now have a tool which you can call from your workflow.
7. Extend the workflow¶
Open up the workflow class again and change the execute method so that it calls the tool “testScript”. For that to work, you need to load one SimpleTestTextFile.
public boolean execute(ExecutionContext context) {
SimpleTestTextFile textFile = (SimpleTestTextFile)loadSourceFile("/tmp/someTextFile.txt");
SimpleTestTextFile result = call("testScript", textFile);
return true;
}
Successful Roddy workflows will return true. If you detect an error, you can return false or throw an exception. Only one thing is missing, before you try out your new workflow.
8. Create the first script¶
cd ~/RoddyPlugins/NewPlugin/resources/analysisTools/workflowTools
echo 'sleep 10' > testScriptSleep.sh
echo 'cat $FILENAME_IN > $FILENAME_OUT' > testScriptSleep.sh
chmod 770 testScriptSleep.sh
9. Create a new properties file for Roddy¶
There is a skeleton application properties file in your Roddy folder. Copy the file [RODDY]/dist/bin/develop/helperScripts/skeletonAppProperties.ini to a location of your choice. Open it and add the folder ~/RoddyPlugins to the pluginDirectories entry. Also change the jobManager class to DirectSynchronousExecutedJobManager. Just comment the currently active line and uncomment the new jobManager.
10. Last steps¶
The last step you need to do is to compile and run the script.
[RODDY_DIRECTORY]/roddy.sh compileplugin NewPlugin --c=[YOUR_INI_FILE]
If you stuck to the example code, everything should be fine now and you can call it. We’ll use to the configurations-free mode here. Therefore we call the testrun mode with the pattern
[RODDY_DIRECTORY]/roddy.sh testrun [PluginName]_[PluginVersion]:[ConfigurationName]Analysis
Note that the “ConfigurationName” is the name attribute in the workflow configuration in the plugin, however without the “Analysis” suffix. The suffix is re-added by Roddy.
Project configuration files are explained in Configuration files. If you use a project configuration file, put in a directory of your choice (e.g. where you put your ini file from the step before).
[RODDY_DIRECTORY]/roddy.sh testrun NewPlugin_develop:test --c=[YOUR_INI_FILE]
Command code block¶
JVM plugins¶
Java or Groovy based plugins are the default plugin type for Roddy, as both provide a lot of checks when the plugin is build. E.g. variable type errors and misspelled variables. Brawl based workflows will be converted to Groovy workflows during runtime. Here we will focus on the development of a new empty plugin. All you need is the basic setup described in pluginDevelopersGuide. The code shown here can be found in the TestPluginWithJarFile plugin.
Note
There are some basic and test workflows available in the Roddy distribution folder. You can always take a look at them, if you need some examples.
Initial workflow¶
To start the development, you need to setup a package structure and put in a class which extends the Workflow class and an initial analysis configuration file.
Here comes the Java workflow class:
package de.dkfz.roddy.knowledge.examples;
import de.dkfz.roddy.core.Workflow;
class SimpleWorkflow extends Workflow {
@Override
public boolean execute(ExecutionContext context) {
}
}
What you can see is a workflow class which overrides the execution method from Workflow. There are other methods which you can override or use:
- checkExecutability - which returns a boolean value and
And here is the initial XML file:
<configuration name='testAnalysis' description=''
configurationType='analysis'
class='de.dkfz.roddy.core.Analysis'
workflowClass='de.dkfz.roddy.knowledge.examples.SimpleWorkflow'
runtimeServiceClass="de.dkfz.roddy.knowledge.examples.SimpleRuntimeService"
listOfUsedTools=""
usedToolFolders="devel"
cleanupScript="cleanupScript">
</configuration>
What you have to do here is to set:
- The name attribute -> This is used as the analysis identifier.
- The workflowClass attribute -> This is the workflow class which we created above.
- And finally the runtimeServiceClass -> This class and its descendants is used to handle file and directory name issues.
That’s it! This workflow could already be run though it would not produce any files.
Load a source file from storage¶
Before you are able to start a job, You will need to load a file from storage. Roddy does not feature file loading by a pattern or wildcards, but you have several other ways to get files from storage. While we say “load” or “get” a file, we mean, that we do create a file object of the type BaseFile. The actual content of the files are not loaded! The file does not even need to exist! Checking files is done in a separate step.
So which possibilities do you have: - Construct the file manually (not recommended) - Call Workflow.getSourceFile or BaseFile.getSourceFile / BaseFile.fromStorage - Call Workflow.getSourceFilesUsingTool (or ExecutionService.getInstance().executeTool() and do the
package de.dkfz.roddy.knowledge.examples;
import de.dkfz.roddy.core.ExecutionContext;
import de.dkfz.roddy.core.Workflow;
import de.dkfz.roddy.knowledge.files.Tuple4;
/**
*/
public class TestWorkflow extends Workflow {
@Override
public boolean execute(ExecutionContext context) {
SimpleRuntimeService srs = (SimpleRuntimeService) context.getRuntimeService();
SimpleTestTextFile initialTextFile = srs.createInitialTextFile(context);
SimpleTestTextFile textFile1 = initialTextFile.test1();
FileWithChildren fileWithChildren = initialTextFile.testFWChildren();
SimpleTestTextFile textFile2 = textFile1.test2();
SimpleTestTextFile textFile3 = textFile2.test3();
Tuple4 mout = (Tuple4) call("testScriptWithMultiOut", textFile3);
return true;
}
}
Call a tool¶
Tool definition¶
Actual call¶
Now, let’s extend the workflow to call a tool. At first we need to get some files from storage with which we can work. Roddy works with explicitely defined dependencies. Job dependencies are automatically created, when an output file is used as an input to another job. Initially we do not have any files, so we need to get at least one from storage.
package de.dkfz.roddy.knowledge.examples;
import de.dkfz.roddy.core.Workflow;
class SimpleWorkflow extends Workflow {
BaseFile createInitialTextFile(ExecutionContext ec) {
BaseFile tf = BaseFile.constructSourceFile(
new File(ec.runtimeService.getOutputAnalysisFolder(ec.getDataSet(),ec.getAnalysis()).getAbsolutePath(),
"textBase.txt"),
ec,
new SimpleFileStageSettings(ec.getDataSet(), "100", "R001"),
null)
)
if (!FileSystemAccessProvider.getInstance().checkFile(tf.getPath()))
FileSystemAccessProvider.getInstance().createFileWithDefaultAccessRights(true, tf.getPath(), ec, true)
return tf
}
@Override
public boolean execute(ExecutionContext context) {
}
}
Brawl workflows¶
- Brawl is Roddys own domain specific language (DSL) for creating workflows. It looks a lot like the DSL of e.g. Snakemake
- or Nextflow.
Brawl workflows can be part of Plugins with or without Jar files. To create them, you create the folder:
resources/brawlWorkflows
Inside, you can create as many Brawl workflows as you like. Brawl workflow files either have the suffix .groovy OR .brawl, e.g.
resources/brawlWorkflow/TestWorkflow.groovy
or
resources/brawlWorkflow/TestWorkflow.brawl
However, one mistake you can see in the above example is, that the workflows are named equally. The workflow identifier is directly taken from the filename itself. So if you want to import the workflow in your project configuration, you’d identify it with “TestWorkflow”.
Structure¶
Brawl workflows are Groovy workflows, so the basic Groovy / Java syntax applies. See the TestWorkflow.groovy, which is located in the Roddy repository.
// Java variables
String variable = "abc"
// "Environment" / Roddy configuration values
cvalue "valueString", "a text", "string"
cvalue "valueInteger", 1
cvalue "valueDouble", 1.0
cvalue "aBooleanValue", true
// Explicit workflow
explicit {
def file = getSourceFile("/tmp", "TextFile")
def a = run "ToolA", file
}
// Tool / Rule section
rule "ToolA", {
input "TextFile", "parameterA"
output "aClass", "parameterB", "/tmp/someoutputfile"
shell """
#!/bin/bash
echo "\$parameterA"
echo "\$parameterB"
touch \$parameterB
"""
}
What you need are: Configuration values, Rules and the explicit Closure / Block.
Configuration values¶
Like in “full” Roddy workflows, there are two types of variables: Java variables, which only apply to the workflow file itself and cvalue variables, which will be stored in the workflows configuration and will therefore be available in the target system environment.
Note
Please note the effects on quoting in Groovy strings! (See below for some more information or look up Groovy docs). Groovy will try to replace variables in Strings, if possible. Sometimes it might be necessary to quote the $ to prevent this!
Rules¶
Rules or Tools are what is called with a run command (see below). They can contain input and output parameters, a shell script OR a file reference and resource options. “rule” and “tool” are the same, you can decide for your preferred identifier. We’ll stick to rule for now. Rules are what will be executed on the target (cluster) system. Although they are configured here, there is a quite a strict separation between rule scripts and the workflow side! Every value you need in a script needs to be passed either as a parameter or as a cvalue! Java configuration values are not automatically available, except if they are directly inserted by Groovy when the workflow is read in.
Note
Every rule / tool you register, will be available on the script side! However, their names are translated. E.g. “ToolA” will become the environment variable “TOOL_TOOL_A” on the script side. See Tools and filenames “Tool entries” for more information.
So which options do you have?
Simple tool registration:
rule "ToolA", "myWorkflowTools/scriptName.sh"
This will tell Roddy, that the script scriptname.sh exists in the resources/analysisTools/myWorkflowTools directory.
Tool with inline (Bash / Shell) code
rule "ToolA", {
shell """
#!/bin/bash
echo "\$parameterA"
echo "\$parameterB"
touch \$parameterB
"""
}
Here, Roddy will create a file called ToolA in the resources/analysisTools/inlineScripts directory
These are the two basic types of tools: inline and external. But what about input and output parameters? Just add them to your definition. If the rule does is inline or references an external script does not matter. The same also applies for resources.
rule "ToolA", {
file "myWorkflowTools/scriptName.sh"
input "TextFile", "parameterA"
output "aClass", "parameterB", "/tmp/someoutputfile"
}
This will configure ToolA to have one input and one output parameter. The input parameter will be accessible in the script / environment with the variable parameterA (parameterB as well). The input parameter is configured to be of the type TextFile. You can put in what you want and even use the same type for all i/o parameters, but the type will allow Roddy to check for i/o compatibility between tools. The output is of type aClass and will be placed in “/tmp/someoutputfile”. The location of the output file is actually a filename pattern. All filename pattern rules apply, please read the filename pattern section to get more information about this.
Like in XML tool definitions, it is of course possible to have more than one input or output parameter:
rule "ToolA", {
file "myWorkflowTools/scriptName.sh"
input "TextFile", "FileParameterA"
input "TextFile", "FileParameterB"
input "string", "StringParameter"
output "VCFFile", "parameterB", "/tmp/\${StringParameter}_A_vs_B.vcf.gz"
output "TextFile", "parameterB", "/tmp/\${StringParameter}_AnotherFile.txt"
}
Here you have three input and two output values. Multiple output values are always bundled and stored into a tuple object! We’ll see later, how you can access it.
Note
Brawl workflows are Groovy! Therefore please note, that the $ sign needs to be escaped in many cases! The example above uses ${StringParameter} to include the variable StringParameter in the file names. Depending on your requirements, you could also quote the filenames with a single tick ‘ to avoid the escape. However, you would then lose the ability to use Groovy variable values from the top part of the workflow (See configuration in the example workflow above).
Resources
As rules can be submitted to a compute cluster, you should make sure, that they don’t consume too many resources. Therefore, it is possible to configure them:
rule "ToolA", {
file "myWorkflowTools/scriptName.sh"
input "TextFile", "parameterA"
output "aClass", "parameterB", "/tmp/someoutputfile"
walltime "10h"
memory 2.0
cores 5 // Alternatively you can use threads if you like
}
So what’s above: cores, walltime and memory all define resources which might be required by your rule. Please look up the tool entries in the configuration section for more information.
explicit {}¶
Soooo finally, this is the part where you run your workflow. You can use all of Roddys capabilities inside in this little closure. However, we offer a shortcut and convenience methods, which might help you.
Get a file where you know the path
explicit {
def file = getSourceFile("/tmp", "TextFile")
[..]
}
If you know the path of a source file, e.g. because you passed it as a configuration value or it has a fixed position like in the above example, you can call getSourceFile. The second parameter is optional and, like in the examples with the rules, sets the type of the file. The type, again, will be used by Roddy for type checks.
Get one ore more source files using a tool
This is different from the previous approach. Using the methods getSourceFileUsingTool or getSourceFileUsingTool will allow you to run a tool on the target system which will then return a single file or a list of file objects.
explicit {
def file = getSourceFileUsingTool("ToolForSingleFile", "TextFile")
def files = getSourceFilesUsingTool("ToolForMultipleFiles", "TextFile")
// OR
BaseFile file = getSourceFileUsingTool("ToolForSingleFile", "TextFile")
List<BaseFile> files = getSourceFilesUsingTool("ToolForMultipleFiles", "TextFile")
[..]
}
As previously mentioned, we are dealing with Groovy code. This way, you can always use the def keyword to declare variables. Another way would be to to use the BaseFile class OR, if you defined a class file somewhere, you can of course use this as well. Keep in mind, that the class file has to match the class in the method call!
Important
File loader scripts are special. You need to define the tool rule like described above. Otherwise it won’t work. ALSO, DO NOT USE ANY DEBUG OUTPUT! Roddy will directly create file objects for every line of output of your script!
Get files which derive from other files
Call a tool
Call a tool with an output filegroup
Get run flags from your configuration
Native plugins¶
Frequently Asked Questions¶
Indicated packet length … too large¶
The SSH library we currently use (sshj) does not work if during your login on the submission host (on which the e.g. qsub command is executed). Make sure during the login no output is generated, e.g. from your $HOME/.profile, $HOME/.bashrc, etc. files.
When something wents wrong and Roddy returns an error code¶
Roddy can exit with a range of exit codes which are:
Exit code | Description |
---|---|
255 | You must call roddy from the right location! This is the folder where roddy.sh resides. |
254 | SystemExitException throw by GroovyServ. Only occurs when GroovyServ is enabled (not by default!). |
253 | Command line was malformed, check your input and correct mistakes. |
252 | Execution requirements unfulfilled. Seems you are missing some applications, please install them or ask your administrator to do it for you. |
251 | Startup options could not be parsed. Check your command line. |
250 | Cannot find requested feature toggle file. Please make sure, that the required file exists. |
249 | Feature toggle is not known. There are only some available. If you are unsure about this, don’t use them, if not necessary. |
248 | Unknown problem with proxy setup. Check your proxy settings. |
247 | scratchBaseDir is not defined in your application ini file. Please set it so that it matches your cluster settings. |
246 | The wrong job manager class is set in your application ini file. Please correct that. |
245 | Application properties file not found or loadable. Make sure, that the file exists. |
244 | Could not load the requested analysis. Look for typos or take another one. |
243 | Severe configuration errors occurred. Take a detailed look into the error messages and your configuration file, |
242 | Unhandled exception. That is a bad one. Take a look at any error message and get in contact with us. |
241 | Unknown SSH host. Change the hostname, it is possibly wrong. |
240 | SSH setup is not valid. Please follow the instructions and check your application ini file. |
239 | Fatal error during SSH setup. Please contact us in this case. |
100 | Someone uses a wrong exit code somewhere in Roddy. Exit codes should be in class ExitReasons (if possible) and must be in the range [1;255]. |
In any case, we try to provide you a good explanation about what happened wrong and how you can solve it. If you find the messages hard to understand, contact us.
The Roddy WMS¶
What is Roddy¶
Roddy is a framework for development and management of script based workflows on a batch processing cluster.
You can find the Roddy source code and its releases on our GitHub project site
Key Features¶
Roddy has several key features which make it a good choice to be used as a base for workflows:
- Multi-Level configuration system
- Modular application design
- Access to several cluster backends (via BatchEuphoria)
- Different versions of plugins/workflows and the Roddy core application are handled in a single installation
- Various already implemented workflows
- Callable stand-alone or integrable in other applications
- Only a few dependencies and no database for the Roddy core application necessary
- Various execution modes to support users to get their work done faster
The multi-layer configuration system and the handling of plugin versions make Roddy particularly well suited for multi-user, multi-project environments.
Where to start?¶
Take a look at the example workflow package: Example workflow
Do you want to use it to run existing workflows? Then head over to the Users guide
Do you want to develop workflows with it? Open up the Plugin developers guide
Do you want to develop it? See the Developers guide
Do you have questions? Check out the Frequently Asked Questions
License and associated projects¶
Roddy is offered under an MIT based license.
We extracted from Roddy two possibly helpful open source libraries, again under MIT license:
- RoddyToolLib is a Java / Groovy library which provides several tools used in BatchEuphoria and Roddy. See the project description for more information.
- BatchEuphoria is a Java / Groovy library designed to offer easy access to cluster systems. Currently supported are PBS, SGE and LSF Rest