The problem
I recently needed to add support for having three optional output files to a Nextflow tuple-based (set
) channel. In my case, this was because I had a process that outputs the variant files for a sample separately but simulataneously and where I want each variant type to be handled separately downstream. That is, it needed to outut the SNVs, indels and delins variants for a sample separately and simulataneously as three separate files, but not all variant types would necessarily be present. For example:
process tricksyProcess {
input:
set(
val(sampleName),
file(variantsFile)
) from tricksyInputChannel
val variantTypes
output:
set(
val(sampleName),
file("snvs.vcf"),
file("indels.vcf"),
file("delins.vcf")
) into tricksyOutputChannel
shell:
'''
my-tricksy-process \
--variants "!{variantsFile}" \
--sample-name "!{sampleName}" \
--snv-output "snvs.vcf" \
--indel-output "indels.vcf" \
--delins-output "delins.vcf";
'''
}
Fortunately, Nextflow has an optional output directive. Unfortunately, it operates at the entire-set level, rather than individual items within the set.
There are a few options for handling this:
- Combine output files into single file, and filter/split into component variant types where needed. Since in my case I have very large lists of variants, this would greatly slow down the pipeline and still require a fair amount of extra boilerplate code downstream.
- Use flow control
if
/else
blocks to define different processes in different scenarios. With three different boolean options, I’d need 8 different definitions of my process! - Always output all variant types and discard/ignore unwanted files downstream. In my case,
my-tricksy-process
would fail if a required variant type doesn’t appear in the input file, so I’d need to add dummy data to allow it to work, which risks contaminating my results. Also, it incurs significant additional compute time. - Create dummy output files for unwanted file types. This is by far the easiest option. You would still need control downstream to prevent processing of unwanted variant types, but it’s simple. It isn’t very defensive, however, and my application is for a pipeline for use in a clinical setting. I therefore want to use Nextflow’s error and file checking strengths to ensure all and only expected file types are present. I also don’t want dummy files to get mixed up with legitimate output files when archiving the results, which could lead to confusion or inappropriate use of the files.
- Place each output file in a separate channel and join them together. This requires having a unique key to join items together, but requires no unnecessary compute, involves minimal boilerplate and outputs only the files actually needed.
Chosen solution
As you can probably guess, I opted for option 5.
Define your process with each independent output in a separate channel. You need to include the unique identifier in each channel. Each channel will be marked optional if the output file is not expected.
output:
set(
val(sampleName),
file("snvs.vcf"),
) optional (!variantTypes.contains("snv")) into tricksyOutputSnvChannel
Next, inside the process’ shell
block, we need to dynamically build the list of command arguments
shell:
'''
file_args=();
if [[ "!{variantTypes.contains("snv")}" == "true" ]]; then
file_args+=(--snv-output snvs.vcf);
fi
if [[ "!{variantTypes.contains("indel")}" == "true" ]]; then
file_args+=(--indel-output indels.vcf);
fi
if [[ "!{variantTypes.contains("delins")}" == "true" ]]; then
file_args+=(--delins-output delins.vcf);
fi
my-tricksy-process \
--variants "!{variantsFile}" \
--sample-name "!{sampleName}" \
${file_args[@]:-}
'''
After the process definition, build a combined output channel by joining separate channels:
tricksyOutCh =
tricksyOutSnvCh
.join(tricksyOutIndelCh, remainder: true, by: 0)
.join(tricksyOutDelinsCh, remainder: true, by: 0)
.map { entry ->
def item = entry[0]
assert (entry[1] == null) == (! variantTypes.contains("snv")): \
"snv status for ${item} should be correct"
assert (entry[2] == null) == (! variantTypes.contains("indel")): \
"indel status for ${item} should be correct"
assert (entry[3] == null) == (! variantTypes.contains("delins")): \
"delins status for ${item} should be correct"
entry
}
After doing the join, we use a map
closure to assert that the value of the output files is null if and only if the output file is not expected. This defends agains our previous process output files that it’s not supposed to, or being misused in some way.
Finally, we can use the new output channel in any desired downstream process. While it would be best to use the variantTypes
object to decide which input items to use, we can also infer when an input file exists or is null as Nextflow will generate a filename in the pattern input.XXX
for a null input object:
input:
set(
val(sampleName),
file(snvFile),
file(indelFile),
file(delinsFile)
) from tricksyOutCh
shell:
'''
use_files=();
for in_file in "!{snvFile}" "!{indelFile}" "!{delinsFile}"; do
if [[ ! "${in_file}" =~ ^input\\.[0-9]*$ ]]; then
use_files+=("${in_file}");
fi
done
...
'''
Working example
Try the above out for yourself with the code below!
Copy this into a file called bin/my-tricksy-process
:
#!/bin/bash
# A simple demo script that just generates specific output files.
args=("$@");
nargs="${#args[@]}";
idx=0;
for (( ; idx < nargs; idx ++ )); do
this_arg="${args[$idx]}";
next_arg="${args[((idx + 1))]:-}";
echo "$this_arg $next_arg"
case "${this_arg}" in
--variants)
(( idx ++ ));
variants_file="$next_arg";
;;
--sample-name)
(( idx ++ ));
sample_name="$next_arg";
;;
--snv-output)
(( idx ++ ));
snv_file="$next_arg";
;;
--indel-output)
(( idx ++ ));
indel_file="$next_arg";
;;
--delins-output)
(( idx ++ ));
delins_file="$next_arg";
;;
esac
done
if [[ ! -z "${snv_file}" ]]; then
echo "$sample_name" > "$snv_file";
fi
if [[ ! -z "${indel_file}" ]]; then
echo "$sample_name" > "$indel_file";
fi
if [[ ! -z "${delins_file}" ]]; then
echo "$sample_name" > "$delins_file";
fi
Then in bin/workflow.nf
save this:
#!/usr/bin/env nextflow
// The list of variant types to use. This would proably be taken from input
// parameters. Feel free to try this script out with different combinations of
// these values.
variantTypes = ["snv", "indel", "delins"]
// Our dummy input channel.
tricksyInCh =
Channel.from(["SampleA", "SampleB", "SampleC"])
.map { sample -> [sample, new File("${sample}.vcf")] }
// Some shortcuts to the boolean flags for convenience
useSnv = variantTypes.contains("snv")
useIndel = variantTypes.contains("indel")
useDelins = variantTypes.contains("delins")
process tricksyProc {
storeDir 'out'
input:
set(
val(sampleName),
file(variantsFile)
) from tricksyInCh
val useSnv
val useIndel
val useDelins
output:
set(
val(sampleName),
file("${sampleName}_snvs.vcf"),
) optional (!useSnv) into tricksyOutSnvCh
set(
val(sampleName),
file("${sampleName}_indels.vcf"),
) optional (!useIndel) into tricksyOutIndelCh
set(
val(sampleName),
file("${sampleName}_delins.vcf"),
) optional (!useDelins) into tricksyOutDelinsCh
shell:
'''
file_args=();
if [[ "!{useSnv}" == "true" ]]; then
file_args+=(--snv-output !{sampleName}_snvs.vcf);
fi
if [[ "!{useIndel}" == "true" ]]; then
file_args+=(--indel-output !{sampleName}_indels.vcf);
fi
if [[ "!{useDelins}" == "true" ]]; then
file_args+=(--delins-output !{sampleName}_delins.vcf);
fi
my-tricksy-process \
--variants "!{variantsFile}" \
--sample-name "!{sampleName}" \
${file_args[@]:-}
'''
}
tricksyOutCh =
tricksyOutSnvCh
.join(tricksyOutIndelCh, remainder: true, by: 0)
.join(tricksyOutDelinsCh, remainder: true, by: 0)
.map { entry ->
def item = entry[0]
assert (entry[1] == null) == (! useSnv): \
"snv status for ${item} should be correct"
assert (entry[2] == null) == (! useIndel): \
"indel status for ${item} should be correct"
assert (entry[3] == null) == (! useDelins): \
"delins status for ${item} should be correct"
entry
}
process downstreamProc {
storeDir 'out'
input:
set(
val(sampleName),
file(snvFile),
file(indelFile),
file(delinsFile)
) from tricksyOutCh
output:
file("${sampleName}_files.txt")
shell:
'''
use_files=();
for in_file in "!{snvFile}" "!{indelFile}" "!{delinsFile}"; do
if [[ ! "${in_file}" =~ ^input\\.[0-9]*$ ]]; then
use_files+=("${in_file}");
fi
done
echo "Using files: ${use_files[@]}" > "!{sampleName}_files.txt";
'''
}
Then run it like so:
export PATH="$(realpath bin):$PATH"
chmod +x bin/my-tricksy-process bin/workflow.nf
rm -rf .nextflow* work out
bin/workflow.nf
# N E X T F L O W ~ version 19.07.0
# Launching `bin/example.nf` [ridiculous_stallman] - revision: 96c7560721
# executor > local (6)
# [cb/ed62cf] process > tricksyProc (3) [100%] 3 of 3 ✔
# [a0/34825e] process > downstreamProc (3) [100%] 3 of 3 ✔
cat out/*_files.txt
# Using files: SampleA_snvs.vcf SampleA_indels.vcf SampleA_delins.vcf
# Using files: SampleB_snvs.vcf SampleB_indels.vcf SampleB_delins.vcf
# Using files: SampleC_snvs.vcf SampleC_indels.vcf SampleC_delins.vcf
Conclusion
Nextflow is powerful but currently lacks an out-of-the-box feature for controlling which items within a Set
Channel should be present. However, with just a little bit of extra code and no wasted compute time, it’s possible to guarantee exactly the correct output files are present and nothing more.
No comments:
Post a Comment