Saturday 31 August 2019

Optional Items in Nextflow Set Channels

The problem

I recently needed to add support for having three optional output files to a Nextflow tuple-based (set) channel. In my case, this was because I had a process that outputs the variant files for a sample separately but simulataneously and where I want each variant type to be handled separately downstream. That is, it needed to outut the SNVs, indels and delins variants for a sample separately and simulataneously as three separate files, but not all variant types would necessarily be present. For example:

process tricksyProcess {
    input:
    set(
        val(sampleName),
        file(variantsFile)
    ) from tricksyInputChannel
    val variantTypes

    output:
    set(
        val(sampleName),
        file("snvs.vcf"),
        file("indels.vcf"),
        file("delins.vcf")
    ) into tricksyOutputChannel

    shell:
    '''
    my-tricksy-process \
        --variants "!{variantsFile}" \
        --sample-name "!{sampleName}" \
        --snv-output "snvs.vcf" \
        --indel-output "indels.vcf" \
        --delins-output "delins.vcf";
    '''
}

Fortunately, Nextflow has an optional output directive. Unfortunately, it operates at the entire-set level, rather than individual items within the set.

There are a few options for handling this:

  1. Combine output files into single file, and filter/split into component variant types where needed. Since in my case I have very large lists of variants, this would greatly slow down the pipeline and still require a fair amount of extra boilerplate code downstream.
  2. Use flow control if/else blocks to define different processes in different scenarios. With three different boolean options, I’d need 8 different definitions of my process!
  3. Always output all variant types and discard/ignore unwanted files downstream. In my case, my-tricksy-process would fail if a required variant type doesn’t appear in the input file, so I’d need to add dummy data to allow it to work, which risks contaminating my results. Also, it incurs significant additional compute time.
  4. Create dummy output files for unwanted file types. This is by far the easiest option. You would still need control downstream to prevent processing of unwanted variant types, but it’s simple. It isn’t very defensive, however, and my application is for a pipeline for use in a clinical setting. I therefore want to use Nextflow’s error and file checking strengths to ensure all and only expected file types are present. I also don’t want dummy files to get mixed up with legitimate output files when archiving the results, which could lead to confusion or inappropriate use of the files.
  5. Place each output file in a separate channel and join them together. This requires having a unique key to join items together, but requires no unnecessary compute, involves minimal boilerplate and outputs only the files actually needed.

Chosen solution

As you can probably guess, I opted for option 5.

Define your process with each independent output in a separate channel. You need to include the unique identifier in each channel. Each channel will be marked optional if the output file is not expected.

    output:
    set(
        val(sampleName),
        file("snvs.vcf"),
    ) optional (!variantTypes.contains("snv")) into tricksyOutputSnvChannel

Next, inside the process’ shell block, we need to dynamically build the list of command arguments

    shell:
    '''
    file_args=();
    if [[ "!{variantTypes.contains("snv")}" == "true" ]]; then
        file_args+=(--snv-output snvs.vcf);
    fi
    if [[ "!{variantTypes.contains("indel")}" == "true" ]]; then
        file_args+=(--indel-output indels.vcf);
    fi
    if [[ "!{variantTypes.contains("delins")}" == "true" ]]; then
        file_args+=(--delins-output delins.vcf);
    fi

    my-tricksy-process \
        --variants "!{variantsFile}" \
        --sample-name "!{sampleName}" \
        ${file_args[@]:-}
    '''

After the process definition, build a combined output channel by joining separate channels:

tricksyOutCh =
    tricksyOutSnvCh
    .join(tricksyOutIndelCh, remainder: true, by: 0)
    .join(tricksyOutDelinsCh, remainder: true, by: 0)
    .map { entry ->
        def item = entry[0]
        assert (entry[1] == null) == (! variantTypes.contains("snv")): \
            "snv status for ${item} should be correct"
        assert (entry[2] == null) == (! variantTypes.contains("indel")): \
            "indel status for ${item} should be correct"
        assert (entry[3] == null) == (! variantTypes.contains("delins")): \
            "delins status for ${item} should be correct"

        entry
    }

After doing the join, we use a map closure to assert that the value of the output files is null if and only if the output file is not expected. This defends agains our previous process output files that it’s not supposed to, or being misused in some way.

Finally, we can use the new output channel in any desired downstream process. While it would be best to use the variantTypes object to decide which input items to use, we can also infer when an input file exists or is null as Nextflow will generate a filename in the pattern input.XXX for a null input object:

    input:
    set(
        val(sampleName),
        file(snvFile),
        file(indelFile),
        file(delinsFile)
    ) from tricksyOutCh

    shell:
    '''
    use_files=();
    for in_file in "!{snvFile}" "!{indelFile}" "!{delinsFile}"; do
        if [[ ! "${in_file}" =~ ^input\\.[0-9]*$ ]]; then
            use_files+=("${in_file}");
        fi
    done

    ...
    '''

Working example

Try the above out for yourself with the code below!

Copy this into a file called bin/my-tricksy-process:

#!/bin/bash

# A simple demo script that just generates specific output files.

args=("$@");
nargs="${#args[@]}";
idx=0;
for (( ; idx < nargs; idx ++ )); do
    this_arg="${args[$idx]}";
    next_arg="${args[((idx + 1))]:-}";

    echo "$this_arg $next_arg"
    case "${this_arg}" in
        --variants)
            (( idx ++ ));
            variants_file="$next_arg";
            ;;

        --sample-name)
            (( idx ++ ));
            sample_name="$next_arg";
            ;;

        --snv-output)
            (( idx ++ ));
            snv_file="$next_arg";
            ;;

        --indel-output)
            (( idx ++ ));
            indel_file="$next_arg";
            ;;

        --delins-output)
            (( idx ++ ));
            delins_file="$next_arg";
            ;;

    esac
done

if [[ ! -z "${snv_file}" ]]; then
    echo "$sample_name" > "$snv_file";
fi

if [[ ! -z "${indel_file}" ]]; then
    echo "$sample_name" > "$indel_file";
fi

if [[ ! -z "${delins_file}" ]]; then
    echo "$sample_name" > "$delins_file";
fi

Then in bin/workflow.nf save this:

#!/usr/bin/env nextflow

// The list of variant types to use. This would proably be taken from input
// parameters. Feel free to try this script out with different combinations of
// these values.
variantTypes = ["snv", "indel", "delins"]

// Our dummy input channel.
tricksyInCh =
    Channel.from(["SampleA", "SampleB", "SampleC"])
    .map { sample -> [sample, new File("${sample}.vcf")] }

// Some shortcuts to the boolean flags for convenience
useSnv = variantTypes.contains("snv")
useIndel = variantTypes.contains("indel")
useDelins = variantTypes.contains("delins")

process tricksyProc {
    storeDir 'out'

    input:
    set(
        val(sampleName),
        file(variantsFile)
    ) from tricksyInCh
    val useSnv
    val useIndel
    val useDelins

    output:
    set(
        val(sampleName),
        file("${sampleName}_snvs.vcf"),
    ) optional (!useSnv) into tricksyOutSnvCh

    set(
        val(sampleName),
        file("${sampleName}_indels.vcf"),
    ) optional (!useIndel) into tricksyOutIndelCh

    set(
        val(sampleName),
        file("${sampleName}_delins.vcf"),
    ) optional (!useDelins) into tricksyOutDelinsCh


    shell:
    '''
    file_args=();
    if [[ "!{useSnv}" == "true" ]]; then
        file_args+=(--snv-output !{sampleName}_snvs.vcf);
    fi
    if [[ "!{useIndel}" == "true" ]]; then
        file_args+=(--indel-output !{sampleName}_indels.vcf);
    fi
    if [[ "!{useDelins}" == "true" ]]; then
        file_args+=(--delins-output !{sampleName}_delins.vcf);
    fi

    my-tricksy-process \
        --variants "!{variantsFile}" \
        --sample-name "!{sampleName}" \
        ${file_args[@]:-}
    '''
}

tricksyOutCh =
    tricksyOutSnvCh
    .join(tricksyOutIndelCh, remainder: true, by: 0)
    .join(tricksyOutDelinsCh, remainder: true, by: 0)
    .map { entry ->
        def item = entry[0]
        assert (entry[1] == null) == (! useSnv): \
            "snv status for ${item} should be correct"
        assert (entry[2] == null) == (! useIndel): \
            "indel status for ${item} should be correct"
        assert (entry[3] == null) == (! useDelins): \
            "delins status for ${item} should be correct"

        entry
    }

process downstreamProc {
    storeDir 'out'

    input:
    set(
        val(sampleName),
        file(snvFile),
        file(indelFile),
        file(delinsFile)
    ) from tricksyOutCh

    output:
    file("${sampleName}_files.txt")

    shell:
    '''
    use_files=();
    for in_file in "!{snvFile}" "!{indelFile}" "!{delinsFile}"; do
        if [[ ! "${in_file}" =~ ^input\\.[0-9]*$ ]]; then
            use_files+=("${in_file}");
        fi
    done

    echo "Using files: ${use_files[@]}" > "!{sampleName}_files.txt";
    '''
}

Then run it like so:

export PATH="$(realpath bin):$PATH"

chmod +x bin/my-tricksy-process bin/workflow.nf

rm -rf .nextflow* work out

bin/workflow.nf
# N E X T F L O W  ~  version 19.07.0
# Launching `bin/example.nf` [ridiculous_stallman] - revision: 96c7560721
# executor >  local (6)
# [cb/ed62cf] process > tricksyProc (3)    [100%] 3 of 3 ✔
# [a0/34825e] process > downstreamProc (3) [100%] 3 of 3 ✔

cat out/*_files.txt
# Using files: SampleA_snvs.vcf SampleA_indels.vcf SampleA_delins.vcf
# Using files: SampleB_snvs.vcf SampleB_indels.vcf SampleB_delins.vcf
# Using files: SampleC_snvs.vcf SampleC_indels.vcf SampleC_delins.vcf

Conclusion

Nextflow is powerful but currently lacks an out-of-the-box feature for controlling which items within a Set Channel should be present. However, with just a little bit of extra code and no wasted compute time, it’s possible to guarantee exactly the correct output files are present and nothing more.

Wednesday 9 December 2015

Chinese IP Courts Flourishing

Many westerners are of the view that in China it is difficult to protect their innovations and, as a consequence, some businesses and people decide to conduct R&D and to file their patent and trademark applications outside of China. Recent advances should allay these fears, however.

The Chinese government established three specialist IP courts at the end of last year due to the huge increase in IP related cases. This great news means that the Chinese legal system will be able to conduct litigation more efficiently. The courts tried to be more transparent by publishing their cases and information online. Additionally, English translations of cases and documents are also being prepared so that international businesses can understand what is happening and how to make use of the courts. These specialized courts are a great development for IP in China and online publication provides useful tools for both local and international businesses.  However, the new courts still have considerable room for improvement when compared to the much more mature IP court system in the West. If the courts continue to receive support and attention, the transparency will be an enduring boon.

Within China, the majority of the IP infringement cases (90%) are between Chinese entities. This indicates that Chinese people and businesses have a keen interest and understanding of IP protection and that they want to use legal methods to protect their innovations. This figure can also suggest that Chinese organizations are investing heavily in R&D that needs to be protected.

Furthermore, in disputes between a foreign entity and a Chinese entity, more than 70% of IP cases were won by the foreign entity. This surprising result might be due to the greater understanding of IP protection by the foreign participants or that they recruit better representatives who are more experienced in preparing for such cases. However, I feel this result also shows the courts are clearly not biased by nationalism as they do not favour Chinese litigants. That said, China needs to greatly increase the institutional capability to handle the new cases; there will be a huge demand for Chinese IP legal experts for a long time to come.

I hope that these figures reassure foreign businesses to invest in R&D in China. Businesses will benefit from respecting the cultural differences of the Chinese market, just as China is taking great efforts to increase the fairness of its legal system and to make investment hugely profitable.

References


Ward, A.M. Have the Chinese specialist IP courts made the grade? The IPKat
(retrieved 09/12/2015)

Cao, Y. Hearings of IP disputes exceed expectations. China Intellectual Property – China Daily
(retrieved 09/12/2015)

Monday 1 December 2014

How do you make the world develop better habits?

You are waiting for the train after a long day of work and out of the corner of your eye you notice a new poster:

Yang Yeo et al. J.Walter Thompson U.S.A., Inc, Shanghai

What did you see?

Is it the tooth, the Colosseum of Rome in Italy, the tooth paste or the foreign text (which I hope you guessed is Chinese)? If you see all of these, then it will not be a huge surprise to discover that this image, which exemplifies expertise, is a very successful advertising campaign for a Chinese toothpaste brand - Maxam.

The creativity does not stop here. If you are a Mandarin speaker, you will then quickly notice the pun in the caption. The verb used here is (zhu), meaning ‘decay’, but it is pronounced exactly the same as (zhu), meaning ‘settle down’. This caption emphasises the message of how bacterial decay can settle down inside your teeth as if building their own civilization, just as how the Roman Empire built Rome. The concept then becomes a popular topic for discussion amongst a wide range of audiences.

Welcome to the world of healthcare advertising, where science and creativity meet to promote a powerful message.

Yang Yeo et al. J.Walter Thompson U.S.A., Inc, Shanghai

In an environment filled with successful and colourful marketing campaigns, healthcare advertising plays an important but serious role in raising the awareness of diseases and their treatments. Successful campaigns should have a long-lasting effect so that the audience start taking care of their health and form beneficial habits. Next to adverts for fabulous holiday destinations, exciting blockbuster movies and sexy new fashion, healthcare advertisements have to grab everybody’s attention and alert all to the importance of taking care of their health. What a challenging task!
The picture above is one of the images produced in a series that was widely distributed as posters throughout train stations and public transport in China.  This advertisement aimed to raise awareness of oral hygiene and the importance of brushing regularly. It not only shows the dire consequences of failing to regularly brush teeth, but also emphasizes the strong similarity between the decay caused by bacteria on teeth and the great historic civilizations of mankind. Both take a long time to conquer, but will leave a permanent mark. I would surmise that nobody who sees this poster could fail to appreciate the urgency of taking care of their oral hygiene.

This image is set in a parallel universe of the tooth and it stands out from other adverts that might be seen on public transport. The tooth appears as a static snapshot during the construction of the Colosseum. The individual arches can be seen clearly, and there are people around the façade that betray a large population. The slogan and the toothpaste appear to be moving into view as if to take action against the slow process of decay. The audience will quickly recognise the product that is being advertised. The strong contrast between the colours of the tooth, the toothpaste and the background makes the audience focus on this amazing architecture built in the tooth and the product needed to combat it. The fine details in the model of the Colosseum are breath-taking and remind the audience about the famous old saying “Rome wasn’t built in a day!”

The expertise on display in this image doesn’t stop at the brilliant idea alone. The image depicts a 3D decaying tooth, which was generated using computer models and image manipulation software that required both patience and immense skill to produce the desired effects. This Colosseum model shows very fine details that give it a high definition. The incredible details are fascinating and make the audience want to look closer for longer to discover what else they can find in the picture.  An advert that makes the audience want to pay attention will no-doubt be successful.


Yang Yeo et al. J.Walter Thompson U.S.A., Inc, Shanghai

Indeed, this advertising campaign has gone viral since it came to public eye. Many people have shared this poster with their friends on social networking websites such as Weibo (Chinese twitter), Facebook and Pinterest. The advert has been widely distributed within China and around the globe, giving the message considerable longevity. It also earned praise from advertising industry magazines, such as “[The prizes] were well deserved”1 and “The detail in the ruins sculpted inside, is remarkable and the whole picture is impressive”2. The advertising enthusiasts and professionals have studied the techniques, so that they can replicate this advert’s success.

The photograph of a 3-D model is innovative and novel in the advertising world. Many prestigious awards have been bestowed upon the campaign3 including the Cannes Lions 2012 (2 Gold, 2 Silver & 2 Bronze), which is the highest award in the industry.

So do you have the urge to brush your teeth now?

Image credits

Client: Maxam
Agency: JWT (J.Walter Thompson U.S.A., Inc), Shanghai
Chief Creative Officer: Yang Yeo
Executive Creative Officer: Elvis Chau
Creative Officer: Hattie Cheng
Copywriter: Chanfron Zhao
Art Directors: Haoxi Lv, Danny Li
Print Producers: Liza Law, Joseph yu, Isaac Xu, Chivel Miao
Photographer/Illustrator: Surachai Puthikulangkura at Illusion
Illustrator: Supachai U-Rairat
Producers: Somsak Pairew, Anotai Panmongkol

References

  1. Weekly Creative Inspiration (2013) DDG Mag, District Design Group. http://districtdesigngroup.com/weekly-creative-inspiration-dont-let-germs-settle-down/ (last retrieved 7th June 2014)
  2. Maxam Civilizations (2012) dutch DZINE. http://www.dutchdzine.com/tag/maxam-civilizations/ (last retrieved 7th June 2014)
  3. Civilization-Rome (2014) Welovead. http://www.welovead.com/en/works/details/384DkoqB (last retrieved 7th June 2014)