High-performance computing for microbiome analysis
As part of the Human Heredity and Health in Africa (H3Africa) project, researchers at UCT are investigating the microorganisms that live in the nasal passageways and throats of children. They are interested in these microbial communities because they have an influence on the likelihood of a child developing pneumonia and wheezing illness, a disease that is a precursor to asthma.
To figure out which types of bacteria are present in the upper airways of children and estimate their abundance, researchers gather samples and use sequencing to analyse a specific region of the microorganisms’ DNA. Such studies may involve hundreds of samples, millions of DNA sequences and hundreds of types of microorganisms. Large-scale projects of this kind require high through-put processing and intensive data management and task scheduling. They also require the help of a bioinformatician – someone who applies information technology to biological and medical research.
Gerrit Botha and Dr Katie Lennard – bioinformaticians based at the Computational Biology Division (CBIO) and members of the Pan African Bioinformatics Network for H3Africa (H3ABioNet) – have developed a streamlined process for analysing microbiome samples on the university’s high-performance computing system.
Such data-analysis pipelines, as they are known, take data inputs and guide them through a number of processing steps that have been linked together. The pipeline Botha and Lennard have developed takes sequencing data and its associated metadata and runs it through pre-processing steps, and alignment and classification algorithms.
“You can just imagine: there are a lot of steps involved, each of which uses different tools. You have to make sure that the output of one tool is compatible with the input of the next,” says Botha. “Our main aim was to build a package that you can give to researchers and that they can use easily.” The pipeline he co-developed has been made available to UCT researchers on the eResearch high-performance computing cluster, and is used in other fields, including oceanography and immunology.
In recognition of the need to update the pipeline and improve its efficiency, during October 2016 Botha and other developers tackled the challenge as part of a coding hackathon run by H3ABioNet. The hackathon brought together developers from H3ABioNet who worked in groups of three or four to create solutions for analysing different types of H3Africa data. One group of developers, including Botha, created a new pipeline for microbiome data, which – among other things – facilitates easier software updates and is more portable: that is, it can be used on other computing clusters and personal computers.
Since the hackathon concluded, Botha has been working to convert the pipeline to run using different containerisation software, called Singularity, and to test it on UCT’s cluster. The focus of a recent workshop at the eResearch Africa 2017 conference, containerisation involves encapsulating one or more software applications in a container with its own operating environment. This helps to ensure the software runs reliably across computing environments; applications are easy to deploy and upgrade, and are easily shared. Botha’s work on containerisation is highly innovative and a leading example of research computing practice.
Once Botha has completed this conversion and testing, the new pipeline will be made available to UCT researchers on the high-performance computing cluster. Other institutions and researchers will also be able to run this pipeline on their computing clusters and personal computers.