The core goals of the facility are to provide a research tool to enable South African leadership in data intensive astronomy and bioinformatics. Advancing this goal involves articulating a strategic science programme and developing the technology solutions required to enable it. This section describes the goals and expected outcomes of the initial strategic science programmes, the data-intensive technology development and the research data management programme.
Data-intensive cloud infrastructure
The Ilifu data-intensive facility has a focus of research and development for cloud-based data-intensive research solutions for the strategic science domains of astronomy and bioinformatics.
In a precursor project, a two-node implementation of a multi-node cloud infrastructure was demonstrated between the University of Cape Town (UCT) and North-West University (NWU). This infrastructure provided cloud compute services and was an early prototype of the technical foundation of the Ilifu facility. The Inter-University Institute for Data-Intensive Astronomy (IDIA) consortium, made up of UCT, University of Pretoria and University of the Western Cape then purchased, further developed and operated and further developed a significant compute facility as a second generation system, which now forms part of the Ilifu computing infrastructure along with a major equipment contribution from the DIRISA grant and from UCT’s Computational Biology Group. The Ilifu partners have further developed scalable systems for cloud-based provisioning of data-centric resources, and have prototyped a tiered, federated cloud infrastructure with consortium partners and external collaborators.
The goals of the project are to:
- Provide a new model for provisioning of data-intensive research infrastructure to researchers.
- Federate cloud systems to create a common eResearch cyberinfrastructure system.
- Demonstrate cloud-based solutions for strategic projects in astronomy and bioinformatics.
Strategic science programmes
The Ilifu project supports strategic data-intensive research in the fields of astronomy and bioinformatics. Ilifu will provide a platform for globally distributed teams of researchers to access, process and visualise the large data sets in the strategic science programmes in each of these disciplines.
Astronomy
The Square Kilometre Array (SKA) is driving the largest big-data challenge of the coming decade. The operation of the South African MeerKAT radio telescope marks the beginning of a big-data revolution in Africa.
MeerKAT will be operated as a national facility for about five years before being incorporated into the SKA. As such, it is a precursor of the SKA itself, and its scientific programmes and data systems are on the pathway to the SKA.
The Ilifu astronomy strategic project focuses on the data flow, processing and scientific analysis of the large data sets from the five imaging MeerKAT Large Survey Projects: MIGHTEE, LADUMA, MHONGHOOSE, ThunderKAT and the MeerKAT Fornax Survey. Since, for certain science goals, radio astronomy data provides only part of the scientific data required, this project also involves the development of data systems and tools for analysis of multi-wavelength astronomy data.
The goals of the astronomy science project are to:
- Set up an agile calibration and a development environment for imaging workflows to process data from MeerKAT and other SKA precursor and pathfinder projects.
- Develop systems for science-project-based data flow, transport and management between data sources and the Ilifu facility.
- Provide a platform for provision of resources to users for post-processing and analytics of MeerKAT science products.
- Establish eResearch platforms and tools to facilitate user access and collaboration.
- Work with the International Virtual Observatory and the South African Astroinformatics Alliance on metadata standards and database systems for multi-wavelength astronomy analyses.
- Work with national and international collaborators to establish a federated cloud system that brings together distributed infrastructure and data resources into an eResearch platform.
Bioinformatics
Analysis of biological data has many challenges due to its size and complexity and the diversity of tools and algorithms required to process and interpret it. There are three major bioinformatics strategic projects in progress led by the University of the Western Cape (UWC), the University of Cape Town (UCT) and Stellenbosch University (SU).
Implementing a platform for tuberculosis surveillance in Africa (UWC)
The use of omics technologies and public health initiatives in Africa has given insights into the dynamics of tuberculosis infection. These approaches ultimately need to inform the roll-out of cost-effective diagnostic technologies and health interventions, yet there are no data analytics platforms in Africa that allow researchers to scale their analyses at the site where data is generated.
This project aims to harness cloud-based and metadata-aware technologies to facilitate distribution of algorithms and storage of omics data for access to and use of data and protocols by researchers in South Africa, Ghana, Uganda and Zimbabwe.
To date we have :
- Implemented scalable CEPH storage for storing biological data.
- Designed a Neo4J database for storing bacterial omics data
- Provide a scalable database to accommodate biological and clinical information.
Ongoing Research:
- To provide a platform that allows rapid analysis of tuberculosis data at data-generating sites using an OpenStack Platform
Implementing an imputation service for the analysis of African human genetic data (UCT)
As part of the Human Heredity and Health in Africa (H3Africa) project, researchers have designed a new genotyping array that is customised for African populations. This is accompanied by a reference panel that is more appropriate for African genetic data than other available panels.
This project uses the Ilifu computing and storage facilities to run imputations for data generated on the new genotyping array using theAfrican reference panel. H3Africa collaborators are able to use this service for their array data. If resources allow, we can then selectively open this up to other users to enable them to do imputation using the reference panel.
The goals of the project are to:
- Develop and implement a single nucleotide polymorphism imputation service using an African reference panel.
- Use this tool for H3Africa projects to analyse genetic data generated by the H3Africa genotyping array.
- Provide the tool as a service for other groups to do imputation using the African reference panel.
Omics computation for precision medicine (SU)
Precision medicine is the customisation of healthcare to individual patients. Omics data, such as genomics and metabolomics data, has great promise for implementing precision medicine and is already being used to predict the outcomes of treatments in pharmacogenomics and cancer treatment.
The current treatment regimen for tuberculosis is a compromise between overtreatment of the subset of cases that are cured quickly and undertreatment of those who are either not cured or have another episode within a year or two after the end of treatment. The long duration of treatment and the side effects of the drugs increase the probability of non-adherence and incomplete treatment, which compounds the problem of drug-resistant strains. It is thus urgent to find treatment modes that are effective, but only for as long as necessary.
Molecular or omic techniques can be used to determine drug resistance in much less time than standard culture procedures and can guide adaptation of treatment regimens.
The goals of the project are to:
- Develop computational pipelines using omics data for a pilot project to predict risk for poor or favourable outcomes to tuberculosis treatment.
- Use pipelines on existing data sets of host transcriptomics and metabolomics, and pathogen whole-genome sequences.
- Expand to accommodate additional omics data types.
Research data management
A well-considered research data management programme is the foundation of an open and collaborative culture of data-intensive research and innovation. Both access to and visibility of data and the enhanced reputation of the researcher are identified as key elements of success.
The aims of this project are to engage closely with researchers to identify the needs of users across multiple disciplines, to measure the perceived hurdles in the use of the infrastructure and to ultimately inform an advocacy programme that responds to those needs and raises awareness of the benefits of data sharing and open science in the data-intensive research environment.
The objectives of the project are to:
- Develop policy on data archiving, accessibility and reuse to govern future use of the Ilifu infrastructure.
- Develop guidelines to facilitate the deployment of user-friendly infrastructure tools, interfaces and services.
- Develop an advocacy programme on the benefits of data sharing and the services provided.
- Place postgraduate students and mid-career training delegates with projects supported by the facility in work-integrated learning programmes to implement research data management policies and services.
Working groups comprising representatives from all consortium partners will deal with the following topics:
- Governance, data policy and infrastructure services
- Standards, interoperability, certification and archiving
- Open science, open access, ethics, legal framework and authorisation for reuse
- Advocacy and training