Data-intensive research is changing the way African researchers can work and the impact they can have. It is also opening up new career paths in the field of data science.
By increasing the volume of data that researchers can analyze and work with at any given time, data-intensive technology allows them to make bigger strides in less time in their chosen disciplines. Data scientists assist this process by providing the skills to help researchers and managers first analyze large volumes of data and then use that analysis to make effective decisions. Big data is already making a big difference in fields ranging from banking and social media to healthcare and astronomy.
Data-intensive research, or big data technology, has come to Africa by way of the stars: the establishment of the Square Kilometre Array (SKA) pointed to the need for the continent to be able to analyze the extremely high volumes of data to be generated by the network of telescope dishes that will ultimately be placed across remote regions of southern Africa.
The SKA project is an internationally renowned effort to build the world’s largest radio telescope with more than a square kilometer of collecting area. It is one of the largest scientific endeavors in history and drives one of the world’s most significant big data challenges of the coming decade.
Three South Africa-based universities involved in the SKA project—North-West University, the University of Cape Town (UCT) and the University of the Western Cape (UWC)—established a partnership in 2015 to form the Inter-University Institute for Data Intensive Astronomy (IDiA).
IDiA is mobilizing researchers in fields such as astronomy, computer science, statistics and eResearch technologies to create data science capacity for leadership in SKA precursor projects such as MeerKAT, which is scheduled to achieve full operation in early 2018. MeerKAT marks the beginning of a radio big data revolution in Africa. It will be operated as a South African national facility for about five years before it is incorporated into the SKA dish array.
The IDiA is also establishing a data-intensive research and training program to develop capacity on the continent to use the data that MeerKAT will deliver. On its own, radio astronomy data is raw; it requires analysis to provide the kinds of answers astronomers and astrophysicists are seeking about the origins of the cosmos. The astronomy project will also involve developing data systems and tools for analysis with multi-wavelength astronomy data.
The SKA is a multinational project involving researchers and data scientists around the world. Thus, one of IDiA’s projects is to create a data platform that will allow remote teams to access the data: the African Research Cloud (ARC). IDiA will also develop and apply processing algorithms that allow for analysis of the data so that we can turn high volumes of information into knowledge we can apply and use.
Fuelling collaborations and solutions
The ARC involves collaborators from around the world. Much of this work is also part of a collaboration with SKA partners in the Netherlands to establish an Advanced European Network of E-infrastructure for astronomy.
The ARC is the first stage of a three-phase plan to address specific uses of data-intensive research. One such application is the African Research Cloud Astronomy Demonstration project (ARCADE), which will specifically serve MeerKAT teams.
The MeerKAT large surveys will produce a terrifying deluge of data. Observations are expected to produce almost 100-terabytes worth of data each day — orders of magnitude more than the conventional volume from a radio telescope. This data will have to be transported, calibrated, imaged, processed and analysed by dozens of astronomers around the world.
ARCADE, thus, focuses on two important aspects of scientific utility: data processing of radio data and large-scale scientific collaboration. A proof-of-concept approach is used: compact and incisive interventions are developed for well-defined technological problem statements.
One such successful intervention involved a large-scale collaborative project, which used a second-year astronomical techniques class at UCT as a test-subject. The project focused on practical learning outcomes for the class of 50. Students had to perform a simple, yet challenging set of analyses on radio and optical images, which included inspection, statistical analyses, plotting and documentation.
A cloud-based hub was created for the project and a beefy virtual machine was populated with state-of-the-art software tools that are the contemporary standards in open source big data initiatives. Students could log onto the ARC via a web browser in a computer lab during a supervised session, but they could also have completed the exercise anywhere in UCT, on their own laptops and mobile devices.
The power of big data
This successful case study demonstrated the power of big data solutions and the advantages of cloud-based technologies, and resulted in two very important findings.
First, the ARC and IDiA provide an unprecedented opportunity for training and collaboration in scientific analyses. The test-subject students were exposed to critical skills in mathematics, statistics and programming in an immersive and collaborative environment.
They were at liberty to discuss, share and work on their projects in a safe and robust programming environment. This sort of intervention can be deployed at a larger scale, and can provide a training environment for anyone with an internet connection. Additionally, the students experienced a first glimpse of tools and techniques that will provide them with an advantage in their future careers in academic institutions or industry.
Second, this cloud-based intervention showcased a lean, information technology (IT)-on-a-diet approach, while retaining a high-degree of technical flexibility. The virtual machine was designed and deployed in a matter of hours, and required only the interaction between a single technical specialist and the scientific researcher. Indeed, one of the aims of ARCADE is to deliver a framework that does not require an IT technical specialist, but is deployable using standard recipes and a few mouse-clicks.
In this respect, we are drawing alongside commercial solutions that are available at a financial premium. Our studies will provide easily accessible solutions for smaller projects that can benefit from large-scale designs for well-defined science projects.
A similar project in bioinformatics will help researchers who are investigating, for instance, the relationship between genetics and disease. Their work involves not only dealing with data in large volumes, but detecting relationships that are highly specialized in certain molecules. Big data analysis can do this kind of sifting and identifying work in a relatively short time.
One such strategic project, based at UWC, will implement a platform for tuberculosis surveillance in Africa, to glean insights into the dynamics of tuberculosis infection. Such an approach can ultimately assist in rolling out cost-effective diagnostic technologies and health interventions. The pilot project involves researchers as far afield as Ghana, South Africa, Uganda and Zimbabwe, but the plan is to involve more countries once the pilot project is completed.
Making Africa’s voice heard
A potential breakthrough in malaria medicine demonstrates the kind of difference big data computing can offer to African science. In 2012, researchers at UCT’s Drug Discovery and Development Centre (H3D) identified a molecule that showed great promise of not only becoming part of a single-dose cure for malaria but also possibly blocking transmission of the malaria parasite from person to person through mosquito bites.
The first part of their work on this project, however, took place at Griffith University in Australia, where scientists with big data capacity screened an initial group of about 36,000 small molecules. When those compounds had been narrowed down to several hundreds, a team of scientists from H3D took over the project and further explored the antimalarial potential of the various chemotypes (or chemical classes).
The candidate molecule is now in the clinical trial process, with a second next-generation back-up candidate also identified and expected to enter the same process in due course. Globally many small molecules have been screened in a similar manner, paving way for exploring new, potential medicines against malaria.
This type of multinational cooperation is part of the modern research landscape around the world. With the development of big data capacity and the ARC, African science will be able to bring a more substantial contribution to such partnerships and influence new breakthroughs based on the data gleaned from projects such as the SKA. It is opening a new door of opportunity.
Russell Taylor is the director of IDiA and Joint UCT/UWC/SKA chair. Bradley Frank is a lecturer at UWC and a senior researcher at IDiA.
This piece was produced by SciDev.Net’s Sub-Saharan Africa English desk.