Big developments in big data: how astronomy is driving data science in Africa

3 Dec 2015 - 20:30
Square Kilometre Array Telescope

The Square Kilometre Array (SKA) project will produce data at a rate comparable to that of global internet traffic. But if we don't have the infrastructure and skills to deal with it, the data will go offshore, Africa will lose this stellar science and business opportunity. Three South African universities have joined together to form a new institute to ensure that Africa will be able to meet this challenge, with benefits for the continent that go far beyond astronomy.

"The data revolution is set to be a globally transformative phenomenon – if you don't ride the wave, you're going to be flooded by it," says Professor Russ Taylor, director of the recently launched Inter-University Institute for Data-Intensive Astronomy (IDIA). IDIA, a partnership initiative between the University of Cape Town (UCT), the University of the Western Cape (UWC) and North-West University (NWU), is a flagship project to respond to the immense big-data challenge of the Square Kilometre Array (SKA), a global endeavour to build the world's largest radio telescope – in South Africa. But, says Taylor, the intended scope of IDIA is far wider than just the SKA, or even astronomy: the institute aims to ensure South Africa is ready not only to ride the big-data wave, but to drive it.

Photo Square Kilometre Array, Karoo South Africa

Big data refers to the large, complex data sets – created and collected through technology – that are set to affect every aspect of life. Al Gore identified the digital revolution and big data as one of the main drivers of global change in his latest book, The Future: Six Drivers of Global Change. Gore writes that our societies, culture, politics, commerce, educational systems and even ways of relating to one another are all being profoundly reorganised through the growth of digital information.

Even by the standards of this world of big data, however, the SKA poses a particular challenge. Tasked to collect data from deep space dating back to the very start of the universe 13 billion years ago, the SKA will collect around 1.5 exabytes of data a year – that is, roughly one and a half billion gigabytes.

South Africa, as co-host (with Australia) of the SKA, is thus uniquely placed to lead the global response to big data – an opportunity we dare not miss.

Speaking at the launch of IDIA earlier this month, UCT Vice-Chancellor Dr Max Price noted that while we have the geographical advantage of the southern skies, we don't – yet – have an advantage when it comes to analysing the data. "The risk is that we may become servants of scientists around the world, and not a team and a country that can generate its own knowledge and play in the big league."

Preparing for data sharing

It is this precise situation the IDIA seeks to avoid. The SKA big-data challenge begins with the MeerKAT radio telescope, built some 90km outside the Northern Cape town of Carnarvon, which will make up about 1% of the total SKA project. MeerKAT will begin generating data in late 2016, for the phase known as 'early science'. That capacity will quadruple the following year. MeerKAT will be a source of astronomy data for IDIA up until the end of the construction of the SKA Phase 1, around 2020. The SKA Phase 1 data flow will be 10 times bigger than that of MeerKAT.

"At IDIA, we are essentially laying the groundwork – in terms of both infrastructure and human resources – to be ready when the SKA turns," says Taylor.

MeerKAT will not be the only source of astronomy data over the next five years; IDIA will also collect data from other telescopes around Photo: Square Kilometre Array Telescopethe globe, including the Very Large Array in New Mexico (currently the largest radio telescope in the world). This sharing of data from other telescopes is part of the IDIA strategy to build global partnerships and develop the kind of systems required to collect, visualise and analyse the massive amounts of data coming from the SKA.

The real challenge, explains Taylor, is not just to build a big pipe to manage the data, but to store it in a way that enables the global collaboration required for a project of this magnitude. "Teams in Africa, Europe, Asia, Australia, and North America all want to work together on this data. So the issue is not only how to store and manage the data, but how to enable collaboration on a big-data set that nobody can actually have on their desktop," he says. "What this means, in practice, is that we need to build new cyber-infrastructure platforms."

The first of these platforms is the Africa Big Data Research Cloud (ARC), the first phase of which is housed in UCT's cloud-based data centre, launched in the same week as IDIA. Cloud computing, which simply means storing and accessing data and programs over the internet instead of on a local computer or server, is already revolutionising the business and research world. The ARC gives researchers the ability to develop collaborative research environments in which they can share data, computational capabilities and other tools, unimpeded by the restrictions of time and space. UCT's distributed big-data research cloud will be prototyped by the IDIA partner universities in collaboration with South African organisations such as the Centre for High Performance Computing in Cape Town. The ARC is envisioned to grow to include the eight African partner countries on SKA, and a number of SKA partners in Europe. Private companies Dell and Canonical have also joined the ARC as partners. In time, the ARC will service the larger research community, allowing researchers to share data and collaboration on a number of other 'big science' projects, such as bioinformatics.

Developing data scientists

In addition to building infrastructure, IDIA is focused on building the skills needed for the new digital world of big data. "Big data will fundamentally change the way we do science," says Taylor. "It used to be that you could run an experiment with a small amount of data, and the data itself was not the challenge. Today, that's no longer the case – the scientific research and the data research are now too entangled to separate."

The world is witnessing a global shortage of data scientists – a job description that didn't even exist just a decade or two ago. IDIA is set to remedy this shortage, in two ways: firstly, through the recruitment of graduate students and postdoctoral researchers to work on the data challenges of MeerKAT and (in time) the SKA; and secondly, by putting in place programmes to train people in this new specialisation. From 2017, UCT will offer a master's degree in data science; while Sol Plaatje University in the Northern Cape recently created a dedicated undergraduate degree in data science.

"The world has changed so fast ... I believe young people in general are already pretty data-savvy," says Taylor. "Our challenge is to establish a career trajectory for this new kind of specialist, the data scientist, who will hold a multidisciplinary set of skills unlike what we have seen before."

The Harvard Business Review has described 'data scientist' as "the sexiest job of the 21st century". This skill set is sought after in just about every industry the world over, from tourism to marketing to astrophysics. A study by McKinsey projects that by 2018, the United States will face a 50% gap between supply and demand for individuals with strong data-analysis expertise. By offering this data-science speciality, South African universities seek to fill not only a niche created by the SKA, but a global skills shortage.

South Africa stands to gain a great deal from taking full advantage of the SKA and the big-data challenge. A large part of the rationale for this country's comprehensive investment in the SKA project is the benefits that will accrue as a result of the project, which extend far beyond just the astronomical. Speaking at the launch of IDIA, Naledi Pandor, Minister for Science and Technology, described the SKA as more than just a science or astronomy project; it's a global infrastructure project – a project to enhance South Africa's skills and technology base across the board.

"There are three elements of development in the SKA," says Taylor. "The first is the development of the technology to build the project; then there is that of the scientific outcomes, and the ownership of these outcomes; and finally, the development of skills that comes from the requirement to utilise such sophisticated equipment."

Such skills are primarily in information and communication technologies, and investment in these skills is a long-term investment, he explains. Over the 50- to 70-year lifespan of the SKA, the growth in technology advancement will be remarkable: for South Africa to reap the rewards of that technological development, we need to engage with it fully.

"At the core of it," says Taylor, "IDIA is about building the capacity to ensure we in Africa are ready to engage in and benefit from one of humanity's most ambitious science projects to date – taking place here, within our borders."

Story by Natalie Simon.

Images courtesy SKA South Africa