Their premise was simple: call-data records show the true nature of social networks and human movement. Understanding social networks and how people really move—as seen from phone movements and calls—could give health officials the ability to predict how a disease will move and where a disease will strike next, and prepare accordingly.
The problem is that call-data records are very hard to get a hold of. The files themselves are huge, there are enormous privacy risks, and the process of making the records safe for distribution is long.
First, the technical basics
Every time you make a phone call from your mobile phone to another mobile phone, the network records the following information (note: this is not a complete list):
- The number from which the call originated
- The number at which the call terminated
- Start time of the call
- Duration of the call
- The ID number of the phone making the call
- The ID number of the SIM card used to make the call
- The code for the antenna used to make the call
On their own, these records are not creepy. Indeed, without them, networks would be unable to connect calls or bill customers. But it is easy to see why operators aren’t rushing to share this information. Even though the data includes none of the actual content of a phone call in the data, simply knowing which number is calling which, and from where and when, is usually more than enough to identify people.
So how can network operators use this valuable data for good while also protecting their own interests and those of their customers? A good example can be found in Africa, where Orange, a French mobile phone network with interests across several African countries, has for the second year run its “Data for Development” (D4D) program, which offers researchers a chance to mine call data for clues on development problems.
Steps to safe sharing
After a successful first year in Ivory Coast, Orange this year ran the D4D program in Senegal. The aim of the program is to give researchers and scientists at universities and other research labs access to data in order to find novel ways to aid development in health, agriculture, transport or urban planning, energy, and national statistics.
Orange collected call-data records for the entire country through all of 2013. In its raw form, the data amounted to 1.1 terabytes, or the equivalent of 1,100 hours of streaming from Netflix in standard definition. To anonymise the data, Orange replaced all the various identifiers listed above with a single number identifying the call. The key used to convert the data into this single number is then destroyed.
Despite this first step, the data still aren’t ready for distribution. Instead, the information was broken down into three datasets. The first set was broken up by antenna in one-hour time intervals, making it possible for researchers to see which towers communicated with which at any given time. There are 1,606 Orange antennae in Senegal. Here’s an example of what that data might look like:
In this entirely made-up example, each cell shows the number of calls between any two given antenna between 1pm and 2pm on Jan. 8, 2013. The real data set would have a table with 1,606 rows and columns for each hour of the year. In order to strip out outliers, such as a remote antenna with only a handful of users, Orange replaced small numbers in any cell with another randomly assigned small number. So in the example above, the four calls between Antenna 3 and 4 would be replaced with another number below 10.
Why is this necessary? “If there is only one call at 3am in a national park, it is very likely that call was made by the guard of the park,” says Nicolas de Cordes, who managed the program at Orange. In other words, simply “anonymizing” data by changing numbers around is never enough. It is too easy to de-anonymize it based on patterns and information that is external to the dataset.
The second dataset was meant to track how people move in Senegal. To do this, Orange divided the antennae by administrative divisions. Looking at the data, researchers can tell when someone moved from one cell to another. But it is not possible to tell whether she simply crossed the street to do so, or traveled hours by car before changing cells.
The third dataset also looked at how people move, but on a more granular level, looking at roads and major routes. In order to enhance privacy, Orange tweaked the data, for example making small changes to the locations of the antenna and the timing of the calls.
Let there be light
It is only after these steps that the data were deemed safe to release to scientists and researchers, who used the information to prepare 53 submissions to Orange’s D4D challenge.
The winning team showed how mobile data could be used for electricity infrastructure planning. The researchers compared data from Orange with data from Senelec, the local energy utility, and found a correlation between mobile phone use and energy use. Just over half the population of Senegal has access to electricity, while mobile phone penetration is close to 100%.
The advantage of using mobile phone data over traditional methods for planning how and where to extend the electricity grid (satellite imagery is one such way) is that it is possible to see the changes that occur when a town is electrified. There is instantly a greater degree of nightlife. More migrants come into the area. That helps plan for the future rather than just existing energy needs, says Markus Schläpfer, one of the researchers on the project. Their hope is that Senelec will use the results in planning its grid expansions.
Other submissions to the D4D challenge covered everything from measuring social disparity to better understanding communiting patterns. There were meta-projects, like the one investigating the question of anonymizing datasets, and, yes, one entitled “Modeling Ebola virus diffusion in Senegal using mobile phone datasets and agent-based simulation.” See summaries of all the projects in this pdf.