After a decade of analyzing such datasets, a few counterintuitive truths emerge:
In the age of petabyte-scale data streams, the number 116 million might seem modest. A single high-resolution video uploaded to a social platform generates more bytes. Yet, in the world of Global System for Mobile Communications (GSM) data, 116 million records is not a volume—it is a language. It is the Rosetta Stone of human mobility, the raw pulse of a connected society, and a computational challenge that bridges the gap between a radio signal and a predictive algorithm.
To understand what 116 million GSM data points truly represent, we must strip away the abstraction of "big data" and look at the physics, the mathematics, and the human reality encoded in every handshake between a phone and a tower.
The number "116m" (116 million) refers to the scale of the dataset analyzed. The researchers analyzed 15 months of mobile phone data covering 1.5 million people in a small European country. Throughout the study period, these users generated approximately 116 million distinct spatial points (records) based on cell tower connections. 116m gsm data
(Note: While the dataset contained 1.5 million users, the paper is often associated with the number 116 million in database or scaling contexts due to the total volume of location pings processed. If you are referring to a different specific figure involving "116m users," please see the clarification on the Yahoo dataset below.)
How does 116 million records of GSM data end up in one place?
| Tool | Cluster Setup | Time to Aggregate by Cell ID | |------|--------------|------------------------------| | Pandas (single node) | 128 GB RAM | Infeasible – out of memory | | DuckDB | Single node, SSD | ~90–120 seconds | | Spark | 4 nodes, 16 cores each | ~25 seconds | | BigQuery | Serverless | ~10 seconds (cost ~$5) | After a decade of analyzing such datasets, a
| Use Case | Example Query on 116M Records | |----------|-------------------------------| | User mobility patterns | Find top 10 routes taken by subscribers over a week. | | Anomaly detection | Identify SIM boxes (fraud) by detecting >1000 SMS/hour from a single IMSI. | | Network optimization | Locate cells with >15% handover failure rate. | | Emergency response | Count unique devices in a disaster zone during a 6-hour window. | | Population density estimation | Aggregate location updates per cell tower every 15 minutes. |
If you were looking for a paper specifically focusing on a dataset with 116 million users (rather than records), you might be referring to the Yahoo! Webscope dataset (specifically the R6 dataset or similar large-scale recommendation benchmarks).
Recommendation: If you are researching privacy, mobility, or mobile data mining, the de Montjoye paper is the standard reference. You can read it here: Nature Scientific Reports Article 20756. | Use Case | Example Query on 116M
Generating 116 million location events is not a passive process. Each event consumes Signaling System No. 7 (SS7) or Diameter signaling capacity. A single LAU requires:
That is roughly 1.5 kilobytes of signaling over the air and core network. Multiply by 116 million: 174 gigabytes of signaling plane data—not user traffic, just the network saying “I know where you are.” This is the hidden cost of mobility. Without careful dimensioning, 116 million events can collapse a regional MSC.
Operators engineer for this by: