Geospatial data is data about specific locations on the earth’s surface. It can represent a geographical area as a whole or it can represent an event associated with a geographical area. Analysis of geospatial data is sought after in a few industries. It involves understanding where the data exists from a spatial perspective and why it exists there.
There are two types of geospatial data: vector data and raster data. Raster data is a matrix of cells represented as a grid, mostly representing photographs and satellite imagery. In this post, we focus on vector data, which is represented as geographical coordinates of latitude and longitude as well as lines and polygons (areas) connecting or encompassing them. Vector data has a multitude of use cases in deriving mobility insights. User mobile data is one such component of it, and it’s derived mostly from the geographical position of mobile devices using GPS or app publishers using SDKs or similar integrations. For the purpose of this post, we refer to this data as mobility data.
This is a two-part series. In this first post, we introduce mobility data, its sources, and a typical schema of this data. We then discuss the various use cases and explore how you can use AWS services to clean the data, how machine learning (ML) can aid in this effort, and how you can make ethical use of the data in generating visuals and insights. The second post will be more technical in nature and cover these steps in detail alongside sample code. This post does not have a sample dataset or sample code, rather it covers how to use the data after it’s purchased from a data aggregator.
You can use Amazon SageMaker geospatial capabilities to overlay mobility data on a base map and provide layered visualization to make collaboration easier. The GPU-powered interactive visualizer and Python notebooks provide a seamless way to explore millions of data points in a single window and share insights and results.
Sources and schema
There are few sources of mobility data. Apart from GPS pings and app publishers, other sources are used to augment the dataset, such as Wi-Fi access points, bid stream data obtained via serving ads on mobile devices, and specific hardware transmitters placed by businesses (for example, in physical stores). It’s often difficult for businesses to collect this data themselves, so they may purchase it from data aggregators. Data aggregators collect mobility data from various sources, clean it, add noise, and make the data available on a daily basis for specific geographic regions. Due to the nature of the data itself and because it’s difficult to obtain, the accuracy and quality of this data can vary considerably, and it’s up to the businesses to appraise and verify this by using metrics such as daily active users, total daily pings, and average daily pings per device. The following table shows what a typical schema of a daily data feed sent by data aggregators may look like.
Attribute | Description |
---|---|
Id or MAID | Mobile Advertising ID (MAID) of the device (hashed) |
lat | Latitude of the device |
lng | Longitude of the device |
geohash | Geohash location of the device |
device_type | Operating System of the device = IDFA or GAID |
horizontal_accuracy | Accuracy of horizontal GPS coordinates (in meters) |
timestamp | Timestamp of the event |
ip | IP address |
alt | Altitude of the device (in meters) |
speed | Speed of the device (in meters/second) |
country | ISO two-digit code for the country of origin |
state | Codes representing state |
city | Codes representing city |
zipcode | Zipcode of where Device ID is seen |
carrier | Carrier of the device |
device_manufacturer | Manufacturer of the device |
Use cases
…