This blog post introduces a real-world use case from Internet of Things (IoT) service providers that use Disaster Recovery for AWS IoT to improve the reliability of their IoT platforms.
IoT service providers, especially those running high-reliability businesses, require consistent device connectivity and the seamless transfer of connectivity configurations and workloads to other regions when regional IoT services become unavailable. This blog post describes a customizable solution that enables cross-region transfer for AWS IoT Core and application services that rely on it.
Introduction
Integrating a disaster recovery (DR) solution within an IoT platform has emerged as a critical imperative for companies operating in the IoT domain. The inherent complexity of IoT systems, characterized by numerous interconnected devices and vast data streams, amplifies the risks posed by potential disruptions. Given that IoT platforms often carry critical applications across industries such as healthcare, manufacturing, and autonomous vehicles, even a brief downtime or data loss could lead to severe financial losses, compromised customer trust, and regulatory non-compliance. By incorporating disaster recovery capability into your IoT architecture, you can proactively mitigate these risks, deliver business continuity, and reinforce your IoT platform’s reliability against network outages, application unavailability, and unforeseen events.
Solution overview
The architecture shown in Figure 1: The architecture of the reliable IoT solution with DR
Disaster recovery
The solution uses Amazon DynamoDB global tables to synchronize all the operations against AWS IoT Core in the primary region to the secondary region. AWS Step Functions and the AWS Lambda function in the secondary region replicate all those operations into AWS IoT Core in the secondary region. All the data synchronized for DR across the regions is application irrelevant and not required to be maintained by the users.
Health checks
The solution uses Amazon Route 53 health checks to decide the fail-over launch. All the factors below are monitored and the failure from any one of them can trigger the fail-over process. The factors show the health status of:
As shown by the dotted red lines in Figure 1, the AWS Lambda function used in Amazon Route 53 health checks makes calls to the APIs and receives all the responses, across all the AWS accounts included in the architecture. The VPC endpoint for Amazon API Gateway can help the Lambda function invoke the APIs across accounts. Please refer to using interface VPC endpoint to access a private API in another AWS account for details. The Lambda function aggregates the API response and decides whether to trigger the fail-over process or not. The decision is passed to Amazon Route 53 via the health check APIs, and Amazon Route 53 performs the fail-over according to the decision.
Fail-over process
Amazon Route 53 follows the policies defined in the records to enforce the fail-over. As shown in Figure 2: The records for fail-over in Amazon Route 53
As shown in Figure 1, the application services implement high availability for the fail-over, relying on the Lambda functions deployment in both regions, multi-region access points of Amazon Simple Storage Service (Amazon S3), and global table replication of Amazon DynamoDB. As shown by the orange lines in Figure 1, the administration consoles publish messages to the command & control services through Amazon Route 53. Once the health check returns failure, Amazon Route 53 points the API endpoint to the services in the secondary region. As shown by the purple lines in Figure 1, to minimize data loss, the data from the Amazon EventBridge event bus in both regions is ingested into the data visualization. During the fail-over, the data that remained in the primary region can continue to be processed.
Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
The RTO of the architecture mainly depends on the duration of the fail-over. The duration is composed of 4 factors:
The fail-over duration can be cut down by reducing the number of those factors, and the requests will be made to the health checks by Amazon Route 53 more frequently.
The RPO of the architecture can be impacted by the following factors:
Summary
By leveraging the DR architecture introduced in this blog, IoT service providers can simply implement disaster recovery within their IoT platforms and reap a multitude of benefits. You can help safeguard against potential revenue loss resulting from IoT service interruptions, cultivate customer trust and loyalty, and enhance your IoT platform’s security posture.
Beyond risk mitigation, the adoption of DR bolsters the operational efficiency of IoT businesses by reducing downtime-related costs and minimizing the need for manual interventions during disruptions.
About the author
 Shi Yin is a senior IoT consultant from AWS Professional Services, based in California. Shi has worked with many enterprise customers to leverage AWS IoT services to build IoT solutions and platforms, e.g., Smart Homes, Smart Warehouses, Connected Vehicles, Commercial IoT, Industrial IoT, etc.