Disaster recovery: how to set up a strategy for Service-Oriented and Event-Driven architectures
Blog
July 26, 2024

Disaster recovery: how to set up a strategy for Service-Oriented and Event-Driven architectures

Developing a disaster recovery plan for environments utilizing Service-Oriented or Event-Driven Architecture with microservices involves various challenges. These architectures typically encompass numerous components, each often maintaining its own database. The integration of these components occurs through request-response APIs or loosely coupled events, which complicates data consistency during backup restoration.

Preemptive planning is essential. Without a comprehensive strategy, crucial steps might be overlooked during an emergency, potentially resulting in data loss. This blog addresses both Service-Oriented and Event-Driven architectures, highlighting the common complexities and the additional considerations for Event-Driven architectures.

Challenges in Service-Oriented and Event-Driven Architectures

Both architectures enable the integration of diverse applications and services, each with their domain, data set, and databases, unlike traditional monolithic systems with a single database. A single database simplifies data consistency during backup and restoration, but this strategy does not translate directly to microservices environments where services are interdependent.

Ensuring data consistency across all applications and services is paramount, especially when one component restores older data. Such restorations impact all other services and applications that have interacted with the data or initiated new business processes.

Event-Driven Architecture Enhancements

Event-Driven Architecture (EDA) extends Service-Oriented Architecture by incorporating an event hub for asynchronous communication, eliminating the need for direct point-to-point communication. Event hubs like Kafka facilitate the distribution and retrieval of events without services knowing their consumers. These hubs have evolved to include data retention and stateful aggregation, similar to SQL queries over relational database tables.

Event hubs enable long-term or infinite data retention, with tiered storage options to manage costs. However, maintaining data security and availability is crucial. Kafka, for instance, supports replication across cluster elements, ensuring data availability despite cluster failures. To guard against human errors or bugs that might delete data, an append-only backup strategy is necessary. Replication alone cannot protect against data deletion, so offloaded data backups are essential for restoring specific data subsets, aligning with database backups. Solutions like Kannika offer such functionalities.

Disaster Recovery with Kafka

Consider an Event-Driven Architecture with four applications/services loosely coupled via Kafka as an event hub. Each service/application sends events to its own topic, adhering to the single-writer principle. This principle simplifies identifying which events from which service impact other services.

The key advantage of using an event hub like Kafka in a disaster recovery plan lies in the stored historical data. This data allows for reprocessing and correcting data post-restore.

For instance, if the Machine Management application experiences data loss at 14:15 and requires a 14:00 backup restoration, a Service-Oriented REST integration would result in a 15-minute data gap, complicating impact assessment and recovery. However, with Event-Driven Architecture, events produced within that timeframe are logged, and consumer offsets indicate data consumption.

Three recovery options are now available:

  1. Reconstruct the database: use events to rebuild the Machine Management database, ensuring synchronization with all applications that consumed events during the 15-minute downtime.
  2. Compensate events: if event data is insufficient, determine which applications (e.g. Document Management and Billing) processed the events and ensure their databases and processes are corrected. This can be a manual process or automated through compensating events.
  3. Hybrid approach: combine database correction with compensating events for comprehensive recovery.

Conclusion

Creating and maintaining a disaster recovery plan for Service-Oriented and Event-Driven Architectures demands significant effort. The primary benefit of Event-Driven Architecture is the availability of events to correct datasets in databases that may require older backups during a disaster. This preparation is crucial for minimizing data loss and ensuring seamless recovery.

Click here to learn more about Kannika Armory!