Track the progress of long-running workflows and retry on failure.Long-running workflows are oftencomposed of multiple steps. Ensure that each step is independent and can be retried to minimize thechance that the entire workflow will need to be rolled back, or that multiple compensating transactionsneed to be executed. Monitor and manage the progress of long-running workflows by implementing apattern such as Scheduler Agent Supervisor pattern.Implement an early warning system that alerts an operator.Identify the key performance indicatorsof your application's health, such as transient exceptions and remote call latency, and set appropriatethreshold values for each of them. Send an alert to operations when the threshold value is reached. Setthese thresholds at levels that identify issues before they become critical and require a recovery response.Implement application logging.Application logs are an important source of diagnostics data. Therecommended practices for application logging include:●Log in production.●Log events at service boundaries. Include a correlation ID that flows across service boundaries. If atransaction flows through multiple services and one of them fails, the correlation ID will help youpinpoint why the transaction failed.●Use semantic logging, also known as structured logging. Unstructured logs make it hard toautomate the consumption and analysis of the log data, which is needed at cloud scale.●Use asynchronous logging. With synchronous logging, the logging system might cause theapplication to fail, as incoming requests are blocked while waiting for log writes.Implement logging using an asynchronous pattern.If logging operations are synchronous, they mightblock your application code. Ensure that your logging operations are implemented as asynchronousoperations.MCT USE ONLY. STUDENT USE PROHIBITED
146Module 6Module Business Continuity and Resiliency in AzureTest the Monitoring SystemsAutomated failover and fallback systems, and manual visualization of system health and performance byusing dashboards, all depend on monitoring and instrumentation functioning correctly. If these elementsfail, miss critical information, or report inaccurate data, an operator might not realize that the system isunhealthy or failing.Plan for and test disaster recovery.Create an accepted, fully-tested plan for recovery from any type offailure that may affect system availability. Choose a multi-site disaster recovery architecture for anymission-critical applications. Identify a specific owner of the disaster recovery plan, including automationand testing. Ensure the plan is well-documented, and automate the process as much as possible. Estab-lish a backup strategy for all reference and transactional data, and test the restoration of these backupsregularly. Train operations staff to execute the plan, and perform regular disaster simulations to validateand improve the plan. If you are using Azure Site Recovery to replicate VMs, create a fully automatedrecovery plan to failover the entire application within minutes.
End of preview. Want to read all 151 pages?
Upload your study docs or become a
Course Hero member to access this document