Successful operation of a workload is measured by the achievement of businessand customer outcomes.
That will be used in those calculations to determine if operations are successful. Consider that operational health includes both the health of the workload and the health and success of the operations acting upon the workload (for example, deployment and incident response). Establish baselines from which improvement or degradation of operations will be identified, collect and analyze your metrics, and then validate your understanding of operations success and how it changes over time. Use collected metrics to determine if you are satisfying customer and business needs, and identify areas for improvement.
Efficient and effective management of operational events is required to achieveoperational excellence. This applies to both planned and unplanned operational events. Use established
runbooks for well-understood events, and use
playbooks to aid in the resolution of other events. Prioritize responses to events based on their business and customer impact. Ensure that if an alert is raised in response to an event,there is an associated process to be executed, with a specifically identified owner. Define in advance the personnel required to resolve an event and include escalation triggers to engage additional personnel, as it becomes necessary, based on impact(that is, duration, scale, and scope). Identify and engage individuals with the authority to decide on courses of action where there will be a business impact from an event response not previously addressed.
Communicate the operational status of workloads through dashboards and notifications that are tailored to the target audience (for example, customer, business,developers, operations) so that they may take appropriate action, so that their expectations are managed, and so that they are informed when normal operations resume.
Determine the root cause of unplanned events and unexpected impacts fromplanned events. This information will be used to update your procedures to mitigate future occurrence of events. Communicate root cause with affected communities as appropriate.
In AWS, you can generate dashboard views of your metrics collected from workloads and natively from AWS. You can leverage CloudWatch or third-party applications toaggregate and present business, workload, and operations level views of operationsactivities. AWS provides workload insights through logging capabilities including AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs enabling the identification ofworkload issues in support of root cause analysis and remediation.
OPS 6: How do you understand the health of your workload?
Define, capture, and analyze workload metrics to gain visibility to workload events so thatyou can take appropriate action.
OPS 7: How do you understand the health of your operations?
Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action.
OPS 8: How do you manage workload and operations events?
Prepare and validate procedures for responding to events to minimize their disruption to your workload.
Routine operations, as well as responses to unplanned events, should be automated. Manual processes for deployments, release management, changes, and rollbacks should be avoided.
Releases should not be large batches that are done infrequently.
Rollbacks are more difficult in large changes. Failing to have a rollback plan, or the ability to mitigate failure impacts, will prevent continuity of operations. Alignmetrics to business needs so that responses are effective at maintaining business continuity. One-time decentralized metrics with manual responses will result ingreater disruption to operations during unplanned events.