### Slide 1:

![Slide 1](slide_1.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 2:

![Slide 2](slide_2.png)

::: Notes


#### Instructor notes

#### Student notes

This module discusses the following topics:

* Monitoring and maintaining healthy applications and workloads, using the following tools: Amazon CloudWatch, Amazon CloudWatch Logs, Amazon EventBridge Events.
* Monitoring AWS infrastructure with AWS Health Dashboard.
* Monitoring distributed applications, using the following tools: AWS X-Ray, X-Ray trace map.

:::

### Slide 3:

![Slide 3](slide_3.png)

::: Notes

This slide grounds the monitoring discussion in the Example Corp. scenario: the application is running and leadership wants data-driven visibility and automated alerting. Monitoring is not simply a technical exercise — it's a response to business requirements for reliability and observability. Ask students to think about what they would monitor first for a new application, and what alert they would want to receive before users reported a problem.

#### Instructor notes

#### Student notes

Now that your new application is deployed and running, your manager has asked you to develop a system for monitoring its health. Leadership wants to take a data-driven approach to managing the workloads. You are asked to design a plan to render real-time data on the environment. You must also implement fully automated alert functions so that you can correct issues when and where they arise.

:::

### Slide 4:

![Slide 4](slide_4.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 5:

![Slide 5](slide_5.png)

::: Notes

CloudWatch is the foundational observability service for AWS — it collects, stores, and enables action on metrics and logs from both AWS services and custom sources. Understanding CloudWatch as a metrics repository helps students reason about what they can and cannot monitor by default: AWS services push metrics automatically, but anything at the OS level or application level requires the CloudWatch agent or custom metric publishing. Ask students where they think CloudWatch's default coverage stops and custom configuration must begin.

#### Instructor notes

#### Student notes

**Amazon CloudWatch** : A *metrics repository*. An Amazon Web Service (AWS) service, such as Amazon Elastic Compute Cloud (Amazon EC2), puts metrics into the repository, and you retrieve statistics based on those metrics. If you put your own custom metrics into the repository, you can also retrieve statistics on these metrics.

* **Collect metrics and logs** : Collect metrics and logs from all your AWS resources, applications, and services that run on AWS and on-premises servers.
* **Monitor** : Visualize applications and infrastructure with CloudWatch dashboards. To troubleshoot and set alarms with CloudWatch alarms, correlate logs and metrics side by side.
* **Respond** : Automate response to operational changes with Amazon EventBridge events and auto scaling.
* **Analyze** : Receive up to 1-second metrics, extended data retention (15 months), and real-time analysis with CloudWatch metrics math.

Amazon CloudWatch can also serve as a monitoring service for AWS Cloud resources and the applications that you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, and set alarms. CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon Relational Database Service (Amazon RDS) database (DB) instances. The service can also monitor custom metrics generated by your applications and services, and any log files that your applications generate. You can use CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health.

:::

### Slide 6:

![Slide 6](slide_6.png)

::: Notes

CloudWatch integrates with over 70 AWS services natively, which means monitoring AWS infrastructure requires minimal setup for standard metrics. The real design challenge is connecting the dots: metrics alone don't tell you whether your application is healthy, and logs alone don't show you performance trends. The combination of metrics, logs, dashboards, and automated responses is what makes CloudWatch a complete observability platform rather than just a data collection service.

#### Instructor notes

#### Student notes

**Centralized access to all data** : Modern applications, such as those running on microservices architectures, generate large volumes of data as metrics, logs, and events. With CloudWatch, you can collect, access, and correlate this data on a single platform from across all your AWS resources, applications, and services. CloudWatch also collects from on-premises servers.

**Simple initiation of collecting metrics** : CloudWatch simplifies monitoring your AWS resources and applications. It natively integrates with more than 70 AWS services such as Amazon EC2, DynamoDB, and Amazon Simple Storage Service (Amazon S3).

**Exploration, analysis, and visualization of logs** : You can explore, analyze, and visualize your logs to help you troubleshoot operational problems. With CloudWatch Logs Insights, you pay only for the queries that you run. You can also publish log-based metrics, create alarms, and correlate logs and metrics together in CloudWatch dashboards for complete operational visibility.

**Automated responses** : You can set alarms and automate actions based on predefined thresholds or on machine learning algorithms that identify anomalous behavior in your metrics. You can also use EventBridge events for serverless to launch workflows with services such as AWS Lambda, Amazon Simple Notification Service (Amazon SNS), and AWS CloudFormation.

**Real-time and historical data** : To optimize performance and resource utilization, you need a unified operational view, real-time granular data, and historical reference. CloudWatch provides automatic dashboards, data with 1-second granularity, and up to 15 months of metrics storage and retention. You can also perform metric math on your data to derive operational and usage insights. For example, you can aggregate usage across an entire fleet of EC2 instances.

For more information, see Amazon CloudWatch at https://aws.amazon.com/cloudwatch/.

:::

### Slide 7:

![Slide 7](slide_7.png)

::: Notes

CloudWatch's core primitives — metrics, alarms, dashboards, logs, and events — each serve a different observability purpose. Metrics answer 'how is it performing?'; logs answer 'what happened?'; events answer 'what changed?'; dashboards answer 'what does it look like right now?'; and alarms answer 'when should I act?'. The CloudWatch agent extends the service to collect OS-level metrics like memory utilization that are not visible from the hypervisor — a common gap that often surprises operators who assume CloudWatch covers everything.

#### Instructor notes

#### Student notes

CloudWatch provides a powerful mechanism for monitoring the state and use of most of the resources that you are managing under AWS. The following terminology and concepts are central to your understanding and use of Amazon CloudWatch:

* **Metrics** : Metrics are data about the performance of your systems. By default, several services provide metrics for resources at no cost. You can also publish your own application metrics. CloudWatch can load all the metrics in your account for search, graphing, and alarms.
* **Dashboards** : Dashboards provide globally accessible overviews of specific metrics from different Regions.
* **Alarms** : Alarms can be configured to alert when a specific condition is met. An example is whether the CPU utilization of one of your instances exceeds 70%.
* **Amazon CloudWatch Logs** : You can store your logs in this service and inspect their content.
* **Amazon EventBridge Events** : By integrating CloudWatch with EventBridge, you can write rules to indicate which events are of interest to your application and what automated action to take.
* **Amazon CloudWatch agent** : You can collect metrics from servers by installing the CloudWatch agent on the server. You can install the agent on both Amazon EC2 instances and on-premises servers, and on servers running either Linux or Windows Server. If you install the agent on an EC2 instance, the metrics that it collects are in addition to the metrics implemented by default on EC2 instances.

For more information about CloudWatch, see "What Is Amazon CloudWatch?" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html), "What Is Amazon EventBridge?" (https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html), and "Collect Metrics, Logs, and Traces from Amazon EC2 Instances and On-Premises Servers with the CloudWatch Agent" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) in the Amazon CloudWatch User Guide. For more information about CloudWatch Logs, see "What Is Amazon CloudWatch Logs?" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) in the Amazon CloudWatch Logs User Guide.

:::

### Slide 8:

![Slide 8](slide_8.png)

::: Notes

CPU utilization is one of the most commonly used EC2 metrics, but it is also one of the most commonly misinterpreted: high CPU might mean a healthy, busy system or an overloaded, struggling one. Memory utilization is not available from the hypervisor and requires the CloudWatch agent — this is a critical gap for diagnosing application-level issues. Custom metrics and CloudWatch alarms connected to Auto Scaling create a closed-loop system where observed performance automatically drives capacity adjustments.

#### Instructor notes

#### Student notes

CPU utilization is a standard metric available on CloudWatch. However, because memory utilization is not visible from the hypervisor, you can use custom metrics. Statistics are metric data aggregations over specified periods of time. CloudWatch provides statistics based on the metric data points provided by your custom data or provided by other AWS services to CloudWatch. The available statistics types are Minimum, Maximum, Sum, Average, and SampleCount. You can use metrics to calculate statistics and then present the data graphically in the CloudWatch console. You can configure alarm actions to stop, start, or terminate an EC2 instance when certain criteria are met. You can also create alarms that initiate Amazon EC2 Auto Scaling and Amazon SNS actions on your behalf. CloudWatch monitoring also offers integration with several third-party tools, such as Splunk. For more information, see Splunk Add-on for AWS at https://docs.splunk.com/Documentation/AddOns/released/AWS/CloudWatch. For more information about metrics, alarms, and statistics, see "Amazon CloudWatch Concepts" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html).

:::

### Slide 9:

![Slide 9](slide_9.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 10:

![Slide 10](slide_10.png)

::: Notes

Namespaces, dimensions, periods, and timestamps are the structural building blocks of CloudWatch metrics — understanding them helps students design monitoring that's both accurate and queryable. Dimensions are especially important: they allow you to slice metrics by instance, Auto Scaling group, or other attributes, enabling comparison across resources. Setting the period too long can mask short-lived performance problems; setting it too short increases costs and may surface noise rather than signal.

#### Instructor notes

#### Student notes

You can view statistical graphs of your published metrics with the AWS Management Console. CloudWatch stores data about a metric as a series of data points. For more information, see "Amazon CloudWatch Concepts" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html).

* **Namespaces** : A *namespace* is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other so that metrics from different applications are not mistakenly aggregated into the same statistics. You can specify a namespace name when you create a metric. The AWS namespaces use the following naming convention: `AWS/service`.
* **Metrics** : *Metrics* are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points represent the values of that variable over time. AWS services send metrics to CloudWatch. You can send your own custom metrics to CloudWatch.
* **Timestamps** : Each metric data point must be associated with a timestamp. If you do not provide a timestamp, CloudWatch creates a timestamp for you based on the time that the data point was received.
* **Dimensions** : A *dimension* is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a metric. Every metric has specific characteristics that describe it. You can think of dimensions as categories for those characteristics.
* **Periods** : A *period* is the length of time associated with a specific CloudWatch statistic. Each statistic represents an aggregation of the metrics data collected for a specified period of time. Periods are defined in numbers of seconds. Valid values for a period are 1, 5, 10, 30, or any multiple of 60.

:::

### Slide 11:

![Slide 11](slide_11.png)

::: Notes

Standard metrics are automatically published by AWS services at no additional charge, but they cover only what's visible from the AWS management plane — not what's happening inside the operating system or application. EC2 provides hypervisor-visible metrics like CPU and network; metrics like memory, disk I/O at the OS level, and application response times require additional instrumentation. Understanding the difference between what CloudWatch provides by default and what requires the agent or custom metrics is essential for designing comprehensive monitoring.

#### Instructor notes

#### Student notes

**Standard metrics** : Standard metrics are the default metrics sent to CloudWatch by the AWS resources used by your account. A metric is a measurement of some attribute of an AWS resource or service. AWS provides a set of standard metrics for each service that you can configure from the AWS Management Console or by using AWS Command Line Interface (AWS CLI) or API. CloudWatch can monitor many characteristics of AWS services, such as the following: CPU and network utilization for instances, Disk read/write operations for Amazon Elastic Block Store (Amazon EBS) volumes, Memory and disk activity for managed databases (Amazon RDS) instances. CloudWatch obtains standard Amazon EC2 metrics, such as CPU time, from the hypervisor. You can also monitor metrics on your Amazon S3 storage usage, including total bytes and total number of objects, on a per-bucket basis. Metrics are grouped first by namespace and then by the various dimension combinations within each namespace. For example, you can view all Amazon EC2 metrics, Amazon EC2 metrics grouped by instance, or Amazon EC2 metrics grouped by Auto Scaling Group. Only the AWS services that you're using send metrics to Amazon CloudWatch. To view available metrics by namespace, dimension, or metric using the AWS CLI, use the `list-metrics` command to list CloudWatch metrics. For more information, see "AWS Services That Publish CloudWatch Metrics" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html).

:::

### Slide 12:

![Slide 12](slide_12.png)

::: Notes

Basic monitoring provides metrics at 5-minute intervals and is included in the Free Tier; detailed monitoring provides 1-minute granularity at additional cost. For most applications, 5-minute resolution is sufficient for trend analysis, but for Auto Scaling or capacity management scenarios, 1-minute resolution helps you react faster to demand changes. The choice of monitoring resolution involves a cost-versus-responsiveness trade-off that should be made based on the operational requirements of the workload.

#### Instructor notes

#### Student notes

By default, your instance is enabled for basic monitoring with data available automatically in 5-minute periods as part of the AWS Free Tier. You also have the option of enabling detailed monitoring. After you enable detailed monitoring, the Amazon EC2 console displays monitoring graphs with a 1-minute period for the instance. For more information, see "Enable or Turn Off Detailed Monitoring for Your Instances" in the Amazon EC2 User Guide for Linux Instances (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html) and Amazon CloudWatch Pricing (http://aws.amazon.com/cloudwatch/pricing/).

:::

### Slide 13:

![Slide 13](slide_13.png)

::: Notes

Custom metrics are necessary whenever you need to monitor application-level behavior that CloudWatch doesn't capture by default, such as request queue depth, error rates, or business-level KPIs. Publishing custom metrics via PutMetricData is simple, but designing which metrics to collect requires thinking clearly about what signals indicate that your application is healthy versus degraded. High-resolution metrics at 1-second granularity are available but increase cost — they should be reserved for scenarios where sub-minute detection of issues is operationally meaningful.

#### Instructor notes

#### Student notes

Custom metrics -- Custom metrics can be

used to capture data that isn't being captured by CloudWatch standard
metrics. Using the AWS CLI or API, you can publish a custom metric that
is specific to the function of your instance and have it registered in
CloudWatch. For example, if you are running an HTTP server on the
instance, you could publish a statistic on service memory usage.
PutMetricData publishes metric data points to Amazon CloudWatch.
CloudWatch associates the data points with the specified metric. If the
specified metric does not exist, CloudWatch creates the metric. When
CloudWatch creates a metric, it can take up to 15 minutes for the metric
to appear. When you publish a high-resolution metric, CloudWatch stores
it with a resolution of 1 second. You can push these through
PutMetricData with a period of 1 second, 5 seconds, 10 seconds, 30
seconds, or any multiple of 60 seconds.For more information, see the
following references:"PutMetricData" in Amazon CloudWatch API Reference
(https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html)"Publish
Custom Metrics" in the Amazon CloudWatch User Guide
(https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html)"High-Resolution
Metrics" in the Amazon CloudWatch User Guide
(https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html#high-resolution-metrics)

:::

### Slide 14:

![Slide 14](slide_14.png)

::: Notes

CloudWatch automatically aggregates high-resolution metric data into lower-resolution periods over time, trading granularity for storage efficiency. This means that recent performance data is available at fine granularity, but historical data becomes increasingly coarser. This tiered retention is important to understand when designing monitoring systems: if you need to retain 1-minute granularity for more than 15 days, you must archive the data yourself. For long-term capacity planning and trend analysis, you may need to export metrics to S3 or another storage service.

#### Instructor notes

#### Student notes

Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days, this data is still available. However, it is aggregated and retrievable only with a resolution of 5 minutes. After 63 days, the data is further aggregated and available with a resolution of 1 hour. To learn more about CloudWatch metric retention, see "GetMetricStatistics" in Amazon CloudWatch API Reference at https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html.

:::

### Slide 15:

![Slide 15](slide_15.png)

::: Notes

Anomaly Detection uses ML algorithms to establish a dynamic baseline for a metric and alert when behavior deviates from the expected pattern. This is more sophisticated than static threshold alarms because it adapts to regular patterns like daily or weekly cycles. The key consideration is that Anomaly Detection requires a sufficient history of data to establish a reliable baseline — alerting on a metric before it has enough historical data will generate false positives. It also cannot account for changes in application behavior after a major deployment.

#### Instructor notes

#### Student notes

In the Metrics section of the CloudWatch console, you can enable Anomaly Detection in the Actions column of the metric's resources. When you enable *anomaly detection* for a metric, CloudWatch applies statistical and machine learning algorithms. The algorithms determine normal baselines and potential anomalies where your resources are operating out of range. You can then create a CloudWatch alarm that will send a notification whenever one of your resources begins to operate in unpredictable ways. To learn more about CloudWatch Anomaly detection, see "Using CloudWatch Anomaly Detection" in the Amazon CloudWatch User Guide at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html.

:::

### Slide 16:

![Slide 16](slide_16.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 17:

![Slide 17](slide_17.png)

::: Notes

CloudWatch dashboards provide a customizable, shared view of metrics and alarms across multiple AWS accounts and Regions. The ability to build cross-Region and cross-account dashboards is particularly valuable in multi-account AWS Organizations environments where different teams own different accounts. Dashboards are a communication tool as well as a monitoring tool — a well-designed operations dashboard reduces the time needed to diagnose incidents by putting the right data in front of the right people.

#### Instructor notes

#### Student notes

Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view. You can view even those resources that are spread across different Regions. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources. With dashboards, you can do the following:

* Create a single view for selected metrics and alarms to help you assess the health of your resources and applications across one or more Regions. You can select the color used for each metric on each graph so that you can track the same metric across multiple graphs.
* Create dashboards that display graphs and other widgets from multiple AWS accounts and multiple Regions.
* Create an operational playbook that provides guidance for team members during operational events about how to respond to specific incidents.
* Create a common view of critical resource and application measurements that team members can share for faster communication flow during operational events.
* Specify how to retain or modify the period setting of graphs added to this dashboard.

For more information, see "Using Amazon CloudWatch Dashboards" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html).

:::

### Slide 18:

![Slide 18](slide_18.png)

::: Notes

Dashboard widgets provide different visual representations of the same underlying data, and choosing the right widget type matters for clarity. Line graphs are appropriate for tracking trends over time; bar charts for comparing categories; number widgets for at-a-glance current state. The Custom widget backed by Lambda is particularly powerful — it can pull in data from any source and render it alongside CloudWatch data, enabling truly integrated operational views that extend beyond what CloudWatch natively captures.

#### Instructor notes

#### Student notes

You can create a dashboard from widgets that draw from metric or log data. Widgets help provide visual representations of your resource activity. Widget types include the following: Explorer (a single widget with multiple tag-based graphs), Line graphs (compare metrics over time), Stacked area (compare the total over time), Number (instantly view the latest value and trend for a metric), Gauge (see the latest value of a metric within a lower and upper range), Bar (compare categories of data), Pie (show percentage or proportional data), Custom widget (code widgets using Lambda and more), Text (use free text with markdown formatting), Logs tables (explore results from CloudWatch Logs Insights), Alarm status (instantly view the status of your alarms in a grid view).

:::

### Slide 19:

![Slide 19](slide_19.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 20:

![Slide 20](slide_20.png)

::: Notes

CloudWatch alarms watch a single metric and trigger actions when that metric crosses a threshold for a sustained period. The sustained-period requirement is a deliberate design: it prevents alarms from firing on transient spikes that don't represent a genuine problem. Connecting alarms to SNS, EC2 actions, or Auto Scaling creates automated response pipelines — but each automated action should be reviewed carefully, because an alarm that triggers Auto Scaling based on a noisy metric can create unnecessary costs.

#### Instructor notes

#### Student notes

A metric alarm watches a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. The action can be an Amazon EC2 action, an auto scaling action, or a notification sent to an Amazon SNS topic. For more information, see "Using Amazon CloudWatch Alarms" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html).

:::

### Slide 21:

![Slide 21](slide_21.png)

::: Notes

The three alarm states — OK, ALARM, and INSUFFICIENT_DATA — represent three distinct situations: normal operation, a threshold breach, and missing data. INSUFFICIENT_DATA is often overlooked but can indicate a monitoring gap or a failed data source. An alarm in ALARM state is not inherently critical; it means a configured threshold was crossed. Understanding the distinction between ALARM as a state name and an actual emergency is important — alarm design should reflect operational priorities, not just technical thresholds.

#### Instructor notes

#### Student notes

An alarm has three possible states:

* `OK` : The metric is within the defined threshold.
* `ALARM` : The metric is outside of the defined threshold.
* `INSUFFICIENT_DATA` : The alarm has started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

ALARM is only a name given to the state, and does not necessarily signal an emergency condition requiring immediate attention. It means that the monitored metric is equal to, greater than, or less than a specified threshold value. You could, for example, define an alarm that tells you when your CPUCreditBalance for a given T2 instance is running low. You might process this notification programmatically to suspend a CPU-intensive job on the instance until your T2 credit balance is once again full. INSUFFICIENT DATA can be returned when no data exists for a given metric. An example of this is the depth of an empty Amazon Simple Queue Service (Amazon SQS) queue. This can also be an indicator that something is wrong in your system.

:::

### Slide 22:

![Slide 22](slide_22.png)

::: Notes

The alarm threshold and evaluation period together control alarm sensitivity. Setting the threshold too low generates false alarms that erode operator confidence; setting it too high means genuine problems go undetected. Requiring multiple consecutive breaches before triggering the alarm reduces noise from transient spikes, but increases the time to detection. Ask students to think about the trade-off: for a production database, would you prefer an alarm that fires quickly on any CPU spike, or one that waits for three consecutive minutes of sustained high usage?

#### Instructor notes

#### Student notes

In the figure, the alarm threshold is set to 3, and the minimum breach is 3 periods. That is, the alarm invokes its action only when the threshold is breached for 3 consecutive periods. In the figure, this happens with the third through fifth time periods, and the alarm's state is set to ALARM. At period six, the value dips below the threshold, and the state reverts to OK. Later, during the ninth time period, the threshold is breached again, but not for the necessary three consecutive periods. Consequently, the alarm's state remains OK. For more information, see "Using Amazon CloudWatch Alarms" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html).

:::

### Slide 23:

![Slide 23](slide_23.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 24:

![Slide 24](slide_24.png)

::: Notes

Effective log analysis requires decisions at each stage: what to capture, how to collect it reliably, and how to extract signal from noise. The configure phase is often underinvested: logs that use inconsistent formats, missing fields, or ambiguous timestamps are difficult to analyze at scale. In a cloud environment where instances are ephemeral, the collect phase is operationally critical — log data on a terminated instance that was never shipped to a central store is data lost forever.

#### Instructor notes

#### Student notes

You can think of the process of log analysis as having three distinct phases:

* **Configure** : In the Configure stage, decide what information you need to capture in your logs and where and how it will be stored. Log information is typically saved in a plaintext format, with distinct fields in an entry separated by a delimiter (for example, spaces, tabs, or commas). At this point, you also must decide on the format of each field. For example, what is the canonical format of dates in your log file? Have you made sure that it is consistent from field to field and across all server instances?
* **Collect** : As discussed in detail in the last module, instances are launched and terminated in a cloud environment. You need a strategy for periodically uploading server log files so that this valuable information is not lost when an instance is eventually terminated.
* **Analyze** : After all the data is collected, it's time to analyze. Using log data gives you greater visibility into the daily health of your systems. It can also provide information about upcoming trends in customer behavior, and insight into how customers are currently using your system. Use analytics tools to report on errors, investigate problem reports, monitor new releases, and measure trends.

:::

### Slide 25:

![Slide 25](slide_25.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 26:

![Slide 26](slide_26.png)

::: Notes

CloudWatch Logs consolidates logs from AWS services, EC2 instances, and on-premises servers into a single managed service. The distinction between vended logs (natively published by AWS services) and custom logs (from your applications via the agent or API) matters for pricing — vended logs receive volume discounts. Centralizing all logs in CloudWatch Logs simplifies correlation between infrastructure events and application behavior, which is critical for incident diagnosis in distributed systems.

#### Instructor notes

#### Student notes

With Amazon CloudWatch Logs, you can collect and store logs from your resources, applications, and services in near real time. Logs have two main categories:

* **Logs published by AWS services** : More than 30 AWS services publish logs to CloudWatch. These services include Amazon API Gateway, AWS Lambda, AWS CloudTrail, and many others.
* **Custom logs** : These are logs from your own application and on-premises resources. You can use AWS Systems Manager to install a CloudWatch agent, or you can use the `PutLogData` API action to publish logs.

Vended logs are specific AWS service logs natively published by AWS services on behalf of the customer and available at volume discount pricing. For more information, see Amazon CloudWatch Pricing at https://aws.amazon.com/cloudwatch/pricing/.

:::

### Slide 27:

![Slide 27](slide_27.png)

::: Notes

Log groups, log streams, and metric filters are the organizational building blocks of CloudWatch Logs. Log groups define retention policies — setting retention too long increases storage costs; setting it too short risks losing data needed for audits or forensics. Metric filters are a powerful bridge between unstructured log data and structured CloudWatch metrics: they extract numerical signals from log events, enabling you to alarm on patterns like error rates or specific log strings without querying raw logs.

#### Instructor notes

#### Student notes

**Log stream** : A log stream is a sequence of log events that share the same source. Each separate source of logs in CloudWatch Logs makes up a separate log stream.

**Log group** : A log group is a group of log streams that share the same retention, monitoring, and access control settings. You can define log groups and specify which streams to put into each group. There is no limit on the number of log streams that can belong to one log group.

**Metrics filters** : You can use metric filters to search for and match terms, phrases, or values in your log events. When a metric filter finds one of the terms, phrases, or values in your log events, you can increment the value of a CloudWatch metric.

For more information, see "Creating Metrics from Log Events Using Filters" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html), "Working with Log Groups and Log Streams" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html), "Collect Metrics, Logs, and Traces from Amazon EC2 instances and On-Premises Servers with the CloudWatch Agent" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html), "Understanding data protection policies" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-data-protection-policies.html), and "Log anomaly detection" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/LogsAnomalyDetection.html) in the Amazon CloudWatch Logs User Guide.

:::

### Slide 28:

![Slide 28](slide_28.png)

::: Notes

CloudWatch Logs Insights provides interactive querying of log data without requiring you to build or maintain a separate log analytics infrastructure. It uses a purpose-built query language optimized for log analysis and automatically discovers field structure in JSON logs from AWS services. The pay-per-query pricing model means it's cost-effective for ad-hoc investigation, but for high-frequency automated queries, metric filters or log subscriptions may be more cost-efficient alternatives.

#### Instructor notes

#### Student notes

Use the CloudWatch Logs Insights feature to search interactively and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes. CloudWatch Logs Insights includes a purpose-built query language with a few simple but powerful commands. To help you get started, CloudWatch Logs Insights provides sample queries, command descriptions, query autocompletion, and log field discovery. Sample queries are included for several types of AWS service logs. CloudWatch Logs Insights automatically discovers fields in logs from AWS services, such as the following: Amazon Route 53, AWS Lambda, AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC). CloudWatch Logs Insights also discovers any application or custom log that emits log events as JSON.

:::

### Slide 29:

![Slide 29](slide_29.png)

::: Notes

CloudWatch Logs Insights queries return structured results that can be visualized directly in the console, enabling operators to move from raw log data to a bar chart of error counts in a single step. The stats function enables aggregation, turning log events into time-series data suitable for trend analysis. However, Logs Insights queries run on demand and are not suitable for real-time alerting — for alert-driven workflows, metric filters that continuously evaluate log streams are more appropriate.

#### Instructor notes

#### Student notes

Queries in CloudWatch Logs Insights return one of the following results: Set of fields from log events, Mathematical aggregation, Other operation performed on log events. You can use visualizations, such as bar charts, line charts, and stacked area charts, to more efficiently identify patterns in your log data. CloudWatch Logs Insights generates visualizations for queries that use the stats function and one or more aggregation functions. For more information, see "Analyzing Log Data with CloudWatch Logs Insights" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) and "CloudWatch Logs Insights Query Syntax" (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html) in the Amazon CloudWatch Logs User Guide.

:::

### Slide 30:

![Slide 30](slide_30.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 31:

![Slide 31](slide_31.png)

::: Notes

Amazon EventBridge provides the event bus architecture that connects AWS services, SaaS applications, and custom code through a declarative, rule-based routing model. Unlike polling-based integrations, EventBridge delivers events in near real time without requiring the consumer to know about the source. The ability to route a single event to multiple targets in parallel enables loosely coupled, event-driven architectures where different teams can independently respond to the same operational event.

#### Instructor notes

#### Student notes

Amazon EventBridge is a serverless event bus that simplifies connecting AWS services and custom applications to target event queues and serverless workflows. EventBridge can deliver a stream of real-time data from event sources, such as AWS services, software as a service (SaaS) applications, Zendesk, Datadog, or PagerDuty, and route that data to targets, such as AWS Lambda, to build serverless workflows. To connect your event sources and your targets, you will need to create event rules. A rule matches incoming events and routes them to targets for processing. A single rule can route to multiple targets, all of which are processed in parallel. Rules are not processed in a particular order. This means that different parts of an organization can look for and process the events that are of interest to them. EventBridge builds upon and extends the former CloudWatch Events service. It uses the same service API and endpoint, and the same underlying service infrastructure. EventBridge is the ideal service for building event-driven architectures. For more information, see Amazon EventBridge (https://aws.amazon.com/eventbridge/).

:::

### Slide 32:

![Slide 32](slide_32.png)

::: Notes

EventBridge rule creation requires two decisions: what events to match (the pattern) and what to do with them (the target). The event source and event type together define the pattern; getting the pattern right requires understanding the event schema published by each AWS service. A common mistake is creating rules that are too broad — matching all events from a service rather than the specific event types that require action — which generates unnecessary invocations and can obscure operationally meaningful signals.

#### Instructor notes

#### Student notes

To create a rule from the EventBridge console, you first need to define an event pattern. This will specify an event source and type.

**Event source** : An event indicates a change in your AWS environment. AWS resources can generate events when their state changes. For example, Amazon EC2 generates an event when the state of an EC2 instance changes from pending to running. Amazon EC2 Auto Scaling generates events when it launches or terminates instances. AWS CloudTrail publishes events when you make API calls. You can generate custom application-level events and publish them to EventBridge. You can also set up scheduled events that are generated on a periodic basis.

**Event type** : In addition to telling EventBridge what sources to receive information from, you also need to specify the types of information that you are looking for from those sources. The AWS services that you specify as sources determine the types of event information that they can pass into EventBridge.

:::

### Slide 33:

![Slide 33](slide_33.png)

::: Notes

EventBridge targets define where matched events are delivered and what action is taken. The breadth of available targets — Lambda, Step Functions, SNS, SQS, ECS, Kinesis — enables a wide range of automated response patterns. Lambda functions are often the default choice for custom logic, but for simpler notification patterns, SNS is more appropriate. The example on this slide — notifying on EC2 state changes — is a foundational operational automation pattern that students can immediately apply to improve their operational visibility.

#### Instructor notes

#### Student notes

**Targets** : In addition to specifying an Event source, to create an EventBridge rule you will also need to specify a target. A *target* is a resource or endpoint that EventBridge sends an *event* to when it matches the event pattern defined for the rule. A target receives events in JSON format. Common EventBridge targets include the following: AWS Lambda functions, Amazon Kinesis Data Streams, Amazon Elastic Container Service (Amazon ECS) tasks, AWS Step Functions state machines, Amazon SNS topics, Amazon SQS queues. For a full list of EventBridge targets, see "Targets Available in the EventBridge Console" in the Amazon EventBridge User Guide (https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-targets.html#eb-console-targets). In the example on this and the previous screen, we create a rule where every time an EC2 instance changes to a state of shutting-down, terminated, or stopped, an Amazon SNS notification is sent to the topic of your choice.

:::

### Slide 34:

![Slide 34](slide_34.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 35:

![Slide 35](slide_35.png)

::: Notes

The AWS Health Dashboard provides both global service health visibility and account-specific event notifications, which serve different audiences. The public service health view shows whether AWS regions and services are experiencing disruptions; the account-specific view shows which of your resources are affected and provides troubleshooting guidance. The integration with AWS Organizations lets operations teams see health events across all accounts in one place, which is essential for coordinated incident response in multi-account environments.

#### Instructor notes

#### Student notes

The AWS Health Dashboard is the single place to learn about the availability and operations of AWS services. You can view the overall status of AWS services, and you can sign in to view personalized communications about your AWS account or organization. Your account view provides deeper visibility into resource issues, upcoming changes, and important notifications. The Health Dashboard provides the following benefits: Personalized view of service health, Proactive notifications, Detailed troubleshooting guidance, Integration and automation, Fine-grained access control by using AWS Identity and Access Management (IAM), Aggregated health events across AWS Organizations. To learn more, see "Getting Started with Your AWS Health Dashboard -- Your Account Health" in the AWS Health User Guide (https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html).

:::

### Slide 36:

![Slide 36](slide_36.png)

::: Notes

Health Dashboard events include both AWS-initiated events (service disruptions, hardware retirement) and scheduled maintenance events that AWS communicates in advance. Scheduled events like instance retirement give you a window to migrate workloads proactively; unscheduled events require reactive response. Public events that don't affect your account still provide useful context about AWS service health in regions or services you depend on indirectly through third-party integrations.

#### Instructor notes

#### Student notes

The types of events found in the event log in AWS Health Dashboard can include both scheduled and unscheduled events. Scheduled events can include network maintenance, power maintenance, and routine hardware retirement. Public events are service events that aren't specific to an AWS account. For example, if Amazon EC2 in an AWS Region incurs an event, AWS Health provides information about the event, even if you don't use services or resources in that Region. Account-specific events list the specific resources affected on the event's Affected resources page. You can view more information about any health event by selecting it from the log display. For more information, see "Event Log" in the AWS Health User Guide (https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html#event-log).

:::

### Slide 37:

![Slide 37](slide_37.png)

::: Notes

The event detail view in AWS Health Dashboard surfaces the specific affected resources by resource ID or ARN, which enables targeted response rather than broad precautionary actions. The start and end time fields help you correlate Health events with application performance anomalies in CloudWatch, confirming whether a workload degradation was caused by an AWS infrastructure event. Integrating Health Dashboard with your incident response process ensures you check for AWS-initiated events early in your troubleshooting workflow.

#### Instructor notes

#### Student notes

The *Event details* pane has two tabs. The *Details* tab displays a text description of the event and relevant data about the event: the event name, status, Region and Availability Zone, start time, end time, and category. The *Affected resources* tab displays information about any AWS resources that are affected by the event. The tab displays the resource ID or Amazon Resource Name (ARN), if available or relevant. For more information, see "Event Details" in the AWS Health User Guide (https://docs.aws.amazon.com/health/latest/ug/getting-started-health-dashboard.html#event-details).

:::

### Slide 38:

![Slide 38](slide_38.png)

::: Notes

Integrating AWS Health with EventBridge enables automated responses to AWS-initiated events, such as sending a notification when an EC2 instance is scheduled for retirement or triggering a runbook when a service disruption is detected. Without this integration, operators must manually check the Health Dashboard to discover events that may be affecting their workloads. Automating detection and response to Health events is a maturity indicator in cloud operations — it moves teams from reactive to proactive incident management.

#### Instructor notes

#### Student notes

You can use EventBridge to detect and

react to changes for AWS Health events. Then, based on the rules that
you create, EventBridge invokes one or more target actions when an event
matches the specified values in a rule. Depending on the type of event,
you can send notifications, capture event information, take corrective
action, initiate events, or take other actions. This example creates a
rule so that EventBridge monitors for all Amazon EC2 events, including
the event type categories, event codes, and resources.For more
information, see "Monitoring AWS Health Events with Amazon EventBridge"
in the AWS Health User Guide
(https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html).

:::

### Slide 39:

![Slide 39](slide_39.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 40:

![Slide 40](slide_40.png)

::: Notes

CloudWatch Resource Health provides a fleet-level view of EC2 instance health, allowing operators to spot patterns across many instances simultaneously. Rather than reviewing each instance's metrics individually, you can filter and sort the fleet by performance dimension, quickly surfacing outliers. This is particularly valuable in Auto Scaling environments where the fleet size changes dynamically — a centralized health view ensures that newly launched instances are visible and monitored without manual dashboard updates.

-A screenshot highlighting CloudWatch resource health
information---visualize the health of EC2 hosts across your applications
and sort EC2 hosts with filter options.

#### Instructor notes

#### Student notes

You can use Resource Health in Amazon CloudWatch to automatically discover, manage, and visualize the health and performance of Amazon EC2 hosts across your applications in a single view. Resource Health visualizations can be based on performance dimensions such as CPU Utilization, Memory Utilization, or Status Check. You can sort through hundreds of Amazon EC2 hosts by using the available filter options.

:::

### Slide 41:

![Slide 41](slide_41.png)

::: Notes

Distributed microservice architectures create observability challenges that don't exist in monolithic applications: a single user request may traverse dozens of services, and a failure or latency spike anywhere along the path can affect end-user experience. Traditional metrics and logs from individual services don't naturally show the relationships between services or the end-to-end latency of a request. Distributed tracing solves this by correlating events across service boundaries, giving you a complete view of how a request was processed.

#### Instructor notes

#### Student notes

Distributed applications are applications built using a microservice architecture. A microservice architecture can make it difficult to achieve user latency requirements. In addition, debugging and tracing user interactions adds complexity. Tracing is tracking and gathering data on a user request as it travels through your entire application.

:::

### Slide 42:

![Slide 42](slide_42.png)

::: Notes

AWS X-Ray adds distributed tracing to your application stack, enabling you to see the full path of a request from entry point to all downstream dependencies. The critical insight X-Ray provides is not just whether individual services are healthy, but where in the request chain latency or errors are introduced. This is particularly valuable for identifying bottlenecks in calls to external services, databases, or other microservices that are technically operational but slower than expected.

#### Instructor notes 

#### Student notes

AWS X-Ray is a service that collects data about requests that your application serves. The service provides tools that you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization. For any traced request to your application, you can review detailed information about the following: Request and response, Calls that your application makes to downstream AWS resources, microservices, databases, and HTTP web APIs. With X-Ray, you can perform the following tasks: Analyze and debug the performance of your distributed applications, View latency distribution and pinpoint performance bottlenecks, Identify specific user impact across your applications. X-Ray works across different AWS and third-party services. The service is ready to use in production with low latency in real time. For more information, see AWS X-Ray (https://aws.amazon.com/xray/).

:::

### Slide 43:

![Slide 43](slide_43.png)

::: Notes

X-Ray's trace-segment-subsegment hierarchy mirrors the structure of a distributed request: the trace is the complete user journey, segments are the individual service contributions, and subsegments are the calls within a service. This structure enables you to drill down from a slow overall request to the specific service call that introduced the latency. Annotating segments with custom metadata allows you to correlate trace data with business context like user IDs, tenant IDs, or feature flags.

#### Instructor notes

#### Student notes

Everything starts with a trace, which encapsulates the complete transaction originated at the user. You view the trace from a holistic viewpoint. X-Ray breaks down the trace into segments, which are portions of the trace that correspond to a single service. X-Ray takes the logs coming from different components, which traditionally are disparate, and connects them for you. Subsegments are remote calls or local compute sections within a service. For more information, review "AWS X-Ray Concepts" in the AWS X-Ray Developer Guide (https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html).

:::

### Slide 44:

![Slide 44](slide_44.png)

::: Notes

The X-Ray trace map visualizes the relationships and health between services in your distributed system, using color to surface problems at a glance: red for 5xx errors, yellow for 4xx errors, purple for throttling. This visual representation makes it possible to identify cascading failures — where one unhealthy service causes errors in other services that depend on it. The integration with CloudWatch metrics and logs in the same view enables correlated investigation: you can see the spike in error rate and examine the logs from the affected nodes without switching consoles.

#### Instructor notes

#### Student notes

The X-Ray trace map enhances the

observability of your services and applications by correlating
CloudWatch metrics, CloudWatch Logs, and AWS X-Ray traces into one
place. You can access the trace map from the Amazon CloudWatch console
under X-Ray traces, or from the X-Ray console. The CloudWatch console is
the recommended interface because it provides the full integrated
experience, including metrics, logs, and traces together. This view
helps you to more efficiently pinpoint performance bottlenecks and
identify impacted users.A trace map displays your service endpoints and
resources as nodes, and highlights the traffic, latency, and errors for
each node and its connections. The map uses color to surface issues at a
glance: red indicates server faults in the 500 series, yellow indicates
client errors in the 400 series, and purple indicates throttling errors.
You can choose a node to review detailed insights about the correlated
metrics, logs, and traces associated with that part of the service. You
can use this capability to investigate problems and their effect on the
application.

:::

### Slide 45:

![Slide 45](slide_45.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 46:

![Slide 46](slide_46.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 47:

![Slide 47](slide_47.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 48:

![Slide 48](slide_48.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 49:

![Slide 49](slide_49.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 50:

![Slide 50](slide_50.png)

::: Notes


#### Instructor notes 

#### Student notes

For more information, see "Why Is My CloudWatch Alarm in INSUFFICIENT_DATA State?" in the AWS Knowledge Center (https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-alarm-insufficient-data-state/).

:::

### Slide 51:

![Slide 51](slide_51.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 52:

![Slide 52](slide_52.png)

::: Notes

A CloudWatch agent that isn't sending data is a silent monitoring failure: you don't get alerts about missing metrics, so you may not realize your monitoring coverage has a gap until an incident occurs. Troubleshooting agent issues via Systems Manager Run Command or Session Manager avoids the need for direct SSH access and works even when the agent is the problem. Verifying the agent's operational status should be part of your instance configuration validation checklist, not just a reactive troubleshooting step.

#### Instructor notes

#### Student notes

You can log in to the instance to review if you've configured it correctly. You can also inspect the configuration, or review the log files. Systems Manager gives you some options to troubleshoot why a CloudWatch agent isn't running. Use Systems Manager commands to query the status of the agent, or use automation to create updates on an instance if you need to update the agent. For more information, see "Verify That the CloudWatch Agent Is Running" in the Amazon CloudWatch User Guide (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/troubleshooting-CloudWatch-Agent.html#CloudWatch-Agent-troubleshooting-verify-running).

:::

### Slide 53:

![Slide 53](slide_53.png)

::: Notes


#### Instructor notes

#### Student notes

:::

### Slide 54:

![Slide 54](slide_54.png)

::: Notes


#### Instructor notes 

#### Student notes

:::

### Slide 55:

![Slide 55](slide_55.png)

::: Notes


#### Instructor notes

#### Student notes

This lab takes you through the following process: Install, configure, and update the application:

* Update the AWS Systems Manager agent with the Systems Manager command document (`AWS-UpdateSSMAgent`).
* Install the CloudWatch agent with the command document (`AWS-ConfigureAWSPackage`).
* Configure the CloudWatch agent.
* Start the CloudWatch agent with Run Command, a capability of AWS Systems Manager.
* Check the status of the CloudWatch agent with Run Command.
* Create the CloudWatch dashboard.
* Subscribe to an Amazon SNS topic (`SysOpsTeamPager`).
* Create CloudWatch alarms for CloudWatch agent status checks.
* Test the CloudWatch alarm by manually changing the state of the CloudWatch agent (Session Manager, a capability of AWS Systems Manager).
* Create a Lambda canary function to monitor a website for a specified response.
* Create the CloudWatch alarm for the Lambda function.

:::

### Slide 56:

![Slide 56](slide_56.png)

::: Notes


#### Instructor notes

#### Student notes

:::