# Hands-On Labs — AWS SysOps Administrator

This directory contains six hands-on labs for the **AWS SysOps Administrator** course. Each lab builds on real-world operational scenarios that a Cloud Operations team might face. The labs progress from foundational auditing and infrastructure management skills through monitoring, backup, and finally a capstone that ties all the threads together.

All files are self-contained HTML pages. Open them in any modern browser — no internet connection is required for content (images are stored locally in `assets/`).

---

## Lab Overview

| # | Title | Core Services | Duration |
|---|-------|---------------|----------|
| 1 | [Auditing Your AWS Resources with SSM and Config](#lab-1-auditing-your-aws-resources-with-aws-systems-manager-and-aws-config) | Systems Manager, AWS Config | ~60 min |
| 2 | [Infrastructure as Code](#lab-2-infrastructure-as-code) | CloudFormation, Infrastructure Composer | ~75 min |
| 3 | [Operations as Code](#lab-3-operations-as-code) | Systems Manager, CloudWatch Logs | ~60 min |
| 4 | [Monitoring Applications and Infrastructure](#lab-4-monitoring-applications-and-infrastructure) | CloudWatch, SNS, Lambda | ~75 min |
| 5 | [Automating with AWS Backup for Archiving and Recovery](#lab-5-automating-with-aws-backup-for-archiving-and-recovery) | AWS Backup, SNS | ~60 min |
| 6 | [Capstone Lab for CloudOps](#lab-6-capstone-lab-for-cloudops) | CloudFormation, Config, SSM, EventBridge | ~90 min |

---

## Lab 1: Auditing Your AWS Resources with AWS Systems Manager and AWS Config

**File:** [lab-1-auditing-your-aws-resources-with-ssm-and-config.html](lab-1-auditing-your-aws-resources-with-ssm-and-config.html)

### Scenario
The operations team has been asked to prove to the security team that all EC2 instances meet company policy — specific software must be present, and certain configuration items must never drift. The team has no standing SSH access to the servers, so they need a scalable, agent-based approach.

### Goals
- Use **AWS Systems Manager Inventory** to collect software and configuration metadata from managed instances without opening SSH ports.
- Author and apply an **AWS Config rule** to continuously evaluate whether instances meet a compliance policy.
- Understand the relationship between SSM managed instances, the SSM Agent, and IAM instance profiles.

### Main AWS Services
- **AWS Systems Manager** — Inventory, Session Manager, Fleet Manager
- **AWS Config** — managed rules, compliance timeline, resource configuration history

### Task Summary
1. Explore the pre-built environment (VPC, EC2 instances, IAM roles).
2. Enable and configure SSM Inventory for managed instances.
3. Query inventory data and review collected metadata.
4. Create an AWS Config rule and evaluate compliance.
5. Interpret compliance results and remediation options.

### Tips
- Make sure the EC2 instance profile includes the `AmazonSSMManagedInstanceCore` managed policy — without it the instance will not appear as "managed" in SSM.
- Inventory collection is asynchronous. After enabling it, wait 2–3 minutes before querying data; the console may show "No data" initially.
- Config rules evaluate on a schedule (periodic) or on configuration change; understand which trigger type your rule uses before expecting immediate results.

### Potential Issues
- **Instance not appearing as SSM-managed**: The SSM Agent must be running and able to reach the SSM endpoints (either via internet gateway or VPC endpoints). Check the IAM role first, then network connectivity.
- **Config rule shows "No results"**: Config must be enabled in the Region before rules work. The lab environment should pre-enable it, but confirm in the Config console.
- **Compliance evaluation lag**: Periodic rules can take up to 24 hours between evaluations by default. The lab uses on-change triggers to avoid this.

---

## Lab 2: Infrastructure as Code

**File:** [lab-2-infrastructure-as-code.html](lab-2-infrastructure-as-code.html)

### Scenario
The team is tired of manually clicking through the console every time a new environment is needed. They want to define a reproducible application stack as a CloudFormation template, deploy it with one command, and be able to detect if anyone makes manual changes that cause configuration drift.

### Goals
- Create and deploy an **AWS CloudFormation** stack from a YAML template.
- Use **AWS Infrastructure Composer** (formerly CloudFormation Designer) to visually edit and validate a template.
- Organize deployed resources using **Resource Groups** and **Tag Editor**.
- Detect and review **drift** when a resource is changed outside of CloudFormation.

### Main AWS Services
- **AWS CloudFormation** — stacks, YAML templates, outputs, drift detection
- **AWS Infrastructure Composer** — visual template designer
- **Resource Groups & Tag Editor** — resource organization and tagging
- **Amazon EC2** — AppServer instances provisioned by the stack

### Task Summary
1. Explore an existing CloudFormation template and understand its structure.
2. Use Infrastructure Composer to visualize and modify the template.
3. Deploy the stack (named **Working**) and observe resource creation events.
4. Apply tags and group resources using Resource Groups.
5. Manually change a stack resource and then run drift detection to identify the change.

### Tips
- CloudFormation stack names are immutable after creation. If the lab asks for a specific name like "Working", use it exactly.
- The **Events** tab in the stack detail is your best friend during deployment — it shows the order of resource creation and any failure reasons with error messages.
- Infrastructure Composer validates template syntax as you type; pay attention to the inline error indicators before saving.
- Drift detection is not automatic — you must explicitly trigger it. Results appear in the **Drift status** section of the stack.

### Potential Issues
- **Stack rollback on CREATE_FAILED**: Read the Events tab error message carefully. Common causes are IAM permission boundaries, resource limits (e.g., VPC or EIP limits), or name conflicts with existing resources.
- **Drift detection says "Detection in progress" for a long time**: This is normal for large stacks. For small lab stacks, it should complete in under a minute. Refresh the page if it seems stuck.
- **Infrastructure Composer not saving**: The browser may block popups or downloads needed to save locally. Use the **Save to CloudFormation** option to persist changes directly in the service.

---

## Lab 3: Operations as Code

**File:** [lab-3-operations-as-code.html](lab-3-operations-as-code.html)

### Scenario
After establishing infrastructure-as-code practices, the team now wants to apply the same philosophy to *operational tasks* — things like installing packages, restarting services, and resizing instances — without ever logging into a server. All operational runbooks should be version-controlled, auditable documents.

### Goals
- Create and run an **SSM Command Document** to install and configure Apache on a fleet of WebServer instances.
- Stream command output to **Amazon CloudWatch Logs** for persistent, searchable audit records.
- Create and run an **SSM Automation Document** to perform an instance resize (stop → change type → start).

### Main AWS Services
- **AWS Systems Manager** — Run Command, Command Documents (AWS-RunShellScript), Automation Documents (AWS-ResizeInstance)
- **Amazon CloudWatch Logs** — log groups, log streams for SSM output
- **Amazon EC2** — WebServer instances targeted by documents

### Task Summary
1. Explore the pre-built SSM document library.
2. Create a custom Command Document and run it against a target instance.
3. Configure CloudWatch Logs as the output destination and verify log delivery.
4. Review SSM Run Command history for audit purposes.
5. Execute an SSM Automation to resize an EC2 instance and observe the state-machine execution steps.

### Tips
- When targeting instances in Run Command, prefer **tag-based targeting** (e.g., `Key=Role, Value=WebServer`) over instance ID lists — it scales automatically as the fleet grows.
- SSM Automation executions show a step-by-step state machine view; use it to understand which step failed if an execution errors out.
- CloudWatch Logs integration requires that the SSM Agent have permission to write logs — verify the instance role includes `CloudWatchAgentServerPolicy` or equivalent.

### Potential Issues
- **Command shows "Delivery Timed Out"**: The instance is not managed by SSM (agent not running, no IAM profile, or no network path to SSM endpoints). Check Fleet Manager to confirm managed status.
- **Logs not appearing in CloudWatch**: SSM Command output to CloudWatch is a separate permission from the command itself. Missing `logs:CreateLogStream` or `logs:PutLogEvents` permission is the usual culprit.
- **Automation fails at ResizeInstance step**: The instance must be stopped before the type can be changed. The AWS-ResizeInstance automation handles this automatically, but it needs `ec2:StopInstances` and `ec2:StartInstances` permissions on the automation IAM role.

---

## Lab 4: Monitoring Applications and Infrastructure

**File:** [lab-4-monitoring-applications-and-infrastructure.html](lab-4-monitoring-applications-and-infrastructure.html)

### Scenario
The application is live, but the team is flying blind — they have no visibility into whether the servers are healthy or the app is responding correctly. The security team also wants alerts when CPU pegs at 100%. The team needs a monitoring stack: custom metrics, dashboards, alarms, and synthetic monitoring.

### Goals
- Install and configure the **CloudWatch Agent** on an EC2 instance to collect custom OS metrics (memory, disk).
- Build a **CloudWatch Dashboard** to visualize infrastructure health at a glance.
- Create **CloudWatch Alarms** and route notifications through **Amazon SNS**.
- Deploy a **Lambda canary** to synthetically monitor application availability.

### Main AWS Services
- **Amazon CloudWatch** — agent, custom metrics, dashboards, alarms, Log Insights
- **Amazon SNS** — notification topics and email/SMS subscriptions
- **AWS Lambda** — canary function for synthetic monitoring
- **Amazon EC2** — AppServer instances being monitored

### Task Summary
1. Install and configure the CloudWatch Agent on an EC2 instance via SSM.
2. Verify custom metrics (memory utilization, disk usage) appear in CloudWatch.
3. Create a dashboard with widgets for EC2 health metrics.
4. Create the `AppServerSystemsCheckAlarm` alarm and wire it to an SNS topic.
5. Confirm the SNS email subscription and test alarm triggering.
6. Deploy and configure a Lambda canary to perform HTTP availability checks.

### Tips
- Confirm the SNS email subscription before testing alarms — notifications go to `PENDING_CONFIRMATION` by default and will not be delivered until you click the confirmation link in the email.
- The CloudWatch Agent config file can be generated interactively using `amazon-cloudwatch-agent-config-wizard` or stored as an SSM Parameter and applied via Run Command — both approaches are valid.
- Lambda canary results appear in CloudWatch under **Synthetics** → **Canaries**; allow 2–3 minutes for the first run to complete.

### Potential Issues
- **Custom metrics not appearing**: Confirm the CloudWatch Agent is running (`systemctl status amazon-cloudwatch-agent`) and the config file is valid JSON. Missing IAM permissions (`cloudwatch:PutMetricData`) will also silently drop metrics.
- **Alarm stays in INSUFFICIENT_DATA**: This means CloudWatch has not yet received enough data points from the metric. Wait until the agent has sent at least one data point, or use "Evaluate high-resolution metrics" if you lowered the period below 60 seconds.
- **Canary failing immediately**: Check the Lambda execution role — it needs `s3:PutObject` to store canary artifacts and the target URL must be reachable from within the Lambda network context.

---

## Lab 5: Automating with AWS Backup for Archiving and Recovery

**File:** [lab-5-automating-with-aws-backup-for-archiving-and-recovery.html](lab-5-automating-with-aws-backup-for-archiving-and-recovery.html)

### Scenario
A compliance audit reveals that the company has no documented, tested backup policy for EC2 instances. Leadership wants automated, scheduled backups, notifications when jobs complete, and proof that recovery actually works — not just that backups were taken.

### Goals
- Create an **AWS Backup plan** with scheduled backup rules and lifecycle policies (transition to cold storage, retention period).
- Assign EC2 resources to the backup plan using tag-based resource assignment.
- Configure **SNS notifications** for backup job state changes.
- Perform a **point-in-time restore** from a recovery point and verify the recovered instance.

### Main AWS Services
- **AWS Backup** — backup plans, backup vaults, recovery points, resource assignment
- **Amazon SNS** — job completion and failure notifications
- **Amazon EC2** — source and recovered instances
- **Amazon EBS** — snapshots used as recovery points

### Task Summary
1. Create a backup vault to store recovery points.
2. Define a backup plan with a cron-based schedule and retention rules.
3. Assign EC2 resources to the plan using tag selectors.
4. Create an SNS topic and subscribe it to AWS Backup job notifications.
5. Trigger an on-demand backup and observe the recovery point creation.
6. Restore from the recovery point and validate the recovered instance.

### Tips
- **Lifecycle rules** on backup plans control transitions to cold (cheaper) storage and expiration. Even in a lab, set reasonable values — a 1-day warm retention with immediate cold transition may produce errors because AWS Backup requires at least 90 days of warm storage before cold transition.
- Tag-based resource assignment is the recommended production pattern because it is dynamic — newly tagged instances are backed up automatically without modifying the plan.
- On-demand backups initiated manually still use the selected vault and respect any vault lock policies.

### Potential Issues
- **"No resources found" for resource assignment**: The tags on the EC2 instance must exactly match the tag key/value in the backup plan's resource assignment selector (case-sensitive).
- **SNS notifications not received**: AWS Backup notifications require the SNS topic to have a resource-based policy that allows `backup.amazonaws.com` to publish. The console creates this automatically; manual SDK/CLI creation may miss it.
- **Restore fails with IAM error**: The restore job needs an IAM role with `AWSBackupServiceRolePolicyForRestores`. Confirm the role ARN selected during restore has this policy attached.
- **Recovery point creation is slow**: Backup jobs are asynchronous. A small EBS volume backup typically completes in 5–15 minutes. The console status will change from `RUNNING` to `COMPLETED`.

---

## Lab 6: Capstone Lab for CloudOps

**File:** [lab-6-capstone.html](lab-6-capstone.html)

### Scenario
The team has graduated from individual tools to building a fully integrated CloudOps practice. In this capstone, they face a realistic operational scenario that requires coordinating CloudFormation drift remediation, automated Config rule enforcement, SSM-based remediation runbooks, and event-driven alerting through EventBridge and SNS — just like a real on-call shift.

### Goals
- Detect and investigate **CloudFormation stack drift** introduced by a simulated unauthorized change.
- Use **AWS Config** with **auto-remediation** to enforce a compliance rule automatically.
- Build an **SSM Automation** remediation document that Config can invoke as a corrective action.
- Wire **Amazon EventBridge** rules to trigger SNS notifications for operational events.

### Main AWS Services
- **AWS CloudFormation** — drift detection and remediation
- **AWS Config** — managed rules, auto-remediation, remediation actions linked to SSM
- **AWS Systems Manager** — Automation documents used as Config remediation targets
- **Amazon EventBridge** — event rules for Config compliance change events
- **Amazon SNS** — operational alerts and notifications
- **Amazon EC2** — resources under compliance governance

### Task Summary
1. Review the pre-built environment and identify the resources under compliance governance.
2. Simulate a configuration drift event (manual EC2 change).
3. Run CloudFormation drift detection and identify the drifted resource.
4. Create an AWS Config rule and link an SSM Automation document as the auto-remediation action.
5. Configure an EventBridge rule to capture Config compliance-change events.
6. Verify end-to-end: trigger a compliance violation and confirm auto-remediation fires and SNS notification is received.

### Tips
- This lab requires coordinating multiple AWS services, so reading through all task steps before starting is especially valuable — understanding the end state helps during configuration.
- **EventBridge rules for Config** use the event pattern `{"source": ["aws.config"], "detail-type": ["Config Rules Compliance Change"]}`. Scope it further with the rule name to avoid a flood of events.
- When linking SSM Automation as a Config remediation action, the automation document's assume role must have permissions on the target resources (EC2, SSM) **and** Config must be allowed to pass that role (`iam:PassRole`).
- Test each piece in isolation first: confirm the Config rule evaluates correctly before testing auto-remediation.

### Potential Issues
- **Auto-remediation triggers but does nothing**: Check the SSM Automation execution history — if the execution role lacks required permissions, the automation will fail silently from Config's perspective.
- **EventBridge rule not firing**: Confirm the rule is **enabled** and the event pattern JSON is valid. Use the **Test event pattern** feature in EventBridge to validate against a sample Config event before relying on live events.
- **Drift detection shows "In sync" after a manual change**: Allow 1–2 minutes for the change to propagate to CloudFormation's configuration tracking, then re-run detection. Drift detection compares the *current* resource state to the *last known stack state*, so the change must already be applied.
- **SNS subscription not confirmed**: Like lab 4, email subscriptions start in `PENDING_CONFIRMATION`. Always confirm the subscription before testing alert flows.

---

## General Lab Tips

- **Console region**: All resources are created in a specific AWS Region. Confirm you are in the correct Region before starting — mismatched Regions are one of the most common sources of "resource not found" errors.
- **IAM permissions**: Lab environments use a restricted IAM user or role. If an action is denied, check whether the task is asking you to use a different service or assume a role before proceeding.
- **Browser console errors**: If a lab page button or diagram does not render, open the browser developer tools (F12). Font Awesome icons render as solid squares without the CDN loaded — all labs in this repo bundle FA via CDN.
- **Timing**: Many AWS operations are asynchronous. If a resource is not visible immediately after creation, wait 30–60 seconds and refresh rather than re-creating it.
- **Cleanup**: Lab accounts are usually cleaned up automatically, but if you are running in a personal account, delete CloudFormation stacks (which cascade-deletes most resources), backup plans, and Config rules to avoid ongoing charges.
