Feature Specification: Cloud Snapshot Demo Lifecycle
Feature Branch: 008-cloud-snapshot-lifecycle
Created: 2026-02-27
Status: Draft
Input: Add snapshot-based cloud demo lifecycle management to the existing Hetzner Cloud infrastructure, enabling near-instant demo readiness by restoring from pre-built snapshots instead of provisioning from scratch
User Scenarios & Testing (mandatory)
User Story 1 - Warm Start a Demo Cluster from Snapshots (Priority: P1)
As a presenter preparing for a stakeholder meeting, I need to bring up a fully-provisioned CUI demo cluster in under 5 minutes so that I can demonstrate compliance capabilities without a 25-minute cold-start delay.
Why this priority: This is the core value proposition. The entire feature exists to eliminate the provisioning bottleneck that prevents practical demos. Without fast cluster restore, no other functionality in this feature matters.
Independent Test: Can be fully tested by having a snapshot set available (from a previous cold build) and running the warm-start command. Verify all 4 VMs are accessible, all services are running, and demo scenarios execute successfully.
Acceptance Scenarios:
- Given a snapshot set exists from a previous cluster build, When I run the warm-start command, Then 4 VMs are created from snapshots with the same server types as the original cluster (mgmt01: cpx21, others: cpx11)
- Given VMs are created from snapshots, When the warm-start completes, Then all nodes are attached to a private network with the same IP assignments (mgmt01: 10.0.0.10, login01: 10.0.0.20, compute01: 10.0.0.31, compute02: 10.0.0.32)
- Given the cluster is restored, When I check service health, Then FreeIPA, Slurm, Wazuh, NFS, Munge, and chronyd are all running on their respective nodes
- Given the cluster is restored, When I run any existing demo scenario (A, B, C, or D), Then the scenario executes identically to a cold-provisioned cluster
- Given no snapshot set exists, When I run the warm-start command, Then I see a clear message directing me to build a cluster first and create snapshots
User Story 2 - Create Snapshot Set from Running Cluster (Priority: P1)
As a presenter who has just completed a successful cold-build provisioning, I need to snapshot the entire cluster so that future demos can start in minutes instead of waiting for full provisioning.
Why this priority: Equal to warm-start. Without the ability to create snapshots, there is nothing to restore from. This is the "plant the seed" step that enables all future fast starts.
Independent Test: Can be fully tested by running demo-cloud-up.sh to completion, then creating snapshots, and verifying the snapshot set is listed and contains metadata for all 4 VMs.
Acceptance Scenarios:
- Given a running, fully-provisioned cluster, When I run the snapshot command, Then all 4 VMs are snapshotted via the cloud API
- Given snapshots are being created, When the process runs, Then I see progress output showing each VM being snapshotted with its name and status
- Given snapshots are complete, When I list available snapshot sets, Then I see the new set with creation date, VM names, and snapshot identifiers
- Given a successful cold-build via demo-cloud-up.sh, When provisioning completes, Then I am prompted with the option to snapshot the cluster for future fast starts
- Given I have multiple snapshot sets, When I list them, Then they are displayed chronologically with identifying labels
User Story 3 - Health Check a Running Cluster (Priority: P1)
As a presenter about to start a demo, I need to verify that all critical services are operational so that I can confidently begin my presentation without surprises.
Why this priority: A restored cluster is only useful if services actually came back correctly. The health check is the trust layer that confirms readiness. It runs automatically during warm-start but must also be available independently.
Independent Test: Can be tested by running the health check against any running cluster (cold-built or restored) and verifying it produces a clear pass/fail summary.
Acceptance Scenarios:
- Given a running cluster, When I run the health check, Then I see a summary table showing pass/fail status for each service on each node
- Given all services are healthy, When the health check completes, Then it exits with code 0 and displays an all-clear message
- Given one or more services are down, When the health check completes, Then it exits with a non-zero code and clearly identifies which services on which nodes have failed
- Given a warm-start has just completed, When the warm-start process finishes, Then the health check runs automatically as a final verification step
User Story 4 - Graceful Session Wind-Down (Priority: P2)
As a presenter who has finished a demo session, I need to cleanly shut down the cluster with the option to preserve current state before teardown so that demo artifacts are not lost and billing stops promptly.
Why this priority: Important for cost management and data preservation, but secondary to the core warm-start/snapshot workflow. Users can always use the existing demo-cloud-down.sh as a fallback.
Independent Test: Can be tested by running the wind-down command on a running cluster, optionally choosing to snapshot first, and verifying all resources are destroyed and cost summary is displayed.
Acceptance Scenarios:
- Given a running cluster, When I run the wind-down command, Then I am asked whether to snapshot current state before teardown
- Given I choose to snapshot before teardown, When teardown proceeds, Then a snapshot set is created before resources are destroyed
- Given I choose not to snapshot, When teardown proceeds, Then resources are destroyed immediately (with confirmation)
- Given teardown completes, When the process finishes, Then I see session duration and estimated cost for the session
User Story 5 - Manage Snapshot Sets (Priority: P2)
As a user managing cloud costs, I need to list and delete old snapshot sets so that I do not accumulate storage charges for outdated snapshots.
Why this priority: Housekeeping capability that prevents cost creep. Not needed for initial demo workflows but becomes important over time.
Independent Test: Can be tested by creating multiple snapshot sets, listing them, deleting one, and verifying it no longer appears in the list.
Acceptance Scenarios:
- Given multiple snapshot sets exist, When I list them, Then I see each set with creation date, label, and number of snapshots
- Given I identify an old snapshot set, When I delete it, Then all snapshots in the set are removed from the cloud provider
- Given I delete a snapshot set, When I list remaining sets, Then the deleted set no longer appears
Edge Cases
- What happens when a snapshot restore fails partway through (e.g., 2 of 4 VMs created)? System provides cleanup guidance and exits with error, leaving partial resources tagged for identification.
- What happens when the cloud provider's snapshot API is temporarily unavailable? System retries with backoff and reports the specific API error.
- What happens when a restored cluster's services fail to start (e.g., FreeIPA fails after IP reassignment)? Health check catches and reports the failures; system suggests re-creating snapshots from a fresh build.
- What happens when snapshot storage quota is exceeded? System displays quota error and suggests deleting old snapshot sets.
- What happens when the warm-start command is run while a cluster already exists? System blocks the operation and warns the user to tear down the existing cluster first.
- What happens when the private network IP range is already in use by another Hetzner resource? System reports the conflict and suggests teardown of the conflicting resource.
Requirements (mandatory)
Functional Requirements
Snapshot Creation
- FR-001: System MUST snapshot all 4 VMs (mgmt01, login01, compute01, compute02) as an atomic set via the cloud provider's snapshot API. Before creating snapshots, the system MUST stop critical services (FreeIPA, Slurm, Wazuh, Munge) on each node to ensure database and state file consistency, then restart them after snapshot completion
- FR-002: System MUST label each snapshot with a set identifier (format: rcd-demo-YYYYMMDD-NN, where NN is a two-digit sequence number starting at 01, incrementing for multiple sets created on the same day), VM name, node role, and cluster metadata
- FR-003: System MUST store snapshot set metadata (snapshot IDs, creation date, source cluster state, VM-to-snapshot mapping) in a local manifest file for later restore
- FR-004: System MUST verify that the source cluster is fully provisioned and services are running before creating snapshots
- FR-005: System MUST prompt users to create snapshots upon successful completion of a cold-build provisioning
Snapshot Restore (Warm Start)
- FR-006: System MUST create new VMs from the most recent snapshot set, using the same server types as the original cluster
- FR-007: System MUST create a new private network and attach all restored VMs with the same IP assignments as the original cluster (10.0.0.10, 10.0.0.20, 10.0.0.31, 10.0.0.32)
- FR-008: System MUST generate a fresh Ansible inventory file compatible with the existing demo playbook inventory format
- FR-009: System MUST run the health check automatically after restore to verify all services are operational
- FR-010: System MUST block warm-start if an existing cluster is detected
- FR-011: System MUST block warm-start and display guidance if no snapshot sets exist
Health Check
- FR-012: System MUST verify the following services on mgmt01: FreeIPA server, slurmctld, wazuh-manager, NFS exports, munge, chronyd
- FR-013: System MUST verify the following services on login01: FreeIPA client enrollment (sssd.service), munge, wazuh-agent, NFS mount, chronyd
- FR-014: System MUST verify the following services on compute nodes: FreeIPA client enrollment (sssd.service), slurmd, munge, wazuh-agent, NFS mount, chronyd
- FR-015: System MUST output a structured pass/fail summary table showing each node and service status
- FR-016: System MUST exit with non-zero status if any service check fails
- FR-016a: When a service check fails, system MUST attempt one automatic restart of the failed service and re-check before reporting failure. If the service remains down after the restart attempt, report it as failed
Session Wind-Down
- FR-017: System MUST offer to snapshot the current cluster state before teardown
- FR-018: System MUST destroy all cloud resources (VMs, networks, SSH keys) using the same mechanism as the existing teardown
- FR-019: System MUST report session duration and estimated cost upon completion
Snapshot Management
- FR-020: System MUST support listing all available snapshot sets with creation date and label
- FR-021: System MUST support deleting a specific snapshot set by label, removing all associated snapshots from the cloud provider
- FR-022: System MUST confirm before deleting snapshot sets
Integration
- FR-023: System MUST provide Makefile targets: demo-warm, demo-cool, demo-snapshot, demo-health
- FR-024: System MUST work inside the existing Docker container (rcd-demo-infra image) and also natively when required CLI tools are installed locally
- FR-025: System MUST respect existing TTL safety checks when operating on restored clusters
- FR-026: System MUST NOT modify existing demo scenarios or playbooks; restored clusters MUST be compatible with existing demo workflows without changes
Key Entities
- SnapshotSet: A labeled group of VM snapshots representing a complete cluster state; contains set label, creation timestamp, source cluster metadata, and individual snapshot references
- SnapshotManifest: Local file storing snapshot set metadata; maps set labels to cloud snapshot IDs, VM names, server types, and private IP assignments
- ServiceHealthReport: Result of a health check run; contains per-node, per-service pass/fail status and an overall cluster readiness assessment
- DemoSession: Runtime state (not persisted) representing a warm-started cluster instance; tracked via Hetzner server labels and computed on-demand during wind-down (session duration, estimated cost)
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: A cluster restored from snapshots is fully operational (all services healthy, all demo scenarios runnable) in under 5 minutes from command invocation
- SC-002: Snapshot creation for a 4-node cluster completes in under 10 minutes
- SC-003: Health check completes in under 60 seconds and correctly identifies all service failures
- SC-004: All 4 existing demo scenarios (A, B, C, D) execute successfully on a snapshot-restored cluster without any playbook modifications
- SC-005: Session wind-down destroys all resources with zero orphaned cloud resources
- SC-006: Snapshot set listing and deletion operations complete in under 30 seconds
- SC-007: The end-to-end workflow (warm-start, run demo scenario B, wind-down) completes in under 15 minutes total
Scope
In Scope
- Hetzner Cloud snapshot create/restore/delete operations via hcloud CLI
- Local snapshot manifest file management
- Health check script for all critical cluster services
- Warm-start and wind-down scripts
- Integration with existing Makefile, Docker wrapper, and TTL checks
- Prompt to snapshot after successful cold-build
Out of Scope
- Changes to the Vagrant demo lab (separate feature)
- Changes to existing demo scenarios or playbooks
- CI/CD pipeline for automated reproducibility testing (separate feature)
- Multi-region or multi-provider snapshot support
- Incremental or differential snapshots
- Automatic snapshot rotation or expiry policies
- Snapshot transfer between Hetzner Cloud projects
Assumptions
- Users have an active Hetzner Cloud account with snapshot creation permissions
- The hcloud CLI is available (installed locally or in the Docker container)
- Hetzner Cloud snapshot API preserves full disk state including running service configurations
- FreeIPA, Slurm, Wazuh, NFS, and Munge services resume correctly after a VM is restored from snapshot and assigned to the same private IP
- Snapshot storage costs are acceptable to users (Hetzner charges per GB/month for snapshots)
- The Docker container image (rcd-demo-infra) already includes the hcloud CLI
- A cold-build provisioning (demo-cloud-up.sh) has been completed at least once before snapshot workflows can be used
Dependencies
- Spec 007 (Cloud Demo Infrastructure): Provides demo-cloud-up.sh, demo-cloud-down.sh, check-ttl.sh, Terraform configuration, Docker wrapper, and Ansible provisioning playbook that this feature extends
- Spec 006 (Vagrant Demo Lab): Provides demo scenarios (A, B, C, D) and playbooks that must work unchanged on snapshot-restored clusters
- Hetzner Cloud Snapshot API: Required for creating and restoring VM snapshots (external dependency)
- hcloud CLI: Required for snapshot operations (bundled in Docker container)
Clarifications
Session 2026-02-27
- Q: Should VMs be shut down, have services stopped, or be snapshotted live? β A: Stop critical services (FreeIPA, Slurm, Wazuh, Munge) before snapshot to protect database consistency; VMs stay running; services restart after snapshot completion
- Q: How should snapshot set label uniqueness be handled for multiple sets on the same day? β A: Append a two-digit sequence suffix (rcd-demo-YYYYMMDD-01, rcd-demo-YYYYMMDD-02)
- Q: Should the health check attempt automatic remediation of failed services or only report? β A: One automatic restart attempt per failed service, then report if still failing