πŸ•ΈοΈ Ada Research Browser

spec.md
← Back

Feature Specification: HPC-Specific CUI Compliance Roles

Feature Branch: 004-hpc-cui-roles Created: 2026-02-15 Status: Draft Dependencies: Specs 001 (Data Models), 002 (Core Ansible Roles), 003 (Compliance Assessment)

Clarifications

Session 2026-02-15

Overview

This specification defines HPC-specific Ansible roles that integrate CUI compliance requirements with research computing operations. Unlike general-purpose server hardening, HPC environments have unique security challenges: batch job scheduling, container execution, high-performance interconnects, parallel filesystems, and researcher workflows that must continue functioning while maintaining compliance.

The roles bridge the gap between security requirements and research computing realities, providing automation that protects CUI data while enabling legitimate scientific work.

User Scenarios & Testing (mandatory)

User Story 1 - Slurm CUI Partition Operations (Priority: P1)

A researcher submits a job to process CUI data on the cluster. The system must verify authorization before execution begins, ensure memory is cleared after job completion, and generate audit evidence that the security controls worked correctly.

Why this priority: Job execution is the core function of an HPC cluster. Without secure job handling, no CUI work can occur. This is the foundational capability all other stories depend on.

Independent Test: Can be fully tested by submitting jobs to the CUI partition with authorized and unauthorized users, verifying prolog blocks unauthorized access, epilog clears memory, and audit logs capture all events.

Acceptance Scenarios:

  1. Given a researcher with valid CUI training and group membership, When they submit a job to the CUI partition, Then the prolog validates their authorization, logs job start with CUI audit tags, and the job executes normally.
  2. Given a researcher whose CUI training has expired, When they submit a job to the CUI partition, Then the prolog rejects the job with a clear message explaining the training requirement.
  3. Given a completed CUI job on a GPU node, When the job ends, Then the epilog clears /dev/shm, /tmp, resets GPU memory, flushes audit logs, and verifies node health before accepting new jobs.
  4. Given a running CUI job, When an auditor requests evidence, Then CUI-specific sacct fields provide job attribution details that integrate with evidence collection.

User Story 2 - Container Security in CUI Enclave (Priority: P1)

A researcher needs to run containerized scientific software (Python/R environments, simulation codes) on CUI data. The container runtime must enforce signed images, restrict filesystem access to approved paths, block network egress, and log all container activity.

Why this priority: Containers are ubiquitous in research computing. Without container support, researchers cannot use standard scientific workflows, making the enclave impractical.

Independent Test: Can be fully tested by attempting to run signed/unsigned containers, accessing restricted paths, and attempting network connections, verifying each restriction works independently.

Acceptance Scenarios:

  1. Given a researcher with a signed container image, When they execute it in the CUI enclave, Then the container runs with only CUI-approved bind mounts and no outbound network access.
  2. Given a researcher with an unsigned container image, When they attempt to run it, Then execution is blocked with a clear error message explaining signature requirements.
  3. Given a running container, When it attempts to access paths outside approved directories, Then the access is denied and logged.
  4. Given any container execution, When it completes, Then an audit log entry captures the container image, user, execution time, and data paths accessed.

User Story 3 - Parallel Filesystem Security (Priority: P1)

A system administrator needs to manage CUI project directories on the parallel filesystem with proper access controls, monitor file operations for audit purposes, enforce quotas, and sanitize data when projects complete.

Why this priority: CUI data resides on the parallel filesystem. Without proper filesystem controls, data protection cannot be enforced regardless of other security measures.

Independent Test: Can be fully tested by creating project directories, verifying ACLs match FreeIPA groups, triggering changelog events, testing quota enforcement, and running sanitization.

Acceptance Scenarios:

  1. Given a new CUI project, When storage is provisioned, Then a project directory is created with ACLs matching the FreeIPA group, quota enforcement enabled, and changelog monitoring active.
  2. Given a user not in the project's FreeIPA group, When they attempt to access the project directory, Then access is denied by ACLs.
  3. Given file operations in a CUI directory, When an auditor needs evidence, Then changelog monitoring provides a record of file creation, modification, and deletion events.
  4. Given a completed CUI project, When offboarding is triggered, Then data is sanitized according to policy, sanitization is verified, and completion evidence is generated.

User Story 4 - Node Lifecycle Management (Priority: P2)

An HPC administrator provisions new compute nodes, ensures they meet compliance requirements on first boot, validates node health between jobs, and properly decommissions nodes when retired.

Why this priority: Node lifecycle affects compliance posture but individual nodes can be managed manually initially. Automation improves efficiency but is not blocking for initial operations.

Independent Test: Can be fully tested by PXE booting a new node, verifying compliance scan passes, running health checks between jobs, and executing decommissioning procedures.

Acceptance Scenarios:

  1. Given a new compute node, When it PXE boots, Then it receives the CUI-hardened image and runs an automated compliance scan before joining the cluster.
  2. Given a node that fails compliance scan, When scan completes, Then the node is quarantined from production use until issues are remediated.
  3. Given a node between jobs, When the scheduler checks availability, Then a health check validates the node is ready and compliant.
  4. Given a node being decommissioned, When the process runs, Then media is sanitized per NIST 800-88 guidelines and sanitization is verified and documented.

User Story 5 - Researcher Onboarding/Offboarding (Priority: P2)

A principal investigator (PI) receives a CUI research award and needs their team onboarded to the secure enclave. Later, when the project ends, the team must be offboarded with proper data handling and access revocation.

Why this priority: While critical for operations, initial projects can be onboarded manually. Automation reduces administrative burden and ensures consistency.

Independent Test: Can be fully tested by running onboarding for a test project, verifying all resources are created correctly, then running offboarding and verifying complete cleanup.

Acceptance Scenarios:

  1. Given a new CUI project approval, When onboarding runs, Then FreeIPA group is created, Slurm account configured, storage directory provisioned with ACLs, Duo is assigned, and PI receives a welcome packet with plain language instructions.
  2. Given a PI receiving the welcome packet, When they read it, Then they understand what their team needs to do (training requirements, access procedures, data handling rules) without technical jargon.
  3. Given a completed CUI project, When offboarding runs, Then all access is revoked, data is archived or sanitized per project requirements, and completion evidence is generated for audit purposes.
  4. Given an offboarding completion, When a team member attempts to access resources, Then all access paths (Slurm, storage, systems) are denied.

User Story 6 - Interconnect Security Documentation (Priority: P3)

A compliance officer needs formal documentation for the InfiniBand RDMA exception within the enclave, demonstrating compensating controls that justify the exception until in-network encryption is available.

Why this priority: Documentation is essential for audits but does not block technical operations. The enclave can operate while documentation is developed in parallel.

Independent Test: Can be fully tested by generating exception documentation, verifying compensating controls are correctly documented, and validating the template produces audit-ready artifacts.

Acceptance Scenarios:

  1. Given the InfiniBand RDMA configuration, When documentation is generated, Then a formal exception document is produced that explains the encryption gap and justifies compensating controls.
  2. Given compensating controls (physical security, boundary encryption, port monitoring), When verification runs, Then each control is validated and evidence is collected.
  3. Given the documentation template, When hardware supports in-network encryption in the future, Then the template can be updated to reflect the new capability.

Edge Cases

Requirements (mandatory)

Functional Requirements

Slurm CUI Partition (roles/hpc_slurm_cui/)

Container Security (roles/hpc_container_security/)

Parallel Filesystem Security (roles/hpc_storage_security/)

Interconnect Security (roles/hpc_interconnect/)

Node Lifecycle (roles/hpc_node_lifecycle/)

Researcher Onboarding/Offboarding

Documentation Updates

Key Entities

Success Criteria (mandatory)

Measurable Outcomes

Assumptions