﻿---
session_code: "BRK179"
title: "Azure AI Infra updates to power frontier and enterprise workloads"
date: "2025-11-18"
speakers:
  - "Param Shah"
  - "Matt Vegas"

products:
  - "Microsoft Ignite"

---

# Azure AI Infra updates to power frontier and enterprise workloads
**CycleCloud 8.8 and CCWS 1.2: Enhancements in Cluster Management for AI and HPC**

**Overview:**
CycleCloud 8.8 and CycleCloud Workspaces Service (CCWS) 1.2 are updates designed to enhance cluster management, monitoring, and user access for AI and High-Performance Computing (HPC) environments.

**Key Features of CycleCloud 8.8:**

1. **Operational Enhancements:**
   - Support for Ubuntu 24.04 LTS and Enterprise Linux 9 provides flexibility for different workload requirements.
   
2. **Node Health Monitoring:**
   - Implementation of a node health agent that conducts non-invasive checks during active jobs, ensuring issues are detected without interrupting work.
   - Invasive checks occur when nodes are idle, testing hardware health to ensure readiness for new tasks.

3. **Hardware Support:**
   - Integration with ARM64 and HPV VI hardware (e.g., GB200 and GP300 series) treats each rack as a dedicated resource, optimizing resource allocation and avoiding bottlenecks.

4. **Optimized Scheduling:**
   - Topology-aware scheduling in Slurm ensures AI workloads are distributed efficiently across the cluster, enhancing performance and resource utilization.

**Key Features of CCWS 1.2:**

1. **Unified Monitoring:**
   - Integration with Azure Managed Graphana provides a centralized view of system health data from both Nvidia DCGM and Slurm exporters, aiding in quick issue identification.

2. **High Availability:**
   - Availability zones ensure that compute resources are redundant, minimizing downtime and maintaining application availability.

3. **Authentication and Access Control:**
   - Use of Entra ID for secure access management, allowing centralized control of user access using Azure Active Directory.

4. **User Interface and Collaboration:**
   - Open on Demand offers browser-based access to shells, files, and VS Code, facilitating flexible collaboration and management without dedicated desktops.
   - Linux VDI provides GPU-accelerated remote sessions within the cluster, supporting various applications requiring high-performance environments.

**Conclusion:**
CycleCloud 8.8 and CCWS 1.2 aim to provide robust tools for managing clusters, ensuring system health, optimizing resource utilization, and offering flexible user interactions. These updates are tailored to meet the needs of AI and HPC workloads, enhancing efficiency and collaboration in research and development environments.
