AI ML GenAI on Data Center-Class GPUs with Red Hat OpenShift

Name: AI ML GenAI on Data Center-Class GPUs with Red Hat OpenShift
Rating: 4.3 (11 reviews)

OpenShift & OpenShift AI on High-Performance GPUs: From Bare-Metal to Production in One Day

Created byLuca Berton

Last updated 8/2025

English

What you'll learn

Stand up a bare-metal data center-grade GPU node, validate firmware & BIOS, and register it in a fresh OpenShift cluster
Install and tune the GPU Operator with Multi-Instance GPU (MIG) profiles for maximum utilisation
Deploy Red Hat OpenShift AI (RHOAI) and run a real Mistral LLM workload with Ollama
Monitor, troubleshoot, upgrade, and scale the platform in production

Course content

5 sections • 11 lectures • 1h 28m total length

Course overview6:15
Ready to go from bare-metal hardware to a GPU-powered AI platform in one busy afternoon? In this fast-paced walkthrough you’ll watch us transform a single NVIDIA H100 server—and a small virtualisation host—into a fully fledged Red Hat OpenShift 4.18 cluster running OpenShift AI. We start by inspecting firmware in iDRAC, flip the must-have BIOS toggles, and generate a custom Agent ISO that bootstraps a three-node control plane with zero external provisioning network.
Once the masters are healthy we attach a bare-metal H100 worker, install the NVIDIA GPU Operator, and slice the card into MIG partitions for multi-tenant workloads. You’ll see Ollama spin up with Mistral-7B, curl an inference endpoint, and then watch live GPU metrics flow into Grafana dashboards. Finally, we roll through an OpenShift upgrade—proving the stack can survive real-world maintenance without downtime.
Whether you’re a machine-learning engineer shipping models, a DevOps pro automating infra, or a curious Python developer taking your first steps into AI operations, this video will show you every YAML, command, and troubleshooting trick you need to replicate the build in your own lab. Grab a coffee, open your terminal, and let’s launch!
Sources
Ask ChatGPT
Infra architecture3:51
Ever wondered how all the moving parts of an on-prem AI platform fit together? In this concise visual tour we break down the four-component lab architecture that powers our entire course — from first oc command to blazing-fast GPU inference.
Jump Host See how a lean RHEL VM becomes the control tower for every OpenShift and iDRAC operation.
NVIDIA H100 Server Peek inside a Hopper GPU with 80 GB of HBM3 and learn why MIG slicing is the secret to multi-tenant performance.
OpenShift Control Plane Watch three high-availability masters (one acting as the rendezvous host) bootstrap themselves from a single Agent ISO.
Network Fabric Understand the API VIP, Ingress VIP, and Layer-4 load balancer that keep north-south and east-west traffic humming—even when pods fail.
We’ll trace traffic flows, highlight fail-over paths, and show where the local mirror registry plugs into an air-gapped deployment. By the end, you’ll see exactly how these pieces click together to form a scalable, production-ready foundation for GPU-accelerated workloads. If you’re prepping an AI cluster—or just curious how enterprise Kubernetes is wired—this video gives you the blueprint in under ten minutes.
Press play, grab your favourite whiteboard marker, and map your own path to AI-ready infrastructure!

Lab setup3:55
In this lesson you’ll watch our entire lab come to life—from powering-on the jump host to mapping the network VLANs that tie every component together. We start at the hypervisor console, allocate CPU, RAM, and storage for each VM, then flip over to the bare-metal H100 node for a quick firmware sanity-check. Next, you’ll see how we:
Attach the virtual ISO repository and upload the Agent image.
Configure a dedicated VM network on vSwitch1 and test end-to-end connectivity with ping and dig.
Set static IPs for the API VIP, Ingress VIP, masters, and worker nodes.
Install and start dnsmasq so our cluster names resolve instantly.
Snapshot the environment—giving us a clean rollback point before the actual OpenShift install.
By the end you will be able to
Recreate the entire VM topology on any ESXi or KVM host.
Verify base networking without touching a single OpenShift command.
Capture a “golden snapshot” so you can experiment fearlessly in later lessons.
Prerequisites
A fresh hypervisor, the jump-host ISO, and about 30 GB of free SSD space.
Install VMware ESXi10:17
In this lesson we’ll take a raw bare-metal host and turn it into a battle-ready VMware ESXi 8.0 U2 hypervisor—your launchpad for the OpenShift control-plane VMs. You’ll watch every step on screen, from creating bootable media to the first login in the vSphere HTML5 client. Along the way we will:
Validate firmware compatibility with the VMware Hardware Compatibility Guide.
Burn the ESXi ISO to USB (or mount it via iDRAC virtual media) and walk through the purple-screen installer.
Tune critical BIOS settings: UEFI, VT-x/AMD-v, SR-IOV, and PCIe Gen 5 lanes.
Configure the Direct Console UI (DCUI) for static IP, DNS, and NTP.
Activate SSH, upload license keys, and patch to the latest build via Lifecycle Manager.
Create datastore1 on NVMe, build vSwitch1, and carve a dedicated VM Network port-group for OpenShift traffic.
By the end you will be able to
Install ESXi 8.x on virtually any supported server in under 10 minutes.
Apply best-practice security (strong root password, NTP, key-based SSH).
Provision storage and networking exactly as the Agent-based Installer expects.
Capture a clean “hypervisor-baseline” snapshot so you can redeploy VMs without reinstalling ESXi.
Prerequisites
• A blank server with at least one SSD/NVMe drive
• ESXi 8.x ISO + license or evaluation key
• Access to server iDRAC/iLO or a bootable USB port
Install RHEL OS on Jump host5:14
In this lesson we turn a blank VM into the jump-host—your one-stop command center for every OpenShift, DNS, and automation task in the course. You’ll see the full, click-by-click walk-through:
Mount the RHEL 9.4 ISO and choose the Minimal Install profile (no GUI bloat).
Partition the disk: 30 GB for /, the rest for /var, XFS for resilience.
Set static networking to 192.168.1.10/24 and hostname jump-01.ocp.lab.local.
Register with Red Hat Subscription Manager, enable BaseOS/AppStream, and run dnf update -y.
Install core packages: podman, git, jq, dnsmasq, httpd-tools, chrony.
Generate an ED25519 SSH key pair for later use in install-config.yaml.
Harden the box: enforce SELinux, open only SSH in firewalld, enable chronyd.
Create the ~/ocp-install workspace and drop in the Red Hat pull secret.
Add bash completion for oc, customise the prompt, and snapshot the VM.
By the end you will be able to
Stand up a lean, secure RHEL jump-host in under ten minutes.
Equip it with every tool—openshift-install, oc, yq, helm—needed for the rest of the lab.
Validate connectivity and DNS resolution before generating the Agent ISO.
Prerequisites
ESXi/kvm VM with 4 vCPU, 8 GB RAM, 50 GB disk
RHEL 9 ISO (or Rocky/CentOS Stream 9 if subscriptions aren’t available)
Internet access for initial dnf update (or local repo if air-gapped)

Control Plane and OpenShift Agent16:39
In this lesson we move from planning to execution: you’ll watch three master VMs boot from the custom agent.x86_64.iso, self-register with the Assisted Service, and converge into a healthy, highly available OpenShift 4.18 control plane. The video captures every screen—BIOS splash to oc get nodes—so you can follow at your own pace.
Key milestones you’ll see on-screen
Mount & power-on master0, master1, master2 in ESXi with the ISO attached.
Live log tail of openshift-install agent wait-for bootstrap-complete, decoding each status: discovering → known → installing → done.
master0 established as the rendezvous host, running the embedded Assisted Service.
Automatic ignition of control-plane services—etcd, kube-api, scheduler—verified via oc get pods -n openshift-kube-system.
First successful login with the kubeadmin credentials and a tour of the web console.
Final health sweep: ClusterVersion Available, all Operators green, and the control-plane marked Ready.
Snapshotting the masters for a clean rollback point.
By the end you will be able to
Generate and boot an Agent ISO that discovers hosts with zero external provisioning network.
Interpret bootstrap logs to diagnose slow or failed installations.
Validate a brand-new control plane using both CLI (oc) and the web console.
Capture a “post-control-plane” snapshot—your safe reset point before adding GPU and CPU workers.
Prerequisites
Lesson 5 jump-host with openshift-install and oc configured.
Master VMs created from the template (4 CPU / 16 GB / 120 GB each).
DNS, VIPs, and load balancer already in place (covered in Lesson 4).
Inspect the GPU node via iDRAC6:17
Before a single Kubernetes pod can claim GPU time, we need to guarantee the hardware is healthy and perfectly tuned. In this 11-minute deep-dive you’ll ride along as we log into Dell iDRAC, audit every sensor, and apply the five BIOS tweaks that keep a Hopper-class GPU happy under OpenShift. You’ll see:
System → Overview – verifying service tag, firmware dates, and BIOS 2.13.0+ for PCIe Gen 5 throughput.
Maintenance → Firmware Inventory – confirming H100 firmware ≥ HBM3 95 and queueing updates in a single reboot job.
Configuration → BIOS Settings – enabling UEFI, SR-IOV, and setting System Profile = Performance with Energy-Efficient Turbo.
Dashboard → Sensors – live look at GPU temperature (< 35 °C idle), fan RPM, inlet temp, and input power.
SSH RACADM session – grabbing the GPU UUID, dumping hwinventory, and exporting a SupportAssist log bundle for your records.
Optional toggle: Persistent GPU UUID for future VMware migrations.
By the end you will be able to
Verify and upgrade H100 firmware entirely through iDRAC—no OS boot required.
Apply best-practice BIOS and thermal settings that prevent driver crashes and throttling.
Collect sensor data and log bundles for proactive support or root-cause analysis.
Capture a “pre-cluster” hardware baseline snapshot you can compare against months later.
Prerequisites
Physical access or iDRAC credentials for the NVIDIA H100 server.
Firmware packages downloaded from Dell Support (if upgrades are needed).
SSH client on the jump-host to run racadm.
Network Preparation5:22
In this lesson, we wire up the arteries that let every OpenShift component talk—cleanly, securely, and with zero surprises at bootstrap. You’ll watch us:
Design the CIDR map – carving a /24 for cluster traffic and reserving IPs for the API VIP (6443), Ingress VIP (443/80), masters, and workers.
Stand-up a Layer-4 HAProxy on the jump-host, forwarding raw TCP for the API and Machine Config Server plus HTTPS/HTTP for Ingress.
Spin up dnsmasq with authoritative A & PTR records—including the wildcard *.apps.<cluster>.<domain>—and validate resolution with dig.
Add a static route on the GPU node to ensure east-west traffic stays on the 10 GbE LAN.
Test port reachability with nc -zv <vip> 6443, confirming the load balancer is live before masters ever boot.
Capture a tcpdump trace to show the first SYN packets hitting all three control-plane nodes—proof that fail-over is instantaneous.
By the end you will be able to
Allocate VIPs and node IPs that satisfy the Agent-based Installer’s strict requirements.
Configure dnsmasq (or any DNS engine) with the exact record set OpenShift needs.
Deploy a lightweight HAProxy config that survives reboots and requires no Layer-7 smarts.
Validate the whole stack—DNS, VIP, and ports—in under two minutes with command-line tools.
Prerequisites
Jump-host from Lesson 5 with root access.
IP plan (spreadsheet or whiteboard) for your lab network.
Basic understanding of firewalls; outbound 443 allowed for RHEL updates.
OCP installation from Agent ISO15:33
In this lesson the capstone of the installation arc: you’ll watch the custom agent.x86_64.iso spin up our entire cluster—from bootstrap ignition to a fully operational control plane plus one CPU-only worker. Every stage of the Agent-based workflow is captured live on screen, giving you a crystal-clear blueprint you can replay in your own lab.
Milestones you’ll see in real time
Boot & Discovery Each VM mounts the ISO, sends its inventory, and flips from discovering → known in the Assisted UI.
Validation Phase The installer checks CPU, RAM, disks, and network—see how to fix “pending-input” hosts fast.
Bootstrap Event master0 (the rendezvous host) launches the temporary control plane; logs scroll inside openshift-install agent wait-for.
Etcd Quorum Masters pivot to “installing-in-progress,” write RHCOS to disk, and reboot into real control-plane services.
Agent ISO Auto-joins the Worker Watch the CPU worker enrol itself and register with the kube-API—no manual CSR needed.
First Console Login Use the freshly generated kubeadmin password, check ClusterVersion, and tour the Operator status dashboard.
Post-Install Snapshots Learn exactly when—and why—to snapshot your VMs for an easy rollback point.
By the end you will be able to
Run the Agent ISO from power-on to “Install complete!” with zero manual intervention.
Decode installer log messages, spot bottlenecks, and restart failed hosts safely.
Validate cluster health via both oc CLI and the web console before moving to GPU integration.
Prerequisites
Network, DNS, and HAProxy configured (Lesson 8).
install-config.yaml and agent-config.yaml in ~/ocp-install.
Jump-host with openshift-install and oc binaries.

Requirements

One server with a high-performance, data center-class GPU—physical or virtualised
A workstation that can SSH into the node and run the "oc" CLI
(Optional) A Red Hat account to pull mirrored images

Description

Unlock the power of enterprise-grade AI in your own data center—step-by-step, from bare-metal to production-ready inference. In this hands-on workshop, you’ll learn how to transform a single high-performance GPU server and a lightweight virtualization host into a fully featured Red Hat OpenShift cluster running OpenShift AI, the GPU Operator, and real LLM workloads (Mistral-7B with Ollama). We skip the theory slides and dive straight into keyboards and terminals—every YAML, every BIOS toggle, every troubleshooting trick captured on video.

What you’ll build

A three-node virtual control plane + one bare-metal GPU worker, deployed via the new Agent-based Installer
GPU Operator with MIG slicing, UUID persistence, and live metrics in Grafana
OpenShift AI (RHODS) with Jupyter and model-serving pipelines
A production-grade load balancer, DNS zone, and HTTPS ingress—no managed cloud needed

Hands-on every step: you’ll inspect firmware through iDRAC, patch BIOS settings, generate a custom Agent ISO, boot the cluster, join the GPU node, and push an LLM endpoint you can curl in under a minute. Along the way, we’ll upgrade OpenShift, monitor GPU temps, and rescue a “Node Not Ready” scenario—because real life happens.

Who should enroll

DevOps engineers, SREs, and ML practitioners who have access to a data center-grade GPU server and want a repeatable, enterprise-compatible install path. Basic Linux and kubectl skills are assumed; everything else is taught live.

By course end, you’ll have a battle-tested Git repository full of manifests, a private Agent ISO pipeline you can clone for new edge sites, and the confidence to stand up—or scale out—your own GPU-accelerated OpenShift AI platform. Join us and ship your first on-prem LLM workload today.

Who this course is for:

Machine Learning Engineers
DevOps Engineers
Site Reliability Engineers (SREs)
Python Developers Exploring Infrastructure
First Steppers into AI Operations

AI ML GenAI on Data Center-Class GPUs with Red Hat OpenShift

What you'll learn

Explore related topics

Course content

Welcome & Lab Setup2 lectures • 10min

Lab Setup3 lectures • 19min

Bare-Metal installation4 lectures • 44min

GPU Enablement in the Cluster1 lecture • 7min

Serving Inference1 lecture • 8min

Requirements

Description

Who this course is for: