Back to browse
VMetal – run a GPU cloud on bare metal without OpenStack

VMetal – run a GPU cloud on bare metal without OpenStack

by teb510·Mar 19, 2026·12 points·1 comment

AI Analysis

●●SolidSolve My ProblemNiche Gem

Saves neoclouds months of engineering by turning bare metal racks into managed Kubernetes clusters.

Strengths
  • Focus on GPU workloads avoids the general-purpose bloat of OpenStack or VMware.
  • Integrates directly with AI schedulers like Run:ai, Ray, and Slurm out of the box.
  • Automates networking and DNS alongside PXE booting for full lifecycle management.
Weaknesses
  • Request Demo gate prevents hands-on testing, limiting community adoption and feedback.
  • Competes with mature open-source alternatives like MAAS and Tinkerbell that are free.
Target Audience

Neocloud providers, enterprises running large GPU clusters, Platform engineers

Similar To

MAAS · Tinkerbell · OpenStack

Post Description

Hi everyone — looking for feedback on a new infrastructure project we launched called vMetal. It's a bare metal management platform for GPU clusters that handles machine discovery, PXE booting, and lifecycle management, without the OpenStack complexity. Built around Kubernetes-native workflows so you can hand it off to teams or drop it into an existing platform. A lot of the infra platforms used for this today were designed 20 years ago (VMware, OpenStack, NVIDIA BCM, MAAS, etc.), while newer tools usually solve only a small piece of the stack. Neither were built with modern GPU cluster ops in mind. In practice most setups end up stitching things together or building custom provisioning pipelines.

With vMetal we took a different approach: treat physical machines like programmable infrastructure resources. Compared to tools like MAAS or Tinkerbell, vMetal is designed around a few ideas: - Bare metal lifecycle automation: Automatically discover machines on the network, boot them, install OS images, and reprovision nodes as hardware moves between clusters or workloads. Built on Metal3 and Ironic. -Built for GPU cluster ops: Supports environments where nodes frequently move between clusters, capacity pools, or tenant workloads. -Direct Kubernetes integration: Provisioned machines can be attached directly to Kubernetes clusters as nodes or assigned to infrastructure pools. -Works with Kubernetes multi-tenancy layers: Integrates with vCluster (virtual clusters) and vNode (node-level isolation) so machines can move from bare metal provisioning into multi-tenant Kubernetes environments. We’ve shared a few other infrastructure projects here before (DevPod, vCluster), and the feedback from HN has been incredibly helpful. Curious how others here are handling bare metal provisioning today — MAAS, Ironic, Metal3, Tinkerbell, something custom?

Open to any feedback, positive or negative.

Similar Projects

Infrastructure●●Solid

A100s may be $3.20/HR on AWS, vs. $2.40/HR on Vast.ai

Wraps a lot of nasty multi-cloud choreography into a single CLI: parallel provisioning across providers, staging/compressing datasets, and plumbing nodes from different clouds into one Kubernetes cluster with generated Helm templates and Karpenter hooks. The Hugging Face Spaces one-command deploy and built-in telemetry/ML integrations are smart touches, but the page leans heavy on integration laundry-listing — I want concrete guarantees around networking/egress, cost arbitration logic, and auth/billing boundaries before trusting it for production budgets.

Solve My ProblemNiche Gem
Facingsouth
103mo ago