Member of Technical Staff - Training Cluster Engineer
Black Forest Labs
IT
San Francisco, CA, USA
Posted on Sep 18, 2025
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.
Role & Responsibilities
- Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
- Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
- Partner with cloud and colocation providers to ensure cluster availability and performance
- Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
- Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
- Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning
Required Experience
- Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
- Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
- Proven track record managingGPU clusters, including driver management and DCGM monitoring
Preferred Qualifications
- Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
- Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
- Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
- Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
- Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
- Experience running hybrid training/inference infrastructure with appropriate resource isolation
- Strong scripting skills (Python, Bash) and infrastructure-as-code experience