Work at a Portfolio Company

Search

My job alerts

Member of Technical Staff - Large scale data infrastructure

Black Forest Labs

Other Engineering, IT

San Francisco, CA, USA

Posted on Dec 5, 2025

Apply now

What if the ability to continually train improved models is just the capability to retrieve and process all our data?

Our founding team pioneered Latent Diffusion and Stable Diffusion - breakthroughs that made generative AI accessible to millions. Today, our FLUX models power creative tools, design workflows, and products across industries worldwide.

Our FLUX models are best-in-class not only for their capability, but for ease of use in developing production applications. We top public benchmarks and compete at the frontier - and in most instances we're winning.

If you're relentlessly curious and driven by high agency, we want to talk.

With a team of ~50, we move fast and punch above our weight. From our labs in Freiburg - a university town in the Black Forest - and San Francisco, we're building what comes next.

What You'll Pioneer

You'll create the data systems that make frontier research and the largest training runs possible. It's building infrastructure at a scale where billion-image datasets are normal and where video processing pipelines need to run across thousands of GPUs.

You'll be the person who:

Develops and maintains scalable infrastructure to store and retrieve massive-scale image and video datasets—the kind where "large" means billions of assets, not millions
Optimizes data retrieval so that every training run can fully utilize all GPUs
Builds tooling to efficiently manage datasets
Manages and coordinates data transfers from licensing partners
Makes sure we are using our object storage as efficiently as possible

Questions We're Wrestling With

What formats will give the best dataloading speed while maintaining the needed flexibility to keep building on top of the data?
What are the actual bottlenecks and failure cases when retrieving data at scale?
How can we identify, prevent and route around data retrieval failures in individual processes?

These questions influence the core of all our research, and are impacting the efficiency and iteration-cycles we can execute.

Who Thrives Here

You’ve managed large-scale object storage with high retrieval rates in the past. You know the difference between infrastructure that works in theory and infrastructure that works when researchers depend on it.

You likely have:

Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
experience building reliable and scalable data loaders
Deep knowledge about cloud object storage and the challenges that go hand in hand with it.
Hands-on familiarity with cloud object storage such as S3 and Azure Blob Storage, cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Have created and managed storage infrastructure in the PB-scale before

What We're Building Toward

We're not just maintaining infrastructure—we're building the computational foundation that determines what research is possible. We are designing systems that will power all future training and data processing. If that sounds more compelling than keeping existing systems running, we should talk.