Senior Software Engineer, Infrastructure
San Francisco Bay Area (Hybrid)
As a Senior Infrastructure Software Engineer, you will focus on automating infrastructure installations and decommissions at scale. You will build tools to constantly improve our scale and speed of deployment. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale.
Your work will enable our partners to bring up new data centers for AI and replace servers and networking in existing data centers as quickly and efficiently as possible without impacting running services. You will also review hardware changes, plan deployments, and aggressively execute to expand our network.
The ideal candidate has a passionate curiosity about how the Internet, GPUs, and computers fundamentally work and has a strong knowledge of Linux and AI or GPU hardware. We require strong coding ability in Python, Go, or similar languages.
This is a highly visible position that requires deep technical understanding of datacenter infrastructure, physical and logical networking, Linux, and basic experience with project management.
- 5 years of relevant Development experience.
- Intermediate level software development skills in Python, Go, or similar.
- Linux systems administration experience.
- Strong skills in network services, including REST APIs and HTTP.
- Strong tooling and automations development experience.
- Network fundamentals DHCP, ARP, subnetting, routing, firewalls, IPv6.
- Experience with configuration management and infrastructure-as-code systems such as Saltstack, Chef, Puppet, Ansible, or Terraform.
- Experience with continuous / rapid release engineering.
- Experience working in a 24/7/365 service environment.
- Familiarity with day-to-day tasks and projects common in Data Center Operations.
- Excellent understanding of low level operating systems concepts including multi-threading, memory management, networking and storage, performance, and scale.
- Experience with Kubernetes and containerization, VPNs, AI workloads, and blockchain based protocols a plus.
- Deep knowledge of network engineering and protocols used in data center switching and routing, Internet routing, and optical line systems a plus.
- GPU programming, NCCL, CUDA knowledge a plus.
- Experience with PyTorch or Tensorflow a plus.
- Aggressively seek opportunities to introduce cutting-edge technology and automation solutions that are effective, efficient, and scalable in order to improve our ability to deploy and maintain our global infrastructure.
- Provisioning, monitoring, and maintaining hardware, software, and network in new data centers.
- Perform architecture and research work for decentralized AI workloads.
- Work with vendors to obtain, debug, and maintain the most efficient and effective next-generation hardware and software for Together AI’s workloads.
- Collaborate with Together AI’s partners to make informed decisions about hardware strategy.
- Plan and implement network and server installations, including in the areas of facility power (AC/DC), cooling, security/access, rack layout, and cable management.
- Provide technical leadership and guidance during deployment activities.
- Create and maintain documentation, plans, SOP’s, MOP’s, etc.
- Communicate your results and updates through blog posts, internal talks, and tickets.
About Together AI
Together AI is a research-driven artificial intelligence company. We contribute leading open-source research, models, and datasets to advance the frontier of AI. Our decentralized cloud services empower developers and researchers at organizations of all sizes to train, fine-tune, and deploy generative AI models. We believe open and transparent AI systems will drive innovation and create the best outcomes for society.
We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work.
Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.