VP of Engineering
Lead the design and evolution of the AI cloud platform including GPU orchestration, compute scheduling, networking, storage, and distributed systems. Make critical decisions regarding cloud infrastructure, bare-metal deployments, and platform scalability. Participate personally in architecture reviews and key technical initiatives. Build and scale large GPU clusters supporting customer workloads and design systems for GPU provisioning, scheduling, utilization optimization, and capacity management. Drive platform reliability and performance for AI training and inference workloads, partnering closely with engineering teams on infrastructure requirements for next-generation AI systems. Remain deeply involved in engineering decisions and technical direction, contribute directly to infrastructure design and implementation efforts, review architecture proposals, system designs, and major infrastructure changes, and act as the technical escalation point for complex infrastructure challenges. Establish best practices for Kubernetes, observability, CI/CD, security, and operational excellence. Build SRE and Platform Engineering functions from the ground up. Define reliability standards including SLOs, SLIs, incident response processes, and capacity planning. Drive automation across infrastructure operations. Recruit and develop Infrastructure, Platform, and SRE teams. Build a high-performance engineering culture focused on ownership and execution. Partner with executive leadership on company strategy and infrastructure investments. Manage infrastructure budgets, vendor relationships, and capacity planning.
Systems Research Engineer Intern - GPU Programming (Fall 2026)
Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a large number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from the perspectives of reliability, performance, and availability. Plan the growth of Together AI's infrastructure.
Frontier Agents Intern (Fall 2026)
As an AI Infrastructure Engineer at Together AI, the responsibilities include participating in on-call rotation (Pagerduty) to respond to production incidents; building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users; building monitoring systems to ensure the highest quality service for customers; designing and implementing operational processes such as deployments and upgrades; debugging production issues across all services and levels of the stack; identifying improvements for the product architecture from reliability, performance, and availability perspectives; and planning the growth of Together AI's infrastructure.
IT Systems Engineer
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure for international government partners. You will own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform and maintaining a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments to ensure appropriate reliability. You will manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments, ensure rapid resolution, and build guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials. Additionally, you will partner with Engineering and ML teams to incorporate lessons learned in the field into the technical architecture and decisions for future use cases.
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
As an AI Infrastructure Engineer, the responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for many concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture concerning reliability, performance, and availability, and planning the growth of Together AI's infrastructure.
Manager, Infrastructure Strategy & Operations
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation (Pagerduty) to respond to production incidents. You build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. You build monitoring systems to ensure the highest quality service for customers. You design and implement operational processes such as deployments and upgrades. You debug production issues across all services and levels of the stack. You identify improvements for the product architecture from the reliability, performance, and availability perspectives. You plan the growth of Together AI's infrastructure.
Lead/Manager Together Cloud Infrastructure Engineer
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation to respond to production incidents, build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, build monitoring systems to ensure the highest quality service for customers, design and implement operational processes such as deployments and upgrades, debug production issues across all services and levels of the stack, identify improvements for product architecture from reliability, performance, and availability perspectives, and plan the growth of Together AI's infrastructure.
Staff Platform Engineer, Voice AI
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly by participating in on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users, building monitoring systems to ensure the highest quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI's infrastructure.
Infrastructure Design Engineer
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. Your tasks include participating in an on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for the product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI's infrastructure.
Business Development Intern
Lead the team responsible for the AI/ML infrastructure that connects machine learning research with large-scale production. Develop and execute the long-term vision and roadmap for the MLOps team to support ML development and deployment needs across business units, balancing short-term tactical deliveries and long-term architectural transformation. Manage and mentor a team of 6-7+ engineers, allocating resources strategically for existing service support and key initiatives. Collaborate cross-functionally with leaders in machine learning, data science, product engineering, and infrastructure to identify issues, address bottlenecks, and facilitate new solution deployment. Architect compute and storage pipelines for managing large datasets without data fragmentation or latency. Modernize inference stack for AI product growth. Work with Site Reliability Engineering to establish comprehensive system metrics. Conduct build vs. buy assessments and audits to benchmark proprietary tools against commercial and open-source alternatives.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
