Research Intern, Inference (Fall 2026)
As an AI Infrastructure Engineer at Together, the responsibilities include participating in on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to support scaling to a large number of concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture in terms of reliability, performance, and availability, and planning the growth of Together AI's infrastructure.
AI Builder Intern
The Production AI Ops Lead is responsible for designing and developing the production lifecycle of full-stack AI applications, supporting system reliability, real-time inference observability, sovereign data orchestration, secure software integration, and resilient cloud infrastructure for international government partners. They own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. They oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI, maintaining a responsive and production-ready environment. The role involves building automated systems to monitor model performance and data drift across geographically dispersed environments to ensure reliability, managing the technical lifecycle within diverse regulatory frameworks, and leading incident response for production issues in mission-critical environments to ensure rapid resolution and prevent recurrence. The lead also translates technical performance metrics into clear insights for senior international government officials and partners with Engineering and ML teams to influence the technical architecture and decisions of future AI use cases.
Harness Engineer
As a Harness Engineer, you will work on the systems that make AI agents effective, focusing on how they find information, assemble context, verify their operation, and improve over time. Responsibilities include developing retrieval systems such as search, ranking, chunking strategies, and hybrid approaches tailored to the problem; context engineering to assemble the right information for agents working with large, heterogeneous document sets; building infrastructure for continuous evaluation of agent accuracy and regression testing of retrieval quality, implementing and maintaining feedback loops; creating and managing agent pipelines that orchestrate between retrieval, models, and downstream actions; and ensuring these systems scale to operate efficiently across thousands of customer document collections, beyond just demo corpora.
Distributed LLM Inference Engineer
As a Distributed LLM Inference Engineer at Anyscale, you will help systems and optimizations that push the boundaries of performance for inference at large scale. Responsibilities include iterating quickly with product teams to ship end-to-end solutions for Batch and Online inference at high scale for open-source Ray users and Anyscale customers; working across the stack integrating Ray Data and LLM engine to provide optimizations achieving low cost solutions for large scale ML inference; integrating with open source software like vLLM, working closely with the community to adopt these techniques in Anyscale solutions, and contributing improvements to open source; and following the latest state-of-the-art developments in the open source and research community, implementing and extending best practices.
AI Deployment Strategist, Enterprise
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead incident response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons learned influence future technical architecture and decisions.
AI Infrastructure Supply Chain Lead
The AI Infrastructure Supply Chain Lead is responsible for translating business requirements into requirements for AI/ML models, preparing data to train and evaluate AI/ML/DL models, building AI/ML/DL models using state-of-the-art algorithms such as transformers, testing and evaluating model quality, publishing models, data sets, and evaluations, deploying models in production by containerizing them, working with customers and internal employees to refine model quality, establishing continuous learning pipelines for models with online or transfer learning, and building and deploying containerized applications on cloud or on-premise environments.
Staff Machine Learning Engineer, Voice AI
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. Your responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for product architecture in reliability, performance, and availability, and planning the growth of Together AI's infrastructure.
AI Systems Engineer, Codex Agents
Design and build the core agent harness and execution loop that lets Codex agents interpret model outputs, use tools, execute code, and complete long-horizon tasks safely. Build sandboxing, isolation, orchestration, state, and workflow infrastructure for agents operating in real development environments. Develop evaluation, experimentation, and debugging systems that distinguish harness issues, model behavior, inference/runtime issues, and product failures. Run ablations across prompts, model-facing interfaces, context construction, tool-use strategies, and harness behavior to improve solve rate, reliability, latency, and cost. Improve observability, profiling, and diagnostics across the agent stack, from backend systems to inference, GPUs, and fleet capacity. Work closely with research to make the harness trainable, measurable, and useful for improving frontier agentic models. Build shared primitives that make Codex faster, safer, more reliable, and easier for other teams and open-source users to build on.
Sr. Revenue Accountant
Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from reliability, performance, and availability perspectives. Plan the growth of Together AI's infrastructure.
Infrastructure Accounting Manager
Participate in on-call rotation (Pagerduty) to respond to production incidents; build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a large number of concurrent users; build monitoring systems to ensure high quality service; design and implement operational processes such as deployments and upgrades; debug production issues across all services and stack levels; identify improvements for product architecture focusing on reliability, performance, and availability; and plan the growth of Together AI's infrastructure.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
