Member of Technical Staff - Sandbox Platform
The role involves working on both the underlying sandbox infrastructure that powers training systems and the developer-facing platform for AI workload management. Responsibilities include designing and implementing distributed orchestration infrastructure in Go and Rust, building high-performance networking and coordination components, creating infrastructure automation pipelines with Ansible, managing cloud resources and container orchestration, and implementing scheduling systems for heterogeneous hardware (CPU, GPU, TPU). Additionally, the role requires building intuitive web interfaces for AI workload management and monitoring, developing REST APIs and backend services in Python, creating real-time monitoring and debugging tools, and implementing user-facing features for resource management and job control.
Technical Lead Manager, Physical AI
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will scale the feedback loop by building automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will navigate global compliance by managing the technical lifecycle within diverse regulatory frameworks. You will lead incident command for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will bridge the gap by translating deep technical performance metrics into clear insights for senior international government officials. Finally, you will drive product evolution by partnering with Engineering and ML teams to ensure lessons from the field directly influence the technical architecture and decisions of future use cases.
Forward Deployed Engineer (Inference & Post-Training)
As an AI Infrastructure Engineer, you are responsible for keeping all user-facing services and production systems running smoothly. Your duties include participating in on-call rotation (Pagerduty) to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for the product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI’s infrastructure.
Staff Engineer, Customer Insights
Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from reliability, performance, and availability perspectives. Plan the growth of Together AI’s infrastructure.
Full-Chip Physical Design Verification Engineer
Lead and contribute to cross-functional efforts solving complex physical design challenges across IPs, projects, and advanced technology nodes. Develop and enhance RTL-to-GDS methodologies, including floorplanning, synthesis, place and route (P&R), static timing analysis (STA), signoff, and assembly. Architect and deploy AI/ML-driven solutions in production physical design flows to improve engineering efficiency, turnaround time, and quality of results (QoR). Optimize electronic design automation (EDA) tools and custom CAD flows using data-driven and machine learning (ML) techniques, collaborating closely with verification, extraction, timing, design for test (DFT), and EDA vendors.
Member of Technical Staff, Inference (Paris, London)
Build low-latency inference pipelines for on-device deployment to enable real-time next-token and diffusion-based control loops in robotics. Design and optimize distributed inference systems on GPU clusters to increase throughput with large-batch serving and efficient resource utilization. Implement efficient low-level code such as CUDA, Triton, and custom kernels, integrating it seamlessly into high-level frameworks. Optimize workloads for both throughput and latency by managing batching, scheduling, quantization, caching, memory management, and graph compilation. Develop monitoring and debugging tools to ensure reliability, determinism, and rapid diagnosis of regressions across both software stacks.
AI Engineer - Model Performance
The Model Performance Engineer is responsible for owning the speed, cost, and reliability of the model inference stack, optimizing real systems serving millions of meetings. This includes tasks such as quantization trade-offs, debugging speculative decoding, managing GPU performance under high concurrency, inference performance improvements through speculative decoding, quantization, serving configuration, GPU selection, batching strategies, cold start mitigation, and adapter swapping. The engineer also builds fine-tuning pipelines to enable the AI team to fine-tune models efficiently for new tasks with repeatable infrastructure, allowing faster transition from dataset to deployed model. Additional responsibilities include benchmarking quantization methods, evaluating serving frameworks for inference quality and speed, building fine-tuning pipelines that produce optimized tunes quickly, optimizing GPU spending by selecting appropriate GPU types and managing cost trade-offs, and debugging production inference issues related to serving frameworks and multimodal pipelines.
Senior AI Infrastructure Engineer - Training Platform
As a Production AI Ops Lead at Scale, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, build automated systems to monitor model performance and data drift across dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments ensuring rapid resolution, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to drive product evolution based on lessons learned in the field.
Mixed-Signal IC Layout Design Engineer
Lead and contribute to cross-functional efforts solving complex physical design challenges across IPs, projects, and advanced technology nodes. Develop and enhance RTL-to-GDS methodologies, including floorplanning, synthesis, P&R, STA, signoff, and assembly. Architect and deploy AI/ML-driven solutions in production flows to improve engineering efficiency, turnaround time, and quality of results (QoR). Optimize EDA tools and custom CAD flows using data-driven and ML-based techniques, collaborating closely with verification, extraction, timing, DFT, and EDA vendors.
AI Factory, Value Engineer
Responsibilities include translating business requirements into requirements for AI/ML models, preparing data to train and evaluate AI/ML/DL models, building AI/ML/DL models using state-of-the-art algorithms especially transformers, testing and evaluating models, benchmarking quality, publishing models and datasets, deploying models in production by containerizing them, working with customers and internal employees to refine model quality, establishing continuous learning pipelines with online or transfer learning, and building and deploying containerized applications on cloud or on-premise environments.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
