Member of Technical Staff - ML Training Systems
The role involves contributing to open-source projects and evolving Modal's infrastructure to train the next generation of language models. The focus is on strong engineering skills with experience in training production machine learning models.
Staff Software Engineer, ML Infrastructure
Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale; implement and integrate state-of-the-art training algorithms into production pipelines; own inference architecture and multi-provider routing, including failover and optimization; research and implement inference optimizations including quantization, speculative decoding, and batching strategies; lead initiatives to improve latency and cost efficiency across the training and serving stack; build evaluation and experimentation infrastructure that enables rapid, reliable iteration; drive technical direction, mentor engineers, and establish best practices for ML infrastructure.
Staff Strategic Sourcing Manager (Hardware)
Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference. Implement and maintain changes in high-performance inference engines including kernel backends, speculative decoding, and quantization. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Design and operate RL and post-training pipelines such as RLHF, RLAIF, GRPO, DPO-style methods, and reward modeling, optimizing algorithms and systems jointly. Make RL and post-training workloads more efficient with inference-aware training loops including async RL rollouts and speculative decoding techniques. Use these pipelines to train, evaluate, and iterate on frontier models on top of the inference stack. Co-design algorithms and infrastructure to tightly couple objectives, rollout collection, and evaluation with efficient inference, quickly identifying bottlenecks across the training engine, inference engine, data pipeline, and user-facing layers. Run ablations and scale-up experiments to understand trade-offs between model quality, latency, throughput, and cost, feeding insights back into model, RL, and system design. Profile, debug, and optimize inference and post-training services under real production workloads. Drive roadmap items that require engine modification such as changing kernels, memory layouts, scheduling logic, and APIs. Establish metrics, benchmarks, and experimentation frameworks to validate improvements rigorously. Provide technical leadership including setting technical direction for cross-team efforts at the intersection of inference, RL, and post-training, and mentoring engineers and researchers on full-stack ML systems work and performance engineering.
Principal Engineer, AI Model LifeCycle
The Principal Software Engineer for the Model LifeCycle team is responsible for managing fine-tuning systems for large foundation models including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling. They implement and maintain end-to-end training pipelines for Large Language Models and distillation and reinforcement learning pipelines such as preference optimization, policy optimization, and reward modeling. They work on agent execution infrastructure and oversee dataset, model, and experiment management including versioning, lineage, evaluation, and reproducible fine-tuning at scale. The role involves close collaboration with product, business, and platform teams to shape core abstractions and APIs, influence long-term architectural decisions related to training runtimes, scheduling, storage, and model lifecycle management, and contribute to the open-source LLM ecosystem. The position includes significant ownership in designing and building core systems from first principles.
Staff Machine Learning Engineer
Define Adaptive's ML strategy including where ML should be applied across products, required infrastructure, and build vs. buy decisions. Design and build production ML systems end-to-end including data pipelines, model training, evaluation frameworks, and inference serving. Establish evaluation methodology to measure model quality, catch regressions, and make data-driven decisions about model changes. Own the strategy for acquiring and formatting necessary data, including labeling, feedback loops, and model improvement over time. Partner with product engineers to integrate ML into the product by writing production code and working within existing codebase. Help build and lead the ML team as scope grows.
Machine Learning Engineer, Distributed Data Systems
Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure while ensuring scalability, reliability, and security. Ensure the data platform can scale by orders of magnitude while remaining reliable and efficient. Partner with researchers to deeply understand requirements and translate them into production-ready systems. Harden, optimize, and maintain critical data infrastructure systems that power multimodal training and evaluation.
Principal Machine Learning Engineer
The role involves building a platform used by Data Scientists and Simulation Engineers to build, train, and deploy Deep Physics Models. The candidate will work on a focused, stream-aligned, and cross-functional team that includes back-end, front-end, and design members, empowered to make its own implementation decisions towards meeting its objectives. Responsibilities include gathering and leveraging domain knowledge and experience from the Data Scientists and Simulation Engineers using the product, taking ownership of work from implementation to production, ensuring quality, scalability, and observability at every step, which includes testing, containerization, continuous integration and delivery, authentication, authorization, telemetry, observability, and monitoring.
Machine Learning Engineer: ML Infra and Model Optimization
Develop and deploy LLM agent systems within the AI-powered avatar framework. Design and implement scalable and efficient backend systems to support AI applications. Collaborate with AI and NLP experts to integrate LLM and LLM-based systems and algorithms into the avatar ecosystem. Work with Docker, Kubernetes, and AWS for AI model deployment and scalability. Contribute to code reviews, debugging, and testing to ensure high-quality deliverables. Document work for future reference and improvement.
NPI Engineer
Design, deploy, and maintain Figure's training clusters. Architect and maintain scalable deep learning frameworks for training on massive robot datasets. Work together with AI researchers to implement training of new model architectures at a large scale. Implement distributed training and parallelization strategies to reduce model development cycles. Implement tooling for data processing, model experimentation, and continuous integration.
Helix Data Creator
Design, deploy, and maintain Figure's training clusters. Architect and maintain scalable deep learning frameworks for training on massive robot datasets. Work together with AI researchers to implement training of new model architectures at a large scale. Implement distributed training and parallelization strategies to reduce model development cycles. Implement tooling for data processing, model experimentation, and continuous integration.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
