ML Infrastructure Engineer Jobs

Discover the latest remote and onsite ML Infrastructure Engineer roles across top active AI companies. Updated hourly.

Check out 23 new ML Infrastructure Engineer opportunities posted on AI Chopping Block

ML/AI Engineer - Vehicle Intelligence

New
Top rated
42dot
Full-time
Full-time
Posted

Develop AI-powered vehicle intelligence features that understand user intent, trip goals, vehicle state, and system constraints. Apply reinforcement learning, planning, optimization, and data-driven modeling to improve vehicle-level decisions across energy, comfort, charging, routing, and proactive vehicle preparation. Build models using vehicle telemetry, navigation data, user behavior, weather, traffic, cabin conditions, charging patterns, and fleet data. Create personalization models that learn user routines, comfort preferences, driving patterns, charging habits, and trip priorities while preserving privacy and user control. Use simulation, digital twins, and scenario-based testing to train, evaluate, and validate AI behavior before production deployment. Collaborate with autonomous driving and VLA teams to define interfaces for sharing user intent, route objectives, vehicle constraints, energy targets, comfort preferences, and system-level recommendations. Integrate ML models into production vehicle and cloud platforms, considering latency, compute efficiency, reliability, safety, explainability, and over-the-air update readiness. Work cross-functionally with Product, UX, Systems Engineering and Controls.

$220,780 – $311,220
Undisclosed
YEAR

(USD)

Sunnyvale or San Francisco, United States
Maybe global
Onsite

Technical Lead Manager - Training Runtime, Data(set) Movement

New
Top rated
OpenAI
Full-time
Full-time
Posted

The Technical Lead Manager will own datasets throughout the training infrastructure and set the direction for how training jobs read data, including APIs, storage contracts, versioning model, benchmarks, debugging tools, and reliability guarantees to make data access consistent across current and future training frameworks. Responsibilities include designing and building a unified dataset read platform for multiple training frameworks; defining dataset APIs, storage-format expectations, registration/versioning, and migration paths to ensure reproducible and maintainable data access; building reliability into the read path such as stateful iteration, caching, fast restart, recovery, and clear operational contracts; developing terminal and web-based visualizers to inspect data late in the pipeline; writing and reviewing production code in core data loading, service, caching, and reliability paths; and partnering with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure. Over time, the role will expand to owning broader data movement systems including checkpoint loads/saves and snapshot transfers, working closely with technical leads and infrastructure teams.

$295,000 – $445,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Remote

Senior Product Engineer, Growth & Lifecycle Infrastructure - Music & Audio

New
Top rated
Stability AI
Full-time
Full-time
Posted

Lead efforts to drive the design and development of customer-facing multi-modal machine learning inference systems. Work with the Platform and Inference teams on building inference systems for the next generation of models, focusing on optimization, model tuning, and deployment. Partner with leading cloud providers to deliver hosted Stability AI inference solutions. Serve as a strategic thought partner for leaders across the organization on driving business impact through machine learning. Contribute to bringing new Stability models and pipelines into existence. Prototype and productionize inference platform improvements and new features.

Undisclosed

()

Los Angeles, United States
Maybe global
Hybrid

Staff ML Systems Engineer, Distributed Systems

New
Top rated
FieldAI
Full-time
Full-time
Posted

Design and build scalable distributed machine learning pipelines across data processing, model training, evaluation, and post-processing workflows. Architect distributed execution systems, including parallelization strategies, workload scheduling, resource allocation, and fault tolerance mechanisms. Develop reusable abstractions, frameworks, and libraries that simplify distributed pipeline development. Optimize performance across distributed CPU and GPU environments, improving throughput, utilization, and reliability. Design systems that effectively manage data partitioning, memory utilization, serialization overhead, and compute efficiency. Partner closely with ML engineers, data engineers, and infrastructure teams to productionize research workflows and enable large-scale model development. Establish best practices and engineering standards for distributed machine learning infrastructure. Evaluate and guide decisions around distributed computing frameworks, infrastructure technologies, and system design trade-offs. Improve observability, debugging, monitoring, and operational tooling for distributed systems at scale.

$170,000 – $200,000
Undisclosed
YEAR

(USD)

Seattle or Irvine, United States
Maybe global
Onsite

Research Engineer – Evals

New
Top rated
Firecrawl
Full-time
Full-time
Posted

Build the evaluation systems from scratch that measure whether Firecrawl's outputs are effective across scraping, crawling, extracting, and mapping. This includes designing metrics, building pipelines, curating datasets, and integrating evaluations into continuous integration and deployment to catch regressions before release. Design benchmarks that represent real customer data distribution including edge cases, and create the collection and labeling systems. Own LLM-as-judge pipelines by designing and validating automated judges for scoring extraction quality, understanding LLM evaluation failure modes, and building human review tooling. Collaborate with research engineers working on models and reinforcement learning to use evaluation metrics as training signals and feedback loops to improve models. Design, run, and communicate fast experiments that test meaningful hypotheses and enable clear decision-making across the team.

$160,000 – $240,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Hybrid

Machine Learning Engineer (Singapore)

New
Top rated
Cantina Labs
Full-time
Full-time
Posted

Build and scale systems for ingesting, processing, and delivering large-scale video and multimodal data for model training. Own the full pipeline from raw content to curated, filtered, and training-ready datasets focusing on speed, reliability, reproducibility, and cost-efficiency. Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes. Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs. Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems. Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency. Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns. Design and implement curation pipelines for selection, filtering, and retention of video and image content for model training including image-text pair datasets. Build and improve VLM-based captioning and metadata generation workflows at scale across video and image data. Develop and apply quality and aesthetic scoring models, CLIP-based semantic filtering, and other signal-extraction approaches for data selection. Build tooling to support deduplication workflows at scale, including near-dedup and exact deduplication pipelines over large video corpora. Analyze dataset composition, identify quality issues, iterate on curation logic to improve training outcomes. Define and evolve standards for high-quality, training-ready video data across different training regimes.

Undisclosed

()

Singapore
Maybe global
Onsite

Machine Learning Intern

New
Top rated
Enterpret
Intern
Full-time
Posted

You will design, build, and ship AI-backed features that are reliable in production. Responsibilities include defining the quality bar by designing evaluation rubrics, test plans, and rollout criteria that are measurable and enforced; building with real-world constraints by writing and extending production code, setting up monitoring, and adding tests to catch regressions before users do; owning features end to end from problem framing to modeling, system design to rollout and iteration; debugging failures across the stack including data, infrastructure, model, prompt logic and hardening the system with those learnings; designing and implementing systems such as retrieval pipelines, agents, or hybrid patterns depending on problem needs; and working across functions by collaborating with product, infrastructure, and engineers to ship features that stick.

Undisclosed

()

Bengaluru
Maybe global
Onsite

Member of Technical Staff, Training (Bay Area, Remote)

New
Top rated
Genesis AI
Full-time
Full-time
Posted

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs to enable rapid diagnosis of performance regressions and failures.

Undisclosed

()

San Carlos, United States
Maybe global
Hybrid

Member of Technical Staff, Training (Paris, London)

New
Top rated
Genesis AI
Full-time
Full-time
Posted

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack, from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures.

Undisclosed

()

Paris, France
Maybe global
Remote

Member of Engineering (Reinforcement Learning Infrastructure)

New
Top rated
Poolside
Full-time
Full-time
Posted

Keep up with the latest research, and be familiar with the state of the art in LLMs, RL, and code generation. Develop methods for tuning training and inference end-to-end for high throughput. Design data control systems in an RL pipeline that govern what the model sees and when. Debug cases where infrastructure decisions are silently degrading learning dynamics. Build observability tooling that surfaces when a system-level issue is the root cause of a training regression. Help build robust, flexible and scalable RL pipelines. Optimize performance across the stack — networking, memory, compute scheduling, and I/O. Write high-quality, pragmatic code. Work in the team: plan future steps, discuss, and always stay in touch.

Undisclosed

()

United Kingdom
Maybe global
Remote

Want to see more ML Infrastructure Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for ML Infrastructure Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a ML Infrastructure Engineer do?","answer":"ML Infrastructure Engineers design, build, and maintain systems that support machine learning workflows from development to production. They create scalable platforms for model training and serving, implement distributed training systems, and develop monitoring solutions to track model performance. These engineers also build data pipelines, optimize ML systems for performance, and implement automated testing and deployment processes while collaborating with data scientists and researchers to productionize ML models."},{"question":"What skills are required for ML Infrastructure Engineer?","answer":"ML Infrastructure Engineers need strong programming skills in Python and sometimes Go, Rust, or C++. Proficiency with ML frameworks like PyTorch and TensorFlow is essential, alongside expertise in cloud platforms (AWS, GCP), containers (Docker), and orchestration (Kubernetes). They should understand distributed systems, data engineering concepts, and model serving techniques. Experience with infrastructure-as-code tools and monitoring systems rounds out the technical requirements, complemented by problem-solving abilities and collaboration skills."},{"question":"What qualifications are needed for ML Infrastructure Engineer role?","answer":"Most ML Infrastructure Engineer positions require a Bachelor's or Master's degree in Computer Science or related field, plus 4-5+ years of experience building production ML systems. Employers typically expect demonstrable experience with cloud platforms, containerization tools, and ML frameworks. Strong understanding of system-level software, machine learning concepts, and resource utilization is necessary. Experience with distributed systems and high-throughput workloads is highly valued, especially for senior positions."},{"question":"What is the salary range for ML Infrastructure Engineer job?","answer":"The research provided doesn't specify salary ranges for ML Infrastructure Engineer jobs. Compensation typically varies based on factors like location, company size, experience level, and specific technical expertise. Organizations like Anthropic, Scale AI, Apple, and other technology companies actively hiring for these positions likely offer competitive compensation packages reflecting the specialized nature of ML infrastructure skills and the current market demand."},{"question":"How long does it take to get hired as a ML Infrastructure Engineer?","answer":"The hiring timeline for ML Infrastructure Engineer positions isn't specified in the provided research. The process typically includes technical interviews focused on systems design, ML fundamentals, and programming skills. Given the specialized nature of the role, companies often conduct thorough evaluations of candidates' experience with production ML systems, distributed computing, and relevant technologies. The specialized requirements may extend the hiring process compared to more general engineering roles."},{"question":"Are ML Infrastructure Engineer job in demand?","answer":"Yes, ML Infrastructure Engineer jobs show strong demand based on active openings at major companies like DataXight, Scale AI, Anthropic, Apple, and Character.AI. The field is growing particularly in specialized areas such as LLM serving infrastructure, on-device ML optimization, and safety-critical ML systems. These positions are distributed across major tech hubs with opportunities ranging from mid-level to senior roles, reflecting industry's increasing need for engineers who can build reliable ML systems at scale."}]