ML Infrastructure Engineer Jobs

Discover the latest remote and onsite ML Infrastructure Engineer roles across top active AI companies. Updated hourly.

Join our AI community Interested in Hiring?

Hiring by

Check out 23 new ML Infrastructure Engineer opportunities posted on AI Chopping Block

View detail

ML/AI Engineer - Vehicle Intelligence

New

Top rated

42dot

–

Full-time

–

Posted

Jun 11, 2026 22:50

Develop AI-powered vehicle intelligence features that understand user intent, trip goals, vehicle state, and system constraints. Apply reinforcement learning, planning, optimization, and data-driven modeling to improve vehicle-level decisions across energy, comfort, charging, routing, and proactive vehicle preparation. Build models using vehicle telemetry, navigation data, user behavior, weather, traffic, cabin conditions, charging patterns, and fleet data. Create personalization models that learn user routines, comfort preferences, driving patterns, charging habits, and trip priorities while preserving privacy and user control. Use simulation, digital twins, and scenario-based testing to train, evaluate, and validate AI behavior before production deployment. Collaborate with autonomous driving and VLA teams to define interfaces for sharing user intent, route objectives, vehicle constraints, energy targets, comfort preferences, and system-level recommendations. Integrate ML models into production vehicle and cloud platforms, considering latency, compute efficiency, reliability, safety, explainability, and over-the-air update readiness. Work cross-functionally with Product, UX, Systems Engineering and Controls.

$220,780 – $311,220

Undisclosed

YEAR

(USD)

Sunnyvale or San Francisco, United States

Maybe global

Onsite

View detail

Technical Lead Manager - Training Runtime, Data(set) Movement

New

Top rated

OpenAI

–

Full-time

–

Posted

Jun 9, 2026 6:07

The Technical Lead Manager will own datasets throughout the training infrastructure and set the direction for how training jobs read data, including APIs, storage contracts, versioning model, benchmarks, debugging tools, and reliability guarantees to make data access consistent across current and future training frameworks. Responsibilities include designing and building a unified dataset read platform for multiple training frameworks; defining dataset APIs, storage-format expectations, registration/versioning, and migration paths to ensure reproducible and maintainable data access; building reliability into the read path such as stateful iteration, caching, fast restart, recovery, and clear operational contracts; developing terminal and web-based visualizers to inspect data late in the pipeline; writing and reviewing production code in core data loading, service, caching, and reliability paths; and partnering with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure. Over time, the role will expand to owning broader data movement systems including checkpoint loads/saves and snapshot transfers, working closely with technical leads and infrastructure teams.

$295,000 – $445,000

Undisclosed

YEAR

(USD)

San Francisco, United States

Maybe global

Remote

View detail

Senior Product Engineer, Growth & Lifecycle Infrastructure - Music & Audio

New

Top rated

Stability AI

–

Full-time

–

Posted

Jun 6, 2026 12:58

Lead efforts to drive the design and development of customer-facing multi-modal machine learning inference systems. Work with the Platform and Inference teams on building inference systems for the next generation of models, focusing on optimization, model tuning, and deployment. Partner with leading cloud providers to deliver hosted Stability AI inference solutions. Serve as a strategic thought partner for leaders across the organization on driving business impact through machine learning. Contribute to bringing new Stability models and pipelines into existence. Prototype and productionize inference platform improvements and new features.

Undisclosed

()

Los Angeles, United States

Maybe global

Hybrid

View detail

Staff ML Systems Engineer, Distributed Systems

New

Top rated

FieldAI

–

Full-time

–

Posted

May 30, 2026 23:50

Design and build scalable distributed machine learning pipelines across data processing, model training, evaluation, and post-processing workflows. Architect distributed execution systems, including parallelization strategies, workload scheduling, resource allocation, and fault tolerance mechanisms. Develop reusable abstractions, frameworks, and libraries that simplify distributed pipeline development. Optimize performance across distributed CPU and GPU environments, improving throughput, utilization, and reliability. Design systems that effectively manage data partitioning, memory utilization, serialization overhead, and compute efficiency. Partner closely with ML engineers, data engineers, and infrastructure teams to productionize research workflows and enable large-scale model development. Establish best practices and engineering standards for distributed machine learning infrastructure. Evaluate and guide decisions around distributed computing frameworks, infrastructure technologies, and system design trade-offs. Improve observability, debugging, monitoring, and operational tooling for distributed systems at scale.

$170,000 – $200,000

Undisclosed

YEAR

(USD)

Seattle or Irvine, United States

Maybe global

Onsite

View detail

Research Engineer – Evals

New

Top rated

Firecrawl

–

Full-time

–

Posted

May 13, 2026 7:04

Build the evaluation systems from scratch that measure whether Firecrawl's outputs are effective across scraping, crawling, extracting, and mapping. This includes designing metrics, building pipelines, curating datasets, and integrating evaluations into continuous integration and deployment to catch regressions before release. Design benchmarks that represent real customer data distribution including edge cases, and create the collection and labeling systems. Own LLM-as-judge pipelines by designing and validating automated judges for scoring extraction quality, understanding LLM evaluation failure modes, and building human review tooling. Collaborate with research engineers working on models and reinforcement learning to use evaluation metrics as training signals and feedback loops to improve models. Design, run, and communicate fast experiments that test meaningful hypotheses and enable clear decision-making across the team.

$160,000 – $240,000

Undisclosed

YEAR

(USD)

San Francisco, United States

Maybe global

Hybrid

View detail

Machine Learning Engineer (Singapore)

New

Top rated

Cantina Labs

–

Full-time

–

Posted

May 13, 2026 6:29

Build and scale systems for ingesting, processing, and delivering large-scale video and multimodal data for model training. Own the full pipeline from raw content to curated, filtered, and training-ready datasets focusing on speed, reliability, reproducibility, and cost-efficiency. Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes. Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs. Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems. Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency. Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns. Design and implement curation pipelines for selection, filtering, and retention of video and image content for model training including image-text pair datasets. Build and improve VLM-based captioning and metadata generation workflows at scale across video and image data. Develop and apply quality and aesthetic scoring models, CLIP-based semantic filtering, and other signal-extraction approaches for data selection. Build tooling to support deduplication workflows at scale, including near-dedup and exact deduplication pipelines over large video corpora. Analyze dataset composition, identify quality issues, iterate on curation logic to improve training outcomes. Define and evolve standards for high-quality, training-ready video data across different training regimes.

Undisclosed

()

Singapore

Maybe global

Onsite

View detail

Machine Learning Intern

New

Top rated

Enterpret

–

Intern

Full-time

–

Posted

May 4, 2026 19:20

You will design, build, and ship AI-backed features that are reliable in production. Responsibilities include defining the quality bar by designing evaluation rubrics, test plans, and rollout criteria that are measurable and enforced; building with real-world constraints by writing and extending production code, setting up monitoring, and adding tests to catch regressions before users do; owning features end to end from problem framing to modeling, system design to rollout and iteration; debugging failures across the stack including data, infrastructure, model, prompt logic and hardening the system with those learnings; designing and implementing systems such as retrieval pipelines, agents, or hybrid patterns depending on problem needs; and working across functions by collaborating with product, infrastructure, and engineers to ship features that stick.

Undisclosed

()

Bengaluru

Maybe global

Onsite

View detail

Member of Technical Staff, Training (Bay Area, Remote)

New

Top rated

Genesis AI

–

Full-time

–

Posted

May 1, 2026 1:42

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs to enable rapid diagnosis of performance regressions and failures.

Undisclosed

()

San Carlos, United States

Maybe global

Hybrid

View detail

Member of Technical Staff, Training (Paris, London)

New

Top rated

Genesis AI

–

Full-time

–

Posted

May 1, 2026 1:42

Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack, from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures.

Undisclosed

()

Paris, France

Maybe global

Remote

View detail

Member of Engineering (Reinforcement Learning Infrastructure)

New

Top rated

Poolside

–

Full-time

–

Posted

Apr 28, 2026 3:25

Keep up with the latest research, and be familiar with the state of the art in LLMs, RL, and code generation. Develop methods for tuning training and inference end-to-end for high throughput. Design data control systems in an RL pipeline that govern what the model sees and when. Debug cases where infrastructure decisions are silently degrading learning dynamics. Build observability tooling that surfaces when a system-level issue is the root cause of a training regression. Help build robust, flexible and scalable RL pipelines. Optimize performance across the stack — networking, memory, compute scheduling, and I/O. Write high-quality, pragmatic code. Work in the team: plan future steps, discuss, and always stay in touch.

Undisclosed

()

United Kingdom

Maybe global

Remote

Want to see more ML Infrastructure Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.

Join our community

(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for ML Infrastructure Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a ML Infrastructure Engineer do?","answer":"ML Infrastructure Engineers design, build, and maintain systems that support machine learning workflows from development to production. They create scalable platforms for model training and serving, implement distributed training systems, and develop monitoring solutions to track model performance. These engineers also build data pipelines, optimize ML systems for performance, and implement automated testing and deployment processes while collaborating with data scientists and researchers to productionize ML models."},{"question":"What skills are required for ML Infrastructure Engineer?","answer":"ML Infrastructure Engineers need strong programming skills in Python and sometimes Go, Rust, or C++. Proficiency with ML frameworks like PyTorch and TensorFlow is essential, alongside expertise in cloud platforms (AWS, GCP), containers (Docker), and orchestration (Kubernetes). They should understand distributed systems, data engineering concepts, and model serving techniques. Experience with infrastructure-as-code tools and monitoring systems rounds out the technical requirements, complemented by problem-solving abilities and collaboration skills."},{"question":"What qualifications are needed for ML Infrastructure Engineer role?","answer":"Most ML Infrastructure Engineer positions require a Bachelor's or Master's degree in Computer Science or related field, plus 4-5+ years of experience building production ML systems. Employers typically expect demonstrable experience with cloud platforms, containerization tools, and ML frameworks. Strong understanding of system-level software, machine learning concepts, and resource utilization is necessary. Experience with distributed systems and high-throughput workloads is highly valued, especially for senior positions."},{"question":"What is the salary range for ML Infrastructure Engineer job?","answer":"The research provided doesn't specify salary ranges for ML Infrastructure Engineer jobs. Compensation typically varies based on factors like location, company size, experience level, and specific technical expertise. Organizations like Anthropic, Scale AI, Apple, and other technology companies actively hiring for these positions likely offer competitive compensation packages reflecting the specialized nature of ML infrastructure skills and the current market demand."},{"question":"How long does it take to get hired as a ML Infrastructure Engineer?","answer":"The hiring timeline for ML Infrastructure Engineer positions isn't specified in the provided research. The process typically includes technical interviews focused on systems design, ML fundamentals, and programming skills. Given the specialized nature of the role, companies often conduct thorough evaluations of candidates' experience with production ML systems, distributed computing, and relevant technologies. The specialized requirements may extend the hiring process compared to more general engineering roles."},{"question":"Are ML Infrastructure Engineer job in demand?","answer":"Yes, ML Infrastructure Engineer jobs show strong demand based on active openings at major companies like DataXight, Scale AI, Anthropic, Apple, and Character.AI. The field is growing particularly in specialized areas such as LLM serving infrastructure, on-device ML optimization, and safety-critical ML systems. These positions are distributed across major tech hubs with opportunities ranging from mid-level to senior roles, reflecting industry's increasing need for engineers who can build reliable ML systems at scale."}]

Find AI jobs in by countries

AI Jobs in Argentina

AI Jobs in Australia

AI Jobs in Brazil

AI Jobs in Canada

AI Jobs in China

AI Jobs in France

AI Jobs in Germany

AI Jobs in Hong Kong

AI Jobs in India

AI Jobs in Japan

AI Jobs in Mexico

AI Jobs in Poland

AI Jobs in Singapore

AI Jobs in South Korea

AI Jobs in Spain

AI Jobs in Sweden

AI Jobs in Taiwan

AI Jobs in United Kingdom

AI Jobs in United States

Find AI jobs for similar categories

AI Technical Trainer Jobs

Applied ML Engineer Jobs

AI Full-Stack Engineer Jobs

AI MLOps Engineer Jobs

AI Systems Engineer Jobs

AI Applied Data Scientist Jobs

AI Robotics Software Engineer Jobs

AI Developer Educator Jobs

Applied AI Engineer Jobs

AI Applied Research Scientist Jobs

ML Researcher Jobs

AI Backend Engineer Jobs

AI Training Data Specialist Jobs

ML Research Engineer Jobs

AI Infrastructure Engineer Jobs

AI Autonomous Systems Engineer Jobs

AI Platform Engineer Jobs

AI Perception Engineer Jobs

AI Robotics Researcher Jobs