AI MLOps Engineer Jobs

Discover the latest remote and onsite AI MLOps Engineer roles across top active AI companies. Updated hourly.

Check out 27 new AI MLOps Engineer opportunities posted on AI Chopping Block

Technical Lead Manager - Training Runtime, Data(set) Movement

New
Top rated
OpenAI
Full-time
Full-time
Posted

The Technical Lead Manager will own datasets throughout the training infrastructure and set the direction for how training jobs read data, including APIs, storage contracts, versioning model, benchmarks, debugging tools, and reliability guarantees to make data access consistent across current and future training frameworks. Responsibilities include designing and building a unified dataset read platform for multiple training frameworks; defining dataset APIs, storage-format expectations, registration/versioning, and migration paths to ensure reproducible and maintainable data access; building reliability into the read path such as stateful iteration, caching, fast restart, recovery, and clear operational contracts; developing terminal and web-based visualizers to inspect data late in the pipeline; writing and reviewing production code in core data loading, service, caching, and reliability paths; and partnering with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure. Over time, the role will expand to owning broader data movement systems including checkpoint loads/saves and snapshot transfers, working closely with technical leads and infrastructure teams.

$295,000 – $445,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Remote

Staff Machine Learning Engineer

New
Top rated
Bjak
Full-time
Full-time
Posted

Own end-to-end ML system execution including data pipelines, training workflows, evaluation systems, inference architecture, and deployment. Fine-tune and adapt models using methods such as LoRA, QLoRA, SFT, DPO, and distillation. Architect and operate scalable inference systems managing latency, cost, and reliability. Design and maintain data systems for high-quality synthetic and real-world training data. Implement evaluation pipelines covering performance, robustness, safety, and bias in partnership with research leadership. Own production deployment including GPU optimization, memory efficiency, latency reduction, and scaling policies. Collaborate closely with application engineering to integrate ML systems into backend, mobile, and desktop products. Make pragmatic trade-offs, ship improvements quickly, and learn from real usage. Work under real production constraints including latency, cost, reliability, and safety. Detect, debug, and resolve production issues quickly to minimize user impact. Support and align team members to deliver high-impact ML work with minimal friction. Ensure iterations on models and systems are measurable, safe, and improve user experience over time.

Undisclosed

()

Seoul, South Korea
Maybe global
Remote

Technical Lead, Machine Learning

New
Top rated
Bjak
Full-time
Full-time
Posted

Own end-to-end ML system execution including data pipelines, training workflows, evaluation systems, inference architecture, and deployment. Fine-tune and adapt models using state-of-the-art methods such as LoRA, QLoRA, SFT, DPO, and distillation. Architect and operate scalable inference systems balancing latency, cost, and reliability. Design and maintain data systems for high-quality synthetic and real-world training data. Implement evaluation pipelines covering performance, robustness, safety, and bias in partnership with research leadership. Own production deployment including GPU optimization, memory efficiency, latency reduction, and scaling policies. Collaborate closely with application engineering to integrate ML systems into backend, mobile, and desktop products. Make pragmatic trade-offs and ship improvements quickly while learning from real usage. Work within real production constraints such as latency, cost, reliability, and safety.

Undisclosed

()

Seoul, South Korea
Maybe global
Remote

AI Deployment Strategist, Enterprise

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead incident response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons learned influence future technical architecture and decisions.

Undisclosed

()

San Francisco or New York, United States
Maybe global
Onsite

Member of Engineering (Pre-training / Data Research)

New
Top rated
Poolside
Full-time
Full-time
Posted

Follow the latest research related to Large Language Models (LLMs) and data quality, being familiar with relevant open-source datasets and models. Design and implement complex pipelines to generate large amounts of diverse data while optimizing available resources. Collaborate closely with teams such as Pretraining, Posttraining, Evals, and Product to ensure short feedback loops on the quality of models delivered. Suggest, conduct, and analyze data ablations or training experiments to improve the quality of generated datasets using quantitative insights.

Undisclosed

()

United Kingdom
Maybe global
Remote

Director of Technology & Systems

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned in the field directly influence the technical architecture and decisions of future use cases.

Undisclosed

()

San Francisco, United States
Maybe global
Onsite

Machine Learning Engineer (Singapore)

New
Top rated
Cantina Labs
Full-time
Full-time
Posted

Build and scale systems for ingesting, processing, and delivering large-scale video and multimodal data for model training. Own the full pipeline from raw content to curated, filtered, and training-ready datasets focusing on speed, reliability, reproducibility, and cost-efficiency. Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes. Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs. Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems. Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency. Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns. Design and implement curation pipelines for selection, filtering, and retention of video and image content for model training including image-text pair datasets. Build and improve VLM-based captioning and metadata generation workflows at scale across video and image data. Develop and apply quality and aesthetic scoring models, CLIP-based semantic filtering, and other signal-extraction approaches for data selection. Build tooling to support deduplication workflows at scale, including near-dedup and exact deduplication pipelines over large video corpora. Analyze dataset composition, identify quality issues, iterate on curation logic to improve training outcomes. Define and evolve standards for high-quality, training-ready video data across different training regimes.

Undisclosed

()

Singapore
Maybe global
Onsite

Research Engineer, Training & Inference

New
Top rated
Harmonic
Full-time
Full-time
Posted

Maintain and optimize the proprietary reinforcement learning (RL) training and serving infrastructure with total stack ownership, including the Python API to CUDA kernels, to achieve peak performance for foundation model workloads. Maximize throughput of the RL system from data generation to model training utilizing sharded multi-node training and inference algorithms. Optimize the inference stack for high-throughput RL and low-latency large language model (LLM) production traffic by tuning the inference engine, router, scheduler, and custom kernels if necessary. Identify and resolve performance bottlenecks in distributed clusters to ensure optimal throughput and memory efficiency for multi-billion parameter models, balancing memory constraints with compute-heavy training cycles.

$200,000 – $450,000
Undisclosed
YEAR

(USD)

Palo Alto, United States
Maybe global
Onsite

Director, Forward Deployed Engineering

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will manage the technical lifecycle within diverse regulatory frameworks. You will lead the response for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials. You will partner with Engineering and ML teams to ensure lessons learned in the field directly influence the technical architecture and decisions of future use cases.

Undisclosed

()

London, United Kingdom
Maybe global
Onsite

Business Development Intern

New
Top rated
PathAI
Full-time
Full-time
Posted

Lead the team responsible for the AI/ML infrastructure that connects machine learning research with large-scale production. Develop and execute the long-term vision and roadmap for the MLOps team to support ML development and deployment needs across business units, balancing short-term tactical deliveries and long-term architectural transformation. Manage and mentor a team of 6-7+ engineers, allocating resources strategically for existing service support and key initiatives. Collaborate cross-functionally with leaders in machine learning, data science, product engineering, and infrastructure to identify issues, address bottlenecks, and facilitate new solution deployment. Architect compute and storage pipelines for managing large datasets without data fragmentation or latency. Modernize inference stack for AI product growth. Work with Site Reliability Engineering to establish comprehensive system metrics. Conduct build vs. buy assessments and audits to benchmark proprietary tools against commercial and open-source alternatives.

$181,500 – $278,300
Undisclosed
YEAR

(USD)

Boston
Maybe global
Remote

Want to see more AI MLOps Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for AI MLOps Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a AI MLOps Engineer do?","answer":"AI MLOps Engineers design and implement CI/CD pipelines for machine learning models, focusing on deployment, monitoring, and maintenance. They containerize models using Docker and Kubernetes, implement automated testing frameworks, and build scalable infrastructure for ML workflows. These engineers monitor models for performance degradation and data drift while ensuring security compliance throughout the pipeline. They bridge the gap between data science and production environments, automating model versioning, retraining, and optimization."},{"question":"What skills are required for AI MLOps Engineer?","answer":"AI MLOps Engineers need strong programming skills in Python and experience with containerization tools like Docker and Kubernetes. Proficiency with cloud platforms (AWS, GCP, Azure) is essential, alongside expertise in CI/CD pipelines, version control, and infrastructure as code. They should understand ML algorithms, model serving patterns, and monitoring systems to track performance metrics. Experience with vector databases, RAG systems, and fine-tuning pipelines for LLMs is increasingly valuable in today's market."},{"question":"What qualifications are needed for AI MLOps Engineer role?","answer":"Most AI MLOps Engineer positions require a bachelor's degree in Computer Science, Data Science, Engineering or related field. Employers typically seek candidates with 4+ years of technical engineering experience, particularly in DevOps, software engineering, or data engineering. Demonstrable expertise with ML deployment, containerization, and cloud platforms is crucial. Strong coding skills in Python and other languages, combined with practical experience implementing and maintaining ML systems in production environments, are highly valued."},{"question":"What is the salary range for AI MLOps Engineer job?","answer":"The research provided does not contain specific salary information for AI MLOps Engineers. Compensation typically varies based on location, experience level, company size, and industry. As this role requires specialized expertise in both ML and DevOps, salaries generally align with other senior technical positions in the AI field. For accurate salary information, it's recommended to consult current compensation surveys or job listings for AI MLOps Engineer positions in your target location."},{"question":"How long does it take to get hired as a AI MLOps Engineer?","answer":"The research doesn't provide specific hiring timelines for AI MLOps Engineer positions. The process typically involves technical interviews assessing both ML knowledge and operational skills. With employers commonly requiring 4+ years of technical experience and specific expertise in ML algorithms, DevOps, and workflow automation, candidates meeting these qualifications may move through the process more quickly. The hiring timeline can vary significantly depending on the company's urgency, the candidate pool, and the specific technical requirements of the position."},{"question":"Are AI MLOps Engineer job in demand?","answer":"The research indicates growing demand for AI MLOps Engineers, evidenced by recruitment at major companies like Microsoft. As organizations increasingly deploy ML models to production, the need for specialists who can bridge data science and operations has expanded. This role is crucial for companies looking to scale AI initiatives reliably and efficiently. The specialized skill set combining ML knowledge with DevOps expertise makes qualified candidates particularly valuable as more businesses implement machine learning in production environments."}]