Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

Software Engineer, AI Voice Agent

New
Top rated
Aircall
Full-time
Full-time
Posted

As a Software Engineer on the AI Voice Agent team, you will work on real-time systems involving live audio such as buffering, streaming, and latency optimization, along with integrating speech providers. You will build and improve conversation intelligence systems, including prompt construction, context management, function calling, and dialogue management to make conversations feel natural. You will develop the action framework to execute configurable API calls, manage success/failure branching, authentication, and runtime execution during calls. You will work on knowledge ingestion, storage, retrieval, memory, and context for the voice agent to improve its performance over time. Additionally, you will collaborate on agent lifecycle tasks such as creation, configuration, testing, and deployment of voice agents and help build evaluation frameworks for model performance, call quality metrics, and call analytics. Participation in on-call rotations is also expected.

$130,000 – $220,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
TypeScript
Prompt Engineering
Model Evaluation
MLOps

Senior Software Engineer, AI Voice Agent

New
Top rated
Aircall
Full-time
Posted

As a Senior Software Engineer on the AI Voice Agent team, you will work on real-time systems involving live audio streaming and latency optimization integrated with speech providers. You will build and improve conversation intelligence systems that manage LLM layers, including prompt construction, context management, function calling, and dialogue management to create natural, actionable phone conversations. You will develop the action framework allowing configurable API calls with branching logic and runtime execution, supporting tasks like data lookup and ticket creation during calls. You'll manage knowledge ingestion, storage, and retrieval to enhance agent memory and learning over time. You will collaborate with designers to enable customers to create, configure, test, and deploy voice agents through intuitive product experiences. Additionally, you will help develop evaluation frameworks, analytics, call quality metrics, and monitoring instrumentation, and participate in on-call rotation duties.

$150,000 – $220,000
Undisclosed
YEAR

(USD)

Maybe global
Python
TypeScript
Prompt Engineering
Model Evaluation
AWS

Staff Software Engineer, Foundations (Managed AI)

New
Top rated
Crusoe
Full-time
Full-time
Posted

As a Staff Software Engineer in the Foundations department, responsibilities include leading the design and implementation of highly scalable systems for the Managed AI offerings, driving the long-term technical roadmap for the Foundations team to support growth and evolving AI workloads, working cross-functionally with Cloud Engineering to align technical goals and solve integration challenges, leading by example through high-quality code contributions and mentoring Senior and Staff-level engineers, championing reliability, observability, and performance by identifying and resolving systemic bottlenecks, and staying current with AI infrastructure trends to ensure efficient and powerful tools are utilized.

$208,000 – $253,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Go
Python
Kubernetes
AWS
GCP

Senior Platform/DevOps Engineer (Kubernetes-Linux)

New
Top rated
Armada
Full-time
Full-time
Posted

Translate business requirements into requirements for AI/ML models; prepare data to train and evaluate AI/ML/DL models; build AI/ML/DL models by applying state-of-the-art algorithms, especially transformers; leverage existing algorithms from academic or industrial research when applicable; test, evaluate, and benchmark AI/ML/DL models and publish the models, data sets, and evaluations; deploy models in production by containerizing them; work with customers and internal employees to refine model quality; establish continuous learning pipelines for models using online or transfer learning; build and deploy containerized applications on cloud or on-premise environments.

$154,560 – $193,200
Undisclosed
YEAR

(USD)

Bellevue, United States
Maybe global
Onsite
Python
Java
C++
Docker
Kubernetes

Software Engineer - Tools & Automation

New
Top rated
Zoox
Full-time
Full-time
Posted

As a Software Engineer and member of the Platform Stability team, you will help build, fine-tune, and maintain a novel AI-powered tool for diagnosing technical issues and identifying root causes. You will collaborate cross-functionally to gather requirements, develop AI/ML and analytical models, and drive data-driven insights as part of a high-performing team. Responsibilities include designing and implementing agentic AI systems with structured interfaces, reasoning loops, and robust error handling; building and maintaining data pipelines, scheduled workflows, and benchmarking infrastructure; developing evaluation and scoring systems to measure and improve model output quality; integrating the platform with internal and external services such as ticketing, messaging, storage, and observability; collaborating with cross-functional teams to translate business requirements into technical AI solutions; and architecting and maintaining production-grade AI solutions with a focus on scalability, reliability, and performance.

$184,000 – $231,000
Undisclosed
YEAR

(USD)

Foster City, United States
Maybe global
Onsite
Python
Prompt Engineering
Model Evaluation
Data Pipelines
MLOps

AI Engineer

New
Top rated
AppZen
Full-time
Full-time
Posted

The AI Engineer will design and develop intelligent agents powered by large language models (LLMs) using tool calling, orchestration frameworks, and advanced context management to enable reasoning, planning, and autonomous decision-making across complex workflows. Responsibilities include working hands-on with modern agentic stacks such as LangGraph and Autogen, implementing asynchronous and streaming architectures, and ensuring production-grade observability to build scalable real-world AI systems.

$160,000 – $180,000
Undisclosed
YEAR

(USD)

San Jose, United States
Maybe global
Onsite
Python
Go
LangChain
MLOps
Kubernetes

Mid/Senior/Staff Software Engineer, Agents

New
Top rated
Harvey
Full-time
Full-time
Posted

As a Software Engineer, Agents, you will build systems that make AI agents indispensable to legal professionals by designing environments and actions for agentic professional work, making model selection decisions, managing context windows, creating optimal tools, and developing evaluation harnesses for faster iteration loops to unlock new capabilities. You will partner with customers and product managers to understand legal workflows, design practical evaluations to capture what excellence means, and ship agents that effectively complete tasks. Additionally, you will optimize agent performance through prompt engineering, model selection, tool design, skill writing, context window management, and evaluation harness development. You will work with the model infrastructure team to design and implement infrastructure for low-latency agent execution, including caching strategies, parallel tool calls, or subagent patterns. Improving observability and instrumentation to profile agent behavior, identify bottlenecks, and drive optimization decisions is also part of the role. Staying current on new developments in agentic systems and applying those insights to product development is expected.

$165,000 – $312,000
Undisclosed
YEAR

(USD)

New York, United States
Maybe global
Onsite
Python
Prompt Engineering
Model Evaluation
OpenAI API
Transformers

Forward Deployed AI Engineer

New
Top rated
Talent Labs
Full-time
Full-time
Posted

Drive the end-to-end technical deployment of Latent Labs models into customer environments, ensuring seamless integration with existing scientific and IT infrastructure. Design and build production-grade API integrations, data pipelines and model-serving infrastructure tailored to each customer’s requirements. Work on-site or embedded with pharma and biotech partners to scope technical requirements, troubleshoot issues and deliver solutions. Ensure deployments meet enterprise standards for security, performance and reliability. Serve as the technical point of contact for assigned customers, building trusted relationships with their scientific and engineering teams, including spending time working on-site at international partner locations as needed. Gather and synthesise customer feedback, translating it into actionable insights for product, research and platform teams. Collaborate with internal teams to shape the product roadmap based on real-world deployment learnings. Create technical documentation, integration guides and best-practice resources for customers. Stay on top of the latest developments in ML infrastructure, model serving and cloud-native tooling. Gain a strong working understanding of protein and cell biology as it relates to the product. Participate in knowledge sharing, including organizing and presenting at internal reading groups.

Undisclosed

()

San Francisco, United States
Maybe global
Hybrid
Python
AWS
Docker
Kubernetes
CI/CD

Staff Engineer, G&C (R4763)

New
Top rated
Shield AI
Full-time
Full-time
Posted

As a Guidance and Controls engineer, you will be responsible for creating and maintaining all control and autonomy algorithms within the XBAT code base. This includes algorithm development, unit tests, component tests, flight software qualification, and flight test support. You will also be responsible for helping update and validate the truth models as required.

$180,000 – $280,000
Undisclosed
YEAR

(USD)

Dallas, United States
Maybe global
Onsite
Python
C++
CI/CD
MLOps
Docker

Director, Data Center Operations

New
Top rated
Together AI
Full-time
Full-time
Posted

The responsibilities include advancing inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference. Implementing and maintaining changes in high-performance inference engines, including kernel backends and speculative decoding, profiling and optimizing performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unifying inference with RL/post-training by designing and operating RL and post-training pipelines, making RL and post-training workloads more efficient with inference-aware training loops, and using these pipelines to train, evaluate, and iterate on frontier models. Co-designing algorithms and infrastructure so that objectives, rollout collection, and evaluation are tightly coupled to efficient inference, identifying bottlenecks across the training engine, inference engine, data pipeline, and user-facing layers. Running ablations and scale-up experiments to understand trade-offs between model quality, latency, throughput, and cost, and feeding these insights back into model, RL, and system design. Owning critical systems at production scale by profiling, debugging, and optimizing inference and post-training services under real production workloads, driving roadmap items requiring engine modification, and establishing metrics, benchmarks, and experimentation frameworks to validate improvements rigorously. Providing technical leadership by setting technical direction for cross-team efforts at the intersection of inference, RL, and post-training, and mentoring other engineers and researchers on full-stack ML systems work and performance engineering.

$200,000 – $280,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite
Python
PyTorch
TensorFlow
MLOps
Model Evaluation

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]