Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Join our AI community Interested in Hiring?

Hiring by

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

View detail

Span - Sr Product Engineer

New

Top rated

Silver.dev

–

Full-time

–

Posted

Apr 28, 2026 20:21

Work on projects such as developing a product that root causes KTLO work and recommends solutions, building a software catalog that works for monoliths and is user-friendly, and helping protect engineering focus time by systemically solving sources of distraction or mental load with AI.

$100,000 – $140,000

Undisclosed

YEAR

(USD)

Argentina

Maybe global

Remote

TypeScript

Python

AWS

Kubernetes

CI/CD

View detail

Lead Member of Technical Staff, Inference Infrastructure

New

Top rated

Cohere

–

Full-time

–

Posted

Apr 28, 2026 16:48

The Lead Member of Technical Staff, Inference Infrastructure, is responsible for providing technical leadership across multiple teams, driving the architecture and strategy for deploying optimized NLP models to production in low latency, high throughput, and high availability environments. They lead the design of customized deployments to meet specific customer needs and mentor engineers to raise the technical standards across the team. The role involves contributing to the development, deployment, and operation of the AI platform delivering large language models through easy-to-use API endpoints, and serving as a key point of contact for customers.

Undisclosed

()

San Francisco, United States

Maybe global

Remote

Golang

C++

Kubernetes

AWS

GCP

View detail

Member of Engineering (Reinforcement Learning Infrastructure)

New

Top rated

Poolside

–

Full-time

–

Posted

Apr 28, 2026 3:25

Keep up with the latest research, and be familiar with the state of the art in LLMs, RL, and code generation. Develop methods for tuning training and inference end-to-end for high throughput. Design data control systems in an RL pipeline that govern what the model sees and when. Debug cases where infrastructure decisions are silently degrading learning dynamics. Build observability tooling that surfaces when a system-level issue is the root cause of a training regression. Help build robust, flexible and scalable RL pipelines. Optimize performance across the stack — networking, memory, compute scheduling, and I/O. Write high-quality, pragmatic code. Work in the team: plan future steps, discuss, and always stay in touch.

Undisclosed

()

United Kingdom

Maybe global

Remote

Python

PyTorch

JAX

Reinforcement Learning

MLOps

View detail

Staff Software Engineer, Core Infrastructure

New

Top rated

Harvey

–

Full-time

–

Posted

Apr 28, 2026 3:18

As a Staff Software Engineer on the Core Infrastructure team at Harvey, your responsibilities include designing and building scalable, fault-tolerant infrastructure systems that power Harvey's AI platform across multiple cloud regions. You will own and evolve the multi-cloud infrastructure (Azure, GCP), including Kubernetes orchestration, networking, and container management. You will lead technical initiatives focused on observability, incident response, and operational excellence, building systems for rapid detection and resolution of issues. Architecting and optimizing distributed systems for reliability, including load balancing, quota management, and failover mechanisms, will be part of your role. You will partner with Product Engineering and Security teams to ensure infrastructure accelerates product development, drive infrastructure-as-code practices using tools like Terraform and Pulumi for reproducible deployments, and mentor engineers through code reviews, design reviews, and technical leadership. Representative projects include designing model proxy architecture for handling inference requests, building distributed rate limiting and quota management systems, architecting multi-region deployment strategies for data residency compliance, developing observability infrastructure with SLA monitoring and cost tracking, and leading CI/CD pipeline evolution to improve velocity and stability.

$236,000 – $290,000

Undisclosed

YEAR

(USD)

San Francisco, United States

Maybe global

Onsite

Python

Kubernetes

Terraform

Pulumi

View detail

Tokens-as-a-Service (Taas) Software Engineer

New

Top rated

OpenAI

–

Full-time

–

Posted

Apr 28, 2026 2:55

Develop systems and tooling to measure, monitor, and improve token throughput across first-party and partner-owned compute environments. Support performance benchmarking, tokenomics analysis, and model porting across heterogeneous infrastructure environments. Build tooling to integrate external or partner infrastructure into OpenAI’s internal compute, observability, and workload management systems. Develop and monitor operational metrics including billing, usage, SLAs, utilization, reliability, and throughput. Identify bottlenecks across hardware, networking, software, and workload enablement that prevent capacity from becoming productive tokens. Partner with compute, infrastructure, networking, finance, and operations teams to translate raw capacity into usable workload-serving capacity. Build dashboards, automation, and reporting systems that provide clear visibility into TaaS capacity, performance, and business outcomes.

$293,000 – $455,000

Undisclosed

YEAR

(USD)

San Francisco, United States

Maybe global

Remote

Python

Docker

Kubernetes

CI/CD

AWS

View detail

Software Engineer, Compute Infrastructure

New

Top rated

OpenAI

–

Full-time

–

Posted

Apr 28, 2026 1:01

In this role, you will spin up and scale large Kubernetes clusters, including automating provisioning, bootstrapping, and cluster lifecycle management; build software abstractions that unify multiple clusters and provide a seamless interface to training workloads; own node bring-up from bare metal through firmware upgrades ensuring fast and repeatable deployment at massive scale; improve operational metrics such as reducing cluster restart times and accelerating firmware or OS upgrade cycles; integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure; develop monitoring and observability systems to detect issues early and maintain cluster stability under extreme load; solve real-time operational challenges, diagnose and fix issues quickly, and continuously improve automation, resilience, performance, and uptime across the systems powering frontier AI model training.

$230,000 – $405,000

Undisclosed

YEAR

(USD)

San Francisco, United States

Maybe global

Remote

Kubernetes

CI/CD

Docker

AWS

GCP

View detail

Software Engineer, Model Serving Infrastructure

New

Top rated

Anyscale

–

Full-time

–

Posted

Apr 27, 2026 23:14

The role involves contributing to the development of next-generation, high-performance machine learning serving systems. Responsibilities include building infrastructure that powers AI applications, working on problems at the intersection of distributed systems, machine learning, and high-performance computing, and solving fundamental computer science problems impacting AI deployment. Specific projects include implementing asynchronous inference for non-blocking client requests, designing intelligent request routing systems to balance load across thousands of model replicas with strict latency SLAs, building traffic management systems for zero-downtime model updates handling terabytes of inference requests, improving state management for scale from thousands to tens of thousands of replicas, architecting frameworks for multi-model orchestration in complex ML pipelines ensuring end-to-end latency guarantees, and developing observability and debugging tools for distributed ML applications at scale. The work involves writing performance-critical code in Python (with Cython optimizations) and potentially C++, working with distributed systems at scale using Ray Core's actor system, gRPC, and custom networking protocols, extending cloud-native infrastructure such as Kubernetes and service meshes, gaining system-level knowledge of ML/AI frameworks like TensorFlow, PyTorch, JAX, and transformers, and ensuring production reliability with tools like OpenTelemetry, Prometheus, distributed tracing, and chaos engineering to maintain 99.99% uptime. The role also involves leveraging AI coding agents to enhance team productivity while maintaining high code quality standards.

Undisclosed

()

Bengaluru, India

Maybe global

Onsite

Python

C++

TensorFlow

PyTorch

JAX

View detail

VP Engineering - London

New

Top rated

H Company

–

Full-time

–

Posted

Apr 27, 2026 22:40

The VP Engineering is responsible for defining and executing a scalable, defensible technology strategy; building a world-class engineering organization and platform; partnering with the CEO on product direction, investor communication, and long-term vision; and ensuring the successful bridging of frontier AI research with enterprise-grade deployment. Responsibilities include architecting and scaling H's AI platform, making build vs. buy decisions, ensuring performance, reliability, and cost efficiency, establishing technical moats, translating AI capabilities into enterprise-ready products, standardizing bespoke systems, balancing iteration speed with robustness, building and leading engineering teams, scaling organizational structure, implementing quality processes, acting as a key counterpart to the CEO in board and investor discussions, articulating technology and product roadmaps, providing technical due diligence, operating cross-functionally across Research, Product, and Go-to-Market, aligning engineering with customer and revenue goals, and helping define long-term company positioning.

Undisclosed

()

London, United Kingdom

Maybe global

Remote

Python

MLOps

Docker

Kubernetes

AWS

View detail

VP Engineering - Paris

New

Top rated

H Company

–

Full-time

–

Posted

Apr 27, 2026 22:38

The VP Engineering is responsible for defining and executing a scalable, defensible technology strategy, including architecting and scaling the AI platform with a focus on agents, orchestration, model integration, and infrastructure. They make critical build versus buy decisions across the technology stack, ensure performance, reliability, and cost efficiency at scale, and establish durable technical moats in a rapidly evolving AI landscape. They translate cutting-edge AI capabilities into repeatable, enterprise-ready products, standardize systems that are currently bespoke or forward-deployed, and balance speed of iteration with platform robustness and maintainability. They build and lead a high-caliber engineering organization, scaling from a startup structure to multi-layered, high-output teams and implement processes to enable speed without sacrificing quality. The VP Engineering acts as a key counterpart to the CEO in board and investor discussions, clearly articulates the company's technology and product roadmap, and provides credibility and depth in technical due diligence and fundraising contexts. They operate at the intersection of Research, Product, and Go-to-Market, align engineering execution with customer outcomes and revenue growth, and help define the company’s long-term product and platform positioning.

Undisclosed

()

Paris, France

Maybe global

Remote

Python

Docker

Kubernetes

AWS

GCP

View detail

Engineering Manager, Cooperative Systems

New

Top rated

OpenAI

–

Full-time

–

Posted

Apr 27, 2026 22:19

Lead and grow a small team building applied AI systems for internal operations. Design and build AI-powered automation systems in close proximity to customers. Stay hands-on in architecture and implementation across the full stack. Develop evolving systems spanning developer tools, automation platforms, knowledge graphs, and data systems. Deploy systems directly to internal users and close customers to iterate rapidly based on real-world feedback. Engage frequently with scaled workforces to understand needs and validate solutions. Create systems for visibility and learning in hybrid workforces. Partner with product, research, and ops teams daily.

$325,000 – $385,000

Undisclosed

YEAR

(USD)

Seattle

Maybe global

Remote

Python

AWS

Docker

Kubernetes

MLOps

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.

Join our community

(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]

Kubernetes AI Jobs

Hiring by

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

Span - Sr Product Engineer

Lead Member of Technical Staff, Inference Infrastructure

Member of Engineering (Reinforcement Learning Infrastructure)

Staff Software Engineer, Core Infrastructure

Tokens-as-a-Service (Taas) Software Engineer

Software Engineer, Compute Infrastructure

Software Engineer, Model Serving Infrastructure

VP Engineering - London

VP Engineering - Paris

Engineering Manager, Cooperative Systems

Access all 4,256 remote & onsite AI jobs.

Frequently Asked Questions

Find AI jobs in by countries

Find AI jobs for similar skills

Join the AI Leaders Network today