Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

TLM, Embedded Experiences

New
Top rated
OpenAI
Full-time
Full-time
Posted

Lead the technical direction, architecture, and execution of critical Cooperative Systems initiatives. Manage and mentor a team of engineers while maintaining meaningful hands-on technical involvement. Partner closely with stakeholders across Support, Operations, Finance, IT, Sales, Legal, and other functions to identify opportunities for AI-driven improvements. Design and build production systems that leverage large language models and other AI technologies. Drive engineering excellence through strong technical decision-making, code quality, operational rigor, and thoughtful system design. Balance rapid experimentation with long-term platform investments. Establish technical roadmaps and execution plans for projects spanning multiple teams. Coach engineers through technical challenges, career growth, and project execution. Help shape the culture, processes, and engineering practices of a growing organization.

$325,000 – $385,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Remote
Python
OpenAI API
Prompt Engineering
MLOps
Docker

Software Engineer, Agents & Automations

New
Top rated
Cohere
Full-time
Full-time
Posted

Design, build, ship, and maintain core capabilities for North’s Agents & Automations platform. Build product and platform features that help users create, run, debug, evaluate, and improve agents and automations. Own features end-to-end, from technical design through implementation, testing, launch, and iteration. Work across the stack, from frontend product surfaces to backend systems, depending on what the product needs. Make practical technical decisions that balance speed, quality, depth, and user impact. Collaborate closely with product, design, modelling, customer-facing teams, and other engineers to define the right outcomes and ship measurable improvements. Use AI actively in your work, while staying intellectually engaged and accountable for the quality and reliability of what you ship.

Undisclosed

()

London, United Kingdom
Maybe global
Remote
Python
TypeScript
Kubernetes
PostgreSQL
AI

Software Engineer, Knowledge Systems

New
Top rated
Exa
Full-time
Full-time
Posted

As a Software Engineer on Knowledge Systems, you will help build systems that understand what is true about the world by extracting, connecting, retrieving, and reasoning over knowledge from the web and beyond to enable AI agents to answer questions with unprecedented precision and completeness.

$180,000 – $350,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
Data Pipelines
CI/CD
AWS
Kubernetes

VP of Engineering

New
Top rated
Hyperbolic
Full-time
Full-time
Posted

Lead the design and evolution of the AI cloud platform including GPU orchestration, compute scheduling, networking, storage, and distributed systems. Make critical decisions regarding cloud infrastructure, bare-metal deployments, and platform scalability. Participate personally in architecture reviews and key technical initiatives. Build and scale large GPU clusters supporting customer workloads and design systems for GPU provisioning, scheduling, utilization optimization, and capacity management. Drive platform reliability and performance for AI training and inference workloads, partnering closely with engineering teams on infrastructure requirements for next-generation AI systems. Remain deeply involved in engineering decisions and technical direction, contribute directly to infrastructure design and implementation efforts, review architecture proposals, system designs, and major infrastructure changes, and act as the technical escalation point for complex infrastructure challenges. Establish best practices for Kubernetes, observability, CI/CD, security, and operational excellence. Build SRE and Platform Engineering functions from the ground up. Define reliability standards including SLOs, SLIs, incident response processes, and capacity planning. Drive automation across infrastructure operations. Recruit and develop Infrastructure, Platform, and SRE teams. Build a high-performance engineering culture focused on ownership and execution. Partner with executive leadership on company strategy and infrastructure investments. Manage infrastructure budgets, vendor relationships, and capacity planning.

Undisclosed

()

San Francisco, United States
Maybe global
Remote
Kubernetes
Docker
CI/CD
AWS
GCP

Operations Program Manager (Computer Vision), Public Sector

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will manage the technical lifecycle within diverse regulatory frameworks. You will lead the response for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials. You will also partner with Engineering and ML teams to ensure lessons learned in the field influence the technical architecture and decisions of future use cases.

Undisclosed

()

St. Louis or Washington, United States
Maybe global
Onsite
Python
Kubernetes
Vector Databases
MLOps
CI/CD

Systems Research Engineer Intern - GPU Programming (Fall 2026)

New
Top rated
Together AI
Full-time
Full-time
Posted

Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a large number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from the perspectives of reliability, performance, and availability. Plan the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite
Python
Terraform
Kubernetes
Docker
CI/CD

Research Intern, Inference (Fall 2026)

New
Top rated
Together AI
Full-time
Posted

As an AI Infrastructure Engineer at Together, the responsibilities include participating in on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to support scaling to a large number of concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture in terms of reliability, performance, and availability, and planning the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

Maybe global
Python
Docker
Kubernetes
Terraform
CI/CD

Frontier Agents Intern (Fall 2026)

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer at Together AI, the responsibilities include participating in on-call rotation (Pagerduty) to respond to production incidents; building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users; building monitoring systems to ensure the highest quality service for customers; designing and implementing operational processes such as deployments and upgrades; debugging production issues across all services and levels of the stack; identifying improvements for the product architecture from reliability, performance, and availability perspectives; and planning the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Kubernetes
Terraform
Ansible
Docker
CI/CD

Agentic Product Analyst

New
Top rated
Netomi
Full-time
Full-time
Posted

As an Agentic Product Analyst at Netomi, you will be responsible for designing, architecting, and deploying large-scale Agentic AI solutions for enterprise customers. This includes leading discovery sessions with customers to understand business processes, identifying automation opportunities, and designing agentic orchestration strategies using Netomi's AI platform. You will build detailed solution blueprints covering workflows, data exchanges, escalation logic, analytics, and agent lifecycle design. Defining end-to-end Agentic AI architectures and working with customer technical teams to map integration dependencies are also key tasks. You will own the creation of integration design documents, support Integration Engineers during implementation, and ensure agent workflows comply with enterprise standards. Collaboration with Product & Engineering to translate requirements into features, serving as product owner during deployment, validating solution behavior with QA, conducting user-experience reviews, training customer teams, and ensuring projects deliver on time with measurable impact also fall under your responsibilities. Additionally, you are expected to act as a trusted advisor to customer stakeholders, present architectural recommendations, drive continuous improvement, and maintain deep expertise in agentic AI, LLMs, workflow orchestration, and enterprise systems.

Undisclosed

()

Gurugram, India
Maybe global
Onsite
Python
Prompt Engineering
MLOps
Docker
Kubernetes

Senior Backend Engineer- AI Agents (Remote)

New
Top rated
Level AI
Full-time
Full-time
Posted

Design and build scalable backend systems powering AI Agents that operate in real-time enterprise environments. Develop agent orchestration frameworks involving multi-step reasoning, tool usage, and decisioning workflows. Build systems for agent memory, context management, and state persistence across interactions. Architect low-latency inference pipelines integrating Large Language Models, Small Language Models, and external tools/services. Implement evaluation frameworks to measure agent performance, accuracy, and reliability. Enable continuous improvement loops for AI agents in production including feedback, retraining, and deployment. Design and manage event-driven, asynchronous workflows for complex agent tasks. Optimize systems for high throughput, low latency, and cost-efficient inference at scale. Build and maintain robust APIs and service layers (REST/gRPC) for agent capabilities. Partner closely with Applied AI/ML teams to productionize models and agent behaviors. Collaborate with Product and Solutions teams to translate real customer workflows into agentic systems. Drive best practices in observability, monitoring, safety, and guardrails for AI systems. Contribute to architecture decisions for scaling multi-tenant, enterprise-grade AI platforms.

Undisclosed

()

United States
Maybe global
Remote
Python
Docker
Kubernetes
AWS
GCP

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]