Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

Full Stack Product Engineer

New
Top rated
Ideogram
Full-time
Full-time
Posted

As a Full-Stack Product Engineer at Ideogram, you will build products that bring generative AI directly to creators, working across the entire technology stack from designing user experiences to optimizing backend systems that serve millions. Your focus will be on shipping features that users love by combining product intuition, strong ownership, and user empathy. You will design APIs and data models to support evolving product needs, utilize AI-native engineering tools to speed up development, debugging, and understanding of the codebase, and work effectively across frontend and backend systems. You will also be responsible for explaining technical concepts to both technical and non-technical stakeholders, participating in constructive code reviews, collaborating with the team, and taking full responsibility for the outcomes of your work, not just the code.

Undisclosed

()

Toronto, Canada
Maybe global
Onsite
Python
JavaScript
TypeScript
Kubernetes
Docker

Solutions Architect (APAC)

New
Top rated
LangChain
Full-time
Full-time
Posted

The Solutions Architect is responsible for designing scalable, highly-available infrastructure for AI platform deployments including compute, storage, networking, security, enterprise integration patterns, Infrastructure as Code (Terraform, Helm), multi-region HA/DR strategies, and CI/CD pipelines. They also design multi-agent systems using different patterns, implement agent logic with modern frameworks (langchain/langgraph), create evaluation frameworks, optimize prompts with A/B testing, and guide deployment and operations. Additionally, they lead technical maturity assessments, work directly with enterprise customers to understand requirements and offer recommendations, and collaborate with Engagement Managers and Product/Engineering teams.

Undisclosed

()

Singapore
Maybe global
Remote
Python
TypeScript
Kubernetes
AWS
GCP

Senior Engineering Manager, Management Plane Systems

New
Top rated
Crusoe
Full-time
Full-time
Posted

Lead the team responsible for the automation, observability, configuration management, and policy enforcement layer that runs across the entire network fleet. Own the architecture, development, and production operation of the SDN Management Plane, including the automation and observability platform for managing network fleet across all regions. Build and operate CI/CD pipelines for network configuration, including automated testing, policy validation, and push-on-green delivery of network changes. Design and implement software systems that enforce reconciliation between declared and actual network state, detect configuration drift, and trigger automated remediation workflows. Define provisioning and onboarding automation for new nodes, regions, and customer environments. Drive the design of network observability systems such as streaming telemetry, synthetic probing, anomaly detection, and real-time traffic monitoring across GPU clusters. Design and implement self-healing network capabilities using closed-loop automation to detect, diagnose, and resolve network faults without human intervention. Set the technical vision for applying GenAI and machine learning to network operations. Partner with Control Plane and Data Plane teams to ensure software interfaces between layers and collaborate with infrastructure and compute teams to support GPU cluster networking requirements. Act as internal platform owner for network automation and treat engineering teams as customers with real product requirements. Lead, mentor, and grow a team of senior and staff-level software and network automation engineers, set technical standards, review architecture and design decisions, and own team performance and development. Foster a high-ownership engineering culture focused on shipping production software.

$237,000 – $288,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
Go
CI/CD
MLOps
Kubernetes

Manager, Infrastructure Strategy & Operations

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation (Pagerduty) to respond to production incidents. You build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. You build monitoring systems to ensure the highest quality service for customers. You design and implement operational processes such as deployments and upgrades. You debug production issues across all services and levels of the stack. You identify improvements for the product architecture from the reliability, performance, and availability perspectives. You plan the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite
Kubernetes
Terraform
Ansible
CI/CD
Docker

Trust Engineer

New
Top rated
Harvey
Full-time
Full-time
Posted

Own the implementation and optimization of Harvey's compliance automation tooling to automate workflows across compliance programs; design and build a compliance data layer in Snowflake by ingesting signals from infrastructure, security tools, and SaaS platforms to create a real-time view of control health and audit readiness; develop AI agents and automated pipelines for evidence collection, control testing, and continuous monitoring at scale; partner with Engineering and Security to map technical implementations to compliance controls and maintain a living, accurate control inventory; build reporting layers that translate compliance signals into clear narratives on risk posture and certification status for executive and cross-functional audiences.

$220,000 – $330,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Remote
Python
SQL
AWS
GCP
Azure

Full Stack Engineer, AI systems

New
Top rated
Bjak
Full-time
Full-time
Posted

Build end-to-end product features across frontend, backend, and AI integrations; design agent workflows that handle planning, tool use, failure, and recovery across multiple steps; integrate LLMs, memory, and external tools into systems that behave reliably under real-world conditions; design real-time AI interactions with streaming, partial results, and tight latency constraints; improve system reliability, observability, and fallback mechanisms; collaborate closely with ML, backend, and product teams to ship features end-to-end; continuously iterate based on real usage and failure modes.

Undisclosed

()

Seoul, South Korea
Maybe global
Remote
Python
JavaScript
PyTorch
OpenAI API
Kubernetes

Backend Engineer, AI (Agent Systems)

New
Top rated
Bjak
Full-time
Full-time
Posted

As a Backend Engineer, AI, you own the inference and orchestration layer that powers every AI interaction in the product. You build and operate backend systems that serve AI-powered features in production, design inference pipelines, orchestration layers, and service boundaries around models. You are responsible for production concerns such as monitoring, logging, alerting, and incident response. Additionally, you optimize latency and throughput across inference, caching, batching, and streaming. Your work enables backend systems to run reliably at scale, handling production AI traffic with low latency and high throughput, ensuring APIs are stable, clear, and support seamless integration with frontend and ML systems. You ensure production incidents are quickly detected, diagnosed, and resolved, minimizing user impact, and continuously improve system performance and reliability through iterative changes based on real usage.

Undisclosed

()

Seoul, South Korea
Maybe global
Remote
Python
PyTorch
OpenAI API
Kubernetes
Docker

Lead/Manager Together Cloud Infrastructure Engineer

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation to respond to production incidents, build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, build monitoring systems to ensure the highest quality service for customers, design and implement operational processes such as deployments and upgrades, debug production issues across all services and levels of the stack, identify improvements for product architecture from reliability, performance, and availability perspectives, and plan the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

Amsterdam
Maybe global
Onsite
Python
Docker
Kubernetes
AWS
Terraform

Director, Technical Program Manager

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components from APIs to UI, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead incident response for production issues in mission-critical environments ensuring rapid resolution and prevention measures, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to incorporate field lessons into future technical architecture and decisions.

Undisclosed

()

San Francisco, United States
Maybe global
Onsite
Python
Kubernetes
Vector Databases
MLOps
CI/CD

Software Engineer, Monetization ML Infrastructure

New
Top rated
OpenAI
Full-time
Full-time
Posted

Design and build the machine learning infrastructure that powers OpenAI's monetization and ads systems. Develop large-scale data pipelines processing impressions, clicks, conversions, advertiser data, marketplace signals, and other inputs used to train and improve ML models. Create scalable model training platforms for ranking, conversion prediction, quality prediction, bidding, targeting, measurement, and optimization workloads. Develop systems to safely and reliably move models from experimentation into production environments. Build and improve real-time inference and serving infrastructure with strict requirements for latency, throughput, reliability, and availability. Design experimentation frameworks enabling A/B testing, holdouts, model comparisons, ramping strategies, and measurement at scale. Improve platform performance by optimizing training efficiency, inference latency, model throughput, infrastructure reliability, and cost effectiveness. Collaborate closely with ML engineers, product engineers, data scientists, and monetization teams to accelerate development and deployment of advertising systems.

$293,000 – $441,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Remote
Python
PyTorch
TensorFlow
Data Pipelines
MLOps

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]