Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on The Homebase

Software Engineer, Architecture, Reliability, & Compute

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between AI core and full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons influence future technical architecture and decisions.

Undisclosed

()

San Francisco or St. Louis or New York or Washington, United States
Maybe global
Onsite
Python
Kubernetes
Docker
MLOps
CI/CD

Engineering Manager, Active Learning

New
Top rated
Deepgram
Full-time
Full-time
Posted

The Engineering Manager role at Deepgram involves leading the design and implementation of internal data and ML training systems. Responsibilities include recruiting, hiring, training, and supporting top engineering talent to build a world-class team; transforming cross-functional visions into detailed project plans with clarity on commitments, risks, and timelines; defining and owning technical strategy to accelerate ML training pipelines; promoting a strong team engineering culture focused on rigorous engineering standards and continuous improvement; partnering with DataOps and Research teams to design and implement new services, features, or products end to end; and coaching and mentoring engineers to support personal growth while achieving ambitious team goals.

$180,000 – $220,000
Undisclosed
YEAR

(USD)

United States
Maybe global
Remote
Python
Docker
Kubernetes
AWS
MLflow

Research Engineer, Machine Learning Systems

New
Top rated
Deepgram
Full-time
Full-time
Posted

The responsibilities include architecting and managing horizontally scalable systems to accelerate the end-to-end training lifecycle for Speech-to-Text (STT) and Text-to-Speech (TTS) models, focusing on optimized data preparation, high-throughput training pipelines, distributed infrastructure, and automated evaluation tooling. The role also involves designing and implementing internal UIs and tools to make ML systems and workflows accessible and transparent to non-technical stakeholders. Additionally, the position requires overseeing and managing training tooling, job orchestration, experiment tracking, and data storage.

$150,000 – $250,000
Undisclosed
YEAR

(USD)

United States
Maybe global
Remote
Python
Kubernetes
Docker
MLflow
Scikit-learn

Sr. Engineering Manager, Machine Learning

New
Top rated
AKASA
Full-time
Full-time
Posted

Lead a talented team of engineers focused on improving AKASA’s machine learning capabilities and delivering cutting-edge products. Supervise and directly contribute across all parts of the LLM stack, including model fine-tuning, inference, evaluation, and deployment. Develop infrastructure and tooling to improve the model development lifecycle. Oversee a high-performing team via hands-on contributions and coaching. Translate business requirements into technical designs that work within constraints such as latency, cost, performance, and uptime. Set the vision and direction for the team and attract top talent to join AKASA. Attend in-office co-working days every Wednesday as part of the local R&D team.

$230,000 – $310,000
Undisclosed
YEAR

(USD)

South San Francisco, United States
Maybe global
Hybrid
Python
PyTorch
Kubernetes
RAG
Model Evaluation

Engineering Manager, Go - Assist & Chat

New
Top rated
Grammarly
Full-time
Full-time
Posted

Own the observability and lifecycle management of AI features across the organization. Build tools and infrastructure to enable teams to develop, monitor, and optimize LLM-powered features. Design and implement closed-loop evaluation pipelines that automatically validate prompt changes. Develop comprehensive metrics and dashboards to track LLM usage including cost per feature, token patterns, and latency. Create systems that tie user feedback to specific prompts and LLM calls. Establish best practices and processes for the full lifecycle of prompts, including development, testing, deployment, and monitoring. Collaborate with engineering teams across the organization to ensure they have the tools and visibility needed to build high-quality AI features.

$103,000 – $174,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite
Go
Kubernetes
Google Cloud
Docker
CI/CD

Compliance Program Manager

New
Top rated
Grammarly
Full-time
Full-time
Posted

Own the observability and lifecycle management of AI features across the organization. Build tools and infrastructure to enable teams to develop, monitor, and optimize LLM-powered features. Design and implement closed-loop evaluation pipelines that automatically validate prompt changes. Develop comprehensive metrics and dashboards to track LLM usage, including cost per feature, token patterns, and latency. Create systems that connect user feedback to specific prompts and LLM calls. Establish best practices and processes for the full lifecycle of prompts: development, testing, deployment, and monitoring. Collaborate with engineering teams across the organization to ensure they have the tools and visibility needed to build high-quality AI features.

$103,000 – $174,000
Undisclosed
YEAR

(USD)

Ukraine
Maybe global
Remote
Go
Kubernetes
Google Cloud
MLOps
Observability

Head of Internal Tools Engineering

New
Top rated
Bjak
Full-time
Full-time
Posted

The Head of Internal Tools Engineering is responsible for owning the end-to-end strategy and roadmap for all internal tools, platforms, and automation, treating internal technology as a product. They make strategic build-vs-buy decisions, map current and next-state process flows, and lead systems transformation for internal teams. They architect and maintain the full engineering lifecycle of internal platforms, build seamless API-first ecosystems integrating various internal systems, ensure system reliability and operational resilience, and design scalable, secure architectures using cloud-native principles and microservices. They lead AI strategy by integrating AI and LLMs into internal workflows and deploying intelligent automation tools. They reduce cognitive load for internal users by providing standardized workflows and self-service capabilities, measure platform success by adoption, satisfaction, and productivity impact, and build, lead, and mentor a high-performing engineering team. They cultivate a collaborative culture, provide technical mentorship, foster psychological safety, partner cross-functionally with leadership across departments, and align internal platform investments with company strategy while demonstrating measurable ROI.

Undisclosed

()

New York, United States
Maybe global
Remote
Python
AWS
GCP
Azure
Docker

Head of Internal Tools Engineering

New
Top rated
Bjak
Full-time
Full-time
Posted

The role involves architecting, building, and scaling the internal technology ecosystem to accelerate workforce productivity, eliminate operational friction, and provide a compounding infrastructure advantage by treating internal tools with product rigor and user-centricity. Responsibilities include owning the end-to-end strategy and roadmap for all internal tools, platforms, and automation; making strategic build-vs-buy decisions; mapping current and next-state process flows and leading systems transformation. The role requires architecting and maintaining the full engineering lifecycle of internal platforms, building API-first ecosystems integrating with various business systems, owning system reliability and operational resilience, and designing scalable, secure cloud-native architectures. The role leads AI adoption and automation integration into internal workflows, including deploying intelligent automation tools, evaluating AI-assisted troubleshooting, and driving continuous experimentation with prototypes. The person will reduce cognitive load for internal users by providing golden paths and standardized workflows, ensuring frictionless onboarding, and measuring platform success via adoption rates, user satisfaction, DORA metrics, and productivity impact. Team leadership duties include building, leading, and mentoring engineers and managers, fostering a collaborative culture rooted in ownership, speed, craftsmanship, and psychological safety. The role partners cross-functionally with various company leadership teams to translate business needs into a unified technical vision, aligning internal platform investments with company strategy and demonstrating measurable ROI.

Undisclosed

()

Beijing, China
Maybe global
Remote
Python
AWS
GCP
Azure
CI/CD

Senior Analytics Engineer

New
Top rated
You.com
Full-time
Full-time
Posted

Design and develop AI applications primarily in Python. Run evaluations to validate models and package solutions for Kubernetes, AWS, or adapt them to customer on-premises clusters. Lead discovery sessions, guide pilot projects, and ensure successful deployments. Collaborate mostly remotely with occasional on-site workshops. Monitor system performance and reliability. Add to the logging, billing and auth services. Build internal tooling to automate repetitive tasks. Provide feedback on patterns, pain points, and reusable modules to the core product team to influence the future direction of the AI platform.

$165,000 – $200,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Hybrid
Python
Kubernetes
AWS
LLM
Machine Learning

Solutions Architect (Dallas)

New
Top rated
LangChain
Full-time
Full-time
Posted

The Solutions Architect is responsible for designing scalable, highly-available infrastructure for AI platform deployments including compute, storage, networking, security, enterprise integration patterns, Infrastructure as Code (Terraform, Helm), multi-region HA/DR strategies, and CI/CD pipelines. They design multi-agent systems using various patterns, implement agent logic using modern frameworks (langchain/langgraph), design evaluation frameworks, optimize prompts with A/B testing, and guide deployment and operations. The role involves leading technical maturity assessments, working directly with enterprise customers to understand requirements and present recommendations, and partnering with Engagement Managers and Product/Engineering teams.

$170,000 – $190,000
Undisclosed
YEAR

(USD)

Dallas, United States
Maybe global
Remote
Python
TypeScript
Kubernetes
AWS
GCP

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]