Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on The Homebase

Senior AI Engineer

New
Top rated
Ryz Labs
Contractor
Full-time
Posted

The responsibilities include building agent-driven enrollment and parent communication pipelines that scale significantly without proportional headcount growth; creating and managing parallel simulations of students testing curriculum to identify gaps and generate improvements; developing automated culture and community agents for engagement, onboarding, and retention at machine scale; constructing real-time operational dashboards to provide leadership with visibility into various business aspects such as enrollment, academic progress, parent satisfaction, and campus operations; designing AI-first workflows for guides, advisors, and operational staff to reduce administrative burdens and refocus on students; building systems called Brainlifts to capture and compound institutional knowledge over time; and integrating these capabilities into Alpha's broader AI ecosystem including EPHOR, Alpha GPTs, and Fleet/Swarm infrastructure.

Undisclosed

()

Buenos Aires, Argentina
Maybe global
Remote
Python
Prompt Engineering
OpenAI API
MLOps
Docker

DevOps Engineer, Infrastructure & Security

New
Top rated
Scale AI
Full-time
Full-time
Posted

The role involves taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. Responsibilities include overseeing the end-to-end health of the platform to ensure seamless integration between the AI core and all full-stack components, from APIs to UI, maintaining a responsive and production-ready environment. The job also requires building automated systems to monitor model performance and data drift across geographically dispersed environments, managing the technical lifecycle within diverse regulatory frameworks, leading the response for production issues in mission-critical environments, ensuring rapid resolution and prevention of future issues. Additionally, the role requires translating deep technical performance metrics into clear insights for senior international government officials and partnering with Engineering and ML teams to ensure lessons learned in the field influence the technical architecture and decisions of future use cases.

Undisclosed

()

San Francisco or New York, United States
Maybe global
Onsite
Kubernetes
Docker
AWS
Vector Databases
MLOps

Field Engineering Manager, Public Sector

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. Responsibilities include owning the production outcome with full accountability for long-term performance and reliability of AI use cases across international government agencies, ensuring full-stack integrity by overseeing all platform components from APIs to UI for a production-ready environment, building automated systems to monitor model performance and data drift across dispersed environments, managing the technical lifecycle within diverse regulatory frameworks, leading incident response in mission-critical environments with rapid resolution and prevention guardrails, translating technical performance metrics into clear insights for senior government officials, and partnering with engineering and ML teams to influence the technical architecture and decisions for future AI use cases.

Undisclosed

()

San Francisco or St. Louis or New York or Washington, United States
Maybe global
Onsite
Python
Kubernetes
MLOps
Vector Databases
Prompt Engineering

Full Stack Engineer

New
Top rated
Agent
Full-time
Full-time
Posted

Build and maintain features for the web-based property management platform using TypeScript, React, Node.js, PostgreSQL, and AWS. Contribute to a monorepo architecture, working within two-week sprint cycles to deliver high-quality code. Implement integrations including DocuSign, Plaid, Stripe, and ownership group payout systems. Optimize platform performance and user experience by replacing legacy systems. Build and integrate AI agents using Claude and other AI APIs to automate organizational processes, developing API integrations and custom agents. Collaborate with the CEO on prioritizing automation opportunities. Take ownership of tasks, independently research and implement solutions to challenges, proactively identify and implement improvements, and contribute ideas to platform architecture and development priorities.

$2,800 – $3,500 / month
Undisclosed
MONTH

(USD)

Buenos Aires, Argentina
Maybe global
Remote
TypeScript
JavaScript
AWS
CI/CD
Docker

Senior Software Engineer, Agents

New
Top rated
Decagon
Full-time
Full-time
Posted

Design and build AI agents that outperform human agents in managing complex customer interactions and driving customer retention. Identify cross-customer trends that guide the evolution of Decagon’s agent building platform and research efforts. Experiment with and run evaluations on the latest text and voice models, then integrate them at scale with large enterprise-grade customers.

$250,000 – $350,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
JavaScript
TypeScript
Prompt Engineering
Model Evaluation

Copy of Member of Technical Staff - ML Engineering

New
Top rated
Talent Labs
Full-time
Full-time
Posted

Deploy, maintain, and optimize production and research compute clusters. Design and implement scalable and efficient ML inference solutions. Develop dynamic and heterogeneous compute solutions for balancing research and production needs. Contribute to productizing model APIs for external use. Develop infrastructure observability and monitoring solutions.

Undisclosed

()

London, United Kingdom
Maybe global
Remote
Kubernetes
AWS
GCP
Azure
PyTorch

C++ Systems Engineer

New
Top rated
LM Studio
Full-time
Full-time
Posted

Design, build, and optimize the core native runtime powering LM Studio and the C++ libraries powering the app and APIs. Work across runtime, LLM engines, llama.cpp/MLX integrations, build infrastructure, and on-device AI software. Focus on system and library integration by wiring the C++ runtime to GPU backends, vendor SDKs, and operating-system services to support user-facing applications. Implement and harden system-level code involving threading, memory, files, IPC, and scheduling. Integrate platform acceleration paths such as Metal, CUDA, and Vulkan across macOS, Windows, and Linux. Profile, debug, and tune execution paths to ensure fast, dependable local AI and maintainable software. Contribute to the C++ runtime powering LM Studio, extend LLM engine integrations, and build platform-aware performance features for desktop OS. Implement resilient IPC, resource management, and scheduling logic to support concurrent model execution. Improve build, packaging, and release infrastructure for native components. Collaborate with the team to deliver cohesive and recognizable user experiences.

$175,000 – $275,000
Undisclosed
YEAR

(USD)

New York City, United States
Maybe global
Onsite
C++
Python
Docker
CI/CD
Kubernetes

Research Engineer – Benchmarking, Evals & Failure Analysis

New
Top rated
Mercor
Full-time
Full-time
Posted

As a Research Engineer at Mercor, you will own benchmarking pipelines, evaluation systems, and failure analysis workflows that directly inform how frontier language models are trained and improved. You will design, implement, and maintain benchmarks and metrics for tool use, agentic behavior, and real-world reasoning, ensuring they scale with training and align with product and research goals. You will build and operate LLM evaluation systems including runs, scoring, dashboards, and reporting to allow tracking and comparison of model performance at scale. You will conduct systematic failure analysis on model outputs, categorize failure modes, quantify their prevalence, and use these insights to influence reward design, data curation, and benchmark design. Additionally, you will create and refine rubrics, automated evaluators, and scoring frameworks that influence training and evaluation decisions, balancing rigor and scalability. You will quantify data usability and quality, guide data generation, augmentation, and curation based on evaluations and failure analysis. Collaboration with AI researchers, applied AI teams, and data producers to align evaluations with training objectives and prioritize important benchmarks and failure analyses is expected. Finally, you will operate with strong ownership in a fast-paced, high-iteration research environment.

$130,000 – $500,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite
Python
MLflow
Docker
Kubernetes
AWS

AI Evaluation Engineer

New
Top rated
Ryz Labs
Contractor
Full-time
Posted

Design and implement evaluation pipelines to measure the performance and reliability of AI models, develop automated testing frameworks to assess model outputs at scale, analyze model performance using both traditional statistical metrics and AI-specific evaluation methods, evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG), identify potential issues related to accuracy, hallucinations, bias, safety, and model drift, conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior, collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance, monitor model performance in production, and help define best practices for AI evaluation and observability.

Undisclosed

()

Argentina
Maybe global
Remote
Python
Prompt Engineering
Model Evaluation
RAG
LangChain

Engineering Leader

New
Top rated
Ema
Full-time
Full-time
Posted

As an Engineering Leader at Ema, you will build and lead a high-performance engineering organization by recruiting, hiring, and developing senior engineers across multiple sub-teams including cloud infrastructure, data platform, ML operations, and developer experience. You will establish engineering standards, a code review culture, on-call expectations, and promote a bias-toward-shipping mentality balanced with production rigor. You will coach and grow senior and staff engineers into technical leaders and manage engineering managers as the organization scales. Your responsibilities include setting the 6–18 month platform roadmap in partnership with engineering teams, making critical architectural decisions such as build versus buy and migration strategies, and driving cross-functional alignment with product, ML/AI research, and go-to-market teams. You will own production health for all platform services, including incident response, postmortems, SLO tracking, and capacity planning. Additionally, you will establish and refine engineering practices to maintain fast shipping without compromising reliability, and participate in executive-level reviews related to infrastructure spend, system health, and engineering velocity.

Undisclosed

()

Bengaluru, India
Maybe global
Onsite
Go
Python
Kubernetes
AWS
GCP

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]