AI Infrastructure Engineer Jobs

Discover the latest remote and onsite AI Infrastructure Engineer roles across top active AI companies. Updated hourly.

Check out 16 new AI Infrastructure Engineer opportunities posted on AI Chopping Block

Research Intern, Inference (Fall 2026)

New
Top rated
Together AI
Full-time
Posted

As an AI Infrastructure Engineer at Together, the responsibilities include participating in on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to support scaling to a large number of concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture in terms of reliability, performance, and availability, and planning the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

Maybe global

AI Builder Intern

New
Top rated
Scale AI
Full-time
Full-time
Posted

The Production AI Ops Lead is responsible for designing and developing the production lifecycle of full-stack AI applications, supporting system reliability, real-time inference observability, sovereign data orchestration, secure software integration, and resilient cloud infrastructure for international government partners. They own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. They oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI, maintaining a responsive and production-ready environment. The role involves building automated systems to monitor model performance and data drift across geographically dispersed environments to ensure reliability, managing the technical lifecycle within diverse regulatory frameworks, and leading incident response for production issues in mission-critical environments to ensure rapid resolution and prevent recurrence. The lead also translates technical performance metrics into clear insights for senior international government officials and partners with Engineering and ML teams to influence the technical architecture and decisions of future AI use cases.

Undisclosed

()

San Francisco or New York
Maybe global
Onsite

Harness Engineer

New
Top rated
Nomic AI
Full-time
Full-time
Posted

As a Harness Engineer, you will work on the systems that make AI agents effective, focusing on how they find information, assemble context, verify their operation, and improve over time. Responsibilities include developing retrieval systems such as search, ranking, chunking strategies, and hybrid approaches tailored to the problem; context engineering to assemble the right information for agents working with large, heterogeneous document sets; building infrastructure for continuous evaluation of agent accuracy and regression testing of retrieval quality, implementing and maintaining feedback loops; creating and managing agent pipelines that orchestrate between retrieval, models, and downstream actions; and ensuring these systems scale to operate efficiently across thousands of customer document collections, beyond just demo corpora.

Undisclosed

()

New York City, United States
Maybe global
Remote

Distributed LLM Inference Engineer

New
Top rated
Anyscale
Full-time
Full-time
Posted

As a Distributed LLM Inference Engineer at Anyscale, you will help systems and optimizations that push the boundaries of performance for inference at large scale. Responsibilities include iterating quickly with product teams to ship end-to-end solutions for Batch and Online inference at high scale for open-source Ray users and Anyscale customers; working across the stack integrating Ray Data and LLM engine to provide optimizations achieving low cost solutions for large scale ML inference; integrating with open source software like vLLM, working closely with the community to adopt these techniques in Anyscale solutions, and contributing improvements to open source; and following the latest state-of-the-art developments in the open source and research community, implementing and extending best practices.

$170,112 – $247,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Remote

AI Deployment Strategist, Enterprise

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead incident response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons learned influence future technical architecture and decisions.

Undisclosed

()

San Francisco or New York, United States
Maybe global
Onsite

AI Infrastructure Supply Chain Lead

New
Top rated
Armada
Full-time
Full-time
Posted

The AI Infrastructure Supply Chain Lead is responsible for translating business requirements into requirements for AI/ML models, preparing data to train and evaluate AI/ML/DL models, building AI/ML/DL models using state-of-the-art algorithms such as transformers, testing and evaluating model quality, publishing models, data sets, and evaluations, deploying models in production by containerizing them, working with customers and internal employees to refine model quality, establishing continuous learning pipelines for models with online or transfer learning, and building and deploying containerized applications on cloud or on-premise environments.

$154,560 – $193,200
Undisclosed
YEAR

(USD)

Bellevue, United States
Maybe global
Onsite

Staff Machine Learning Engineer, Voice AI

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. Your responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for product architecture in reliability, performance, and availability, and planning the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite

AI Systems Engineer, Codex Agents

New
Top rated
OpenAI
Full-time
Full-time
Posted

Design and build the core agent harness and execution loop that lets Codex agents interpret model outputs, use tools, execute code, and complete long-horizon tasks safely. Build sandboxing, isolation, orchestration, state, and workflow infrastructure for agents operating in real development environments. Develop evaluation, experimentation, and debugging systems that distinguish harness issues, model behavior, inference/runtime issues, and product failures. Run ablations across prompts, model-facing interfaces, context construction, tool-use strategies, and harness behavior to improve solve rate, reliability, latency, and cost. Improve observability, profiling, and diagnostics across the agent stack, from backend systems to inference, GPUs, and fleet capacity. Work closely with research to make the harness trainable, measurable, and useful for improving frontier agentic models. Build shared primitives that make Codex faster, safer, more reliable, and easier for other teams and open-source users to build on.

$230,000 – $385,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Onsite

Sr. Revenue Accountant

New
Top rated
Together AI
Full-time
Full-time
Posted

Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from reliability, performance, and availability perspectives. Plan the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite

Infrastructure Accounting Manager

New
Top rated
Together AI
Full-time
Full-time
Posted

Participate in on-call rotation (Pagerduty) to respond to production incidents; build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a large number of concurrent users; build monitoring systems to ensure high quality service; design and implement operational processes such as deployments and upgrades; debug production issues across all services and stack levels; identify improvements for product architecture focusing on reliability, performance, and availability; and plan the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite

Want to see more AI Infrastructure Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for AI Infrastructure Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a AI Infrastructure Engineer do?","answer":"AI Infrastructure Engineers design and build the systems that power machine learning workloads. They optimize performance by resolving bottlenecks, implement scaling solutions through load balancing and redundancy, and deploy cloud infrastructure specifically for AI applications. These specialists build fault-tolerant systems for serving large language models, maintain continuous integration pipelines, and collaborate with AI teams to translate research needs into production-ready infrastructure."},{"question":"What skills are required for AI Infrastructure Engineer?","answer":"Key skills for this role include proficiency with cloud platforms (AWS SageMaker, Azure ML, Vertex AI), infrastructure as code tools like Terraform, and containerization technologies such as Docker and Kubernetes. Strong programming abilities in Python, Go or C++ are essential, with CUDA knowledge for GPU optimization. Experience with monitoring tools (Prometheus, Grafana), distributed systems, deep learning frameworks, and Linux/UNIX environments is highly valued in candidates."},{"question":"What qualifications are needed for AI Infrastructure Engineer role?","answer":"Employers typically require a bachelor's degree in Computer Science, AI, Machine Learning, or related technical field. Most positions demand 4+ years of experience in cloud infrastructure, large-scale systems, or software engineering with an infrastructure focus. Practical expertise in cloud computing, Linux administration, network architecture, and container technologies is essential. Specialized knowledge in GPU programming, distributed systems, and LLM serving capabilities strengthens applications considerably."},{"question":"What is the salary range for AI Infrastructure Engineer job?","answer":"The research provided doesn't contain specific salary information for AI Infrastructure Engineers. Compensation typically varies based on location, experience level, company size, and the specific technical skills required. As this role combines specialized AI knowledge with infrastructure expertise, salaries generally reflect the high demand for professionals who can effectively build and optimize systems for machine learning workloads at scale."},{"question":"How long does it take to get hired as a AI Infrastructure Engineer?","answer":"The research doesn't provide specific hiring timeline information. The hiring process length varies by company and often includes technical assessments of cloud architecture knowledge, infrastructure as code experience, and machine learning operations skills. Given the specialized nature of AI infrastructure roles and their typical requirement of 4+ years of relevant experience, candidates should expect thorough evaluation of their technical capabilities and problem-solving abilities."},{"question":"Are AI Infrastructure Engineer job in demand?","answer":"Yes, AI Infrastructure Engineer positions show strong demand signals. Major companies like Accenture, Scale AI, and Zoom are actively recruiting for these specialized roles. The increasing deployment of large language models and AI applications across industries creates consistent need for professionals who can build optimized infrastructure. The specialized skill intersection of cloud platforms, containerization, GPU optimization, and machine learning operations makes qualified candidates particularly valuable in today's job market."}]