AI Infrastructure Engineer Jobs

Discover the latest remote and onsite AI Infrastructure Engineer roles across top active AI companies. Updated hourly.

Check out 16 new AI Infrastructure Engineer opportunities posted on AI Chopping Block

Member of Technical Staff - Sandbox Platform

New
Top rated
Prime Intellect
Full-time
Full-time
Posted

The role involves working on both the underlying sandbox infrastructure that powers training systems and the developer-facing platform for AI workload management. Responsibilities include designing and implementing distributed orchestration infrastructure in Go and Rust, building high-performance networking and coordination components, creating infrastructure automation pipelines with Ansible, managing cloud resources and container orchestration, and implementing scheduling systems for heterogeneous hardware (CPU, GPU, TPU). Additionally, the role requires building intuitive web interfaces for AI workload management and monitoring, developing REST APIs and backend services in Python, creating real-time monitoring and debugging tools, and implementing user-facing features for resource management and job control.

$150,000 – $300,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Hybrid

Technical Lead Manager, Physical AI

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will scale the feedback loop by building automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will navigate global compliance by managing the technical lifecycle within diverse regulatory frameworks. You will lead incident command for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will bridge the gap by translating deep technical performance metrics into clear insights for senior international government officials. Finally, you will drive product evolution by partnering with Engineering and ML teams to ensure lessons from the field directly influence the technical architecture and decisions of future use cases.

Undisclosed

()

San Francisco
Maybe global
Onsite

Forward Deployed Engineer (Inference & Post-Training)

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer, you are responsible for keeping all user-facing services and production systems running smoothly. Your duties include participating in on-call rotation (Pagerduty) to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service for customers, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for the product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI’s infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite

Staff Engineer, Customer Insights

New
Top rated
Together AI
Full-time
Full-time
Posted

Participate in on-call rotation (Pagerduty) to respond to production incidents. Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. Build monitoring systems to ensure the highest quality service for customers. Design and implement operational processes such as deployments and upgrades. Debug production issues across all services and levels of the stack. Identify improvements for the product architecture from reliability, performance, and availability perspectives. Plan the growth of Together AI’s infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite

Full-Chip Physical Design Verification Engineer

New
Top rated
Tenstorrent
Full-time
Full-time
Posted

Lead and contribute to cross-functional efforts solving complex physical design challenges across IPs, projects, and advanced technology nodes. Develop and enhance RTL-to-GDS methodologies, including floorplanning, synthesis, place and route (P&R), static timing analysis (STA), signoff, and assembly. Architect and deploy AI/ML-driven solutions in production physical design flows to improve engineering efficiency, turnaround time, and quality of results (QoR). Optimize electronic design automation (EDA) tools and custom CAD flows using data-driven and machine learning (ML) techniques, collaborating closely with verification, extraction, timing, design for test (DFT), and EDA vendors.

$100,000 – $500,000
Undisclosed
YEAR

(USD)

Austin or Fort Collins or Santa Clara, United States
Maybe global
Hybrid

Member of Technical Staff, Inference (Paris, London)

New
Top rated
Genesis AI
Full-time
Full-time
Posted

Build low-latency inference pipelines for on-device deployment to enable real-time next-token and diffusion-based control loops in robotics. Design and optimize distributed inference systems on GPU clusters to increase throughput with large-batch serving and efficient resource utilization. Implement efficient low-level code such as CUDA, Triton, and custom kernels, integrating it seamlessly into high-level frameworks. Optimize workloads for both throughput and latency by managing batching, scheduling, quantization, caching, memory management, and graph compilation. Develop monitoring and debugging tools to ensure reliability, determinism, and rapid diagnosis of regressions across both software stacks.

Undisclosed

()

Paris, France
Maybe global
Remote

AI Engineer - Model Performance

New
Top rated
Fathom
Full-time
Full-time
Posted

The Model Performance Engineer is responsible for owning the speed, cost, and reliability of the model inference stack, optimizing real systems serving millions of meetings. This includes tasks such as quantization trade-offs, debugging speculative decoding, managing GPU performance under high concurrency, inference performance improvements through speculative decoding, quantization, serving configuration, GPU selection, batching strategies, cold start mitigation, and adapter swapping. The engineer also builds fine-tuning pipelines to enable the AI team to fine-tune models efficiently for new tasks with repeatable infrastructure, allowing faster transition from dataset to deployed model. Additional responsibilities include benchmarking quantization methods, evaluating serving frameworks for inference quality and speed, building fine-tuning pipelines that produce optimized tunes quickly, optimizing GPU spending by selecting appropriate GPU types and managing cost trade-offs, and debugging production inference issues related to serving frameworks and multimodal pipelines.

Undisclosed

()

San Francisco, United States
Maybe global
Remote

Senior AI Infrastructure Engineer - Training Platform

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead at Scale, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, build automated systems to monitor model performance and data drift across dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments ensuring rapid resolution, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to drive product evolution based on lessons learned in the field.

Undisclosed

()

San Francisco or Seattle or New York, United States
Maybe global
Onsite

Mixed-Signal IC Layout Design Engineer

New
Top rated
Tenstorrent
Full-time
Full-time
Posted

Lead and contribute to cross-functional efforts solving complex physical design challenges across IPs, projects, and advanced technology nodes. Develop and enhance RTL-to-GDS methodologies, including floorplanning, synthesis, P&R, STA, signoff, and assembly. Architect and deploy AI/ML-driven solutions in production flows to improve engineering efficiency, turnaround time, and quality of results (QoR). Optimize EDA tools and custom CAD flows using data-driven and ML-based techniques, collaborating closely with verification, extraction, timing, DFT, and EDA vendors.

$100,000 – $500,000
Undisclosed
YEAR

(USD)

Santa Clara or Austin or Fort Collins, United States
Maybe global
Hybrid

AI Factory, Value Engineer

New
Top rated
Armada
Full-time
Full-time
Posted

Responsibilities include translating business requirements into requirements for AI/ML models, preparing data to train and evaluate AI/ML/DL models, building AI/ML/DL models using state-of-the-art algorithms especially transformers, testing and evaluating models, benchmarking quality, publishing models and datasets, deploying models in production by containerizing them, working with customers and internal employees to refine model quality, establishing continuous learning pipelines with online or transfer learning, and building and deploying containerized applications on cloud or on-premise environments.

$154,560 – $193,200
Undisclosed
YEAR

(USD)

United States
Maybe global
Remote

Want to see more AI Infrastructure Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for AI Infrastructure Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a AI Infrastructure Engineer do?","answer":"AI Infrastructure Engineers design and build the systems that power machine learning workloads. They optimize performance by resolving bottlenecks, implement scaling solutions through load balancing and redundancy, and deploy cloud infrastructure specifically for AI applications. These specialists build fault-tolerant systems for serving large language models, maintain continuous integration pipelines, and collaborate with AI teams to translate research needs into production-ready infrastructure."},{"question":"What skills are required for AI Infrastructure Engineer?","answer":"Key skills for this role include proficiency with cloud platforms (AWS SageMaker, Azure ML, Vertex AI), infrastructure as code tools like Terraform, and containerization technologies such as Docker and Kubernetes. Strong programming abilities in Python, Go or C++ are essential, with CUDA knowledge for GPU optimization. Experience with monitoring tools (Prometheus, Grafana), distributed systems, deep learning frameworks, and Linux/UNIX environments is highly valued in candidates."},{"question":"What qualifications are needed for AI Infrastructure Engineer role?","answer":"Employers typically require a bachelor's degree in Computer Science, AI, Machine Learning, or related technical field. Most positions demand 4+ years of experience in cloud infrastructure, large-scale systems, or software engineering with an infrastructure focus. Practical expertise in cloud computing, Linux administration, network architecture, and container technologies is essential. Specialized knowledge in GPU programming, distributed systems, and LLM serving capabilities strengthens applications considerably."},{"question":"What is the salary range for AI Infrastructure Engineer job?","answer":"The research provided doesn't contain specific salary information for AI Infrastructure Engineers. Compensation typically varies based on location, experience level, company size, and the specific technical skills required. As this role combines specialized AI knowledge with infrastructure expertise, salaries generally reflect the high demand for professionals who can effectively build and optimize systems for machine learning workloads at scale."},{"question":"How long does it take to get hired as a AI Infrastructure Engineer?","answer":"The research doesn't provide specific hiring timeline information. The hiring process length varies by company and often includes technical assessments of cloud architecture knowledge, infrastructure as code experience, and machine learning operations skills. Given the specialized nature of AI infrastructure roles and their typical requirement of 4+ years of relevant experience, candidates should expect thorough evaluation of their technical capabilities and problem-solving abilities."},{"question":"Are AI Infrastructure Engineer job in demand?","answer":"Yes, AI Infrastructure Engineer positions show strong demand signals. Major companies like Accenture, Scale AI, and Zoom are actively recruiting for these specialized roles. The increasing deployment of large language models and AI applications across industries creates consistent need for professionals who can build optimized infrastructure. The specialized skill intersection of cloud platforms, containerization, GPU optimization, and machine learning operations makes qualified candidates particularly valuable in today's job market."}]