AI Infrastructure Engineer Jobs

Discover the latest remote and onsite AI Infrastructure Engineer roles across top active AI companies. Updated hourly.

Check out 13 new AI Infrastructure Engineer opportunities posted on The Homebase

Inference Technical Lead, On-Device Transformers

New
Top rated
OpenAI
Full-time
Full-time
Posted

As a Technical Lead on the Future of Computing Research team, you will evaluate and select silicon platforms such as GPUs, NPUs, and specialized accelerators for on-device and edge deployment of OpenAI models. You will work closely with research teams to co-design model architectures that meet real-world deployment constraints including latency, memory, power, and bandwidth. You will analyze and model system performance, identifying tradeoffs between model design, memory hierarchy, compute throughput, and hardware capabilities. You will partner with hardware vendors and internal infrastructure teams to bring up new accelerators and ensure efficient execution of transformer workloads. Additionally, you will build and lead a team of engineers responsible for implementing the low-level inference stack, including kernel development and runtime systems. You will also take nascent research capabilities and develop them into usable capabilities.

$445,000 – $445,000
Undisclosed
YEAR

(USD)

San Francisco, United States
Maybe global
Hybrid

AI & IT Systems Engineer

New
Top rated
Jasper
Full-time
Full-time
Posted

As Jasper undergoes an agentic AI shift, the AI & IT Systems Engineer role involves ensuring the IT infrastructure is robust, secure, and fine-tuned for advanced AI workflows, spending 70-80% of time on AI enablement deployments. Responsibilities include modernizing and improving IT systems to support autonomous AI workflows, building scalable automation infrastructure to enhance efficiency and reduce manual tasks, and operationalizing AI initiatives using tools like Claude, ChatGPT, and Zapier to create intelligent, cross-platform workflows involving platforms like Google Workspace and Slack. The role also requires managing core IT systems such as Identity Providers and Mobile Device Management, streamlining identity and access operations using features like Okta Workflows, and providing cross-functional technical support across departments to implement AI enablement projects. Additionally, the engineer manages a broad SaaS ecosystem, including Google Workspace and Linear, and assists in developing training resources and playbooks to facilitate team adoption of new AI tools.

$135,000 – $155,000
Undisclosed
YEAR

(USD)

United States
Maybe global
Remote

Customer Support Engineer (Inference), India

New
Top rated
Together AI
Full-time
Full-time
Posted

Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference. Implement and maintain changes in high-performance inference engines, including kernel backends, speculative decoding, and quantization. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Design and operate RL and post-training pipelines, jointly optimizing algorithms and systems to make inference and post-training workloads more efficient. Train, evaluate, and iterate on frontier models using these pipelines. Co-design algorithms and infrastructure for tightly coupled objectives, rollout collection, and evaluation to efficient inference. Identify bottlenecks across training engine, inference engine, data pipeline, and user-facing layers. Run ablations and scale-up experiments to understand trade-offs between model quality, latency, throughput, and cost, feeding insights back into model, RL, and system design. Profile, debug, and optimize inference and post-training services under real production workloads. Drive roadmap items requiring engine modification such as changing kernels, memory layouts, scheduling logic, and APIs. Establish metrics, benchmarks, and experimentation frameworks to rigorously validate improvements. Provide technical leadership by setting technical direction for cross-team efforts at the intersection of inference, RL, and post-training, and mentoring other engineers and researchers on full-stack ML systems work and performance engineering.

$200,000 – $280,000
Undisclosed
YEAR

(USD)

India
Maybe global
Onsite

Helix AI Engineer, Agentic Systems

New
Top rated
Figure AI
Full-time
Full-time
Posted

Design, deploy, and maintain Figure's training clusters. Architect and maintain scalable deep learning frameworks for training on massive robot datasets. Work together with AI researchers to implement training of new model architectures at a large scale. Implement distributed training and parallelization strategies to reduce model development cycles. Implement tooling for data processing, model experimentation, and continuous integration.

$150,000 – $350,000
Undisclosed
YEAR

(USD)

San Jose, United States
Maybe global
Onsite

Manual Quality Assurance Engineer, Web Core Product

New
Top rated
Speechify
Full-time
Full-time
Posted

Work alongside machine learning researchers, engineers, and product managers to bring AI Voices to customers for diverse use cases. Deploy and operate the core ML inference workloads for the AI Voices serving pipeline. Introduce new techniques, tools, and architecture that improve performance, latency, throughput, and efficiency of deployed models. Build tools to identify bottlenecks and sources of instability and design and implement solutions to address the highest priority issues.

$140,000 – $200,000
Undisclosed
YEAR

(USD)

Maybe global
Remote

AI Infrastructure Engineer

New
Top rated
42dot
Full-time
Full-time
Posted

Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm. Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery. Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency. Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources. Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.

Undisclosed

()

Pangyo, South Korea
Maybe global
Remote

Director of Governance & Risk Compliance

New
Top rated
Scale AI
Full-time
Full-time
Posted

The role involves designing and developing the production lifecycle of full-stack AI applications and supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. Responsibilities include taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, overseeing the end-to-end health of the platform ensuring seamless integration between AI core and full-stack components, building automated systems to monitor model performance and data drift across dispersed environments, managing the technical lifecycle within diverse regulatory frameworks, leading incident response for production issues in mission-critical environments, translating technical performance metrics into clear insights for senior government officials, and partnering with Engineering and ML teams to drive product evolution based on field lessons.

Undisclosed

()

San Francisco, United States
Maybe global
Onsite

Staff Software Engineer, GPU Infrastructure (HPC)

New
Top rated
Cohere
Full-time
Full-time
Posted

As a Staff Software Engineer, you will build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds ensuring high throughput and low-latency performance for AI workloads. You will optimize for AI/ML training by collaborating with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, using technologies like RDMA, NCCL, and high-speed interconnects. You will troubleshoot and resolve complex issues by identifying and resolving infrastructure bottlenecks, performance degradation, and system failures to minimize disruption to AI/ML workflows. You will enable researchers with self-service tools by designing intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently. You will drive innovation in ML infrastructure by working closely with AI researchers to understand emerging needs such as JAX, PyTorch, and distributed training and translating them into robust, scalable infrastructure solutions. You will champion best practices by advocating for observability, automation, and infrastructure-as-code (IaC) across the organization to ensure systems are maintainable and resilient. Additionally, you will provide mentorship and collaborate through code reviews, documentation, and cross-team efforts to foster a culture of knowledge transfer and engineering excellence.

Undisclosed

()

Canada
Maybe global
Remote

Mechanical Engineer, Packaging Systems

New
Top rated
Figure AI
Full-time
Full-time
Posted

$150,000 – $350,000 / year
Undisclosed
YEAR

(USD)

San Jose, United States
Maybe global
Onsite

Helix Data Creator

New
Top rated
Figure AI
Full-time
Full-time
Posted

Design, deploy, and maintain Figure's training clusters. Architect and maintain scalable deep learning frameworks for training on massive robot datasets. Work together with AI researchers to implement training of new model architectures at a large scale. Implement distributed training and parallelization strategies to reduce model development cycles. Implement tooling for data processing, model experimentation, and continuous integration.

$150,000 – $350,000
Undisclosed
YEAR

(USD)

Los Angeles
Maybe global
Onsite

Want to see more AI Infrastructure Engineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Have questions about roles, locations, or requirements for AI Infrastructure Engineer jobs?

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What does a AI Infrastructure Engineer do?","answer":"AI Infrastructure Engineers design and build the systems that power machine learning workloads. They optimize performance by resolving bottlenecks, implement scaling solutions through load balancing and redundancy, and deploy cloud infrastructure specifically for AI applications. These specialists build fault-tolerant systems for serving large language models, maintain continuous integration pipelines, and collaborate with AI teams to translate research needs into production-ready infrastructure."},{"question":"What skills are required for AI Infrastructure Engineer?","answer":"Key skills for this role include proficiency with cloud platforms (AWS SageMaker, Azure ML, Vertex AI), infrastructure as code tools like Terraform, and containerization technologies such as Docker and Kubernetes. Strong programming abilities in Python, Go or C++ are essential, with CUDA knowledge for GPU optimization. Experience with monitoring tools (Prometheus, Grafana), distributed systems, deep learning frameworks, and Linux/UNIX environments is highly valued in candidates."},{"question":"What qualifications are needed for AI Infrastructure Engineer role?","answer":"Employers typically require a bachelor's degree in Computer Science, AI, Machine Learning, or related technical field. Most positions demand 4+ years of experience in cloud infrastructure, large-scale systems, or software engineering with an infrastructure focus. Practical expertise in cloud computing, Linux administration, network architecture, and container technologies is essential. Specialized knowledge in GPU programming, distributed systems, and LLM serving capabilities strengthens applications considerably."},{"question":"What is the salary range for AI Infrastructure Engineer job?","answer":"The research provided doesn't contain specific salary information for AI Infrastructure Engineers. Compensation typically varies based on location, experience level, company size, and the specific technical skills required. As this role combines specialized AI knowledge with infrastructure expertise, salaries generally reflect the high demand for professionals who can effectively build and optimize systems for machine learning workloads at scale."},{"question":"How long does it take to get hired as a AI Infrastructure Engineer?","answer":"The research doesn't provide specific hiring timeline information. The hiring process length varies by company and often includes technical assessments of cloud architecture knowledge, infrastructure as code experience, and machine learning operations skills. Given the specialized nature of AI infrastructure roles and their typical requirement of 4+ years of relevant experience, candidates should expect thorough evaluation of their technical capabilities and problem-solving abilities."},{"question":"Are AI Infrastructure Engineer job in demand?","answer":"Yes, AI Infrastructure Engineer positions show strong demand signals. Major companies like Accenture, Scale AI, and Zoom are actively recruiting for these specialized roles. The increasing deployment of large language models and AI applications across industries creates consistent need for professionals who can build optimized infrastructure. The specialized skill intersection of cloud platforms, containerization, GPU optimization, and machine learning operations makes qualified candidates particularly valuable in today's job market."}]