Member of Technical Staff, Pre-training Systems
Design and operate the distributed infrastructure that trains Magic's long-context models at scale, focusing on large-scale model training across massive GPU clusters. Scale distributed training across large GPU clusters including data, tensor, and pipeline parallelism. Optimize communication patterns and gradient synchronization. Improve checkpointing, fault tolerance, and job recovery systems. Profile and eliminate performance bottlenecks across compute, networking, and storage. Improve experiment reproducibility and orchestration workflows. Increase hardware utilization and training throughput. Collaborate with Kernels and Research to align model architecture with systems realities.
Member of Technical Staff, Inference & RL Systems
Design and operate distributed systems that serve models in production and power large-scale post-training workflows. Work on systems impacting inference latency, throughput, stability, and reliability of reinforcement learning (RL) and post-training training loops. Own infrastructure for production inference and large-scale RL iteration to handle KV-cache scaling, memory pressure with long sequences, batching trade-offs, long-horizon trajectory rollouts, and sustained throughput. Design and scale high-performance inference serving systems, optimize KV-cache management, batching strategies, and scheduling, improve throughput and latency for long-context workloads, build and maintain distributed RL and post-training infrastructure, improve reliability of rollout, evaluation, and reward pipelines, automate fault detection and recovery for serving and RL systems, profile and eliminate performance bottlenecks across GPU, networking, and storage layers, and collaborate with Kernels and Research teams to align execution systems with model architecture.
Member of Technical Staff, Tech Lead
The role involves tackling complex problems end-to-end with ownership of parts of the product, making decisions across the LLM pipeline, infrastructure, backend, and UX. The candidate is expected to define the architecture for years to come by making critical decisions on a greenfield stack. The work includes pushing the most advanced AI models to their limits, communicating tradeoffs, problems, and blockers directly, and building a product that works with attention to detail. The job focuses on developing AI-powered research capabilities such as building a research agent, creating a database of millions of humans for precise targeting, advancing realtime video interviews with emotional understanding, building a distributed information mining agent, and developing a customer preference model with synthetic personas to extrapolate new insights.
Staff Software Engineer - Product Fundamentals
The Staff Software Engineer in the Product Fundamentals Group at Multiverse is responsible for architecting systems that allow deployment of AI-powered features at scale, acting as a force multiplier guiding direction across multiple teams, and owning complex cross-functional problems that ensure stability, security, and architectural integrity as AI capabilities scale. The role involves auditing and aligning technical direction with business objectives, leading highly complex AI-related projects with a focus on predictable delivery, operational excellence, and impactful user experience, and defining frameworks to guide architectural strategy across teams while coaching others. The engineer solves ambiguous engineering challenges, makes critical decisions thoughtfully and decisively, drives architectural strategy for major platforms to ensure AI systems are reliable and performant, coordinates broader initiatives spanning multiple work-streams to adopt technical debt and scalability strategies, and innovates by leveraging emerging AI technologies to solve complex problems and build foundational components for the engineering organization.
Software Engineer, AI Compiler
Work across the full stack with software, systems, and hardware teams to ensure correctness, performance, and deployment readiness for real workloads. Contribute to shaping the long-term compiler architecture and tooling strategy. Design and implement parts of the compiler stack targeting the novel AI accelerator, including front-end lowering, IR transformations, optimization passes, and backend code generation. Build and evolve MLIR/LLVM based infrastructure to support graph lowering, hardware-aware optimizations, and performance-centric code emission. Collaborate closely with hardware architects, microarchitects, and research teams to co-design compiler strategies that align with evolving ISA and hardware constraints. Develop profiling and analysis tools to identify performance bottlenecks, validate generated code, and ensure high throughput/low latency execution of AI workloads. Enable efficient mapping of high-level ML models to hardware by working with model frameworks and graph representations such as ONNX, JAX, and PyTorch. Drive performance tuning strategies including kernel authoring, schedule generation, and hardware-specific optimization passes.
Software Intern
As a Software Engineering Intern at TensorWave, responsibilities include collaborating with senior engineers on features for cloud control plane, orchestration layer, user-facing APIs, or internal tooling; working on automation, monitoring, and observability for GPU clusters (Slurm + Kubernetes-native environments); participating in debugging performance bottlenecks in high-throughput inference or distributed training pipelines; writing clean, well-tested code and participating in code reviews; and learning how bare-metal AI clouds operate at scale, including hardware partitioning, high-speed networking, and storage.
Senior Backend / Systems Engineer (AI) - San Mateo, CA
Design and build extensible backend systems that support flexible configurations for different customers and content types. Develop infrastructure that interfaces cleanly with large language models (LLMs), enabling prompt engineering, context injection, and modular evaluation workflows. Build tooling and platforms that enable fast iteration by AI engineers and analysts, including declarative pipelines, parameterized jobs, and reproducible experiments. Prioritize ease of deployment, integration, and testing, both for internal teams and external partners. Collaborate closely with product, data, and policy teams to translate nuanced safety needs into scalable, maintainable software systems.
Software Engineer - Sensing, Consumer Products
As a Software Engineer on Consumer Products Research, the responsibilities include building and shipping production software for sensing algorithms by translating algorithm prototypes into reliable end-to-end systems, implementing and owning key parts of the Python shipping pipeline including integration surfaces, evaluation hooks, and quality/performance guardrails. The role also involves developing embedded/on-device software in an RTOS environment (such as Zephyr) and deploying models to device runtimes and hardware accelerators. Additional responsibilities include optimizing real-time on-device perception loops for stability, latency, power, and memory constraints, creating data collection and instrumentation tooling to bring up new sensing modalities and accelerate iteration from prototype to dataset to model to device, and partnering cross-functionally with algorithms, human data, firmware/hardware teams to debug, profile, and harden systems against real-world variability.
Senior Software Engineer, ML Core
Design, develop, and deploy custom and off-the-shelf ML libraries and toolings to improve ML development, training, deployment, and on-vehicle model inference latency. Build tooling and establish development best practices to manage and upgrade foundational libraries such as Nvidia driver, PyTorch, TensorRT, to improve ML developer experience and expedite debugging efforts. Collaborate closely with cross-functional teams including applied ML research, high-performance compute, advanced hardware engineering, and data science to define requirements and align on architectural decisions. Work across multiple ML teams within Zoox, supporting in- and off-vehicle ML use cases and coordinating to meet the needs of vehicle and ML teams to reduce the time from ideation to productionization of AI innovations.
Software Engineer, Platform Systems
Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs. Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior. Improve observability, reliability, and performance across OpenAI's training platform. Debug and resolve issues in complex, high-throughput distributed systems. Collaborate with systems, infrastructure, and research teams to evolve platform capabilities. Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
