Staff Software Engineer, Foundations (Managed AI)
As a Staff Software Engineer in the Foundations department, responsibilities include leading the design and implementation of highly scalable systems for the Managed AI offerings, driving the long-term technical roadmap for the Foundations team to support growth and evolving AI workloads, working cross-functionally with Cloud Engineering to align technical goals and solve integration challenges, leading by example through high-quality code contributions and mentoring Senior and Staff-level engineers, championing reliability, observability, and performance by identifying and resolving systemic bottlenecks, and staying current with AI infrastructure trends to ensure efficient and powerful tools are utilized.
Director, Engineering, Proactive Offense
Lead and scale Horizon3.ai's Offensive Engineering organization, overseeing teams responsible for exploit development, offensive content, and attack automation within the NodeZero platform. Set clear technical and product direction for how NodeZero identifies, exploits, and validates vulnerabilities across large, complex environments. Partner with Product, Precision Defense, and Platform teams to define and deliver offensive capabilities that influence the roadmap and enhance customer outcomes. Drive execution from proof-of-concept through production to transform cutting-edge attack research into scalable, productized features. Stay hands-on to guide architectural decisions and evaluate exploit and automation approaches, mentoring technical leads in building resilient, modular systems. Build, mentor, and scale diverse teams of software engineers, exploit developers, and offensive researchers, fostering a culture of collaboration, creativity, and engineering excellence that bridges offensive and product software development. Collaborate across engineering, product, and GTM teams to align offensive innovation with business priorities and ensure delivery of impactful capabilities for customers. This role is central to the mission of delivering continuous, autonomous security testing at scale.
Software Engineer, Workload Enablement
Port and validate key inference and training workloads on new platforms/SKUs as they arrive, driving correctness, performance, and stability to an internal readiness bar. Build a suite of benchmarks and stress tests that capture real end-to-end behavior of workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking, storage, thermals, and other relevant parts. Conduct deep-dive performance analysis on distributed training and inference focusing on collective performance and tuning, overlap of compute/communication, kernel-level bottlenecks, memory bandwidth, and scheduling effects. Create repeatable test harnesses that run in continuous integration and lab environments producing actionable outputs such as pass/fail, performance scores, and regression detection. Partner with systems and fleet bring-up engineers to ensure the platform is stable, performant, operationally usable, and scalable through containerization, Kubernetes integration, telemetry hooks, and failure triage loops. Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal reproductions, and prioritized issue lists.
Software Engineer, Codex Core Agents
The role involves designing and building execution environments for AI agents, including sandboxing, isolation, and reproducibility. It includes developing systems for agent orchestration across multi-step, tool-using workflows and building infrastructure for running, testing, and debugging code generated by models. Responsibilities also include creating state and memory systems that allow agents to persist context across long-running tasks, optimizing tokens, latency, reliability, and cost across Codex’s production fleet, and supporting model rollouts, capacity planning, and managing tradeoffs between quality, speed, and economics to maintain a fleet of frontier agents at scale. Additionally, the job entails building shared platform capabilities that unblock product teams, partner teams, and open source Codex.
Software Engineer
The Software Engineer in the Defence team will build and extend critical components of client deliverables across diverse software domains, deliver robust technical artefacts in both compiled and non-compiled languages, implement defined engineering patterns and practices tailored for the Defence sector, collaborate closely with Machine Learning Engineering and Data Science teams to integrate and refine technical solutions, apply rigorous software engineering best practices to enhance scalability and quality of codebases, and execute CI/CD processes while managing application deployments on Kubernetes and bare metal environments.
Engineering Manager, Distillation & Dectection Platform
Lead a team of software engineers building detection and mitigation systems for frontier model misuse, focusing on model IP protection, distillation detection, and emerging risks from autonomous agents. Set the technical roadmap and execution strategy including prioritization, design, shipping, iteration, and impact measurement. Build production systems such as services, pipelines, tooling, instrumentation, and automation that can scale with frontier model usage. Partner with Research and Product teams to translate evolving model capabilities into scalable tests, signals, and mitigations. Drive strong engineering fundamentals including architecture, reliability, monitoring, performance, and operational excellence. Hire and grow a team across backend, data systems, and applied ML engineering domains. Anticipate and address scalability challenges as agentic workflows advance.
Staff Engineer
The Staff Engineer is responsible for making hard architectural tradeoffs and owning the outcomes, such as choices between Durable Object SQLite and shared Postgres for session state, Cloudflare Workers CPU limits versus longer-running workloads, and single-tenant sandboxes versus multi-tenant pools. They design the system that handles concurrent agent sessions across integrations with consistent state. The role includes defining reliability and observability standards for the team, including SLOs, error budgets, tracing strategies, and incident response patterns. The Staff Engineer reviews every significant pull request to set technical direction without blocking velocity and ships code daily, actively contributing architectural leadership alongside output. They work closely with the CTO on all architectural decisions that significantly impact the system.
Software Engineer l
Develop AI agents and software to automatically diagnose and repair hardware faults across massive NVIDIA and AMD GPU clusters. Create deep-level observability and diagnostic tools to monitor the health of high-density compute systems. Build testing suites using PyTorch and NCCL to ensure systems are ready for production. Develop automation for critical facility systems, including power management and advanced liquid cooling. Take full ownership of tools from initial code to deployment and operational support.
Senior Systems Performance Engineer
The Senior Systems Performance Engineer at Crusoe is responsible for leading the evaluation and establishment of New Product Introduction (NPI) across varied hardware architectures with a focus on Bare Metal and VM environments. They conduct deep-dive performance evaluations and workload characterizations across compute, memory, storage, and networking. They develop sophisticated multi-variable projection models and frameworks to analyze system design options through tradeoffs such as Power and Total Cost of Ownership (TCO). The role involves collaborating with external vendors to drive platform customization and optimize server and AI architectures for maximum performance-per-TCO. They design and implement performance methodologies to scale evaluation processes for large-scale GPU/AI data centers. Additionally, they engage in industry research and contribute technical insights to consortiums and standards committees to influence future hardware roadmaps.
C++ Systems Engineer
Design, build, and optimize the core native runtime powering LM Studio and the C++ libraries powering the app and APIs. Work across runtime, LLM engines, llama.cpp/MLX integrations, build infrastructure, and on-device AI software. Focus on system and library integration by wiring the C++ runtime to GPU backends, vendor SDKs, and operating-system services to support user-facing applications. Implement and harden system-level code involving threading, memory, files, IPC, and scheduling. Integrate platform acceleration paths such as Metal, CUDA, and Vulkan across macOS, Windows, and Linux. Profile, debug, and tune execution paths to ensure fast, dependable local AI and maintainable software. Contribute to the C++ runtime powering LM Studio, extend LLM engine integrations, and build platform-aware performance features for desktop OS. Implement resilient IPC, resource management, and scheduling logic to support concurrent model execution. Improve build, packaging, and release infrastructure for native components. Collaborate with the team to deliver cohesive and recognizable user experiences.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
