Senior Systems Performance Engineer
The Senior Systems Performance Engineer at Crusoe is responsible for leading the evaluation and establishment of New Product Introduction (NPI) across varied hardware architectures with a focus on Bare Metal and VM environments. They conduct deep-dive performance evaluations and workload characterizations across compute, memory, storage, and networking. They develop sophisticated multi-variable projection models and frameworks to analyze system design options through tradeoffs such as Power and Total Cost of Ownership (TCO). The role involves collaborating with external vendors to drive platform customization and optimize server and AI architectures for maximum performance-per-TCO. They design and implement performance methodologies to scale evaluation processes for large-scale GPU/AI data centers. Additionally, they engage in industry research and contribute technical insights to consortiums and standards committees to influence future hardware roadmaps.
C++ Systems Engineer
Design, build, and optimize the core native runtime powering LM Studio and the C++ libraries powering the app and APIs. Work across runtime, LLM engines, llama.cpp/MLX integrations, build infrastructure, and on-device AI software. Focus on system and library integration by wiring the C++ runtime to GPU backends, vendor SDKs, and operating-system services to support user-facing applications. Implement and harden system-level code involving threading, memory, files, IPC, and scheduling. Integrate platform acceleration paths such as Metal, CUDA, and Vulkan across macOS, Windows, and Linux. Profile, debug, and tune execution paths to ensure fast, dependable local AI and maintainable software. Contribute to the C++ runtime powering LM Studio, extend LLM engine integrations, and build platform-aware performance features for desktop OS. Implement resilient IPC, resource management, and scheduling logic to support concurrent model execution. Improve build, packaging, and release infrastructure for native components. Collaborate with the team to deliver cohesive and recognizable user experiences.
Engineering Leader
As an Engineering Leader at Ema, you will build and lead a high-performance engineering organization by recruiting, hiring, and developing senior engineers across multiple sub-teams including cloud infrastructure, data platform, ML operations, and developer experience. You will establish engineering standards, a code review culture, on-call expectations, and promote a bias-toward-shipping mentality balanced with production rigor. You will coach and grow senior and staff engineers into technical leaders and manage engineering managers as the organization scales. Your responsibilities include setting the 6–18 month platform roadmap in partnership with engineering teams, making critical architectural decisions such as build versus buy and migration strategies, and driving cross-functional alignment with product, ML/AI research, and go-to-market teams. You will own production health for all platform services, including incident response, postmortems, SLO tracking, and capacity planning. Additionally, you will establish and refine engineering practices to maintain fast shipping without compromising reliability, and participate in executive-level reviews related to infrastructure spend, system health, and engineering velocity.
Software Engineer, Architecture, Reliability, & Compute
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies, oversee the end-to-end health of the platform ensuring seamless integration between AI core and full-stack components, build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead response for production issues in mission-critical environments ensuring rapid resolution and prevention, translate technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons influence future technical architecture and decisions.
Head of Internal Tools Engineering
The Head of Internal Tools Engineering is responsible for owning the end-to-end strategy and roadmap for all internal tools, platforms, and automation, treating internal technology as a product. They make strategic build-vs-buy decisions, map current and next-state process flows, and lead systems transformation for internal teams. They architect and maintain the full engineering lifecycle of internal platforms, build seamless API-first ecosystems integrating various internal systems, ensure system reliability and operational resilience, and design scalable, secure architectures using cloud-native principles and microservices. They lead AI strategy by integrating AI and LLMs into internal workflows and deploying intelligent automation tools. They reduce cognitive load for internal users by providing standardized workflows and self-service capabilities, measure platform success by adoption, satisfaction, and productivity impact, and build, lead, and mentor a high-performing engineering team. They cultivate a collaborative culture, provide technical mentorship, foster psychological safety, partner cross-functionally with leadership across departments, and align internal platform investments with company strategy while demonstrating measurable ROI.
Head of Internal Tools Engineering
The role involves architecting, building, and scaling the internal technology ecosystem to accelerate workforce productivity, eliminate operational friction, and provide a compounding infrastructure advantage by treating internal tools with product rigor and user-centricity. Responsibilities include owning the end-to-end strategy and roadmap for all internal tools, platforms, and automation; making strategic build-vs-buy decisions; mapping current and next-state process flows and leading systems transformation. The role requires architecting and maintaining the full engineering lifecycle of internal platforms, building API-first ecosystems integrating with various business systems, owning system reliability and operational resilience, and designing scalable, secure cloud-native architectures. The role leads AI adoption and automation integration into internal workflows, including deploying intelligent automation tools, evaluating AI-assisted troubleshooting, and driving continuous experimentation with prototypes. The person will reduce cognitive load for internal users by providing golden paths and standardized workflows, ensuring frictionless onboarding, and measuring platform success via adoption rates, user satisfaction, DORA metrics, and productivity impact. Team leadership duties include building, leading, and mentoring engineers and managers, fostering a collaborative culture rooted in ownership, speed, craftsmanship, and psychological safety. The role partners cross-functionally with various company leadership teams to translate business needs into a unified technical vision, aligning internal platform investments with company strategy and demonstrating measurable ROI.
AI Software Engineer (Back End)
Build and maintain back end services that handle model inference and user requests, design systems to manage requests, sessions, and streaming responses, implement reliability mechanisms such as rate limiting, retries, and graceful failure, build authentication and access controls for public usage, design systems for logging, telemetry, and evaluation signals, improve latency, throughput, and reliability of model serving, integrate new model checkpoints into the production system, and work closely with training and infrastructure engineers to deploy and operate the model. The role involves working inside production systems including logs, traces, performance profiles, and deployment pipelines to ensure the system stays up, fast, and behaves predictably under load.
Software Engineer, Inference Platform
Own inference deployments end-to-end including initial configuration, performance tuning, production SLA maintenance, and incident response; drive measurable improvements in throughput, time-to-first-token (TTFT), and cost-per-token across diverse model families and customer workload patterns; build and operate KV cache and scheduling infrastructure to maximize utilization across concurrent requests; implement and validate disaggregated prefill/decode pipelines and Kubernetes-based orchestration supporting them at scale; profile and resolve bottlenecks at compute, memory, and communication layers and instrument deployments for end-to-end observability; partner with customers to translate model architectures, access patterns, and latency requirements into deployment configurations and platform improvements; contribute to the inference platform architecture and roadmap focusing on reducing deployment complexity, improving hardware utilization, and expanding support for new model classes and accelerators; participate in an on-call rotation to maintain production reliability and SLA commitments.
Training: Process Management Engineer
As a Training Runtime: Process Management Engineer, you will design, build, and maintain software to orchestrate and monitor machine learning workloads on large supercomputers, working primarily with Python and Rust. Your responsibilities include profiling and optimizing the software stack to support computation orchestration at frontier scale, improving reliability, observability, and fault tolerance for long-running jobs, debugging complex distributed systems issues across large clusters, and responding to the changing shapes and needs of the ML systems to enable researchers. The role involves building high-performance asynchronous systems with a strong emphasis on performance, correctness, and scalability, and working on software that ties thousands of computers together as a unified system while promoting a fast debugging and development cycle and relentless optimization for scale, stability, and performance.
Software Engineer Systems Research Internship, Applied Emerging Talent (Summer 2026)
The intern will investigate challenging systems problems, build practical solutions, and measure their impact to improve Applied Systems by making them more efficient, scalable, and reliable. The work typically involves defining hypotheses about system improvements, instrumenting production systems to gather metrics and analyze data, building or modifying real systems with prototypes or production-quality improvements, running experiments and benchmarks, analyzing results, communicating tradeoffs and recommendations clearly, and publishing the research in technical journals and conferences. Focus areas include distributed systems and storage, compute and scheduling, performance engineering, reliability and observability, networking and data pipelines, and systems for machine learning.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
