Software Engineer, Platform
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI. Additionally, you will build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons influence future technical architecture and decisions.
Relocate to SF: Software Engineer (AI Infra)
Build the platforms that power Pylon's AI features such as prompt executions and search infrastructure. Improve LLM observability including AI evaluations both online and offline, scorers, and prepare Pylon's AI for future scaling. Enhance the quality and performance of AI features.
Software Engineer, ML Data Infrastructure
The Software Engineer, ML Data Infrastructure will collaborate with engineers to build advanced AI design experiences, tackle complex technical challenges including scaling distributed systems and enabling generative media experiences, build robust data infrastructure at petabyte scale ensuring reliability and performance across multi-modal training pipelines, optimize data processing workflows for high throughput involving distributed systems, TPU infrastructure, and large-scale storage, and partner with research scientists to understand data requirements and translate them into production-grade systems to accelerate model development cycles.
Senior Engineering Manager, Management Plane Systems
Lead the team responsible for the automation, observability, configuration management, and policy enforcement layer that runs across the entire network fleet. Own the architecture, development, and production operation of the SDN Management Plane, including the automation and observability platform for managing network fleet across all regions. Build and operate CI/CD pipelines for network configuration, including automated testing, policy validation, and push-on-green delivery of network changes. Design and implement software systems that enforce reconciliation between declared and actual network state, detect configuration drift, and trigger automated remediation workflows. Define provisioning and onboarding automation for new nodes, regions, and customer environments. Drive the design of network observability systems such as streaming telemetry, synthetic probing, anomaly detection, and real-time traffic monitoring across GPU clusters. Design and implement self-healing network capabilities using closed-loop automation to detect, diagnose, and resolve network faults without human intervention. Set the technical vision for applying GenAI and machine learning to network operations. Partner with Control Plane and Data Plane teams to ensure software interfaces between layers and collaborate with infrastructure and compute teams to support GPU cluster networking requirements. Act as internal platform owner for network automation and treat engineering teams as customers with real product requirements. Lead, mentor, and grow a team of senior and staff-level software and network automation engineers, set technical standards, review architecture and design decisions, and own team performance and development. Foster a high-ownership engineering culture focused on shipping production software.
Software Engineer, Monetization ML Infrastructure
Design and build the machine learning infrastructure that powers OpenAI's monetization and ads systems. Develop large-scale data pipelines processing impressions, clicks, conversions, advertiser data, marketplace signals, and other inputs used to train and improve ML models. Create scalable model training platforms for ranking, conversion prediction, quality prediction, bidding, targeting, measurement, and optimization workloads. Develop systems to safely and reliably move models from experimentation into production environments. Build and improve real-time inference and serving infrastructure with strict requirements for latency, throughput, reliability, and availability. Design experimentation frameworks enabling A/B testing, holdouts, model comparisons, ramping strategies, and measurement at scale. Improve platform performance by optimizing training efficiency, inference latency, model throughput, infrastructure reliability, and cost effectiveness. Collaborate closely with ML engineers, product engineers, data scientists, and monetization teams to accelerate development and deployment of advertising systems.
Client Engineering Lead
As a Staff/Principal-level Technical Lead, you will be responsible for driving the end-to-end technical execution of multiple concurrent enterprise engagements in close partnership with the Project Lead, from technical discovery to production deployment. You will architect and implement secure, highly scalable integrations between the AI platform and clients' existing data pipelines, APIs, and infrastructure. You will lead technical discovery sessions, architecture workshops, and data readiness assessments with customer IT, data, and engineering leadership teams. You will build and customize AI-enabled solutions, scripts, and workflows that address complex business problems identified in the sales process. You will serve as the primary technical liaison and escalation point between customer engineering teams and internal product, engineering, and data science teams to unblock deployments quickly. You will ensure that all deployed solutions meet enterprise-grade standards for performance, security, data privacy, and scalability. You will debug complex integration issues, manage technical risks across overlapping projects, and provide hands-on troubleshooting during implementation. Additionally, you will contribute to the internal codebase by documenting technical blueprints, developing reusable integration components, and providing product feedback based on real-world edge cases.
Senior Software Engineer
Own the complete development lifecycle for spam and scam detection infrastructure including research, proposing solutions, implementation, testing, deployment, production maintenance, and monitoring. Participate in on-call rotation for rapid recognition and resolution of production issues while improving system reliability. Design and build frameworks that enable data scientists to develop, test, and deploy complex scam detection models with access to call data in a privacy-aware and regulation-compliant manner. Make independent implementation decisions while driving collaborative design discussions to improve system quality, maintainability, and cost-effectiveness. Evaluate critical tradeoffs between immediate fixes and durable solutions prioritizing service quality and system resilience. Collaborate with product managers, data scientists, and engineering teams to align technical decisions with business impact and user needs. Recognize and promote engineering patterns, design principles, and architectural decisions across teams to raise quality and execution speed. Influence team operations by pushing back on non-aligned solutions, surfacing issues early in project planning, and reasoning about business impact versus cost.
Staff Software Engineer - Managed Kubernetes
As a Staff Engineer on the Orchestration team, you will drive the technical vision for Lambda's Managed Kubernetes bare-metal platform, including control plane scalability, multi-tenancy, cluster lifecycle management, and high availability. You will integrate and extend NVIDIA's open-source ecosystem, design GPU-aware orchestration systems, and lead the development of services powering managed services. Your responsibilities include informing and helping with networking solutions such as CNI integration and high-performance fabrics, and informing and helping with storage architecture requirements for AI workloads. You will build the foundation for Managed Slurm on Kubernetes, design higher-level platform services for inference, including model serving infrastructure and autoscaling, and design self-healing systems and automation for incident response and platform resilience. You will lead chaos engineering efforts to validate system behavior under failure conditions, establish operational excellence including upgrade automation and zero-downtime maintenance. Additionally, you will serve as a technical bridge between Orchestration and other infrastructure teams, drive infrastructure-wide decisions, provide input on bare-metal provisioning, network topology, and storage systems, champion consistency and standardization, work directly with customers and internal teams to understand deployments and roadmap managed platforms. Your role includes setting technical direction for Kubernetes services, driving reviews and design sessions, mentoring engineers, collaborating cross-functionally, engaging with NVIDIA and open-source communities, representing Lambda externally, and shaping AIOps vision for automated capacity planning, anomaly detection, and predictive maintenance of cloud infrastructure.
Software Engineer, Productivity - Inference Runtime
The responsibilities include improving systems that ensure inference engine releases are correct, performant, and regression-free by evolving tooling and infrastructure for deploy gate validation; bringing rigor to release, validation, branching, and deployment processes across the inference stack; improving canary, async, and large-scale validation workflows for inference systems; hardening CI, testing, and validation infrastructure so failures are actionable and trustworthy; reducing noisy or flaky failures caused by infrastructure instability, GPU scheduling, or test environment issues; building automation for failure triage, ownership detection, debugging, and escalation; partnering closely with inference teams, research developer productivity, engine acceleration, and infrastructure teams to improve release quality and rollout safety; and reducing developer friction in testing, debugging, and release workflows so engineers can move faster with confidence.
Software Engineer, ML Systems
Build and manage end-to-end machine learning (ML) pipelines including ETL and automated evaluation that support reinforcement learning research. Identify and refactor inefficient research code to enable scalable performance of promising ideas. Establish best practices for versioning, experiment tracking, and continuous integration/continuous deployment (CI/CD) for ML models to ensure reliability. Manage the deployment and scaling of workloads on Kubernetes, and implement tooling and telemetry to monitor agent behavior and training health.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
