Senior Engineering Manager, Managed Platform Services
Lead the Command Center Insights & Actions team to build systems that translate raw infrastructure telemetry into human-readable diagnostics and automated remediation workflows. Own and execute a technical roadmap including alerting engines, heuristic development, node health systems, and state machines that trigger proactive maintenance without impacting customer workloads. Explore integration of Large Language Models to build AI solutions within the Command Center product. Drive the Insights & Actions roadmap covering alerting infrastructure, control plane APIs, automated action systems, and telemetry-derived insights such as straggler node detection and GPU profiling. Contribute to strategic roadmaps, refine early product requirements, collaborate cross-functionally with product, design, and engineering teams, manage complex multi-engineer projects focused on customer outcomes, drive technical excellence through process improvements and best practices, and cultivate team growth by coaching and mentoring engineers, setting clear performance expectations, and defining career paths to build a high-performing and sustainable team.
Forward Deployed Engineering Manager
As an Applied Research Engineer at Labelbox, you are responsible for creating frameworks and tools to construct, train, benchmark, and evaluate autonomous agent capabilities. You design agent-focused data programs using supervised fine-tuning (SFT) and reinforcement learning (RL) methodologies. You develop data pipelines from diverse sources such as code repositories, web browsers, and computer systems. You implement and adapt popular open-source agent libraries and benchmarks with proprietary datasets and models. You engage with research teams in frontier AI labs and the wider AI community to understand evolving agent data needs and share best practices. You collaborate closely with frontier AI lab customers to understand their requirements and guide model development. Additionally, you publish research findings in academic journals, conferences, and blog posts.
Senior Product Engineer, Growth & Lifecycle Infrastructure - Music & Audio
Lead efforts to drive the design and development of customer-facing multi-modal machine learning inference systems. Work with the Platform and Inference teams on building inference systems for the next generation of models, focusing on optimization, model tuning, and deployment. Partner with leading cloud providers to deliver hosted Stability AI inference solutions. Serve as a strategic thought partner for leaders across the organization on driving business impact through machine learning. Contribute to bringing new Stability models and pipelines into existence. Prototype and productionize inference platform improvements and new features.
Researcher: Agent Post-Training, API & Power-Users
The role involves improving the capabilities, reliability, and product fit of OpenAI’s agentic models for power users and API developers. Responsibilities include designing and running experiments to enhance model behavior in API and power-user workflows such as function calling, tool use, coding, planning, and long-horizon execution. The role requires building evals, graders, and environments from real developer and power-user workflows, turning observed failures into training data, hypotheses, and improvements. The researcher partners with API and power-users to identify behavior gaps and translate product signals into post-training interventions. They improve model behavior when composed into systems, ensuring reliable tool use, respect for developer intent, appropriate error handling, clarification when needed, and task coherence. The role also includes owning end-to-end model behavior projects from failure analysis through training, eval design, integration into major model runs, and launch readiness. Developing feedback loops using power-user traces and production-like environments to identify model failures and gaps is part of the job. The researcher assists in deciding which capabilities, fixes, and integrations are ready for major model runs. Additionally, debugging hard failures in models by analyzing traces, evals, training data, and product context is required. The role involves working on early-training and alignment interventions, improving large-scale training and launch machinery, and taking on cross-functional projects that touch model training, product infrastructure, and production agent harnesses, including multi-agent systems and training against production-like environments.
AI Deployment Engineering Manager, Digital Natives
The AI Deployment Engineering Manager leads the AI Deployment Engineering team in the Digital Native segment, focusing on ensuring the safe and effective deployment of Generative AI applications for developers and enterprises. Responsibilities include owning the strategy and operating model of the team to align with company objectives and customer needs, leading, building, and mentoring the team to deliver exceptional customer outcomes evidenced by production customer applications and increased API adoption. The role involves serving as the technical advocate for customers by synthesizing their needs to guide Research and Applied Product/Engineering roadmaps. The manager acts as the primary technical escalation point during development, maintaining direct communication with executive-level stakeholders and fostering trust. Additionally, the role requires serving as an industry thought leader and championing the safe and innovative application of the technology across various sectors. The manager oversees the entire implementation journey for strategic technology and software customers in the Americas, ensuring seamless platform integration, aligning technical teams to deliver a consistent and exceptional experience throughout the customer lifecycle, with success measured by live production applications, increased API adoption, and impactful customer stories.
Agentic AI/ML Engineer Intern, Solutions
As an Agentic AI/ML Engineer Intern, you will design and implement agentic workflows with tool use, memory, and orchestration to automate repetitive tasks and answer questions over internal and customer-facing data. You will contribute to AI Ops infrastructure including orchestration, evaluations, and observability, enabling agent-native DevOps to automate engineering and internal operations workflows. You will build and optimize RAG pipelines with vector databases and knowledge graphs to ground agents in the correct context. Additionally, you will set up evaluation pipelines to measure agent quality, reliability, and performance. This role involves prototyping, evaluating, and shipping agent-native solutions to multiply the impact of teams and technology, supporting scaling of customer base and operations without scaling headcount linearly.
Agentic AI/ML Engineer
Design and build agentic workflows that leverage tool use, memory, planning, and orchestration to automate repetitive tasks and enable natural-language access to internal and customer-facing data. Contribute to FieldAI's AI Ops platform by developing agent infrastructure for orchestration, evaluation, observability, and reliability, and apply these capabilities to create agent-native DevOps workflows that automate engineering, support, and operational processes. Develop and optimize retrieval systems, including RAG pipelines, vector databases, and knowledge graph integrations, to provide agents with accurate, relevant, and scalable context. Build evaluation frameworks and automated testing pipelines to measure agent quality, reliability, safety, latency, and business impact, using those insights to continuously improve system performance. Prototype, iterate, and deploy AI-powered tools that improve internal productivity and deliver actionable insights to customers. Partner closely with engineering, product, field operations, and customer-facing teams to identify high-leverage opportunities for automation and agent-driven workflows.
Robotics Software Engineer
The Robotics Software Engineer will help develop and grow the data collection labs, owning the entire integration lifecycle including identifying and sourcing new hardware and collaborating with mechanical and electrical engineers on setup, software integration, and operational deployment. They will develop innovative robot control interfaces suited to a variety of morphologies, environments, and tasks, collaborate closely with research and engineering teams to develop automation tools and machinery that facilitate the evaluation of advanced robotic policies, and lead the design and implementation of data collection, visualization, and quality control processes.
Safety Coordinator / Lab Lead
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI, maintaining a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments to ensure reliability. You will manage the technical lifecycle within diverse regulatory frameworks and lead the response for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned influence future technical architecture and decisions.
Technical Program Manager, Platform
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring reliability. You will manage the technical lifecycle within diverse regulatory frameworks and lead the response for production issues in mission-critical environments to ensure rapid resolution and build guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned influence the technical architecture and decisions of future use cases.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
