Copy of Member of Technical Staff - ML Engineering
Deploy, maintain, and optimize production and research compute clusters. Design and implement scalable and efficient ML inference solutions. Develop dynamic and heterogeneous compute solutions for balancing research and production needs. Contribute to productizing model APIs for external use. Develop infrastructure observability and monitoring solutions.
AI Evaluation Engineer
Design and implement evaluation pipelines to measure the performance and reliability of AI models, develop automated testing frameworks to assess model outputs at scale, analyze model performance using both traditional statistical metrics and AI-specific evaluation methods, evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG), identify potential issues related to accuracy, hallucinations, bias, safety, and model drift, conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior, collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance, monitor model performance in production, and help define best practices for AI evaluation and observability.
Engineering Manager, Active Learning
The Engineering Manager role at Deepgram involves leading the design and implementation of internal data and ML training systems. Responsibilities include recruiting, hiring, training, and supporting top engineering talent to build a world-class team; transforming cross-functional visions into detailed project plans with clarity on commitments, risks, and timelines; defining and owning technical strategy to accelerate ML training pipelines; promoting a strong team engineering culture focused on rigorous engineering standards and continuous improvement; partnering with DataOps and Research teams to design and implement new services, features, or products end to end; and coaching and mentoring engineers to support personal growth while achieving ambitious team goals.
Senior Machine Learning Engineer
Lead the exploration and application of Large Language Models and Generative AI, focusing on new areas within these fields. Translate the latest research into high-performing systems and models that can enhance user experiences. Help set the team's strategic direction, fostering an environment that encourages innovation and professional growth. Actively engage in all aspects of development including ideation, experimentation, implementation, and deployment. Collaborate with various teams and product managers to develop and implement ML-based solutions, ensuring performance optimization and alignment with broader business goals.
Member of Technical Staff - Research Software Engineer
The role involves bridging the gap between research and production by transforming cutting-edge algorithms into scalable training systems. Responsibilities include designing and optimizing large-scale training loops and data pipelines, implementing state-of-the-art techniques ensuring numerical stability and computational efficiency, building internal tooling for launching, monitoring, and reproducing complex experiments, diagnosing deep bottlenecks across the training stack such as GPU memory issues, communication overhead, and dataloader stalls, and translating research prototypes into reusable, production-grade infrastructure. The engineer will architect and optimize the core training infrastructure including RL training loops, distributed GPU systems, and large-scale data pipelines, working closely with researchers to build reliable, scalable systems.
Senior Brand Events Manager
Own the observability and lifecycle management of AI features across the organization. Build tools and infrastructure to enable teams to develop, monitor, and optimize LLM-powered features. Design and implement closed-loop evaluation pipelines that automatically validate prompt changes. Develop comprehensive metrics and dashboards to track LLM usage: cost per feature, token patterns, and latency. Create systems that tie user feedback to specific prompts and LLM calls. Establish best practices and processes for the full lifecycle of prompts: development, testing, deployment, and monitoring. Collaborate with engineering teams across the organization to ensure they have the tools and visibility needed to build high-quality AI features.
Principal AI Ops Architect, IPS
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will oversee the end-to-end health of the platform ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. Build automated systems to monitor model performance and data drift across geographically dispersed environments ensuring reliability. Manage the technical lifecycle within diverse regulatory frameworks. Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. Translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure field lessons influence the technical architecture and future use cases.
Senior Pathologist
Lead the team responsible for the infrastructure supporting AI/ML Stack, focusing on scalability and efficiency of the Machine Learning Operations platform. Develop and execute the long-term vision and roadmap for the MLOps team to support ML development and deployment across business units, balancing short-term tactical deliveries with long-term architectural transformation. Manage and mentor a team of 6-7+ engineers, allocating resources strategically to support existing services and execute key strategic initiatives. Collaborate cross-functionally with leaders in machine learning, data science, product engineering, and infrastructure to identify pain points, remove bottlenecks, and facilitate new solution deployment. Architect compute and storage pipelines for ML Engineers to manage large datasets and artifacts efficiently. Modernize the AI product inference stack for significant growth in global deployments. Work with Site Reliability Engineering to establish comprehensive system observability metrics. Conduct assessments for technology refresh and benchmark proprietary tools against commercial and open-source alternatives to meet future needs.
Machine Learning Operations Engineer
Optimize orchestration processes to ensure efficient deployment and management of AI models. Implement cost-saving strategies to minimize infrastructure expenses while maximizing performance. Upgrade throughput to enhance scalability and responsiveness of AI systems. Collaborate with cross-functional teams to identify bottlenecks and implement solutions to improve workflow efficiency. Ship new features and updates rapidly while maintaining high levels of quality and reliability. Deploy and monitor machine learning models produced by deep learning engineers. Design, deploy, and maintain performant and scalable processes for data acquisition and manipulation to enhance dataset accessibility. Participate actively in the team's software development process, including design reviews, code reviews, and brainstorming sessions. Maintain accurate and updated software development documentation.
AI Software Engineer (Model Training)
Build and maintain systems that support large scale model training including designing and maintaining distributed training pipelines for large language models, building data ingestion and preprocessing systems for large training datasets, developing tooling for experiment management, checkpointing, and reproducibility, monitoring and debugging long running training jobs across clusters, improving reliability and observability across the training stack, optimizing training throughput across compute, memory, and data pipelines, working closely with researchers to translate experimental ideas into training runs, and diagnosing failures across infrastructure, training loops, and data pipelines. The work involves working inside code, logs, dashboards, and experiment outputs to make large scale training reliable.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
