Applied ML Engineer, Data
Build and maintain data pipelines for large video generation models, including data ingestion, parsing, filtering, preprocessing, and dataset curation at scale, using tools such as AWS S3 and DynamoDB. Design and run annotation workflows across platforms such as MTurk, Prolific, and Mechanical Turk, including task design, quality control, and label validation. Train, evaluate, and improve smaller supporting models used for data filtering, quality assessment, preprocessing, or other parts of the ML pipeline. Partner closely with research and engineering teams to turn experimental workflows into scalable, repeatable systems that support model training and evaluation. Own data quality across the pipeline by identifying bottlenecks, failure modes, and low-quality sources, and continuously improving tooling and processes. Build internal tools and automation that make it easier to prepare datasets, launch annotation jobs, monitor outputs, and support model development end to end. Drive larger pipeline projects from start to finish, such as new dataset creation efforts or upgrades to labeling and preprocessing infrastructure. Work within a Kubernetes-based training infrastructure, ensuring datasets are properly prepared, formatted, and delivered to training clusters. Profile and optimize research model inference scripts used in preprocessing steps, ensuring that model-driven filtering and transformation stages run within practical time and cost constraints when applied to large-scale raw data.
Machine Learning Engineer
Design, develop, and deploy end-to-end machine learning pipelines, ensuring efficiency in training, validation, and inference. Implement MLOps best practices, including CI/CD for ML models, model versioning, monitoring, and retraining strategies. Optimize ML models using feature engineering, hyperparameter tuning, and scalable inference techniques. Work with structured and unstructured data, leveraging Pandas, NumPy, and SQL for efficient data manipulation. Apply machine learning design patterns to build modular, reusable, and production-ready models. Collaborate with data engineers to develop high-performance data pipelines for training and inference. Deploy and manage models on cloud platforms (AWS, GCP, Azure) with containerization and orchestration tools like Docker and Kubernetes. Maintain model performance by implementing continuous monitoring, bias detection, and explainability techniques.
Lead/Manager Site Reliability Engineering Team (Amsterdam)
Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference. Implement and maintain changes in high-performance inference engines such as SGLang- or vLLM-style systems and Together's inference stack, including kernel backends, speculative decoding methods like ATLAS, and quantization. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unify inference with RL/post-training by designing and operating RL and post-training pipelines where inference constitutes the majority of the cost, optimizing algorithms and systems jointly. Enhance RL and post-training workloads with inference-aware training loops, including asynchronous RL rollouts and speculative decoding techniques, making large-scale rollout collection and evaluation more efficient. Use these pipelines to train, evaluate, and iterate on cutting-edge models based on the inference stack. Co-design algorithms and infrastructure to tightly couple objectives, rollout collection, and evaluation to efficient inference, and identify bottlenecks across training engines, inference engines, data pipelines, and user-facing layers quickly. Run ablation and scale-up experiments to analyze trade-offs between model quality, latency, throughput, and cost, feeding insights back into model, RL, and system design. Own critical production-scale systems by profiling, debugging, and optimizing inference and post-training services under real production workloads. Lead roadmap initiatives necessitating engine modifications such as changes to kernels, memory layouts, scheduling logic, and APIs. Establish metrics, benchmarks, and experimentation frameworks to rigorously validate improvements. Provide technical leadership by setting direction for cross-team efforts at the intersection of inference, RL, and post-training and mentor engineers and researchers on full-stack ML systems work and performance engineering.
Senior Machine Learning Engineer, Voice AI
Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference; implement and maintain changes in high-performance inference engines including kernel backends, speculative decoding, and quantization; profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unify inference with RL/post-training by designing and operating RL and post-training pipelines, optimizing algorithms and systems where most costs are inference, and making RL and post-training workloads more efficient with inference-aware training loops. Use these pipelines to train, evaluate, and iterate on frontier models on the inference stack. Co-design algorithms and infrastructure to tightly couple objectives, rollout collection, and evaluation to efficient inference, identifying bottlenecks in the training engine, inference engine, data pipeline, and user-facing layers. Run ablations and scale-up experiments to understand trade-offs between model quality, latency, throughput, and cost, feeding insights back into model, RL, and system design. Own critical systems at production scale by profiling, debugging, and optimizing inference and post-training services under real production workloads and driving roadmap items requiring engine modifications including kernels, memory layouts, and scheduling logic. Establish metrics, benchmarks, and experimentation frameworks to rigorously validate improvements. Provide technical leadership by setting technical direction for cross-team efforts at the intersection of inference, RL, and post-training, and mentoring engineers and researchers in full-stack ML systems work and performance engineering.
Member of Technical Staff - Mid-Training Infra
Design, build, and operate large-scale GPU infrastructure for high-throughput model inference and mid-training workloads. Develop systems that power synthetic data generation and reinforcement learning pipelines at scale. Build high-performance inference platforms capable of serving and evaluating models across thousands of GPUs. Optimize throughput, latency, and GPU utilization for large language model inference and rollout workloads. Build infrastructure that supports reinforcement learning pipelines, including large-scale rollout generation, evaluation, and policy improvement loops. Work closely with research teams to support distributed RL workloads and large-scale model evaluation infrastructure. Improve performance of model execution through kernel-level optimization, model parallelism strategies, and GPU runtime improvements. Develop distributed systems that enable large-scale synthetic data generation and RL-driven training workflows. Diagnose and resolve performance bottlenecks across inference runtimes, GPU kernels, networking, and distributed compute systems.
Member of Technical Staff - Pre-Training Infra
Build and scale distributed training systems that power frontier model pre-training. Work closely with research teams to design and operate large-scale training runs for foundation models. Develop infrastructure that enables efficient training across thousands of GPUs using modern distributed training frameworks. Optimize training throughput, stability, and efficiency for large model training workloads. Collaborate directly with pre-training researchers to translate experimental ideas into scalable, production-ready training systems. Improve performance of distributed training workloads through optimization of communication, memory usage, and GPU utilization. Build and maintain training pipelines that support large-scale datasets, checkpointing, and experiment iteration. Debug and resolve performance bottlenecks across distributed training stacks including model parallelism, GPU communication, and training runtime systems. Contribute to the development of systems that enable rapid experimentation and iteration on new training techniques.
Research Scientist, AI Controls and Monitoring
As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned influence the technical architecture and decisions of future use cases.
Machine Learning Engineer
As a Machine Learning Engineer at Faculty, you will be responsible for building and deploying production-grade machine learning software, tools, and infrastructure. You will create reusable, scalable solutions that accelerate the delivery of machine learning systems, collaborate with engineers, data scientists, and commercial leads to solve critical client challenges, lead technical scoping and architectural decisions to ensure project feasibility and impact, define and implement standards for deploying machine learning at scale, and act as a technical advisor to customers and partners by translating complex ML concepts for stakeholders.
Copy of Member of Technical Staff - ML Engineering
Deploy, maintain, and optimize production and research compute clusters. Design and implement scalable and efficient ML inference solutions. Develop dynamic and heterogeneous compute solutions for balancing research and production needs. Contribute to productizing model APIs for external use. Develop infrastructure observability and monitoring solutions.
AI Evaluation Engineer
Design and implement evaluation pipelines to measure the performance and reliability of AI models, develop automated testing frameworks to assess model outputs at scale, analyze model performance using both traditional statistical metrics and AI-specific evaluation methods, evaluate AI systems built on modern architectures such as LLM-based applications and Retrieval-Augmented Generation (RAG), identify potential issues related to accuracy, hallucinations, bias, safety, and model drift, conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior, collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance, monitor model performance in production, and help define best practices for AI evaluation and observability.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
