ML/AI Engineer - Vehicle Intelligence
Develop AI-powered vehicle intelligence features that understand user intent, trip goals, vehicle state, and system constraints. Apply reinforcement learning, planning, optimization, and data-driven modeling to improve vehicle-level decisions across energy, comfort, charging, routing, and proactive vehicle preparation. Build models using vehicle telemetry, navigation data, user behavior, weather, traffic, cabin conditions, charging patterns, and fleet data. Create personalization models that learn user routines, comfort preferences, driving patterns, charging habits, and trip priorities while preserving privacy and user control. Use simulation, digital twins, and scenario-based testing to train, evaluate, and validate AI behavior before production deployment. Collaborate with autonomous driving and VLA teams to define interfaces for sharing user intent, route objectives, vehicle constraints, energy targets, comfort preferences, and system-level recommendations. Integrate ML models into production vehicle and cloud platforms, considering latency, compute efficiency, reliability, safety, explainability, and over-the-air update readiness. Work cross-functionally with Product, UX, Systems Engineering and Controls.
Technical Lead Manager - Training Runtime, Data(set) Movement
The Technical Lead Manager will own datasets throughout the training infrastructure and set the direction for how training jobs read data, including APIs, storage contracts, versioning model, benchmarks, debugging tools, and reliability guarantees to make data access consistent across current and future training frameworks. Responsibilities include designing and building a unified dataset read platform for multiple training frameworks; defining dataset APIs, storage-format expectations, registration/versioning, and migration paths to ensure reproducible and maintainable data access; building reliability into the read path such as stateful iteration, caching, fast restart, recovery, and clear operational contracts; developing terminal and web-based visualizers to inspect data late in the pipeline; writing and reviewing production code in core data loading, service, caching, and reliability paths; and partnering with teams working on training frameworks, reinforcement learning, multimodal models, storage, runtime, and cluster infrastructure. Over time, the role will expand to owning broader data movement systems including checkpoint loads/saves and snapshot transfers, working closely with technical leads and infrastructure teams.
Senior Product Engineer, Growth & Lifecycle Infrastructure - Music & Audio
Lead efforts to drive the design and development of customer-facing multi-modal machine learning inference systems. Work with the Platform and Inference teams on building inference systems for the next generation of models, focusing on optimization, model tuning, and deployment. Partner with leading cloud providers to deliver hosted Stability AI inference solutions. Serve as a strategic thought partner for leaders across the organization on driving business impact through machine learning. Contribute to bringing new Stability models and pipelines into existence. Prototype and productionize inference platform improvements and new features.
Staff ML Systems Engineer, Distributed Systems
Design and build scalable distributed machine learning pipelines across data processing, model training, evaluation, and post-processing workflows. Architect distributed execution systems, including parallelization strategies, workload scheduling, resource allocation, and fault tolerance mechanisms. Develop reusable abstractions, frameworks, and libraries that simplify distributed pipeline development. Optimize performance across distributed CPU and GPU environments, improving throughput, utilization, and reliability. Design systems that effectively manage data partitioning, memory utilization, serialization overhead, and compute efficiency. Partner closely with ML engineers, data engineers, and infrastructure teams to productionize research workflows and enable large-scale model development. Establish best practices and engineering standards for distributed machine learning infrastructure. Evaluate and guide decisions around distributed computing frameworks, infrastructure technologies, and system design trade-offs. Improve observability, debugging, monitoring, and operational tooling for distributed systems at scale.
Research Engineer – Evals
Build the evaluation systems from scratch that measure whether Firecrawl's outputs are effective across scraping, crawling, extracting, and mapping. This includes designing metrics, building pipelines, curating datasets, and integrating evaluations into continuous integration and deployment to catch regressions before release. Design benchmarks that represent real customer data distribution including edge cases, and create the collection and labeling systems. Own LLM-as-judge pipelines by designing and validating automated judges for scoring extraction quality, understanding LLM evaluation failure modes, and building human review tooling. Collaborate with research engineers working on models and reinforcement learning to use evaluation metrics as training signals and feedback loops to improve models. Design, run, and communicate fast experiments that test meaningful hypotheses and enable clear decision-making across the team.
Machine Learning Engineer (Singapore)
Build and scale systems for ingesting, processing, and delivering large-scale video and multimodal data for model training. Own the full pipeline from raw content to curated, filtered, and training-ready datasets focusing on speed, reliability, reproducibility, and cost-efficiency. Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes. Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs. Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems. Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency. Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns. Design and implement curation pipelines for selection, filtering, and retention of video and image content for model training including image-text pair datasets. Build and improve VLM-based captioning and metadata generation workflows at scale across video and image data. Develop and apply quality and aesthetic scoring models, CLIP-based semantic filtering, and other signal-extraction approaches for data selection. Build tooling to support deduplication workflows at scale, including near-dedup and exact deduplication pipelines over large video corpora. Analyze dataset composition, identify quality issues, iterate on curation logic to improve training outcomes. Define and evolve standards for high-quality, training-ready video data across different training regimes.
Machine Learning Intern
You will design, build, and ship AI-backed features that are reliable in production. Responsibilities include defining the quality bar by designing evaluation rubrics, test plans, and rollout criteria that are measurable and enforced; building with real-world constraints by writing and extending production code, setting up monitoring, and adding tests to catch regressions before users do; owning features end to end from problem framing to modeling, system design to rollout and iteration; debugging failures across the stack including data, infrastructure, model, prompt logic and hardening the system with those learnings; designing and implementing systems such as retrieval pipelines, agents, or hybrid patterns depending on problem needs; and working across functions by collaborating with product, infrastructure, and engineers to ship features that stick.
Member of Technical Staff, Training (Bay Area, Remote)
Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs to enable rapid diagnosis of performance regressions and failures.
Member of Technical Staff, Training (Paris, London)
Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack, from data pipelines to GPU kernels. Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization. Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks. Optimize workloads for hardware efficiency including CPU/GPU compute balance, memory management, data throughput, and networking. Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures.
Member of Engineering (Reinforcement Learning Infrastructure)
Keep up with the latest research, and be familiar with the state of the art in LLMs, RL, and code generation. Develop methods for tuning training and inference end-to-end for high throughput. Design data control systems in an RL pipeline that govern what the model sees and when. Debug cases where infrastructure decisions are silently degrading learning dynamics. Build observability tooling that surfaces when a system-level issue is the root cause of a training regression. Help build robust, flexible and scalable RL pipelines. Optimize performance across the stack — networking, memory, compute scheduling, and I/O. Write high-quality, pragmatic code. Work in the team: plan future steps, discuss, and always stay in touch.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
