Data Engineer - Foundational
As a Data Engineer on the Foundational team, you will build ETL/ELT pipelines to extract, decode, and store raw Electro-Optical (EO) and Infrared (IR) video from field logs into optimized formats like WebDataset, TFRecords, or Parquet. You will develop algorithms to synchronize EO and IR frames temporally and spatially to provide paired inputs for model training. You will architect storage-to-GPU pipelines to ensure multi-node training clusters maintain over 90% GPU utilization without I/O bottlenecks. Your role includes writing and optimizing distributed data processing jobs using tools such as Apache Spark, Ray, or Apache Beam to process thousands of hours of tactical video logs. You will implement automated quality checks to filter corrupted or blank frames and maintain reproducible training runs through robust versioning and lineage tracking. Additionally, you will assess and implement advanced storage solutions like MinIO and S3 tiering to manage growing datasets while optimizing cost and latency.
Software Data Engineer
Collaborate with Machine Learning, Full-stack engineers and Science to solve complex document mining challenges, capture and model additional scientific experiments, scale data pipelines for rapid and reliable data transfer from research to platform, work with semi-structured and unstructured data, define and apply best practices for technologies in a cloud-based environment, architect and maintain robust data pipelines that ingest diverse sources and use large language models (LLMs) for high-fidelity entity extraction into structured formats, implement evaluation frameworks to monitor extraction models' accuracy, drift, and hallucination rates in production pipelines, lead or consult on engineering design proposals according to the Platform Stream roadmap, make independent technical decisions based on business context and team goals, proactively identify new opportunities and implement project improvements, respond with urgency to operational issues and own resolution within one's responsibility, and challenge the status quo by proposing new technologies or ways of working.
Member of Technical Staff, Pre-training Data
As a Software Engineer on the Pre-training Data team, you will design and operate the systems that define the model’s training corpus at scale. Your work will focus on large-scale data acquisition, processing, filtering, mixture design, and ablation-driven iteration. You will handle infrastructure and experimental loops that decide the data used for training and thereby influence what the model learns. Responsibilities include building and operating large-scale web crawling, scraping, and ingestion pipelines; designing filtering, deduplication, quality controls, and dataset versioning systems; running data ablations across sources, rewrites, mixtures, and long-sequence strategies; optimizing distributed data processing systems for throughput and cost efficiency; improving observability and reliability of large ETL and dataflow jobs; and collaborating with Research and Training Systems teams to ensure corpus design aligns with model behavior.
Senior Data Engineer
Lead the end-to-end design and delivery of scalable, secure, and intelligent data products and solutions that support HackerOne’s transformation into an AI-first organization. Partner across business and engineering teams to identify high-leverage opportunities for automation, integration, and system modernization. Drive the architecture and execution of platform-level capabilities, leveraging AI and modern tooling to reduce manual effort, improve decision-making, and increase system resilience. Provide technical leadership to internal engineers and external development partners, ensuring design quality, operational excellence, and long-term maintainability. Shape and contribute to incident and on-call response strategy, playbooks, and processes, focusing on building systems that fail gracefully and recover quickly. Act as a multiplier to mentor other engineers, advocate for technical excellence, and promote a culture of innovation, curiosity, and continuous improvement. Champion effective change management and enablement, ensuring systems are launched, adopted, understood, and evolved.
Senior AI Platform Engineer (Autonomous Driving)
Set technical strategy and oversee development of a high scale, reliable data platform to manage, visualize, and serve large-scale datasets for ML model training and validation. Build the data lakehouse for autonomous driving scene datasets, including sensor data, calibration data, and annotation data. Drive the Autonomous Driving Data SDK development, including scene data search, datasets preparation, and dataset loading. Identify and resolve performance bottlenecks in the data processing pipelines, including data processing latency, data search latency, and Test Procedure (TP) coverage. Bootstrap and maintain infrastructure for data platform components such as Data Processing Pipeline, Database, Data Lakehouse, and Data Serving. Collaborate with cross-functional teams, including ML algorithm, ML application, and Cloud Infrastructure teams, to align ML platforms with the overall Autonomous Driving System Architecture.
Senior AI Data Pipeline Engineer
Design and build high-performance, scalable data pipelines to support diverse AI and Machine Learning initiatives across the organization. Architect and implement multi-region data infrastructure to ensure global data availability and seamless synchronization. Develop flexible pipeline architectures that allow for complex branching and logic isolation to support multiple concurrent AI projects. Optimize large-scale data processing workloads using Databricks and Spark to maximize throughput and minimize processing costs. Maintain and evolve the containerized data environment on Kubernetes, ensuring robust and reliable execution of data workloads. Collaborate with AI researchers and platform teams to streamline the flow of high-quality data into training and evaluation pipelines.
Senior Data Engineer
The Senior Data Engineer on the Foundations team will create technical foundations including infrastructure, tools, and APIs that enable the entire company to access product data safely and efficiently. Responsibilities include defining schemas for new entities and refactoring existing models for improved performance and clarity, transitioning legacy data scripts into robust, version-controlled services, designing and developing domain-driven services with reusable APIs, creating a universal data layer with APIs and connectors such as Data Warehouse APIs (GraphQL), building features in the Internal Developer Platform (IDP) to simplify AI model deployment and management, building infrastructure for GenAI like Vector Databases or Model Context Protocols, automating security and compliance checks to ensure data privacy and safety, replacing manual approval gates with automated checks to maintain speed without compromising safety, and creating a high-fidelity data layer that allows non-technical stakeholders to generate reports without understanding raw tables. The role requires close collaboration with Infrastructure, DevEx, Security engineers, and internal Tech, Data Science, ML, and Business teams to enable self-serve data usage across the company.
Data Engineer | Power
As a Data Engineer, you will build and evolve the data backbone of an AI-first product including document intelligence, time-series IoT data, and agentic AI systems. You will design, implement, and operate data systems across the full lifecycle from raw ingestion to AI-driven outputs used by customers. You will work directly with customers and internal stakeholders to understand problems and translate them into technical solutions, iterating quickly. Responsibilities include building pipelines that support document processing, sensor data, and ML workflows, contributing to feature engineering and model experimentation when needed, and owning systems in production. You will make architectural decisions, improve system reliability over time, and help define best practices as the team and product scale.
Senior Data Engineer, People Analytics
Build and maintain resilient ETL pipelines to centralize data from core HCM and ATS systems into Google Cloud Platform, Big Query, and other people analytics products. Architect a semantic data layer using dbt to translate raw database schemas into business-friendly logic, enabling non-technical leaders to ask natural language questions and get accurate answers. Leverage AI and LLMs to extract insights from unstructured data and build predictive models for attrition and headcount planning. Design data products that solve operational problems by automating HR workflows, building custom apps for internal mobility, or redesigning organizational structure. Partner with Talent, Finance, and People leaders to translate business questions into data inquiries and consult on analytics possibilities. Design and deploy Sigma workbooks to guide executives through complex narratives to ensure data-driven action.
ML Systems Engineer (Platform & Biometrics Data Infrastructure)
Build and operate high-throughput pipelines for sensor and event data (batch and streaming) ensuring quality, lineage, and reliability. Create scalable dataset curation and labeling workflows including sampling, slice definitions, weak supervision, gold-set management, and evaluation set integrity. Develop ML platform components such as feature pipelines, training orchestration, model registry, reproducible experiment tracking, and automated evaluation. Implement monitoring and observability for production ML systems covering data drift, performance regression, alerting, and automated failure detection. Standardize schemas and interfaces across studies and product telemetry to enable reusable, consistent analytics and model development. Collaborate cross-functionally with ML engineers, data science, firmware, and backend teams to support new studies and product launches, ensuring data architecture meets evolving research and product needs.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
