Navigating Bottlenecks: Infrastructure Lessons from AV ML Systems
Autonomous Vehicle (AV) ML systems demand infrastructure that can handle real-time perception, high-throughput data, and latency-critical workloads. While model optimization gets much attention, infrastructure bottlenecks often define system performance. This talk shares lessons from scaling AV ML pipelines using Kubernetes-native tools. We’ll cover orchestration with Dagster, distributed execution via Ray, and dynamic GPU scaling with Kueue and KubeRay. From cloud-based fleet learning to edge-deployed perception, we’ll explore how to balance performance, cost, and developer velocity. If you’re building or maintaining AV ML systems, this session offers practical strategies to move fast, without compromising safety or scalability.