Data Ingestion: The Unsung Hero Of Connected Vehicle Data

Connected vehicle data comes in all shapes and sizes. On one side of the spectrum, you have relatively small amounts of well-structured data like vehicle attributes, repair records, and warranty claims. On the other, you’re dealing with terabytes or even petabytes of raw diagnostic data—sensor readings, faults, and manufacturing logs.
Handling this data in its raw, pristine form would be a major pain point for anyone involved in day-to-day analytics. Viaduct’s data ingestion process transforms chaotic data streams into structured, queryable formats—giving engineering and analytics teams a solid foundation to work from.
In today’s post, we’ll take a quick tour through the benefits of data ingestion pipelines, then dive deeper into three specific transformations we perform at Viaduct when ingesting customer data.
Ingestion Pipeline Benefits
It’s tempting to think that with enough data in a warehouse, you’re just a few SQL queries away from insights. And while that might be technically true with enough time and compute power, there are at least three big downsides to skipping a proper ingestion step.
1. Implementation complexity - Without a dedicated ingestion layer, your data fetching and transformation logic gets mixed in with business logic. Even with good engineering practices, this creates a major burden on analysts and turns straightforward business questions (e.g., “How many vehicles reported DTC P1022 in the 30 days leading up to a fuel injector failure?”) into multi-step technical deep dives that require tens or hundreds of steps before getting an answer.
2. Cost - Running the same transformations repeatedly is one of the fastest ways to blow your cloud compute budget—especially at terabyte or petabyte scale. A proper ingestion step lets us do heavy processing once, and trade expensive compute time for cheaper, scalable storage. This is almost always a net win.
3. Speed - Redundant processing isn’t just costly—it’s slow. Long query times limit the data you can analyze and drag down iteration speed. Having information in the right shape is often what separates impossible from possible or what allows us to take a long-running batch process and turn it into an immediate data query. This can be the difference between real-time insights and overnight batch jobs. Reshaped data often enables analyses that would otherwise be too slow or computationally infeasible, unlocking new problems and types of workflows that were previously out of reach.
With those challenges in mind, let’s look at three key transformations Viaduct performs as part of our ingestion pipeline.
Data Reshaping
One of the most common transformations when importing data is taking relatively few wide records and converting them into a larger number of narrow rows. For example, a typical record we see in source telematics systems looks like this:
vehicle_id: v123
timestamp: 2025-04-12T12:34:56Z
odometer: 65432
speed: <missing>
torque: <missing>
longitude: 14.505130
latitude: 46.051080
After transformation, we would store this information in three database rows:
The benefits? We eliminate missing values, reduce storage costs, and improve query speed—especially since analytical engines tend to struggle with “holey” datasets.
Transactional systems might stop there, maybe adding an index or two. But for analytical workloads, physical layout matters. We sort sensor readings by sensor_id, then vehicle_id, then timestamp. This allows disk-level optimizations like sequential reads, which drastically speed up aggregations.
Since no single layout works for all queries, we also store reshaped data in multiple sorted formats—trading a bit more storage for much faster access. One OEM, for example, was able to go from calculating a key attribute on just 1,000 VINs per week to over 400,000 VINs per day using this approach.
Data Deduplication
Vehicle data often arrives with duplicates—thanks to retries, noise in the pipeline, or overlapping exports. There are two ways to handle this.
One option is to ignore duplicates during analysis—fine when they’re rare. But you can’t know that without first checking, and counting duplicates is nearly as expensive as removing them.
The better option—and the one we use—is to deduplicate data during ingestion. Once the data is reshaped, this becomes a tractable problem, and the payoff is worth it: a cleaner, more reliable pipeline.
An added bonus of our ingestion logic is a substantial increase in user-friendliness. For example, when we are receiving data exports from customers and not pulling it from telematics systems, customers can be quite lenient with what they upload. Following the “upload a bit too much just in case” mentality is perfectly fine when data deduplication is in place.
Data Enrichment
The final major transformation step is enrichment—merging in external data or deriving features that are critical for analysis. Technically, this could happen later, but doing it at ingestion time saves time, compute, and unnecessary data duplication.
For example, we might extract keywords from technician repair notes or cluster customer complaint symptoms using NLP. On the telematics side, we augment DTCs with “freeze frame” features like odometer and engine hours at the moment of the fault—critical context for downstream queries and our AI algorithms to detect, diagnose, and remediate systematic issues.
These enrichments could happen at runtime, but doing them once during ingestion often speeds up queries by 10x or more—especially since joins on these datasets are time-consuming because we are normally dealing with billions of data points.
Wrapping Up
Data ingestion isn’t just a preprocessing step—it’s the backbone of any scalable, cost-effective analytics pipeline. It sets the stage for every query, insight, and product decision that follows.
Building an ingestion pipeline takes time. It requires deep familiarity with the source data, the destination model, and the types of queries you plan to run.
But if your goal is rapid iteration, real-time insight, and engineering efficiency, shaping the data at the point of entry is not optional—it’s essential.