Efficiency matters more than ever for data engineers. However, teams often default to full materialization. This involves repeatedly reprocessing the entire dataset, even if only a handful of rows have changed.
Not very efficient, right? That's where incremental updates come in. They're the unsung heroes of data optimization, targeting only the transformed data rather than the whole process.
So why doesn't everyone use incremental updates? It's not as easy as computing the last update timestamp and inserting all new records. There are many edge cases that lead to a lot of headaches. There's a bittersweet dance we do with incremental data processing – it's the love-hate relationship that keeps us on our toes.
In this blog, we'll explore the alluring benefits, tough challenges, and offer some clever solutions to make the most of incremental data processing.
This post covers:
Regardless of whether or not you use the Narrator platform, you'll walk away with some tangible best practices to get the most out of your incremental data processing.
First, the good stuff...
Let's start with the undeniable allure: incremental updates save us serious compute costs. Imagine it as skillfully inserting new pieces into an intricate puzzle, rather than reconstructing the entire puzzle from scratch. This approach is especially beneficial for those massive tables that get more updates than your favorite social media feed. If we're fully materializing the data on each update, it can easily ballon to become 50%-80% of your total warehousing bill. By processing only the new data, it allows us to rack up substantial savings with each update.
Here's an example. Suppose you have a table with 100M rows with data from the last 5 years. And you want to update this table every hour.
That's a big difference!
And there's another benefit - it's fast. That's because this approach circumvents the redundant cycle of re-processing unchanged data. In the example above we reduced the total processing by ~99%. Those are rows you don't have to compute, which saves you expensive processing time. So it's about saving money and time. But time is money... so yeah.
It's not all sunshine...
Just like any worthwhile endeavor, it takes effort. And incremental processing comes with its own set of complexities that can't be ignored. In this section, we'll highlight some of the challenges that arise with incremental processes and share the solutions we use at Narrator so you can realize the full potential of this powerful approach.
Incremental processing uses timestamps to identify and update data that has changed within a specific time window. This can make dealing with data from different sources, each with distinct update cadences, a tricky problem. A join between two sources on different update schedules may lead to missing records. That's why synchronization, while complex, is necessary to keep the data accurate.
At Narrator, we run a periodic reconciliation job to identify records that have been missed in recent processing. Our incremental jobs are configured to run every 30 minutes by default, and it would be very costly to run this reconciliation job as frequently. Instead, we perform the diff every night or every other night to identify and insert the records that have been missed due to misaligned update schedules.
As a final step, we limit the our reconciliation timeframe to the last few months, which further reduces unnecessary re-processing of historical data. In practice, this approach is very effective. And take our word for it - a periodic reconciliation task is MUCH easier than trying to synchronize upstream scheduling.
Then there's the data that doesn't play well with incremental updates. This is data that's changing in the past, like funnels with longer conversion windows. Situations like a 3 month sales cycle can throw a wrench in the works. Sometimes, it's back to the drawing board for a full reprocessing.
Example: Facebook Ad Data
Facebook ad data is a good example. The daily ad conversions from Monday's campaign will show 10 conversions at the end of the day, but by Wednesday the conversion count is 12, then 11, then 15. By the end of next week, those Monday campaign numbers will stabilize. These situations are especially common with email engagement, website data, and ad data. And these are typically the BIGGEST datasets in your warehouse, making them the most costly to continually reprocess.
Typically, it's the most recent data that is changing as the days go by, then after a longer period of time it'll stabilize.
We take advantage of fact that only recent data is changing and selectively configure some incremental processes to fully delete the last 30 days (or longer) of processed data and incrementally re-add them on each update. This saves us from re-processing all the historical data that's already stabilized and save our compute power for the updates that need it.
Let's talk more about those cost savings. It's not necessarily guaranteed. All those shiny ($$) benefits of incremental processing might not be realized if data tables lack proper optimization, such as a proper partitioning. Without it, the computational cost of scanning the data prior to insert could be just as much as the cost to fully rebuild the table.
All of the tables we create are partitioned by time in a reasonable way and make proper use of distribution keys. And for large-scale data, we allow users to define parameterized queries that selectively filter and modify the SQL during processing-time for optimal performance.
Some data cleanup tasks can't be performed incrementally. When a user updates their email address, all historical records need to be updated. And clean-up processes, like the removal of internal users and test data, need to be performed across the entirety of your processed data. Not only are these updates costly, they also lead data to drift from the truth if not managed properly.
Addressing complex tasks such as identity resolution involves identifying the specific subset of data impacted by an update and directing our efforts towards that particular segment. We carefully prepare the records that may be influenced by each update, employing smaller temporary staging tables to preprocess the necessary computations. This strategy reduces the scope of datasets that need processing. This is the only way to maintain data accuracy without throwing resources (specially, compute resources) at the problem.
So, there you have it – the ups and downs of incremental updates. It's a cost-saving hero and a tricky partner all in one. But we're in it for the long haul, navigating each challenge along the day. Through sharing our experiences, we're hopeful that you can adopt these strategies and treat your team to some celebratory drinks with the extra funds you'll save.
Cheers to incremental updates! 🍻