Extract data deltas from the data extension
Exporting data deltas from a marketing automation platform like Salesforce Marketing Cloud has been a widely discussed topic. I’ve seen numerous approaches to tackling this challenge. In this post, we’ll explore various methods for extracting data deltas and delve into my personal favorite. Stay tuned to uncover practical strategies that could revolutionize your data management processes, or not.
Extracting data deltas of previous day, week
This scenario, which is easily set up, has the objective of processing only data created in the previous day or week. The automation is configured to run daily or weekly, systematically collecting data from the preceding period, precisely from midnight to midnight. We can configure it effortlessly, ensuring there are no problems with duplicates or missing record gaps. I’ll discuss later how “gap issue” can cause some records to go missing. There’s just one minor issue that might lead you to consider a different implementation approach. This arises if there’s a downtime in the automation that exceeds one day. In such instances, you would need to adjust the deltas to be selected from the current day minus two, instead of one. Another minor concern, especially with SFMC, is that all timestamps are in CST. This discrepancy could result in data being missed or experiencing a two-day delay if the automation is run in a local timezone where it’s already a new day, but in CST, it’s still considered ‘yesterday’. It’s crucial to ensure your automation runs at a time when CST has also transitioned past midnight.
I found that the best solution for this case is an SQL activity utilizing the DATEDIFF
function in MS SQL Server. DATEDIFF
efficiently calculates the difference between two dates, making it ideal for this application.
--DATEDIFF ( datepart , startdate , enddate ) SELECT some_delta FROM sendlog_table sl WHERE DATEDIFF( DAY, sl.Date, SYSDATETIME() ) = 1
When extracting data with dates in different time zones, it’s crucial to adjust for these differences. I found the best approach is to use the DATEDIFF
function in MS SQL Server, combined with time zone conversion functions. This ensures accurate date comparisons.
SELECT some_delta FROM sendlog_table sl WHERE DATEDIFF( DAY, sl.Date AT TIME ZONE 'Central America Standard Time' AT TIME ZONE 'UTC', SYSDATETIME() AT TIME ZONE 'Central America Standard Time' AT TIME ZONE 'UTC' ) = 1
Next, I will demonstrate how to add a configuration file and retrieve the ‘lastRun’ timestamp from it, which will serve as our time restriction.
Extracting data deltas using configuration file
I have seen couple of implementations where the extract automation run based on the record found in configuration and its lastRun date. This SQL could look like:
SELECT some_delta FROM sendlog_table sl INNER JOIN configuration_table_lastRun ct on ct.automationName = 'my_automation_name' WHERE sl.date >= ct.lastRun
Or another way is to use subquery
SELECT some_delta FROM sendlog_table sl WHERE sl.date >= (SELECT TOP 1 lastRun FROM configuration_table_lastRun WHERE automationName = 'my_automation_name')
In summary, the first query is generally preferable due to its simplicity and potentially better performance. In our case, performance won’t be an issue as the table might only consist of a few records.
The last step of the automation will be updating the “lastRun” date in our configuration_table_lastRun data extension. Now, some of you might be thinking about what happens with records that were created during the automation had started and the automation had finished on the very next run?
In scenarios where there is a time gap between our first activity and the last, wherein we save the ‘lastRun’ to the configuration table, we could encounter a “gap issue”. This could potentially lead to the loss of some records. I have experienced gaps ranging from a few minutes to several hours. These occurred when the delta query had to be divided into multiple SQL activities due to the extensive volume of records resulted in activity timeouts.
To better illustrate the “Gap issue” (If you understood you can skip following example):
- Automation Process: Imagine you have an automated system that regularly processes data from a database. This process is scheduled to run at specific intervals – say every hour – and updates the ‘lastRun’ timestamp in a configuration table once it completes.
- The Gap Issue: The gap occurs in the period between the start and the completion of an automation run. Here’s how:
- Start Time: The process begins, let’s say, at 1:00 PM. It notes the ‘lastRun’ timestamp (for example, 12:00 PM from the previous run).
- Data Generation: During the process (from 1:00 PM onwards), new data entries are continuously being added to the database.
- End Time: The process finishes at 1:30 PM and updates the ‘lastRun’ timestamp to 1:30 PM.
- Potential Data Loss: The problem is with the data that was added between 1:00 PM and 1:30 PM. Since the next run of the process will start by looking at the ‘lastRun’ timestamp (now 1:30 PM), it might miss processing the data entered during that 30-minute window.
- Why It Happens:
- Long Processing Time: If the data volume is huge, the process might take a long time, widening this gap.
- Split Queries: In cases where you have to split your SQL queries to manage large datasets, this can further complicate synchronizing the data, increasing the risk of missing records that were added while the previous queries were running.
In summary, the gap in records is a synchronization issue in data processing automation, where records added during an ongoing process might not be captured in the subsequent process due to the timing mismatch.
Is there a way to mitigate our “gap issue”? Indeed, I might have a solution. Let’s examine it next.
Extracting data deltas using configuration file v2
To resolve our gap issue, we need to synchronize our ‘lastRun’ timestamps accurately in two places:
- At the beginning of the automation in our SQL select.
- In the configuration file when the automation finishes.
We will add an additional field to our ‘last run’ data extension, designated to store the actual date when the automation started. This date will be employed both as a boundary setter for our selection process and as the ‘lastRun’ date for the next execution of our automation.
The initial step in our automation process involves updating the ‘actualRun’ field with the current date. This date will then be used in the subsequent stages of our automation to establish the upper boundary for our queries.
SELECT GetDate() as actualRun, automationName FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive'
In our actual SQL query activity that select delta data we will amend following. You can use either inner join or subquery which ever you like.
SELECT some_delta FROM sendlog_table sl WHERE sl.date >= (SELECT TOP 1 lastRun FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive') and sl.date < (SELECT TOP 1 actualRun FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive')
And last but not least we will update our lastRun in configuration_table_lastRun data extension and if needed we can add lastRunT_1 (t-1) timestamp
SELECT (SELECT TOP 1 actualRun FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive') AS lastRun, automationName, (SELECT TOP 1 lastRun FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive') AS lastRunT_1 FROM configuration_table_lastRun WHERE automationName = 'SendLog_Archive'
Configuration data extension is capable of storing ‘lastRun’ and ‘actualRun’ timestamps for numerous automations that extract data deltas.