Key-based Incremental Replication
Key-based Incremental Replication
Daton uses this replication method to replicate new or updated data from a source.
What is Key-based Incremental Replication
Key-based incremental replication is a method for replicating data in Daton.
In this technique, only the data that has been updated since the last replication in Daton is copied to the destination system. The replication is based on unique identifiers, such as primary keys, which are used to identify the rows of data that have been modified or added since the previous replication.
By only transferring the changed data, key-based incremental replication reduces the amount of network bandwidth and processing power required for replication, making it faster and more efficient than full data replication. It is particularly useful in scenarios where the source database contains a large amount of data and only a small portion of the data changes frequently.
How does Key-based Incremental Replication work in Daton
When utilizing Key-based Incremental Replication, Daton employs a Replication Key column in the source table to identify any updated or new data requiring replication.
To replicate a table using Key-based Incremental Replication, Daton performs the following actions:
- During a replication job, Daton stores the 'highest value' found in the Replication Key column of a table.
- In the next replication job, Daton compares the 'saved value' from the previous job to the 'Replication Key column value' in the source.
- All rows in the table with a Replication Key greater than or equal to the stored value are replicated.
- Daton saves the 'new maximum value' from the table’s Replication Key column.
- Repeats the process.
Following is an example using SQL query:
SELECT
replication_key_column,
selected_column_1,
selected_column_2,
[...]
FROM schema.table
WHERE
replication_key_column
>= ‘last_saved_maximum_value’
When do we use Key-based Incremental Replication in Daton
Key-based Incremental Replication is an efficient alternative for replicating data when Log-based Replication is not supported in certain integration. This method is suitable if a table has a modification timestamp column that updates when the record changes and if records are not hard deleted from the source table.
For SaaS integrations, Daton uses Key-based Incremental Replication whenever possible, and the Replication Method and Replication Key(s) used by specific tables can be found in the Schema section of the integration's documentation.
Limitations of Key-based Incremental Replication
- High update rates: Key-based incremental replication is not efficient for source systems with high update rates because it requires a constant comparison between the source and target systems to identify changes.
- Schema changes: If the source system undergoes schema changes, key-based incremental replication may not work correctly. For example, if a new column is added to the source system, the key-based approach may not recognize the new column and may fail to replicate data correctly.
- Data inconsistencies: If there are inconsistencies in the data between the source and target systems, key-based incremental replication may not work as expected. For example, if there are duplicate keys in the source system, key-based replication may not be able to determine which row to replicate.
- Data skew: If the data in the source system is skewed, meaning that some rows have more changes than others, key-based incremental replication may not be efficient. This is because the replication process will spend more time comparing and updating the highly changing rows.
Overcoming the Limitations of Key-based Incremental Replication
- Using a time-based or log-based replication method for source systems with high update rates.
- Developing a process for handling schema changes in the source system, such as automatically updating the target system schema.
- Implementing data cleaning and normalization techniques to ensure consistency between the source and target systems.
- Optimizing the replication process to handle data skew, such as using parallel processing or batching updates for highly changing rows.