De-Dupe shape

Introduction

The de-dupe shape can be used to handle duplicate records found in incoming payloads. It can be used in three modes:

  • Filter. Filters out duplicated data so only new data continues through the flow.

  • Track. Tracks new data but does not check for duplicated data.

  • Filter & track. Filters out duplicated data and also tracks new data.

A process flow might include a single de-dupe shape set to one of these modes (e.g. filter & track), or multiple de-dupe shapes at different points in a flow, with different behaviours.

A single incoming payload for any process flow shape should not exceed 500MB.

We recommend processing multiple, smaller payloads rather than one single payload (1000 x 0.5MB payloads are more efficient than 1 x 500MB payload!).

For payloads up to 500MB, consider adding a flow control shape to batch data into multiple, smaller payloads. Payloads exceeding 500MB should be batched at source.

Need to know

  • Tracked de-dupe data is retained for 90 days after it's added to a data pool.

  • Tracked de-dupe data can be interrogated via the tracked data page - by default it's available here for 15 days.

  • The de-dupe shape is not atomic - as such we advise against multiple process flows attempting to update the same data pool at the same time.

  • The de-dupe shape works with incoming payloads from a connection shape, and also from a manual payload, API call, or webhook.

  • JSON and XML payloads are supported.

How it works

The de-dupe shape is configured with a behaviour, a data pool, and a key:

Behaviour

As noted previously, the de-dupe shape can be used in three modes, which are summarised below.

Mode
Summary

Filter

Remove duplicate data from the incoming payload so only new data continues through the flow. New data is NOT tracked.

Track

Log each new key value received in the data pool.

Filter & track

Remove duplicate data from the incoming payload AND log each new key value received.

Why are there separate filter and track behaviour options?

These options provide flexibility for what happens to new/duplicate data, and when it happens. Let's look at an example below:

Here, we receive an incoming payload from the first connection shape and send it into a de-dupe shape which is configured to filter & track. This means that any duplicate records (based on the key value) will be removed and the key value for any new records is logged in the data pool before the updated payload continues to the next shape.

Often this is fine - but let's take a closer look at our sample process flow. Following the de-dupe shape, we have another four shapes to process before completion. If this run were to fail for any reason, we'd want to re-send the data - but because we've already tracked the new data in this payload, those records wouldn't be sent again (they would be filtered out as duplicates).

To avoid this scenario, we could add TWO de-dupe shapes to our process flow, where:

  • Shape 1 is placed immediately after the first connection shape, with its behaviour set to filter. Any duplicate records are removed right at the start.

  • Shape 2 is placed at the very end of the process flow (after data has been pushed to the final endpoint), with its behaviour set to track. At this point, we know that the data has been pushed successfully, so we can safely log new records.

For example:

The above shows the kind of flexibility offered by the three behaviour modes for the de-dupe shape. The approach described may not be appropriate for every case, but it does illustrate the importance of considering where the de-dupe shape is placed in a process flow, and whether using multiple shapes with different behaviours could be beneficial.

Data pools

Data pools are created in general settings and are used to organise de-dupe data. Once a data pool has been created it becomes available for selection when configuring a de-dupe shape for a process flow.

When data passes through a de-dupe shape which is set for tracked behaviour, the value associated with the key field for each new record is logged in the data pool. So, the data pool will contain all unique key field values that have passed through the shape.

You can have multiple de-dupe shapes (either in the same process flow or in different process flows) sharing the same data pool. Typically, you would create one data pool for each entity type that you are processing. For example, if you are syncing orders via an 'orders' endpoint and products via a 'products' endpoint, you'd create two data pools - one for orders and another for products.

Key field

The key field is the data field that should be used to match records. This would typically be some sort of id that uniquely identifies payload records - for example, an order id if you're processing orders, a customer id if you're processing customer data, etc.

How duplicate data is handled

When duplicate data is identified it is removed from the payload however, exactly what gets removed depends on the configured key field.

If your given key field is a top-level field for a simple payload, the entire record will be removed. However, if the payload structure is more complex and the key field is within an array, then duplicates will be removed from that array but the parent record will remain.

Let's look at a couple of examples.

Example 1: Simple payload & top-level key field
[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",        
        "item1": "pears",
        "item2": "apples",
        "item3": "oranges",
        "item4": "peaches",
    }
]

In the example above we have a simple payload with single-level customer records in an array - there are no nested arrays. If we were to specify customerID as the de-dupe key field and a match is found, the entire record will be removed.

Let's say that customerID of 10000201 passes through the same de-dupe shape (and therefore the same data pool) twice. In this case, a match would be made and the payload output from the de-dupe shape would be:

[]
Example 2: Complex payload & nested key field
[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",
        "orders": [
            {
                "customerID": 10000201,
                "order_id": 222222,
                "item1": "pears",
                "item2": "apples",
                "item3": "oranges",
                "item4": "peaches"
            },
            {
                "customerID": 10000201,
                "order_id": 333333,
                "item1": "grapes",
                "item2": "plums",
                "item3": "peaches",
                "item4": "lychees"
            }
        ]
    }
]

In the example above we have a more complex payload with orders in an array. If we were to specify orders.customerID as the de-dupe key field and a match is found, the associated order(s) will be removed but any parent data will remain.

Let's say that customerID of 10000201 passes through the same de-dupe shape (and therefore the same data pool) twice. In this case, a match would be made and the payload output from the de-dupe shape would be:

[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",
        "orders": []
    }
]

More information

Last updated