De-Dupe shape
Last updated
Last updated
The de-dupe shape can be used to handle duplicate records found in incoming payloads. It can be used in three behaviour modes:
Filter. Filters out duplicated data so only new data continues through the flow.
Track. Tracks new data but does not check for duplicated data.
Filter & track. Filters out duplicated data and also tracks new data.
A process flow might include a single de-dupe shape set to one of these modes (e.g. filter & track), or multiple de-dupe shapes at different points in a flow, with different behaviours.
Tracked de-dupe data is retained for 90 days after it's added to a data pool.
The de-dupe shape is not atomic - as such we advise against multiple process flows attempting to update the same data pool at the same time.
The de-dupe shape works with incoming payloads from a connection shape, and also from a manual payload, API call, or webhook.
JSON and XML payloads are supported.
The de-dupe shape is configured with a behaviour, a data pool, and a key:
As noted previously, the de-dupe shape can be used in three modes, which are summarised below.
Mode | Summary |
---|---|
Filter | Remove duplicate data from the incoming payload so only new data continues through the flow. New data is NOT tracked. |
Track | Log each new key value received in the data pool. |
Filter & track | Remove duplicate data from the incoming payload AND log each new key value received. |
Data pools are created in general settings and are used to organise de-dupe data. Once a data pool has been created it becomes available for selection when configuring a de-dupe shape for a process flow.
When data passes through a de-dupe shape which is set for tracked behaviour, the value associated with the key field for each new record is logged in the data pool. So, the data pool will contain all unique key field values that have passed through the shape.
You can have multiple de-dupe shapes (either in the same process flow or in different process flows) sharing the same data pool. Typically, you would create one data pool for each entity type that you are processing. For example, if you are syncing orders via an 'orders' endpoint and products via a 'products' endpoint, you'd create two data pools - one for orders and another for products.
Tracked de-dupe data is retained for 90 days after it's added to a data pool.
The key field
is the data field that should be used to match records. This would typically be some sort of id
that uniquely identifies payload records - for example, an order id
if you're processing orders, a customer id
if you're processing customer data, etc.
When duplicate data is identified it is removed from the payload however, exactly what gets removed depends on the configured key field
.
If your given key field is a top-level field for a simple payload, the entire record will be removed. However, if the payload structure is more complex and the key field is within an array, then duplicates will be removed from that array but the parent record will remain.
Let's look at a couple of examples.
The de-dupe shape supports JSON and XML payloads.