# De-Dupe shape

## Introduction

The `de-dupe` shape can be used to handle duplicate records found in incoming payloads. It can be used in three modes:

* `Filter`. Filters out duplicated data so only new data continues through the flow.
* `Track`. Tracks new data but does not check for duplicated data.
* `Filter & track`. Filters out duplicated data and also tracks new data.

A process flow might include a single `de-dupe` shape set to one of these modes (e.g. `filter & track`), or multiple `de-dupe` shapes at different points in a flow, with different behaviours.

{% hint style="info" %}
A single incoming payload for [any process flow shape](https://open.gitbook.com/~site/site_dIV1g/~/revisions/9dTvdvRJIRVZQXuBMMnJ/process-flows/building-process-flows/process-flow-shapes) should not exceed 500MB.

We recommend processing multiple, smaller payloads rather than one single payload (1000 x 0.5MB payloads are more efficient than 1 x 500MB payload!).

For payloads up to 500MB, consider adding a[ flow control shape](https://open.gitbook.com/~site/site_dIV1g/~/revisions/9dTvdvRJIRVZQXuBMMnJ/process-flows/building-process-flows/process-flow-shapes/standard-shapes/flow-control-shape) to batch data into multiple, smaller payloads. Payloads exceeding 500MB should be batched at source.
{% endhint %}

## Need to know

* Tracked `de-dupe` data is retained for 90 days after it's added to a data pool.&#x20;
* Tracked `de-dupe` data can be interrogated via the [tracked data page](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/standard-shapes/track-data-shape/the-tracked-data-page) - by default it's available here for 15 days.
* The `de-dupe` shape is not atomic - as such we advise against multiple process flows attempting to update the same data pool at the same time.
* The `de-dupe` shape works with incoming payloads from a [connection shape](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/standard-shapes/connector-shape), and also from a [manual payload](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/standard-shapes/manual-payload-shape), [API call](https://doc.wearepatchworks.com/product-documentation/developer-hub/patchworks-core-api), or [webhook](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/standard-shapes/trigger-shape/trigger-shape-webhook).
* JSON and XML payloads are supported.

## How it works

The `de-dupe` shape is configured with a [behaviour](#behaviour), a [data pool](#data-pools), and a [key](#key-field):

<div align="left"><figure><img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2FUrjpyX09EtV0Yz4FqV0Q%2Fdedupe%20settings.png?alt=media&#x26;token=aba6421a-5a0a-43e1-a18c-6d152cd038b6" alt="" width="357"><figcaption></figcaption></figure></div>

### Behaviour

As noted previously, the `de-dupe` shape can be used in three modes, which are summarised below.

| Mode           | Summary                                                                                                               |
| -------------- | --------------------------------------------------------------------------------------------------------------------- |
| Filter         | Remove duplicate data from the incoming payload so only new data continues through the flow. New data is NOT tracked. |
| Track          | Log each new key value received in the data pool.                                                                     |
| Filter & track | Remove duplicate data from the incoming payload AND log each new key value received.                                  |

<details>

<summary><img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2FfUVdhc1UgMWObOesLjkS%2Ficon%20eye.png?alt=media&#x26;token=c8b6d3a9-b5a9-4cb0-8aeb-00bd39c00b60" alt="" data-size="line"> Why are there separate filter and track behaviour options?</summary>

These options provide flexibility for what happens to new/duplicate data, and when it happens.  Let's look at an example below:

<img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2F2aCrWfE1OQsptnp6kaTO%2Fdedupe%20track%20and%20filter%20example.png?alt=media&#x26;token=4bef537a-fa4d-4fb9-9aef-40369e7c646f" alt="" data-size="original">

Here, we receive an incoming payload from the first connection shape and send it into a `de-dupe` shape which is configured to `filter & track.` This means that any duplicate records (based on the `key` value) will be removed and the `key` value for any new records is logged in the data pool before the updated payload continues to the next shape.

Often this is fine - but let's take a closer look at our sample process flow. Following the `de-dupe` shape, we have another four shapes to process before completion. If this run were to fail for any reason, we'd want to re-send the data - but because we've already tracked the new data in this payload, those records wouldn't be sent again (they would be filtered out as duplicates).&#x20;

To avoid this scenario, we could add TWO `de-dupe` shapes to our process flow, where:

* Shape 1 is placed immediately after the first connection shape, with its behaviour set to `filter`. Any duplicate records are removed right at the start.
* Shape 2 is placed at the very end of the process flow (after data has been pushed to the final endpoint), with its behaviour set to `track`. At this point, we know that the data has been pushed successfully, so we can safely log new records.

For example:

<img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2F31aqVRHnELzhOLdX9hBB%2Fdedupe%20example%202.png?alt=media&#x26;token=8c9c8b1f-b08b-447b-a9e0-c36a4195966b" alt="" data-size="original">

The above shows the kind of flexibility offered by the three behaviour modes for the de-dupe shape. The approach described may not be appropriate for every case, but it does illustrate the importance of considering where the de-dupe shape is placed in a process flow, and whether using multiple shapes with different behaviours could be beneficial.

</details>

### Data pools

Data pools are [created in general settings](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/advanced-shapes/de-dupe-shape/working-with-data-pools) and are used to organise de-dupe data. Once a data pool has been created it becomes available for selection when configuring a `de-dupe` shape for a process flow.&#x20;

When data passes through a `de-dupe` shape which is set for `tracked` behaviour, the value associated with the [key field](#key-field) for each new record is logged in the data pool. So, the data pool will contain all unique key field values that have passed through the shape.

You can have multiple `de-dupe` shapes (either in the same process flow or in different process flows) sharing the same data pool. Typically, you would create one data pool for each entity type that you are processing. For example, if you are syncing orders via an 'orders' endpoint and products via a 'products' endpoint, you'd create two data pools - one for orders and another for products.

{% hint style="danger" %}
Tracked de-dupe data is retained for 90 days after it's added to a data pool.
{% endhint %}

### Key field

The `key field` is the data field that should be used to match records. This would typically be some sort of `id` that uniquely identifies payload records - for example, an `order id` if you're processing orders, a `customer id` if you're processing customer data, etc.

## How duplicate data is handled

When duplicate data is identified it is removed from the payload however, exactly what gets removed depends on the configured `key field`.

If your given key field is a top-level field for a simple payload, the entire record will be removed. However, if the payload structure is more complex and the key field is within an array, then duplicates will be removed from that array but the parent record will remain.

Let's look at a couple of examples.

<details>

<summary><img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2FuKTXJEY34CVnak6PxrOz%2Fexample%20icon%202.svg?alt=media&#x26;token=7c8af2c5-9519-4757-bea9-172569a023bd" alt="" data-size="line"> <strong>Example 1: Simple payload &#x26; top-level key field</strong></summary>

{% code lineNumbers="true" %}

```json
[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",        
        "item1": "pears",
        "item2": "apples",
        "item3": "oranges",
        "item4": "peaches",
    }
]
```

{% endcode %}

In the example above we have a simple payload with single-level customer records in an array - there are no nested arrays. If we were to specify `customerID` as the de-dupe `key field` and a match is found, the entire record will be removed.&#x20;

Let's say that `customerID` of `10000201` passes through the same de-dupe shape (and therefore the same data pool) twice. In this case, a match would be made and the payload output from the de-dupe shape would be:

{% code lineNumbers="true" %}

```json
[]
```

{% endcode %}

</details>

<details>

<summary><img src="https://2440044887-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FLYNcUBVQwSkOMG6KjZfz%2Fuploads%2FuKTXJEY34CVnak6PxrOz%2Fexample%20icon%202.svg?alt=media&#x26;token=7c8af2c5-9519-4757-bea9-172569a023bd" alt="" data-size="line"> <strong>Example 2: Complex payload &#x26; nested key field</strong></summary>

{% code lineNumbers="true" %}

```json
[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",
        "orders": [
            {
                "customerID": 10000201,
                "order_id": 222222,
                "item1": "pears",
                "item2": "apples",
                "item3": "oranges",
                "item4": "peaches"
            },
            {
                "customerID": 10000201,
                "order_id": 333333,
                "item1": "grapes",
                "item2": "plums",
                "item3": "peaches",
                "item4": "lychees"
            }
        ]
    }
]
```

{% endcode %}

In the example above we have a more complex payload with orders in an array. If we were to specify `orders.customerID` as the de-dupe `key field` and a match is found, the associated order(s) will be removed but any parent data will remain.&#x20;

Let's say that `customerID` of `10000201` passes through the same de-dupe shape (and therefore the same data pool) twice. In this case, a match would be made and the payload output from the de-dupe shape would be:

{% code lineNumbers="true" %}

```json
[
    {
        "customerID": 10000201,
        "first_name": "Beyonce",
        "last_name": "Knowles",
        "orders": []
    }
]
```

{% endcode %}

</details>

## More information

* [Adding a data pool](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/advanced-shapes/de-dupe-shape/working-with-data-pools)
* [Adding & configuring a de-dupe shape](https://doc.wearepatchworks.com/product-documentation/process-flows/building-process-flows/process-flow-shapes/advanced-shapes/de-dupe-shape/adding-and-configuring-a-de-dupe-shape)
