Deduplication API (dedupe)

Overview

The Deduplication API provides a mechanism to identify and manage duplicate records in a dataset, thereby optimizing storage usage and improving data quality. Users specify criteria to detect duplicates and control the resulting output to either mark, move, or eliminate duplicate records.

Format

API End point: https://api.versium.com/v2/jobs?operation=dedupe

Http method: POST

Content-Type: application/json

Required Configuration Parameters

Name

Description

Value Type

Notes

input_list_id

Id of an existing list to perform dedupe on.

integer

project_name

The name of the project to store the list in. If a project with this name doesn’t exist, a new one will be created.

string

Required if project_id is not provided

project_id

The id of an existing project to store the list in.

integer

Required if project_name is not provided

match_fields

An array of arrays, each containing a field index followed by options for preprocessing data before matching.

array

Example: [[field_index1, option1, option2, ...], [field_index2, option3, option4, ...]]

sort_fields

An array of arrays, each containing a field index followed by options for field sorting.

array

Example: [[field_index1, option1, option2, ...], [field_index2, option3, option4, ...]]

Optional Configuration Parameters

Name

Description

Value Type

Notes

callback_url

A URL to ping when the job has completed or failed.

string

Match Options

Options for preprocessing data points before matching:

  • case_insensitive or i: Case-insensitive matching.
  • trim or t: Trim whitespace from the beginning and end of the data point.
  • filled or f: Require the field to be non-empty.

Sort Options

Options for determining how fields are sorted:

  • case_insensitive or i: Case-insensitive sorting.
  • trim or t: Trim whitespace.
  • reverse or r: Sort in descending order.
  • numeric or n: Numeric sorting.
  • filled or f: Sort based on whether the field is filled, ignoring case_insensitive and numeric options. Filled fields will come before empty fields by default, but this can be reversed with the “reverse” flag.

Example Request:

Assume the header of the input file is as follows:

Updated Date

First

Last

Phone

Address

City

State

Zip

Busname

Email

The following request will use First, Last, and Address to match records for duplicates with case-insensitive and trimmed comparison. First and Last are required to be non-empty.

Records with a filled Phone will be prioritized first, followed by filled Email, and finally Updated Date reverse order (descending), i.e. most recent records.

POST https://api.versium.com/v2/jobs?operation=dedupe
Content-Type: application/json

input_list_id=194
project_name=MyProject

match_fields=[[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]]
sort_fields=[[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]] 

Usage Notes

  • field_index for match_fields and sort_fields are 0-based integer indices of the fields such that the left-most field has index 0, then 1, then 2 etc.
  • When identifying duplicates, all match fields must contain identical values for records to be considered duplicates.
  • Specifying non-empty match fields (filled option) helps prevent false uniqueness due to missing data.
  • The first record in a set of duplicates is retained as the original by default. Use sort_fields to control the order and, consequently, which record is preserved.
  • Output lists will have an extra field prepended to the beginning, indicating the record’s original line number in the input file before sorting.

Response Overview

The JSON response will contain meta information and the state_url that can be used to obtain the job status. Once completed the JSON response will include the URL to send subsequent API requests to.

Dispatched response

{
    "versium": {
        "version": "2.0",
        "query_id": "9dc4828868e938d263a77fd1a7a63623",
        "query_time": 0.18544507026672363,
        "results": {
            "id": 194,
            "guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
            "state": "dispatched",
            "operation": "dedupe",
            "created_at": "2024-04-16T01:30:14.000000Z",
            "finished_at": null,
            "config": {
                "match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
                "sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
            },
            "output_list": {
                "id": 224,
                "download_url": null,
                "num_matches": 0,
                "num_records": 0,
                "project_id": 8,
                "insights": null
            },
            "state_url": "http://api.versium.com/v2/jobs/194",
            "callback_url": null,
            "suppression_input_list_ids": null,
            "est_sec_remaining": null,
            "progress_percentage": 0
        }
    }
}

Processing response:

{
    "versium": {
        "version": "2.0",
        "query_id": "3570e37172475acffd82f1f2fd6a144a",
        "query_time": 0.06275701522827148,
        "results": {
            "id": 194,
            "guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
            "state": "processing",
            "operation": "dedupe",
            "created_at": "2024-04-16T01:30:14.000000Z",
            "finished_at": null,
            "config": {
                "match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
                "sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
            },
            "output_list": {
                "id": 224,
                "download_url": null,
                "num_matches": 0,
                "num_records": 0,
                "project_id": 8,
                "insights": null
            },
            "state_url": "http://api.versium.com/v2/jobs/194",
            "callback_url": null,
            "suppression_input_list_ids": null,
            "est_sec_remaining": null,
            "progress_percentage": 0
        }
    }
}

Completed job:

Once the job is completed, the response will contain the "state":"Done" string. The response will also contain the following:

  • download URL - the url that can be used to download the generated list in CSV format.
  • number of matches in the list
  • number of records in the list
  • insights - a set of data containing the insights of the generated list
{
    "versium": {
        "version": "2.0",
        "query_id": "e6d1535f504b2ea8be1726431185a9ed",
        "query_time": 0.23850011825561523,
        "results": {
            "id": 194,
            "guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
            "state": "done",
            "operation": "dedupe",
            "created_at": "2024-04-16T01:30:14.000000Z",
            "finished_at": "2024-04-16 01:30:28",
            "config": {
                "match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
                "sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
            },
            "output_list": {
                "id": 224,
                "download_url": "http://api.versium.com/v2/lists/224/download",
                "num_matches": 71702,
                "num_records": 71702,
                "project_id": 8,
                "insights": null
            },
            "state_url": "http://api.test/v2/jobs/194",
            "callback_url": null,
            "suppression_input_list_ids": null,
            "est_sec_remaining": 0,
            "progress_percentage": 100
        }
    }
}

Examples Scenarios

Below is a table of dummy data that will be used to demonstrate different deduplication scenarios. Rec ID is for record identification purposes only and will not count toward the field indices.

Rec ID

Updated

First

Last

Phone

Address

City

State

Zip

Busname

Email

0

2010-01-01

John

Smith

245-777-8866

123 Main St

Redmond

WA

98038

Microsoft

[email protected]

1

2014-01-01

John

Smith

245-777-8866

 

 

 

98038

Google

[email protected]

2

2020-01-01

John

Smith

248-6080-548

123 Main St

Redmond

 

 

 

[email protected]

3

2021-01-01

John

Smith

 

 

Redmond

WA

98038

 

[email protected]

4

2023-01-01

Jane

Smith

245-777-8866

123 Main St

Redmond

WA

98038

Amazon

[email protected]

5

2021-01-01

Joe

Smith

848-555-1234

211 1st St

Kirkland

WA

98033

 

[email protected]

6

2022-01-01

Joe

Smith

848-555-1234

456 State St

Bellevue

WA

98038

Amazon

 

7

2023-01-01

Joe

Smith

 

456 State St

Bellevue

WA

98038

Amazon

[email protected]

8

2022-01-01

Joe

Schmo

 

 

 

NaN

NaN

Google

[email protected]

9

2023-01-01

Joe

Schmo

989-221-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

10

2023-02-01

Joe

Schmo

989-222-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

11

2023-03-01

Joe

Schmo

989-223-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

12

2023-04-01

Joe

Schmo

 

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

Scenario 1: Individual First & Last Match

Only consider first and last name when determining if two records are duplicates.

Match on First, Last.

Sort on filled Phone, then filled Email, then most recently Updated

match_fields: [[1], [2]]

sort_fields: [[3, “f”], [9, “f”], [0, “r”]]

Rec ID

Updated

First

Last

Phone

Address

City

State

Zip

Busname

Email

2

2020-01-01

John

Smith

248-6080-548

123 Main St

Redmond

 

 

 

[email protected]

4

2023-01-01

Jane

Smith

245-777-8866

123 Main St

Redmond

WA

98038

Amazon

[email protected]

5

2021-01-01

Joe

Smith

848-555-1234

211 1st St

Kirkland

WA

98033

 

[email protected]

11

2023-03-01

Joe

Schmo

989-223-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

Scenario 2: Household

Only keep 1 record per household.

Match on Last, Address.

Sort on filled Phone, then filled Email, then most recently Updated

match_fields: [[2], [4]]

sort_fields: [[3, “f”], [9, “f”], [0, “r”]]

Rec ID

Updated

First

Last

Phone

Address

City

State

Zip

Busname

Email

4

2023-01-01

Jane

Smith

245-777-8866

123 Main St

Redmond

WA

98038

Amazon

[email protected]

5

2021-01-01

Joe

Smith

848-555-1234

211 1st St

Kirkland

WA

98033

 

[email protected]

6

2022-01-01

Joe

Smith

848-555-1234

456 State St

Bellevue

WA

98038

Amazon

 

11

2023-03-01

Joe

Schmo

989-223-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

Scenario 3: Multiphone

Keep each unique phone number for an individual. Allows for duplicate phone numbers as long as they are for a different person, such as a home phone for a household.

Match on First, Last, Phone (Filled)

Sort on most recently Updated

match_fields: [[1], [2], [3, “f”]

sort_fields: [[0, “r”]]

Rec ID

Updated

First

Last

Phone

Address

City

State

Zip

Busname

Email

1

2014-01-01

John

Smith

245-777-8866

 

 

 

98038

Google

[email protected]

2

2020-01-01

John

Smith

248-6080-548

123 Main St

Redmond

 

 

 

[email protected]

4

2023-01-01

Jane

Smith

245-777-8866

123 Main St

Redmond

WA

98038

Amazon

[email protected]

6

2022-01-01

Joe

Smith

848-555-1234

456 State St

Bellevue

WA

98038

Amazon

 

9

2023-01-01

Joe

Schmo

989-221-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

10

2023-02-01

Joe

Schmo

989-222-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]

11

2023-03-01

Joe

Schmo

989-223-4883

411 Pine St

Seattle

WA

98103

Microsoft

[email protected]