Overview
The Deduplication API provides a mechanism to identify and manage duplicate records in a dataset, thereby optimizing storage usage and improving data quality. Users specify criteria to detect duplicates and control the resulting output to either mark, move, or eliminate duplicate records.
Format
API End point: https://api.versium.com/v2/jobs?operation=dedupe
Http method: POST
Content-Type: application/json
Required Configuration Parameters
Name | Description | Value Type | Notes |
---|---|---|---|
input_list_id | Id of an existing list to perform dedupe on. | integer | |
project_name | The name of the project to store the list in. If a project with this name doesn’t exist, a new one will be created. | string | Required if project_id is not provided |
project_id | The id of an existing project to store the list in. | integer | Required if project_name is not provided |
match_fields | An array of arrays, each containing a field index followed by options for preprocessing data before matching. | array | Example: |
sort_fields | An array of arrays, each containing a field index followed by options for field sorting. | array | Example: |
Optional Configuration Parameters
Name | Description | Value Type | Notes |
---|---|---|---|
callback_url | A URL to ping when the job has completed or failed. | string |
Match Options
Options for preprocessing data points before matching:
case_insensitive
ori
: Case-insensitive matching.trim
ort
: Trim whitespace from the beginning and end of the data point.filled
orf
: Require the field to be non-empty.
Sort Options
Options for determining how fields are sorted:
case_insensitive
ori
: Case-insensitive sorting.trim
ort
: Trim whitespace.reverse
orr
: Sort in descending order.numeric
orn
: Numeric sorting.filled
orf
: Sort based on whether the field is filled, ignoringcase_insensitive
andnumeric
options. Filled fields will come before empty fields by default, but this can be reversed with the “reverse” flag.
Example Request:
Assume the header of the input file is as follows:
Updated Date | First | Last | Phone | Address | City | State | Zip | Busname |
---|
The following request will use First, Last, and Address to match records for duplicates with case-insensitive and trimmed comparison. First and Last are required to be non-empty.
Records with a filled Phone will be prioritized first, followed by filled Email, and finally Updated Date reverse order (descending), i.e. most recent records.
POST https://api.versium.com/v2/jobs?operation=dedupe
Content-Type: application/json
input_list_id=194
project_name=MyProject
match_fields=[[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]]
sort_fields=[[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]]
Usage Notes
- field_index for
match_fields
andsort_fields
are 0-based integer indices of the fields such that the left-most field has index 0, then 1, then 2 etc. - When identifying duplicates, all match fields must contain identical values for records to be considered duplicates.
- Specifying non-empty match fields (
filled
option) helps prevent false uniqueness due to missing data. - The first record in a set of duplicates is retained as the original by default. Use
sort_fields
to control the order and, consequently, which record is preserved. - Output lists will have an extra field prepended to the beginning, indicating the record’s original line number in the input file before sorting.
Response Overview
The JSON response will contain meta information and the state_url that can be used to obtain the job status. Once completed the JSON response will include the URL to send subsequent API requests to.
Dispatched response
{
"versium": {
"version": "2.0",
"query_id": "9dc4828868e938d263a77fd1a7a63623",
"query_time": 0.18544507026672363,
"results": {
"id": 194,
"guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
"state": "dispatched",
"operation": "dedupe",
"created_at": "2024-04-16T01:30:14.000000Z",
"finished_at": null,
"config": {
"match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
"sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
},
"output_list": {
"id": 224,
"download_url": null,
"num_matches": 0,
"num_records": 0,
"project_id": 8,
"insights": null
},
"state_url": "http://api.versium.com/v2/jobs/194",
"callback_url": null,
"suppression_input_list_ids": null,
"est_sec_remaining": null,
"progress_percentage": 0
}
}
}
Processing response:
{
"versium": {
"version": "2.0",
"query_id": "3570e37172475acffd82f1f2fd6a144a",
"query_time": 0.06275701522827148,
"results": {
"id": 194,
"guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
"state": "processing",
"operation": "dedupe",
"created_at": "2024-04-16T01:30:14.000000Z",
"finished_at": null,
"config": {
"match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
"sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
},
"output_list": {
"id": 224,
"download_url": null,
"num_matches": 0,
"num_records": 0,
"project_id": 8,
"insights": null
},
"state_url": "http://api.versium.com/v2/jobs/194",
"callback_url": null,
"suppression_input_list_ids": null,
"est_sec_remaining": null,
"progress_percentage": 0
}
}
}
Completed job:
Once the job is completed, the response will contain the "state":"Done" string. The response will also contain the following:
- download URL - the url that can be used to download the generated list in CSV format.
- number of matches in the list
- number of records in the list
- insights - a set of data containing the insights of the generated list
{
"versium": {
"version": "2.0",
"query_id": "e6d1535f504b2ea8be1726431185a9ed",
"query_time": 0.23850011825561523,
"results": {
"id": 194,
"guid": "932906eb-e63d-4712-b636-a4df4fc0be1e",
"state": "done",
"operation": "dedupe",
"created_at": "2024-04-16T01:30:14.000000Z",
"finished_at": "2024-04-16 01:30:28",
"config": {
"match_fields": [[1,"i","t","f"],[2,"i","t","f"],[4,"i","t"]],
"sort_fields": [[3,"i","t","f"],[9,"i","t","f"],[0,"i","t","r"]],
},
"output_list": {
"id": 224,
"download_url": "http://api.versium.com/v2/lists/224/download",
"num_matches": 71702,
"num_records": 71702,
"project_id": 8,
"insights": null
},
"state_url": "http://api.test/v2/jobs/194",
"callback_url": null,
"suppression_input_list_ids": null,
"est_sec_remaining": 0,
"progress_percentage": 100
}
}
}
Examples Scenarios
Below is a table of dummy data that will be used to demonstrate different deduplication scenarios. Rec ID is for record identification purposes only and will not count toward the field indices.
Rec ID | Updated | First | Last | Phone | Address | City | State | Zip | Busname | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2010-01-01 | John | Smith | 245-777-8866 | 123 Main St | Redmond | WA | 98038 | Microsoft | |
1 | 2014-01-01 | John | Smith | 245-777-8866 |
|
|
| 98038 | ||
2 | 2020-01-01 | John | Smith | 248-6080-548 | 123 Main St | Redmond |
|
|
| |
3 | 2021-01-01 | John | Smith |
|
| Redmond | WA | 98038 |
| |
4 | 2023-01-01 | Jane | Smith | 245-777-8866 | 123 Main St | Redmond | WA | 98038 | Amazon | |
5 | 2021-01-01 | Joe | Smith | 848-555-1234 | 211 1st St | Kirkland | WA | 98033 |
| |
6 | 2022-01-01 | Joe | Smith | 848-555-1234 | 456 State St | Bellevue | WA | 98038 | Amazon |
|
7 | 2023-01-01 | Joe | Smith |
| 456 State St | Bellevue | WA | 98038 | Amazon | |
8 | 2022-01-01 | Joe | Schmo |
|
|
| NaN | NaN | ||
9 | 2023-01-01 | Joe | Schmo | 989-221-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft | |
10 | 2023-02-01 | Joe | Schmo | 989-222-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft | |
11 | 2023-03-01 | Joe | Schmo | 989-223-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft | |
12 | 2023-04-01 | Joe | Schmo |
| 411 Pine St | Seattle | WA | 98103 | Microsoft |
Scenario 1: Individual First & Last Match
Only consider first and last name when determining if two records are duplicates.
Match on First, Last.
Sort on filled Phone, then filled Email, then most recently Updated
match_fields: [[1], [2]]
sort_fields: [[3, “f”], [9, “f”], [0, “r”]]
Rec ID | Updated | First | Last | Phone | Address | City | State | Zip | Busname | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 2020-01-01 | John | Smith | 248-6080-548 | 123 Main St | Redmond |
|
|
| |
4 | 2023-01-01 | Jane | Smith | 245-777-8866 | 123 Main St | Redmond | WA | 98038 | Amazon | |
5 | 2021-01-01 | Joe | Smith | 848-555-1234 | 211 1st St | Kirkland | WA | 98033 |
| |
11 | 2023-03-01 | Joe | Schmo | 989-223-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft |
Scenario 2: Household
Only keep 1 record per household.
Match on Last, Address.
Sort on filled Phone, then filled Email, then most recently Updated
match_fields: [[2], [4]]
sort_fields: [[3, “f”], [9, “f”], [0, “r”]]
Rec ID | Updated | First | Last | Phone | Address | City | State | Zip | Busname | |
---|---|---|---|---|---|---|---|---|---|---|
4 | 2023-01-01 | Jane | Smith | 245-777-8866 | 123 Main St | Redmond | WA | 98038 | Amazon | |
5 | 2021-01-01 | Joe | Smith | 848-555-1234 | 211 1st St | Kirkland | WA | 98033 |
| |
6 | 2022-01-01 | Joe | Smith | 848-555-1234 | 456 State St | Bellevue | WA | 98038 | Amazon |
|
11 | 2023-03-01 | Joe | Schmo | 989-223-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft |
Scenario 3: Multiphone
Keep each unique phone number for an individual. Allows for duplicate phone numbers as long as they are for a different person, such as a home phone for a household.
Match on First, Last, Phone (Filled)
Sort on most recently Updated
match_fields: [[1], [2], [3, “f”]
sort_fields: [[0, “r”]]
Rec ID | Updated | First | Last | Phone | Address | City | State | Zip | Busname | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2014-01-01 | John | Smith | 245-777-8866 |
|
|
| 98038 | ||
2 | 2020-01-01 | John | Smith | 248-6080-548 | 123 Main St | Redmond |
|
|
| |
4 | 2023-01-01 | Jane | Smith | 245-777-8866 | 123 Main St | Redmond | WA | 98038 | Amazon | |
6 | 2022-01-01 | Joe | Smith | 848-555-1234 | 456 State St | Bellevue | WA | 98038 | Amazon |
|
9 | 2023-01-01 | Joe | Schmo | 989-221-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft | |
10 | 2023-02-01 | Joe | Schmo | 989-222-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft | |
11 | 2023-03-01 | Joe | Schmo | 989-223-4883 | 411 Pine St | Seattle | WA | 98103 | Microsoft |