DataPrep performs a series of data cleansing and manipulation actions on inputs and returns the results of those actions. DataPrep actions can generate new fields, enriching the input records with extra data points. They can also overwrite existing fields, replacing messy inputs with cleansed data. Control over whether a new field is created or an existing field is overwritten lies completely with the end user.
Actions can take multiple inputs and have multiple outputs depending on the particular operation they perform. Actions can be chained together in a sequence such that the output(s) of one action becomes the input(s) of a subsequent action. The API takes the name of an action, input fields, output fields, and sometimes options that change the behavior of the action.
There are two ways to call the API, a Long-form API and a Short-form API form. Both forms provide the same functionality, except that the short-form only allows a single input record at a time while the long-form allows for batching multiple records into a single request.
List of DataPrep Actions
The list of available actions to perform are listed below. The inputs, outputs, and options for each action are included with their definition.
What it does | Cleans this | To This | |
---|---|---|---|
Checks an email field for validity. |
|||
Cleans bad characters from a name. |
Jonath4an Doe |
Jonathan Doe |
|
Cleans bad characters from a first name. |
Jonath4an |
Jonathan |
|
Cleans bad characters from a last name. |
Smi#th |
Smith |
|
Cleans bad characters from a business name. |
Micro$soft |
Microsoft |
|
Cleans bad characters from a city name. |
Seatt?le |
Seattle |
|
Checks for a valid US state and converts to 2 letter abbreviation if possible. |
Washington |
WA |
|
Checks whether a field looks like a 5-digit US postal code. Does NOT check for full validity (i.e. 00005 passes this check). |
98052 |
98052 (valid) |
|
Checks whether a field looks like a potential 9-digit US postal code and if so, splits it into two columns containing the first 5 digits and the other 4. |
90601-1051 |
[90601] |
|
Given a country name field, outputs a country code or blank if it is invalid. |
Italy |
IT |
|
Checks a phone field for validity, if invalid, outputs the number if valid, or blank otherwise. |
+14175763685 |
4175763685 |
|
Converts a field to all uppercase. |
uppercase |
UPPERCASE |
|
Capitalizes the first letter of each word in a field. |
washington |
Washington |
|
Standardizes a job role field and ranks it, lower number indicates a higher rank. |
Irrigation Sales Manager |
Sales, Agriculture |
|
Splits a full name into separate columns for prefix, firstname, middlename, lastname, and suffix, all uppercased. |
[Chris Angelo Smith] |
[CHRIS][ANGELO][SMITH] |
|
Parses a name field into separate columns for category (individual/business/unknown), prefix, firstname, middlename, lastname, suffix, and business name if applicable. |
|
|
|
Attempts to parse a fuzzy location like "Greater Seattle Area" into a meaningful location like "Seattle, WA, US". |
[Greater Seattle Area] |
[Seattle] [WA] [US] |
|
Extracts the domain from an email. |
gmail.com |
||
Checks if an email address appears to be from a free Email Service Provider or not. |
1 |
||
Checks if an IP looks like a valid public IP address, outputs blank if the check fails. |
50.242.100.253 |
50.242.100.253 |
|
Converts a dotted quad IP address into an integer value. |
206.40.146.40 |
3458765352 |
|
Converts an integer value into a dotted quad IP address. |
3458765352 |
206.40.146.40 |
|
Generates a probable email for a given first name, last name, and domain. |
John, Doe, Versium.com |
||
Checks if a name appears to match an email and outputs component match count (how many name components match the email) and a weighted score. Higher scores are better. |
Name = Wendi |
N2E Matches = 7 |
|
Attempts to extract a valid year from a field. Will filter out non-numeric characters. |
|||
Merges separate year, month, and day fields into a single field. |
|||
Attempts to extract a date and format it as YYYYMMDD. |
|||
Merges separate date and time fields into a single field with the format YYYY-MM-DD HH:MM:SS |
|||
Extracts the hour from a timestamp field. |
|||
Reformats a full datetime field into YYYY-MM-DD HH:MM:SS |
|||
Transforms a date field from one format to another. |
|||
Filters out values matching a certain pattern. |
|||
Outputs the domain name for an email or url stemmed without the host. |
|||
Converts accented characters to non-accented equivalents. |
|||
Takes an input and applies a hashing algorithm with an optional salt. |
|||
Returns the line type (mobile, landline, etc.) and the name of the Carrier. |
|||
Takes one or more inputs and tries to extract a postal address and standardize it. Returns Housenumber, Street, Unit, City, State, Zip, Country. |
[5550 Newcastle ave #555, Encino, CA, 91316, United States] |
[5550] [Newcastle ave] [#555] [Encino] [CA] [91316] [United States] |
|
Combines and standardizes address components into a single field. |
[5550 Newcastle Ave #555] [Encino] [CA] [91316] |
[5550 Newcastle Ave #555, Encino, CA, 91316] |
|
Accepts (US-based) latitude and longitude fields and attempts to output an address for the given location. |
[47.62230901][-122.3486291] |
[1] [114 REPUBLICAN ST] [SEATTLE] [WA] [98109] [4534] |
|
Takes a first name, last name, address, and zip as inputs and creates a pseudo-unique identifier for that individual. |
|||
Takes a first name, last name, address, and zip as inputs and creates a pseudo-unique identifier for that household. |
|||
Takes a Zip5 and Zip4 and returns the congressional district for that area. |
|||
Takes the name of a country and outputs the region (Africa, APAC, US/CA, LATAM, etc.) |
Italy |
EMEA |
|
Takes in an IP address and outputs IP Country, IP City, IP Zip, IP ISP Name, IP Domain Name, IP Usage Type, Proxy Type, IP Block ID, IP Block Len. |
[161.69.123.10] |
[US] [NY] [New York City] [DCH [VPN] |
|
Formats a phone number into a selected standard format. Only works with 10-digit (or 11-digit with country code) North American phone numbers. |
[+1 417.576 3685] |
[+1 (417) 576-3685] |
Clean Email
Clean Email API Action String: email Checks an email field for validity. Outputs the email on success and blank on fail. Provides light correction (e.g. gmail.co becomes gmail.com).
Input Idx | Input Type |
---|---|
0 |
Output Idx | Output Type |
---|---|
0 |
Option Idx | Option | Values |
---|---|---|
0 |
aggressive |
0 = No |
Examples:
{
"inputs": [
{
"FirstName": "John",
"LastName": "Smith",
"EmailAddr": "[email protected]"
},
{
"FirstName": "Jane",
"LastName": "Williams",
"EmailAddr": "[email protected]"
}
],
"actions": [
{
"name": "email",
"inputFields": [
"EmailAddr"
],
"outputFields": [
"EmailAddrClean"
],
"options": {
"aggressive": 1
}
}
],
"output": [
"FirstName",
"LastName",
"EmailAddrClean",
"EmailAddr"
]
}
https://api.versium.com/v2/dataprep?actions[]=email:EmailAddr:1&FirstName=John&LastName=Smith&[email protected]&output=FirstName,LastName,EmailAddrClean,EmailAddr
https://api.versium.com/v2/dataprep?actions[]=email:EmailAddr:1&FirstName=Jane&LastName=Williams&[email protected]&output=FirstName,LastName,EmailAddrClean,EmailAddr
{
"versium": {
"version": "2.0",
"match_counts": [],
"num_matches": 0,
"num_results": 1,
"query_id": "0fe4aa159cca6853dd",
"query_time": 0.145,
"results": [
{
"FirstName": "John",
"LastName": "Smith",
"EmailAddrClean": "",
"EmailAddr": "[email protected]"
},
{
"FirstName": "Jane",
"LastName": "Williams",
"EmailAddrClean": "[email protected]",
"EmailAddr": "[email protected]"
}
]
}
}
Clean Name
Clean name API Action String: name Cleans bad characters from a name (only allows alphabetic characters).
Input Idx | Input Type |
---|---|
0 | Fullname |
Output Idx | Output Type |
---|---|
0 | Fullname |
Clean First Name
Clean First Name API Action String: first Cleans bad characters from a first name (only allows alphabetic characters).
Input Idx | Input Type |
---|---|
0 | First |
Output Idx | Output Type |
---|---|
0 | First |
Clean Last Name
Clean Last Name API Action String: last Cleans bad characters from a last name (only allows alphabetic characters).
Input Idx | Input Type |
---|---|
0 | Last |
Output Idx | Output Type |
---|---|
0 | Last |
Clean Business Name
Clean Business Name API Action String: busname Cleans bad characters from a business name (only allows alphanumeric characters).
Input Idx | Input Type |
---|---|
0 | Business |
Output Idx | Output Type |
---|---|
0 | Business |
Clean City Name
Clean City Name API Action String: city Cleans bad characters from a city name (only allows alphabetic characters).
Input Idx | Input Type |
---|---|
0 | City |
Output Idx | Output Type |
---|---|
0 | City |
Clean State Name
Clean State Name API Action String: state Checks for a valid US state and converts to 2 letter abbreviation if possible.
Input Idx | Input Type |
---|---|
0 | State |
Output Idx | Ouput Type |
---|---|
0 | State |
US ZIP5 Check
US ZIP5 Check API Action String: uszip5 Checks whether a field looks like a 5-digit US postal code. Does NOT check for full validity (i.e. 00005 passes this check).
Input Idx | Input Type |
---|---|
0 | Zip |
Outut Idx | Output Type |
---|---|
0 | Zip |
US ZIP9 Check & Split
US ZIP9 Check & Split API Action String: uszip9 Checks whether a field looks like a potential 9-digit US postal code and if so, splits it into two columns containing the first 5 digits and the other 4.
Input Idx | Input Type |
---|---|
0 | Zip |
Output Idx | Output Type |
---|---|
0 | Zip |
Clean Country Code
Clean Country Code API Action String: country Given a country name field, outputs a country code or blank if it is invalid (allows only alphabetic characters).
Input Idx | Input Type |
---|---|
0 | Country |
Output Idx | Output Type |
---|---|
0 | Country |
Clean Phone Number
Clean Phone Number API Action String: phone Checks a phone field for validity, if invalid, outputs the number if valid, or blank otherwise.
Input Idx | Input Type |
---|---|
0 | Phone |
Output Idx | Output Type |
---|---|
0 | Phone |
Uppercase
Uppercase API Action String: strtoupper Converts a field to all uppercase
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Capitalize Words
Capitalize Words API Action String: ucwords Capitalizes the first letter of each word in a field.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Standardize & Rank Job Role
Standardize & Rank Job Role API Action String: titlerank3 Standardizes a job role field and ranks it, lower number indicates a higher rank.
Input Idx | Input Type |
---|---|
0 | Title |
Output Idx | Output Type |
---|---|
0 | Title Rank 3 |
1 | Generic String |
Split Full Name
Split Full Name API Action String: splitfullname2 Splits a full name into separate columns for prefix, firstname, middlename, lastname, and suffix, all uppercased.
Input Idx | Input Type |
---|---|
0 | Fullname |
Output Idx | Output Type |
---|---|
0 | Generic String |
1 | First |
2 | Generic String |
3 | Last |
4 | Generic String |
Categorize & Split Name
Categorize & Split Name API Action String: namecatparse Parses a name field into separate columns for category (individual/business/unknown), prefix, firstname, middlename, lastname, suffix, and business name if applicable.
Input Idx | Input Type |
---|---|
0 | Fullname |
Output Idx | Output Type |
---|---|
0 | Generic String |
1 | Generic String |
2 | First |
3 | Generic String |
4 | Last |
5 | Generic String |
6 | Business |
Examples:
{
"inputs": [
{
"FullName": "John Smith"
},
{
"FullName": "Jane Williams"
}
],
"actions": [
{
"name": "namecatparse",
"inputFields": [
"FullName"
],
"outputFields": [
"EntityCategory",
"Prefix",
"First",
"Middle",
"Last",
"Suffix",
"BusName"
]
}
],
"output": [
"EntityCategory",
"First",
"Middle",
"Last",
"BusName"
]
}
http://api.versium.com/v2/dataprep?actions[]=namecatparse:FullName:EntityCategory,Prefix,First,Middle,Last,Suffix,BusName&FullName=John Smith&output=EntityCategory,First,Middle,Last,BusName
http://api.versium.com/v2/dataprep?actions[]=namecatparse:FullName:EntityCategory,Prefix,First,Middle,Last,Suffix,BusName&FullName=Jane Williams&output=EntityCategory,First,Middle,Last,BusName
{
"versium": {
"version": "2.0",
"match_counts": [],
"num_matches": 0,
"num_results": 1,
"query_id": "0fe4aa159cca6853dd",
"query_time": 0.145,
"results": [
{
"EntityCategory": "Individual",
"First": "John",
"Middle": "",
"Last": "Smith",
"BusName": ""
},
{
"EntityCategory": "Individual",
"First": "Jane",
"Middle": "",
"Last": "Williams",
"BusName": ""
}
]
}
}
Fix Fuzzy Location
Fix Fuzzy Location API Action String: tlilocmap Attempts to parse a fuzzy location like "Greater Seattle Area" into a meaningful location like "Seattle, WA, US".
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Address |
1 | City |
2 | State |
3 | Zip |
4 | Country |
Extract Domain
Extract Domain API Action String: domain Extracts the domain from an email.
Input Idx | Input Type |
---|---|
0 |
Output Idx | Output Type |
---|---|
0 | Domain |
ESP Email Check
ESP Email Check API Action String: isespfree Checks if an email address appears to be from a free Email Service Provider or not (i.e. 1 = Free ESP, 0 = Private ESP)
Input Idx | Input Type |
---|---|
0 |
Output Idx | Output Type |
---|---|
0 | Generic String |
Public IP Address Check
Public IP Address Check API Action String: ip Checks if an IP looks like a valid public IP address, outputs blank if the check fails.
Input Idx | Input Type |
---|---|
0 | Ip |
Output Idx | Output Type |
---|---|
0 | Ip |
IP Address to Integer
IP Address to Integer API Action String: ip2long Converts a dotted quad IP address into an integer value. (e.g. 206.40.146.40 becomes 3458765352)
Input Idx | Input Type |
---|---|
0 | Ip |
Output Idx | Output Type |
---|---|
0 | Generic String |
Integer to IP Address
Integer to IP Address API Action String: long2ip Converts an integer value into a dotted quad IP address. (e.g. 3458765352 becomes 206.40.146.40)
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Ip |
Generate Patterned Email
Generate Patterned Email API Action String: gpe Generates a probable email for a given first name, last name, and domain.
Input Idx | Input Type |
---|---|
0 | First |
1 | Last |
2 | Domain |
Output Idx | Output Type |
---|---|
0 |
Name-To-Email Check
Name-To-Email Check API Action String: n2echeck Checks if a name appears to match an email and outputs component match count (how many name components match the email) and a weighted score. Higher scores are better.
Input Idx | Input Type |
---|---|
0 | |
1 | First |
2 | Last |
Output Idx | Output Type |
---|---|
0 | Generic String |
1 | Generic String |
Clean/Extract Year
Clean/Extract Year API Action String: year Attempts to extract a valid year from a field. Will filter out non-numeric characters.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Merge Date Fields
Merge Date Fields API Action String: dobmerge Merges separate year, month, and day fields into a single field.
Input Idx | Input Type |
---|---|
0 | Any |
1 | Any |
2 | Any |
Output Idx | Output Type |
---|---|
0 | Date |
Date Extract
Date Extract API Action String: dob Attempts to extract a date and format it as YYYYMMDD.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Date |
DateTime Merge
DateTime Merge API Action String: tsmerge Merges separate date and time fields into a single field with the format YYYY-MM-DD HH:MM:SS
Input Idx | Input Type |
---|---|
0 | Date |
0 | Time |
Output Idx | Output Type |
---|---|
0 | Datetime |
Time to Hour
Time to Hour API Action String: time2hour Extracts the hour from a timestamp field.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Format DateTime
Format DateTime API Action String: timestamp Reformats a full datetime field into YYYY-MM-DD HH:MM:SS
Input Idx | Input Type |
---|---|
0 | Datetime |
Output Idx | Output Type |
---|---|
0 | Datetime |
Date Transform
Date Transform API Action String: datetransform Transforms a date field from one format to another.
Input Idx | Input Type |
---|---|
0 | Date |
Output Idx | Output Type |
---|---|
0 | Date |
Option Idx | Options | Value |
---|---|---|
0 |
datetransform (Transform Type) |
0 = (MMDDYYYY to YYYYMMDD) |
Examples:
{
"inputs": [
{
"FirstName": "John",
"LastName": "Smith",
"DOB": "01 15 1980"
},
{
"FirstName": "Jane",
"LastName": "Williams",
"DOB": "06 24 1990"
}
],
"actions": [
{
"name": "datetransform",
"inputFields": [
"DOB"
],
"outputFields": [
"DOB"
],
"options": {
"datetransform": 4
}
}
]
}
http://api.versium.com/v2/dataprep?actions[]=datetransform:DOB:DOB:4&FirstName=John&LastName=Smith&DOB=01 15 1980
http://api.versium.com/v2/dataprep?actions[]=datetransform:DOB:DOB:4&FirstName=Jane&LastName=Williams&DOB=06 24 1990
{
"versium": {
"version": "2.0",
"match_counts": [],
"num_matches": 0,
"num_results": 1,
"query_id": "0fe4aa159cca6853dd",
"query_time": 0.145,
"results": [
{
"FirstName": "John",
"LastName": "Smith",
"DOB": "19800115",
"EmailAddr": "[email protected]"
},
{
"FirstName": "Jane",
"LastName": "Williams",
"DOB": "19900624",
"EmailAddr": "[email protected]"
}
]
}
}
Filter Values
Filter Values API Action String: mvzonk Filters out values matching a certain pattern.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Option Idx | Option | Value |
---|---|---|
0 |
pattern (Pattern) |
email (Is Email) |
Stem Domain
Stem Domain
API Action String: stemdomain Outputs the domain name for an email or url stemmed without the host.
Input Idx | Input Type |
---|---|
0 | Domain |
Output Idx | Output Type |
---|---|
0 | Domain |
Transliterate
Transliterate
API Action String: transliterate Converts accented characters to non-accented equivalents
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Hashmap
Hashmap
API Action String: hashmap Takes an input and applies a hashing algorithm with an optional salt.
Input Idx | Input Type |
---|---|
0 | Any |
Output Idx | Output Type |
---|---|
0 | Generic String |
Option Idx | Options | Value |
---|---|---|
0 |
algorithm |
md5U |
1 |
salt |
any |
Phoneline Type
Phoneline Type
API Action String: linetype-pe Returns the line type (mobile, landline, etc.) and the name of the Carrier
Input Idx | Input Type |
---|---|
0 | Phone |
Output Idx | Output Type |
---|---|
0 | Line Type |
1 | Generic String |
LibPostal Address Standardizer
LibPostal Address Standardizer
API Action String: lpaddrstd Takes one or more inputs and tries to extract a postal address and standardize it. Returns House number, Street, Unit, City, State, Zip, Country
Input Idx | Input Type |
---|---|
0 | Full Address |
0 | ... |
0 | Generic String |
1 | Generic String |
2 | Generic String |
3 | City |
4 | State |
5 | Zip |
6 | Country |
Combine & Standardize Address
Combine & Standardize Address
API Action String: addrstd Combines and standardizes address components into a single field.
Input Idx | Input Idx |
---|---|
0 | Address |
1 | City |
2 | State |
3 | Zip |
Output Idx | Output Type |
---|---|
0 | Address |
Option Idx | Option | Value |
---|---|---|
0 |
blankifinvalid |
0 = No |
Reverse Geocode
Reverse Geocode
API Action String: reversegeocode Accepts (US-based) latitude and longitude fields and attempts to output an address for the given location.
Input Idx | Input Type |
---|---|
0 | Any (Latitude) |
1 | Any (Longitude) |
Output Idx | Output Type |
---|---|
0 | Generic String (Precision) |
1 | Address |
2 | City |
3 | State |
4 | Zip |
5 | Zip4 |
Simple Tag Individual
Simple Tag Individual
API Action String: stindiv Takes a first name, last name, address, and zip as inputs and creates a pseudo-unique identifier for that individual.
Input Idx | Input Type |
---|---|
0 | First |
1 | Last |
2 | Address |
3 | Zip |
Output Idx | Output Type |
---|---|
0 | Generic String |
Option Idx | Option | Value |
---|---|---|
0 | length | 1-50 |
Simple Tag Household
Simple Tag Household
API Action String: sthhld Takes a first name, last name, address, and zip as inputs and creates a pseudo-unique identifier for that household.
Input Idx | Input Type |
---|---|
0 | First |
1 | Last |
2 | Address |
3 | Zip |
Output Idx | Output Type |
---|---|
0 | Generic String |
Option Idx | Option | Value |
---|---|---|
0 | lenght | 1-50 |
Zip to Congressional District
Zip to Congressional District
API Action String: z54tocongrdist Takes a Zip5 and Zip4 and returns the congressional district for that area.
Input Idx | Input Type |
---|---|
0 | Zip |
1 | Zip4 |
Output Idx | Output Type |
---|---|
0 | Generic String |
Global Region
Global Region
API Action String: globalregion Takes the name of a country and outputs the region (Africa, APAC, US/CA, LATAM, etc.)
Input Idx | Input Type |
---|---|
0 | Country |
Output Idx | Output Type |
---|---|
0 | Generic String |
IP To Location
IP To Location
API Action String: ip2loc Takes in an IP address and outputs IP Country, IP City, IP Zip, IP ISP Name, IP Domain Name, IP Usage Type, Proxy Type, IP Block ID, IP Block Len.
Input Idx | Input Type |
---|---|
0 | IP |
Output Idx | Output Type |
---|---|
0 | Country |
1 | City |
2 | Zip |
3 | Generic String |
4 | Generic String |
5 | Generic String |
6 | Generic String |
7 | Generic String |
8 | Generic String |
Format Phone Number
Format Phone Number
API Action String: naphonefmt Formats a phone number into a selected standard format. Only works with 10-digit (or 11-digit with country code) North American phone numbers.
Input Idx | Input Type |
---|---|
0 | Phone |
Output Idx | Output Type |
---|---|
0 | Phone |
Option Idx | Option | Value |
---|---|---|
0 |
phoneformat |
0 = (XXX) XXX-XXXX |
1 |
includenacountrycode |
0 = No (Always Stripped) |