find_duplicates¶
Find duplicate rows in a tabular file. You name the columns whose combined value forms the duplicate key; every row that shares its key with at least one other row is returned.
When to use¶
- Data-quality checks: "are there duplicate
emailvalues?" - Validating a supposed primary key.
- De-duplication planning before a
convertor downstream load.
Input schema¶
| Parameter | Type | Required? | Default | Description |
|---|---|---|---|---|
path |
string | yes | (no default) | Path to the file |
key_columns |
string[] | yes | (no default) | Column names whose combined value is the duplicate key |
table |
string | no | (no default) | Specific table for multi-table sources |
limit |
integer | no | server default | Max duplicate rows to return in the response. 0 = unlimited. |
unlimited |
bool | no | false |
Lift the 5,000,000-row file-loader cap so duplicate detection scans every row in the file |
Keys are compared on the cells' string representation, so int(1) and
float(1.0) are not treated as equal.
Response shape¶
{
"key_columns": ["email"],
"duplicate_row_count": 4,
"result": {
"schema": [ { "name": "id", "type": "Int64" }, … ],
"rows": [ [ … ], … ],
"row_count": 4,
"truncated": false,
"total_rows_available": 4,
"cell_truncated": false
}
}
duplicate_row_count is the total number of duplicate rows found;
result carries the rows themselves (subject to the row/cell caps;
see Limits & truncation).
Example call¶
{
"name": "find_duplicates",
"arguments": {
"path": "/tmp/contacts.csv",
"key_columns": ["first_name", "last_name"]
}
}
See also¶
value_frequency: counts of every value, not just the duplicated ones.run_sql:GROUP BY … HAVING count(*) > 1for custom duplicate logic.