value_frequency¶
Count how often each value appears in one column of a tabular file.
This is a value_counts() equivalent. Results are ordered most-frequent first.
When to use¶
- Inspecting a categorical column's distribution.
- Finding the dominant (or rare) values before filtering.
- Histogramming a numeric column via Sturges binning.
Input schema¶
| Parameter | Type | Required? | Default | Description |
|---|---|---|---|---|
path |
string | yes | (no default) | Path to the file |
column |
string | yes | (no default) | Name of the column to count |
table |
string | no | (no default) | Specific table for multi-table sources |
top_n |
integer | no | (all) | Return only the N most frequent values / bins |
bin |
boolean | no | false |
Group a numeric column into Sturges bins instead of raw values |
unlimited |
boolean | no | false |
Lift the 5,000,000-row file-loader cap so the counts include every row in the file |
Response shape¶
{
"column_name": "country",
"binned": false,
"nulls": 12,
"total_non_null": 9988,
"unique_count": 47,
"rows": [
{ "label": "US", "count": 4831 },
{ "label": "DE", "count": 1190 },
{ "label": "UK", "count": 1042 }
]
}
When bin: true on a numeric column, each label is a half-open range
like [0.00, 5.00) and binned is true. unique_count counts
distinct values (or bins) across the whole column even when top_n
shortens rows.
Example calls¶
{
"name": "value_frequency",
"arguments": { "path": "/tmp/users.parquet", "column": "country", "top_n": 10 }
}
{
"name": "value_frequency",
"arguments": { "path": "/tmp/users.parquet", "column": "age", "bin": true }
}
See also¶
profile: stats for every column at once.find_duplicates: rows sharing key values.- Value Frequency: the same compute in the GUI.