Tips & Recipes¶

Worked workflows that chain Octa's features together. Most of these are 2-3 steps; pick the one that matches your task and follow along.

Convert a messy CSV to a clean Parquet¶

When you've got a CSV with mixed types, the date inference and type promotion happen automatically. Round-tripping through Parquet makes them stick.

Via the GUI¶

Open the CSV. Octa auto-detects the delimiter (see CSV Quote / Escape if detection is wrong).
Inspect the columns, right-clicking a header → Change Type if a column was inferred wrong. See Date Inference for how date columns get promoted automatically.
(Optionally) Edit a few cells via the Table view, mark some rows with Colour Marking.
File → Save As… → name it clean.parquet.

Octa picks the Parquet writer from the extension, applies your type changes, and writes a properly-typed Parquet file (see Saving).

Via the CLI¶

octa --convert messy.csv clean.parquet

Type inference still applies: CSV column types are inferred at read time and become the Parquet schema. See octa --convert for the full reference.

For larger jobs:

# Run a SQL pre-process first, then convert the result
octa --sql messy.csv -q 'SELECT id, name, CAST(amount AS DOUBLE) AS amount FROM data' -f csv > clean.csv
octa --convert clean.csv clean.parquet
rm clean.csv

Via MCP¶

With Octa registered as an MCP server (octa --mcp), ask your client (Claude Desktop, Claude Code, MCP Inspector, ...) to convert the file and it calls the convert tool:

convert(input="messy.csv", output="clean.parquet")

Type inference applies exactly as it does on the CLI: the CSV column types are inferred at read time and become the Parquet schema. For a SQL pre-process in the same step, use run_sql with a write_to target instead of a separate convert. See MCP server and the MCP setup guide for registering the server and the full tool list.

Open a huge file without blowing memory¶

Octa's streaming readers (Parquet, CSV, TSV) load only the first 1M rows by default. For files with more rows than that:

Settings → Performance → Initial-load row cap: raise it if you regularly work with bigger files (and have RAM to spare).
For one-off analysis without changing the setting, point DuckDB at the file directly outside of Octa, since it streams Parquet / CSV / JSON natively without any cap. The Octa SQL panel and octa --sql both honour the initial-load cap, so neither bypasses it.

Find duplicate rows in a CSV¶

-- Open the file in Octa, switch to the SQL view
-- (Ctrl+J / ToggleSqlPanel), and run:
SELECT *, COUNT(*) AS n
FROM data
GROUP BY ALL
HAVING n > 1
ORDER BY n DESC;

GROUP BY ALL groups by every column; rows with n > 1 are duplicates. For "duplicates by some columns only":

SELECT customer_id, order_date, COUNT(*) AS n
FROM data
GROUP BY customer_id, order_date
HAVING n > 1;

SQL exploratory analysis¶

When you don't know what's in a file yet:

-- 1. What columns + types
DESCRIBE data;

-- 2. Top-N for each categorical column
SELECT region, COUNT(*) AS n FROM data GROUP BY region ORDER BY n DESC LIMIT 10;

-- 3. Distribution of a numeric column
SELECT
  MIN(amount) AS min,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS p25,
  MEDIAN(amount) AS median,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS p75,
  MAX(amount) AS max,
  AVG(amount) AS mean
FROM data;

-- 4. Find suspicious values
SELECT * FROM data WHERE amount > 1e7 OR amount < 0;

Compare two CSVs¶

For a quick line-by-line text diff, open one CSV → View → Compare with… → pick the other. The Compare view defaults to Text Diff when both sides are text-shaped.

For a row-level comparison that survives column-order differences:

Open CSV A → View → Compare with… → pick CSV B.
Switch the Compare view's sub-mode to Row Hash Diff.
Check the column-picker boxes to include the columns you want to match on (e.g. just id, or customer_id + order_date).
Three buckets appear: Left-only, Right-only, Shared.

Cross-format works the same way: compare CSV vs Parquet, or SQLite vs JSON. The hash sees only CellValue::to_string output.

Compare two database tables¶

Open both tables in two tabs (File → Open Directory… and the folder sidebar make this fast). Then:

Activate tab A, then View → Compare with… and pick the file that holds tab B's table.
The table picker fires; choose tab B's table.
Switch to Row Hash Diff and pick the primary-key column(s).

Same buckets, same semantics.

Extract text from an EPUB¶

EPUBs open in the Reader view by default, but the Table view exposes every paragraph as a row:

chapter	paragraph	text
1	1	Cover
2	1	The Lion, the Witch and the Wardrobe
2	2	# The Lion, the Witch and the Wardrobe
...	...	...

From there:

Search the toolbar for any word to find every paragraph containing it.

SQL (Ctrl+J) for more powerful queries:

-- Count word frequency by chapter
SELECT chapter, COUNT(*) AS paragraphs,
       SUM(LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) + 1) AS approx_words
FROM data
GROUP BY chapter
ORDER BY chapter;

-- Find the longest paragraph
SELECT chapter, paragraph, LENGTH(text) AS chars, text
FROM data
ORDER BY chars DESC
LIMIT 5;

-- Every occurrence of a name
SELECT chapter, paragraph, text
FROM data
WHERE text LIKE '%Aslan%'
ORDER BY chapter, paragraph;

Pipe Octa into other tools¶

The CLI's -f json output is jq-friendly:

# Extract email addresses
octa --sql users.parquet -q 'SELECT email FROM data WHERE active' -f json \
  | jq -r '.[].email'

# Count rows per group
octa --head sales.csv -n 1000 -f json \
  | jq 'group_by(.region) | map({region: .[0].region, n: length})'

# Sample 10 random rows
octa --sql data.csv -q 'SELECT * FROM data USING SAMPLE 10 ROWS' -f json \
  | jq '.[0]'

Octa writes data to stdout, status messages to stderr, so pipes stay clean even on errors.

Quickly inspect an unknown file¶

When someone hands you a file with no documentation:

# 1. Schema (cheap)
octa --schema mystery.dat

# 2. First few rows
octa --head mystery.dat -n 5

# 3. Row count
octa --sql mystery.dat -q 'SELECT count(*) FROM data'

# 4. Open in GUI for visual exploration
octa mystery.dat

See --schema, --head, and --sql for the full reference on each step.

If the extension isn't recognised at all, Octa falls back to the plain-text reader so step 1 still works; you'll get one column called Line with one row per file line.

Convert SQLite to multiple Parquet files¶

Octa's --convert writes the first table from a multi-table source. To export every table:

# (Pseudo-script; adapt for your shell)
for table in users orders products; do
  octa --sql app.sqlite -q "SELECT * FROM ${table}" -f json > /tmp/dump.json
  octa --convert /tmp/dump.json "${table}.json"
done
rm /tmp/dump.json

Or use Octa's GUI: open the SQLite, the table picker shows every table, load one, then File → Save As… as Parquet, and repeat.

Or use DuckDB directly (which is what Octa uses under the hood):

duckdb -c "ATTACH 'app.sqlite' AS db (TYPE SQLITE);
           COPY db.users TO 'users.parquet';
           COPY db.orders TO 'orders.parquet';
           COPY db.products TO 'products.parquet';"

Read-only mode for safety¶

Inspecting a production data file you can't risk modifying? Press F8 to enter read-only mode. Every editing path is disabled (cell edits, structural changes, marks, undo/redo), but you can still:

Search and filter.
Sort columns.
Run SQL queries (mutations don't persist anyway).
Save As to a different file.

The status bar shows [Read-only] while active. Toggle off with F8 again.