Skip to content

Supported Formats

Octa reads ~25 file formats out of the box. Most are also writable. Unknown extensions fall back to the plain-text reader so you can always open something.

At-a-glance matrix

Format Extensions Read Write
Parquet .parquet
CSV / TSV .csv, .tsv
JSON .json
JSON Lines .jsonl, .ndjson
Excel .xlsx, .xls, .xlsm, .xlsb, .xlm ✅ *
ODS .ods
Arrow IPC / Feather .arrow, .feather
Avro .avro
ORC .orc
HDF5 .h5, .hdf5, .hdf
NetCDF v3 .nc
SQLite .sqlite, .sqlite3, .db ✅ **
DuckDB .duckdb, .ddb ✅ **
GeoPackage .gpkg ✅ **
SAS .sas7bdat
SPSS .sav, .zsav
Stata .dta
R Datasets .rds, .rdata, .rda
DBF / dBase .dbf
XML .xml
TOML .toml
YAML .yaml, .yml
Jupyter notebook .ipynb
Markdown .md, .markdown, .mdown, .mkd
EPUB .epub
GeoJSON .geojson
Archive (zip / tar / tgz) .zip, .tar, .tgz
Fixed-width (FWF) .fwf, .prn
Source code / config .py, .rs, .go, .ts, .js, ... (see below)
Plain text anything else

* Excel write always produces .xlsx structure, because the writer uses rust_xlsxwriter which doesn't emit legacy .xls / .xlsm / .xlsb. Save those as .xlsx to round-trip them through Octa.

** Database writes are diff-based and reject schema changes. See Saving for details.

Caveats and limitations by format

Streaming readers (large files OK)

Parquet, CSV, and TSV all stream. Octa loads the first AppSettings.initial_load_rows (default 5,000,000) rows and continues loading the rest in the background as you scroll. You can change the cap (or tick the Unlimited checkbox to load every row up front) under Settings → Performance. From the CLI, override per-invocation with --rows N|all. From MCP, pass unlimited: true to a tool to lift the cap for that single call. Multi-million-row files open without delay; the bottom of the table fills in as you reach it.

Parquet files written with very many small row groups (more than 32,767, which is common with Spark or streaming ingest pipelines) used to fail the native arrow-parquet reader with Row group ordinal 32768 exceeds i16 max value. Octa now retries those reads through a DuckDB-backed reader automatically, with the same schema and types and no user action required.

Files produced by pandas (DataFrame.to_parquet) embed the row index as an extra column on disk (__index_level_0__ by default, or whatever you passed to set_index). Octa strips those columns on read so the table view shows only the real data columns. Both the Arrow schema metadata's index_columns entries and the default __index_level_0__ name are honoured, including on files written by older pandas releases that didn't emit the metadata block.

R datasets

Octa only handles the single data.frame / tibble case for .rds. Workspace files (.rdata / .rda produced by save()) are registered by extension but currently return an error pointing you at saveRDS(), since rds2rust only accepts the X\n magic of single-object RDS, not the RDX2\n workspace envelope.

HDF5

Octa uses a pure-Rust HDF5 parser (no system libhdf5 dependency). Compound datasets (the layout pandas/PyTables write for DataFrames) are decoded field-by-field.

HDF5 1.10+ vs older files

The upstream hdf5-reader 0.2 library misreads compound v1 layouts when members don't start on 8-byte boundaries. HDF5 1.10+ files with compound v3 (the default for h5py libver="latest" and modern pandas) parse correctly. Older pandas / pytables files may surface garbled columns.

NetCDF

Octa supports NetCDF v3 only. NetCDF v4 files are HDF5 under the hood, so open them with the HDF5 reader by renaming the extension.

The reader groups all 1D variables sharing the largest dimension into one table (each variable becomes a column). Multi-dimensional or scalar variables are skipped, with a count surfaced in the file's format label (e.g. "NetCDF (3 multi-D vars skipped)").

EPUB

Read-only. Octa converts each chapter's XHTML to Markdown at load time and renders chapter-by-chapter in the EPUB Reader view. The flat Table view is still available with one row per paragraph (chapter, paragraph, text columns), useful for searching the book's text with the filter bar or SQL.

GeoJSON

Read-only. Opens by default in the Map view with OSM (Open Street Map) tile background. The Table view is also available with one row per Feature; the geometry is serialised as WKT in a __geometry column, and every property becomes its own column.

Archives (zip / tar / tgz)

Read-only. The archive opens as a table listing one row per entry (path, size_bytes, compressed_bytes, mtime, is_dir, type). An action bar above the table extracts the selected entry into a tempfile and opens it as a fresh tab, so any reader Octa supports works on archive contents. See the Archive Viewer page for the full walkthrough.

Fixed-width (FWF)

Read-only, best-effort. Fixed-width files have no delimiter: each field sits in a fixed range of character columns, padded with spaces. Octa infers the column boundaries by sampling the leading lines and finding the character positions that are blank in every line (the gaps between fields), and treats the first line as the header (blank header cells become col_1, col_2, ...). All columns are read as text. Detection works best on cleanly aligned exports (typical mainframe / spreadsheet .prn output); a column whose values run together with its neighbour cannot be split. Claims .fwf and .prn only (not .txt, which stays plain text).

Source code and config files

Octa opens common source-code and configuration files as plain text (one row per line) and syntax-highlights them in the Raw view. Because they are registered formats, they appear in the open dialog's All Supported filter rather than only opening via the catch-all fallback. Recognised extensions include:

  • Python .py, .pyw, .pyi
  • Rust .rs
  • Shell .sh, .bash, .zsh, .fish
  • C / C++ .c, .cpp, .cc, .cxx, .h, .hpp, .hxx
  • Go .go
  • JS / TS / Web .js, .jsx, .mjs, .cjs, .ts, .tsx, .html, .htm, .css, .scss, .sass
  • JVM .java, .kt, .kts, .scala, .groovy
  • Scripting .rb, .php, .pl, .lua, .swift
  • Data science .r, .jl
  • Terraform / HCL .tf, .tfvars, .hcl
  • Misc .tex, .dart, .ex, .exs, and the plain-text / config set (.txt, .log, .ini, .cfg, .conf, .env, ...)

Any other unknown extension still opens through the plain-text reader, so you can always open something.

Wrong or missing file extensions

Octa does not rely on the extension alone. When a file's extension is missing, wrong, or unrecognised, it looks at the content to pick a reader:

  • Magic bytes identify binary formats regardless of name, a Parquet file called export.bin, a SQLite database with no extension, a ZIP-based archive, and so on.
  • Structure probes recognise text formats: a .txt that is actually JSON, or a delimited file whose extension doesn't match.

This works in two places. When opening a file, Octa consults the content sniffer before falling back to plain text. And if the reader chosen from the extension errors (for example a .csv that is really Parquet), Octa retries with the sniffed reader instead of just showing a parse error. The upshot: renamed and mislabelled files usually just open as the right thing.

Repairing malformed CSV / TSV files

CSV and TSV files in the wild are often slightly broken: the wrong text encoding, a stray byte-order mark (BOM) at the start, control characters, a delimiter that disagrees with the extension (a .csv that is really tab-separated), or ragged rows with uneven column counts. Octa can offer to clean these up on open.

This is off by default. Turn on Offer repair on malformed files in Settings → File-Specific. With it on, when a CSV/TSV reads but looks malformed, a prompt appears that lists what was detected and shows a preview of the repaired result. You choose:

  • Repair and open re-decodes the text, re-detects the delimiter, and strips stray markers.
  • Open without repair loads the file as-is.
  • Cancel backs out.

The repair only changes what Octa loads into memory, your file on disk is never modified. It applies to CSV/TSV only. See CSV quote / escape for the related quoting and delimiter rules.

Multi-table files

SQLite, DuckDB, and GeoPackage can hold multiple tables. When you open such a file, Octa shows a table picker dialog listing the available tables with row counts and schemas, so you can pick one to load. Single-table databases auto-load without the picker. From the MCP or CLI side, list_tables gives you the same enumeration, and every result-bearing MCP tool accepts a table argument to pick one.

Excel multi-sheet workbooks

Excel workbooks behave differently from databases: Octa treats each worksheet as a table and opens several at once, each in its own tab.

  • If the workbook has up to N sheets, all of them open automatically. N is the Excel sheets to auto-open (default 5), can be changed in Settings → Performance.
  • If it has more than N, a sheet picker appears listing every sheet with the first N pre-checked. Tick the ones you want (Select all / Select none help) and click Open. You can pick any number of sheets, including all of them.

The first row of each sheet is used as the header row, the same as the single-sheet behaviour.

Format conversion

The CLI's octa --convert IN OUT routes through the same readers / writers as the GUI, so any read+write pair is a valid conversion target:

octa --convert data.csv data.parquet
octa --convert legacy.xlsx tidy.sqlite
octa --convert measurements.dta measurements.json

Read-only formats (SAS, RDS, HDF5, NetCDF, EPUB, GeoJSON, archives) are rejected up-front as conversion targets, so Octa surfaces a clear error rather than silently writing a malformed file.

See also

  • octa --convert, the CLI for round-tripping between any two writable formats.
  • View modes overview covers which view Octa picks for each format.
  • Saving files covers read-only formats and diff-based DB writes.
  • Date inference explains how string columns in text formats get promoted to typed dates on load.