Parquet Format

Use Parquet when you want the best default format for analytics, typed loading, and large datasets.


Choose Parquet when the destination is analysis, not generic interchange.

For most research workflows, this is the best default format. It keeps the same raw event stream as CSV, but stores it in a typed, columnar layout that is easier to process at scale.

At A Glance

PropertyValue
extension.parquet
wrapperno extra .gz wrapper
layoutcolumnar
raw data or summaryraw event stream
timestamps inside rowsUTC, nanosecond precision
file day conventionYYYYMMDD grouped by America/New_York market day
best fitPython, Polars, DuckDB, pandas, large reads, typed analytics

What You Get

Each file contains the same daily raw dataset you can retrieve as CSV, but with a stricter typed schema.

SFTP path example:

bash
data/parquet/ES/06-25/20250601.parquet

Column Schema

ColumnTypeNullableMeaning
levelstringnoL1 or L2
mdtint8nomarket data type code
timestamptimestamp(ns, UTC)noevent timestamp
operationint8yesadd, update, remove for L2
depthint8yesprice level index for L2
market_makerstringyesreserved, currently null
pricedecimal128(18,8)noexact price value
volumeint32notrade size or quote size

Why Users Choose Parquet

  • typed columns reduce parsing work
  • prices stay exact as decimal128(18,8)
  • timestamps keep nanosecond precision without string parsing
  • analytics engines can read only the columns they need
  • large pulls are usually smaller and faster to work with than equivalent text files

Interpretation Rules

The logical meaning is the same as CSV:

  • L1 rows contain best bid, best ask, trades, and session statistics
  • L2 rows contain order book depth updates
  • mdt uses the same 0 to 9 codes
  • operation and depth are null for L1
  • file name day is America/New_York, while event timestamps are UTC

Practical Advantage Over CSV

If you already know the destination is a research stack, Parquet removes several headaches:

  • no delimiter handling
  • no string-to-number conversion pass
  • no string timestamp parsing pass
  • no floating-point drift from reading price as free-form text

That matters more as the number of files grows.

Minimal Example

python
import polars as pl

df = pl.read_parquet("20250601.parquet")
trades = df.filter((pl.col("level") == "L1") & (pl.col("mdt") == 2))

If you need maximum compatibility with legacy tools, go back to CSV. If you are building an analysis workflow, continue with Python Analysis.

Related Pages