Parquet Format
Use Parquet when you want the best default format for analytics, typed loading, and large datasets.
Choose Parquet when the destination is analysis, not generic interchange.
For most research workflows, this is the best default format. It keeps the same raw event stream as CSV, but stores it in a typed, columnar layout that is easier to process at scale.
At A Glance
| Property | Value |
|---|---|
| extension | .parquet |
| wrapper | no extra .gz wrapper |
| layout | columnar |
| raw data or summary | raw event stream |
| timestamps inside rows | UTC, nanosecond precision |
| file day convention | YYYYMMDD grouped by America/New_York market day |
| best fit | Python, Polars, DuckDB, pandas, large reads, typed analytics |
What You Get
Each file contains the same daily raw dataset you can retrieve as CSV, but with a stricter typed schema.
SFTP path example:
data/parquet/ES/06-25/20250601.parquet
Column Schema
| Column | Type | Nullable | Meaning |
|---|---|---|---|
level | string | no | L1 or L2 |
mdt | int8 | no | market data type code |
timestamp | timestamp(ns, UTC) | no | event timestamp |
operation | int8 | yes | add, update, remove for L2 |
depth | int8 | yes | price level index for L2 |
market_maker | string | yes | reserved, currently null |
price | decimal128(18,8) | no | exact price value |
volume | int32 | no | trade size or quote size |
Why Users Choose Parquet
- typed columns reduce parsing work
- prices stay exact as
decimal128(18,8) - timestamps keep nanosecond precision without string parsing
- analytics engines can read only the columns they need
- large pulls are usually smaller and faster to work with than equivalent text files
Interpretation Rules
The logical meaning is the same as CSV:
L1rows contain best bid, best ask, trades, and session statisticsL2rows contain order book depth updatesmdtuses the same0to9codesoperationanddepthare null forL1- file name day is
America/New_York, while event timestamps areUTC
Practical Advantage Over CSV
If you already know the destination is a research stack, Parquet removes several headaches:
- no delimiter handling
- no string-to-number conversion pass
- no string timestamp parsing pass
- no floating-point drift from reading price as free-form text
That matters more as the number of files grows.
Minimal Example
import polars as pl
df = pl.read_parquet("20250601.parquet")
trades = df.filter((pl.col("level") == "L1") & (pl.col("mdt") == 2))
If you need maximum compatibility with legacy tools, go back to CSV. If you are building an analysis workflow, continue with Python Analysis.
