Parquet is a column based data store or File Format (Useful for Spark read/write and SQL in order to boost performance)

Parquet is a column based data store or File Format (Useful for Spark read/write and SQL in order to boost performance)

PARQUET FILE SYSTEM

Parquet is a column based store.

Parquet was a joint project of cloudera and Twitter engineers.

It is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Twitter is starting to convert some of its major data source to Parquet in order to take advantage of the compression and deserialization savings.

Fusemachines is AI based sales based company we used to use JSON as a data store. Now we are upgrading to Parquet data store file system.

Time and space analysis on Small Datasets (Statistics on my computer)

Query : Select C1 from table

Id

Format

Read(Time sec)

Write(Time s)

Size

1

Json

10

8

1.8GB

2

Csv

5

13

272 MB

3

Parquet

0.5

1.6

68 MB

Time and Space analysis on Larger DataSets

Subscribe, i will write it later

Details on Parquet File:

Apache Parquet provides the following benefits:[6]

Column-wise compression is efficient and saves storage space

Compression techniques specific to a type can be applied as the column values tend to be of the same type

Queries that fetch specific column values need not read the entire row data thus improving performance

Different encoding techniques can be applied to different columns

Comments