Parquet is a column based data store or File Format (Useful for Spark read/write and SQL in order to boost performance)

                                                                PARQUET FILE SYSTEM
Parquet is a column based store.
Parquet was a joint project of cloudera and Twitter engineers.
It is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.
Twitter is starting to convert some of its major data source to Parquet in order to take advantage of the compression and deserialization savings.

Fusemachines is AI based sales based company we used to use JSON as a data store. Now we are upgrading to Parquet data store file system.

Difference-between-Column-based-and-Row-based-Tables.png
 

Time and space analysis on Small Datasets (Statistics on my computer)

Query :  Select C1 from table

Id
Format
Read(Time sec)
Write(Time s)
Size
1
Json
10
8
1.8GB
2
Csv
5
13
272 MB
3
Parquet
0.5
1.6
68 MB

Time and Space analysis on Larger DataSets
Subscribe, i will write it later

Details on Parquet File:

Apache Parquet provides the following benefits:[6]
  • Column-wise compression is efficient and saves storage space
  • Compression techniques specific to a type can be applied as the column values tend to be of the same type
  • Queries that fetch specific column values need not read the entire row data thus improving performance
  • Different encoding techniques can be applied to different columns







Comments

Popular posts from this blog

DIfferent issues that may occur in Apache spark and their remedies.

Steps to Install zeppelin with spark