Header Ads Widget

DATA FORMAT

DATA FORMAT :

A data/file format defines how information is stored in HDFS.

Hadoop does not have a default file format and the choice of a format depends on its use.

The big problem in the performance of applications that use HDFS is the information search time and the writing time.

Managing the processing and storage of large volumes of information is very complex that’s why a certain data format is required.

The choice of an appropriate file format can produce the following benefits:

  • Optimum writing time
  • Optimum reading time
  • File divisibility
  • Adaptive scheme and compression support

Some of the most commonly used formats of the Hadoop ecosystem are :

● Text/CSV: A plain text file or CSV is the most common format both outside and within the Hadoop ecosystem.

● SequenceFile: The SequenceFile format stores the data in binary format, this format accepts compression but does not store metadata.

● Avro: Avro is a row-based storage format. This format includes the definition of the scheme of your data in JSON format. Avro allows block compression along with its divisibility, making it a good choice for most cases when using Hadoop.

● Parquet: Parquet is a column-based binary storage format that can store nested data structures. This format is very efficient in terms of disk input/output operations when the necessary columns to be used are specified.

● RCFile (Record Columnar File): RCFile is a columnar format that divides data into groups of rows, and inside it, data is stored in columns.

● ORC (Optimized Row Columnar): ORC is considered an evolution of the RCFile format and has all its benefits alongside some improvements such as better compression, allowing faster queries.

Post a Comment

0 Comments