What is meant by "streaming data access" in HDFS?

16,312

Solution 1

Streaming just implies that it can offer you a constant bitrate above a certain threshhold when transferring the data, as opposed to having the data come in in bursts or waves.

If HDFS is laid out for streaming, it will probably still support seek, with a bit of overhead it requires to cache the data for a constant stream.

Of course, depending on system and network load, your seeks might take a bit longer.

Solution 2

HDFS stores data in large blocks -- like 64 MB. The idea is that you want your data layed out sequentially on your hard drive, reducing the number of seeks your hard drive has to do to read data.

In addition, HDFS is a user-space file system, so there is a single central name node that contains an in-memory directory of where all of the blocks (and their replicas) are stored across the cluster. Files are expected to be large (say 1 GB or more), and are split up into several blocks. In order to read a file, the code asks the name node for a list of blocks and then reads the blocks sequentially.

The data is "streamed" off the hard drive by maintaining the maximum I/O rate that the drive can sustain for these large blocks of data.

Share:
16,312

Related videos on Youtube

vertti
Author by

vertti

I'm a Python developer with over 30 years of programming experience. My current focus is using Django for development of business SaaS applications, but I like other Python frameworks: Pylons, Flask, and Webcore. I also spend a lot of time dealing with security and implementing counter-measures.

Updated on September 17, 2022

Comments

  • vertti
    vertti over 1 year

    According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct?

    I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks.

    Maybe there is another file system that could do this better?

    • Admin
      Admin almost 9 years
      Streaming Access Pattern in HDFS is: Write Once, Read Any Number Of Times, But Don't Try To Change The Contents Of The File.