data block size in HDFS, why 64MB?

72,843

Solution 1

What does 64MB block size mean?

The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.

If yes, what is the advantage of doing that?

HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.

Solution 2

HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):

A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.

Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).

Solution 3

In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.

So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.

Solution 4

In normal OS block size is 4K and in hadoop it is 64 Mb. Because for easy maintaining of the metadata in Namenode.

Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.

If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.

Conclusion: To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.

Solution 5

It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.

Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.

Share:
72,843
dykw
Author by

dykw

Updated on July 09, 2022

Comments

  • dykw
    dykw almost 2 years

    The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.

    What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?

    If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?

    Can we do the same by using the disk's original 4KB block size?

  • dykw
    dykw over 10 years
    thanks for your answer. Assume block size is 4KB and a file is store in continuous blocks in the disk. Why can't we retrieve 1000 MB file by using 1 request? I know may be currently HDFS doesn't support such access method. But what the problem of such access method?
  • Praveen Sripati
    Praveen Sripati over 10 years
    In the case of small files, lets say that you have a bunch of 1k files, and your block size is 4k. That means that each file is wasting 3k, which is not cool. - this is not true in case of HDFS. Lets say the file is 100MB, then the blocks are 64MM and 36BM. Usually the size of the last block is less unless the file is a multiple of 64MB.
  • dykw
    dykw over 10 years
    so you mean the underlying implementation of a 64MB block read is not broken down into many 4KB block reads from the disk? Does the disk support to read 64MB in 1 read? Please feel free to ask me for clarification if the question is not clear. Thanks.
  • Praveen Sripati
    Praveen Sripati over 10 years
    64MB HDFS block will be split into multiple 4KB blocks OS file system blocks.
  • dykw
    dykw over 10 years
    if 64MB HDFS block will be split into multiple 4KB blocks, what's the point of using 64MB HDFS block?
  • bstempi
    bstempi over 10 years
    To reduce load on the Node Server. Fewer blocks to track = few requests and less memory tracking blocks.
  • cabad
    cabad over 10 years
  • cabad
    cabad over 10 years
    I agree with (1), but not with (2). The framework could (by default) just have each mapper deal with multiple data blocks.
  • bstempi
    bstempi over 10 years
    @dykw HDFS could retrieve a 1000 MB file in one request if the block sizes were at least 1000 MB.
  • sab
    sab over 10 years
    So there is really no advantage of block size being 64 or 128 with regards to sequential access? Since each block may be split into multiple native file system blocks?
  • Rags
    Rags almost 10 years
    There is advantage as these 4k blocks are stored on the disk contiguously which means there is no additional seek between one 4k block to the next
  • user1956609
    user1956609 over 9 years
    Does this mean that, if I store a 1mb file in HDFS with a block size of 64mb, it will take up 64mb of HDFS storage capacity?
  • bstempi
    bstempi over 9 years
    @user1956609 No, a 1Mb file will not take up 64Mb on disk.
  • David Ongaro
    David Ongaro about 9 years
    This answer is just plain wrong. What "block" or "block size" means is dependent on the file system and in the case of HDFS it does not mean the smallest unit it can store, it's the smallest unit the namenode references. And a block is usually stored sequentially on a physical disk, which makes reading and writing a block fast. For small files the block size doesn't matter much, because they will be smaller than the blocksize anyway and stored as a smaller block. So bigger block sizes are generally better but one has to weigh that against the desired amount of data and mapper distribution.
  • bstempi
    bstempi about 9 years
    @DavidOngaro Saying that the block size is the smallest unit that a namenode references is correct...my explanation is a slight oversimplification. I'm not sure why that makes the answer 'just plain wrong,' though.
  • David Ongaro
    David Ongaro about 9 years
    @bstempi: At least 3 things are wrong: 1. Most files are not a multiple of the block size so the last block is actually smaller than the default block size (and yes its referenced by the namenode). 2. There is not such a thing as the "block size", since the block size is a file attribute and can therefore be different from file to file (there is a "default block size" though, and that is what the question was about). 3. You give wrong advice, if you already have mostly small files making the block size even smaller would make things worse
  • bstempi
    bstempi about 9 years
    @DavidOngaro I agree with your third point and will update my answer when I'm not on mobile. Your first point is an assumption on your part..I never stated that a partial block takes the space of a full block. The second point is just nit picking; the vast majority of people don't change this attribute from file to file, so I didn't think it was worth mentioning.
  • David Ongaro
    David Ongaro about 9 years
    @bstempi: The second point is not nitpicking, if you change the default block size it leaves your old files at the old blocksize unless you migrate them explicitly with distcp and it's important to understand that. So even if you never set it on a file individual level you can end up with files of different block sizes. Also it proves the first point: if there is no single blocksize, the namenode can not use the "default blocksize" as a smallest addressable unit. Not need for making assumptions here.
  • bstempi
    bstempi about 9 years
    @DavidOngaro: but that's not the question that was asked. The question was about the default block size. If you take that much issue with my answer, then I suggest you submit an edit our submit your own answer.
  • bstempi
    bstempi about 9 years
    Each mapper processes a split, not a block. Further more, even if a mapper is assigned a split of N blocks, the end of the split may be a partial record, causing the Record Reader (this is specific to each record reader, but generally true for the ones that come with Hadoop) to read the rest of the record from the next block. The point is that mappers often cross block boundaries.
  • Ayan Biswas
    Ayan Biswas about 9 years
    the default block size for hadoop 1 was 64 MB for hadoop 2 its 128 MB
  • cabad
    cabad almost 8 years
    This was already mentioned in my answer. It would have been preferrable to add comments to my answer than to post an aswer that adds very little to prior answers.
  • Jon Andrews
    Jon Andrews about 7 years
    @bstempi, You had mentioned that using 64Mb blocks, the number of request would be reduced to 16. But the operating system would process data in 4kb blocks. It wont be processing the data in 64mb. How does that work in background? How can we reduce the seek time with 64mb block when the operating system would only process 4kb block at a time?
  • David Ongaro
    David Ongaro about 7 years
    @BasilPaul: I now tried to answer your question in aspect 3 of stackoverflow.com/a/43382368/2727750. Basically disk seek time doesn't matter much when you already have a network transfer going on, but TCP throughput matters. So I wouldn't argue that the advantage is the lower number of "requests", but the lower number of persistent TCP connections.
  • Jon Andrews
    Jon Andrews about 7 years
    @DavidOngaro, First one regarding Disk throughput. You had mentioned that data would be written sequentially to the disk. So if i have 64 mb of data in slave node,8kb blocks of 64mb data would be formed sequentially in the disk of slave node to cut down seek time. When the data is written or read it would process in 8kb blocks? Is that correct?
  • Jon Andrews
    Jon Andrews about 7 years
    @Rags, U mean to say that if i have 64mb data in disk of slave node,8kb blocks of 64mb data would be formed in a sequential manner to cut down extra seek time. Is that correct?
  • Jon Andrews
    Jon Andrews about 7 years
    @Rags, If that is the case how will the file system save the data in sequential manner. Because that is not the way how data is saved in disk. Data would be scattered in disk.How does the sequential formation of data happen in disk. Could u give me clarity on this part. Bit confused?
  • bstempi
    bstempi about 7 years
    The goal of using a larger block size is to reduce the number of requests to the name node. So, someone would say, "where can I find this file?" to the name node, which would say, "You can find this 64Mb chunk here, and this one here, etc." The requestor would go to each place that a chunk is stored and say, "hey, give me a copy of that chunk." What HDFS recognizes as a block is not the same as the underlying FS. If the underlying FS uses 4k blocks, then that 64Gb chunk would take ~67 million blocks. That's fine because the latency on those are low, unlike the latency to the name node.
  • bstempi
    bstempi about 7 years
    To sate it differently: The goal is not to reduce the operating system's burden in storing the blocks; it's to reduce how much thinking the name node has to do when it comes to remembering where each piece of the file is. The name node is much slower than a local OS's file system.
  • Rags
    Rags about 7 years
    @Basil Paul, That is a very good question. The intent is to get contiguous blocks from the underlying file system. In production set up HDFS gets its own volumes so getting contiguous blocks is not an issue. If you mix up with other storage like mapreduce temp data etc, then the issue arises. How it is exactly managed I am not sure. You may have to open the code and see how it is managed.
  • Dmytro
    Dmytro about 6 years
    so, to summarize, minimize odds that a file can't be stored on a single block? And even if the records are stored not linearly, the whole block is loaded into memory so after the transfer, seek time disappears? But does that really help if the underlying fs uses 4k blocks? I guess less requests for blocks, eg asking for 64k once over a network is cheaper than asking for 4k 16 times by 15 round trips.
  • Tom Taylor
    Tom Taylor over 5 years
    From "MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores" - means map reduce task is applied over huge amount of data?
  • Tom Taylor
    Tom Taylor over 5 years
    I couldn't clearly get you here " but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task". Can you please elaborate on this?
  • Tom Taylor
    Tom Taylor over 5 years
    @bstempi : So, how these chunks are handled by OS? Do we have seek (not skip) option in HDFS to access a byte position at a random point?
  • Tom Taylor
    Tom Taylor over 5 years
    Is there any tool available to visually look at how these chunks are stored at OS file system on the respective data nodes?
  • Tom Taylor
    Tom Taylor over 5 years
    Is it possible to store multiple small files (say file size of 1KB) and store it in a single 64MB block? If we could store multiple small files in a block - how the nth file in a block would be read - will the file pointer be seeked to that particular nth file offset location - or will it skip n-1 files before reading the nth file content?