## HDFS Concepts

### Blocks

• 128 MB by default
• HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks.
• Having a block abstraction for a distributed filesystem brings several benefits
• A file can be larger than any single disk in the network.
• Simplifies the storage subsystem.
• Providing fault tolerance and availability.
• % hdfs fsck / -files -blocks

### Namenodes and Datanodes

• An HDFS cluster has two types of nodes operating in a master−worker pattern
• namenode (the master)
• datanodes (workers)
• Without the namenode, the filesystem cannot be used
• For this reason, it is important to make the namenode resilient to failure
• The first way is to back up the files that make up the persistent state of the filesystem
• It is also possible to run a secondary namenode

### Block Caching

• For frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache. By default, a block is cached in only one datanode’s memory.

### HDFS Federation

• one namenode might manage all the files rooted under /user, say, and a second name‐
node might handle files under /share.
• namespace volume
• block pool
• namenodes do not communicate with one another

### HDFS High Avalibility

• The new namenode is not able to serve requests until it has
• loaded its namespace image into memory
• replayed its edit log
• received enough block reports from the datanodes to leave safe mode.
• On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
• Hadoop 2 remedied this situation by adding support for HDFS high availability (HA).
• there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.
• There are two choices for the highly available shared storage
• NFS filer
• quorum journal manager (QJM)
• The actual observed failover time will be longer in practice (around a minute or so)
• The transition from the active namenode to the standby is managed by a new entity in
the system called the failover controller
• default implementation uses ZooKeeper to ensure that only one namenode is active.
• The QJM only allows one namenode to write to the edit log at one time

## The Command-Line Interface

### Basic Filesystem Operations

• copying a file from the local filesystem to HDFS

• copy the file back to the local filesystem and check whether it’s the same

• create a directory first just to see how it is displayed in the listing

## The Java Interface

Example. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler

There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in a static block.

Here’s a sample run:

### Reading Data Using the FileSystem API

Example. Displaying files from a Hadoop filesystem on standard output by using the FileSystem directly

The program runs as follows:

#### FSDataInputStream

Example. Displaying files from a Hadoop filesystem on standard output twice, by using seek()

Here’s the result of running it on a small file:

### Writing Data

Example. Copying a local file to a Hadoop filesystem

Typical usage:

### Querying the Filesystem

The FileStatus class encapsulates filesystem metadata for files and directories, including file length, block size, replication, modification time, ownership, and permission information.

#### Listing files

Example. Showing the file statuses for a collection of paths in a Hadoop filesystem

We can use this program to find the union of directory listings for a collection of paths:

## DataFlow

### Anatomy of a File Read

Mathematically inclined readers will notice that this is an example of a distance metric.

### Anatomy of a File Write

A typical replica pipeline:

## Coherency Model

After creating a file, it is visible in the filesystem namespace, as expected:

However, any content written to the file is not guaranteed to be visible, even if the stream is flushed. So, the file appears to have a length of zero:

HDFS provides a way to force all buffers to be flushed to the datanodes via the hflush() method on FSDataOutputStream. After a successful return from hflush(), HDFS guarantees that the data written up to that point in the file has reached all the datanodes in the write pipeline and is visible to all new readers:

Note that hflush() does not guarantee that the datanodes have written the data to disk, only that it’s in the datanodes’ memory (so in the event of a data center power outage, for example, data could be lost). For this stronger guarantee, use hsync() instead.

Closing a file in HDFS performs an implicit hflush(), too:

## Parallel Copying with distcp

One use for distcp is as an efficient replacement for hadoop fs -cp. For example, you can copy one file to another with:

% hadoop distcp file1 file2

You can also copy directories:

% hadoop distcp dir1 dir2