Hadoop Distributed File System (HDFS)

HDFS is a very large distributed file system (10k nodes, 100 million files, 10PB).

How does this relate to S3 buckets?

??

it assumes commodity hardware.

Hardware Implementation

Just do Database Sharding

There’s a data coherency problem.

Data coherency

Write-once-ready-many access model
Client can only append to existing files

Logbook style.

Files are broken up into blocks (typically 64MB - 128MB block sizes)

Feels related to how Paging works

HDFS Namenode

This feels like the lookup table that we had → Message Broker. No, more like a Connection Broker

Heart beat message?

The teacher talked about how there’s only 1 namenode. And then there’s a backup namenode if it goes down. Wouldn’t it be better to deploy a cluster of namenodes?

HDFS Datanode

A block server

Stores data in the local file system
Stores metadata of a block
Serves data and metadata to clients

Block Placement Policy

3 replicas will be stored on at least 2 racks

The things that do mapreduce are datanodes.

Replication

The Namenode detects datanode failures. If there are failures, it:

Chooses new datanodes for new replicas
Balances disk usage
Balances communication traffic to datanodes

🛠️ Steven Gong

Table of Contents

Hadoop Distributed File System (HDFS)

Hardware Implementation

HDFS Namenode

HDFS Datanode

Block Placement Policy

Replication

Graph View

Backlinks