Hadoop Distributed File System (HDFS)

HDFS is a very large distributed file system (10k nodes, 100 million files, 10PB).

How does this relate to S3 buckets?

??

it assumes commodity hardware.

Hardware Implementation

Just do Database Sharding

There’s a data coherency problem.

Data coherency

  • Write-once-ready-many access model
  • Client can only append to existing files

Logbook style.

Files are broken up into blocks (typically 64MB - 128MB block sizes)

  • Feels related to how Paging works

HDFS Namenode

This feels like the lookup table that we had → Message Broker. No, more like a Connection Broker

Heart beat message?

The teacher talked about how there’s only 1 namenode. And then there’s a backup namenode if it goes down. Wouldn’t it be better to deploy a cluster of namenodes?

HDFS Datanode

A block server

  • Stores data in the local file system
  • Stores metadata of a block
  • Serves data and metadata to clients

Block Placement Policy

3 replicas will be stored on at least 2 racks

The things that do mapreduce are datanodes.

Replication

The Namenode detects datanode failures. If there are failures, it:

  • Chooses new datanodes for new replicas
  • Balances disk usage
  • Balances communication traffic to datanodes