【正文】
如果這些文件損壞了,整個HDFS實例都將失效。如果某個Datanode節(jié)點上的空閑空間低于特定的臨界點,按照均衡策略系統(tǒng)就會自動地將數(shù)據(jù)從這個Datanode移動到其他空閑的Datanode。常見的三種出錯情況是:Namenode出錯, Datanode出錯和網(wǎng)絡(luò)割裂(network partitions)。在同一個目錄中創(chuàng)建所有的本地文件并不是最優(yōu)的選擇,這是因為本地文件系統(tǒng)可能無法高效地在單個目錄中支持大量的文件。 Namenode在內(nèi)存中保存著整個文件系統(tǒng)的名字空間和文件數(shù)據(jù)塊映射(Blockmap)的映像。每個數(shù)據(jù)塊都有一個指定的最小副本數(shù)。 當前,這里介紹的默認副本存放策略正在開發(fā)的過程中。這種策略設(shè)置可以將副本均勻分布在集群中,有利于當組件失效情況下的負載均衡。HDFS采用一種稱為機架感知(rackaware)的策略來改進數(shù)據(jù)的可靠性、可用性和網(wǎng)絡(luò)帶寬的利用率。副本系數(shù)可以在文件創(chuàng)建的時候指定,也可以在之后改變。 Namenode負責維護文件系統(tǒng)的名字空間,任何對文件系統(tǒng)名字空間或?qū)傩缘男薷亩紝⒈籒amenode記錄下來。這種架構(gòu)并不排斥在一臺機器上運行多個Datanode,只不過這樣的情況比較少見。它也負責確定數(shù)據(jù)塊到具體Datanode節(jié)點的映射。這種特性方便了HDFS作為大規(guī)模數(shù)據(jù)應用平臺的推廣。這一假設(shè)簡化了數(shù)據(jù)一致性問題,并且使高吞吐量的數(shù)據(jù)訪問成為可能。為了提高數(shù)據(jù)的吞吐量,在一些關(guān)鍵方面對POSIX的語義做了一些修改。 二、前提和設(shè)計目標 硬件錯誤 硬件錯誤是常態(tài)而不是異常。它和現(xiàn)有的分布式文件系統(tǒng)有很多共同點。 one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. Data Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks。HDFS在最開始是作為Apache Nutch搜索引擎項目的基礎(chǔ)架構(gòu)而開發(fā)的。HDFS的設(shè)計中更多的考慮到了數(shù)據(jù)批處理,而不是用戶交互處理。一個單一的HDFS實例應該能支撐數(shù)以千萬計的文件。將計算移動到數(shù)據(jù)附近,比之將數(shù)據(jù)移動到應用所在顯然更好。HDFS暴露了文件系統(tǒng)的名字空間,用戶能夠以文件的形式在上面存儲數(shù)據(jù)。HDFS采用Java語言開發(fā),因此任何支持Java的機器都可以部署Namenode或Datanode。文件系統(tǒng)名字空間的層次結(jié)構(gòu)和大多數(shù)現(xiàn)有的文件系統(tǒng)類似:用戶可以創(chuàng)建、刪除、移動或重命名文件。為了容錯,文件的所有數(shù)據(jù)塊都會有副本。 副本存放: 最最開始的一步 副本的存放是HDFS可靠性和性能的關(guān)鍵。 通過一個機架感知的過程,Namenode可以確定每個Datanode所屬的機架id。于此同時,因為數(shù)據(jù)塊只放在兩個(不是三個)不同的機架上,所以此策略減少了讀取數(shù)據(jù)時需要的網(wǎng)絡(luò)傳輸總帶寬。處于安全模式的Namenode是不會進行數(shù)據(jù)塊的復制的。例如,在HDFS中創(chuàng)建一個文件,Namenode就會在Editlog中插入一條記錄來表示;同樣地,修改文件的副本系數(shù)也將往Editlog插入一條記錄。 Datanode將HDFS數(shù)據(jù)以文件的形式存儲在本地的文件系統(tǒng)中,它并不知道有關(guān)HDFS文件的信息。一個遠程過程調(diào)用(RPC)模型被抽象出來封裝ClientProtocol和Datanodeprotocol協(xié)議。Datanode的dead可能會引起一些數(shù)據(jù)塊的副本系數(shù)低于指定值,Namenode不斷地檢測這些需要復制的數(shù)據(jù)塊,一旦發(fā)現(xiàn)就啟動復制操作。當客戶端創(chuàng)建一個新的HDFS文件,會計算這個文件每個數(shù)據(jù)塊的校驗和,并將校驗和作為一個單獨的隱藏文件保存在同一個HDFS名字空間下。這種多副本的同步操作可能會降低Namenode每秒處理的名字空間事務數(shù)量。 數(shù)據(jù)完整性 從某個Datanode獲取的數(shù)據(jù)塊有可能是損壞的,損壞可能是由Datanode的存儲設(shè)備錯誤、網(wǎng)絡(luò)錯誤或者軟件bug造成的。Namenode通過心跳信號的缺失來檢測這一情況,并將這些近期不再發(fā)送心跳信號Datanode標記為dead,不會再將新的IO請求發(fā)給它們??蛻舳送ㄟ^一個可配置的TCP端口連接到Namenode,通過ClientProtocol協(xié)議與Namenode交互。這個過程稱為一個檢查點(checkpoint)。 六、文件系統(tǒng)元數(shù)據(jù)的持久化 Namenode上保存著HDFS的名字空間。如果一個HDFS集群跨越多個數(shù)據(jù)中心,那么客戶端也將首先讀本地數(shù)據(jù)中心的副本。這種策略減少了機架間的數(shù)據(jù)傳輸,這就提高了寫操作的效率。 大型HDFS實例一般運行在跨越多個機架的計算機組成的集群上,不同機架上的兩臺機器之間的通訊需要經(jīng)過交換機。接收到心跳信號意味著該Datanode節(jié)點工作正常。五、數(shù)據(jù)復制 HDFS被設(shè)計成能夠在一個大集群中跨機器可靠地存儲超大文件。 四、文件系統(tǒng)的名字空間 (namespace) HDFS支持傳統(tǒng)的層次型文件組織結(jié)構(gòu)。 Namenode和Datanode被設(shè)計成可以在普通的商用機器上運行。Namenode是一個中心服務器,負責管理文件系統(tǒng)的名字空間(namespace)以及客戶端對文件的訪問。 “移動計算比移動數(shù)據(jù)更劃算” 一個應用請求的計算,離它操作的數(shù)據(jù)越近就越高效,在數(shù)據(jù)達到海量級別的時候更是如此。因此,HDFS被調(diào)節(jié)以支持大文件存儲。因此錯誤檢測和快速、自動的恢復是HDFS最核心的架構(gòu)目標。HDFS能提供高吞吐量的數(shù)據(jù)訪問,非常適合大規(guī)模數(shù)據(jù)集上的應用。 江 漢 大 學 畢 業(yè) 論 文(設(shè) 計)外 文 翻 譯原文來源 The Hadoop Distributed File System: Architecture and Design 中文譯文 Hadoop分布式文件系統(tǒng):架構(gòu)和設(shè)計 姓 名 XXXX 學 號 200708202137 2013年 4月 8 日英文原文The Hadoop Distributed File System: Architecture and DesignSource:Introduction The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on modity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly faulttolerant and is designed to be deployed on lowcost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is Assumptions and Goals Hardware Failure Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of ponents and that each ponent has a nontrivial probability of failure means that some ponent of HDFS is always nonfunctional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Simple Coherency Model HDFS applications need a writeoncereadmany access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appendingwrites to files in the future. “Moving Computation is Cheaper than Moving Data” A putation requested by an application is much more efficient if