freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

正文內(nèi)容

并行etl工具可擴(kuò)展技術(shù)的研究與開(kāi)發(fā)-文庫(kù)吧在線文庫(kù)

  

【正文】 被設(shè)計(jì)用來(lái)使用腳本進(jìn)行數(shù)據(jù)處理。216。 DISTRIBUTE BY操作: ORDER BY操作:(備注:這種類型的JOIN操作的應(yīng)用場(chǎng)景類似于SQL中IN后面跟一個(gè)子查詢的場(chǎng)景,其SELECT子句中不能有右表的字段存在。216。216。 ALTER TABLE tableName CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_ment] [FIRST|AFTER column_name]。/ location239。value239。 修改表結(jié)構(gòu):3. 處理流程圖46 自定義MR Jar組件執(zhí)行流程 并行ETL工具的集成對(duì)Hive和Pig的集成主要考慮以下幾個(gè)方面的問(wèn)題:功能組件的設(shè)計(jì)、元數(shù)據(jù)處理的遷移和管理、工作流的解析。圖43自定義MR Java組件前臺(tái)界面2. 后臺(tái)實(shí)現(xiàn)后臺(tái)接收前臺(tái)傳過(guò)來(lái)的代碼,將其寫(xiě)入到特定的目錄下面,生成Java源代碼文件。 部署架構(gòu) 圖42 軟件系統(tǒng)部署架構(gòu)該系統(tǒng)的軟件架構(gòu)被設(shè)計(jì)成B/S架構(gòu),一是因?yàn)槭袌?chǎng)上沒(méi)有基于B/S架構(gòu)的并行ETL工具,二是因?yàn)橄鄬?duì)于C/S架構(gòu),B/S架構(gòu)具有一些突出的優(yōu)點(diǎn):該架構(gòu)使用網(wǎng)絡(luò)瀏覽器作為統(tǒng)一的客戶端,系統(tǒng)功能實(shí)現(xiàn)的核心部分被集中部署到服務(wù)器上,使得系統(tǒng)的開(kāi)發(fā)、維護(hù)和使用被較大地簡(jiǎn)化。上層按照MapReduc作業(yè)的依賴關(guān)系控制Job提交給Hadoop集群的時(shí)機(jī),Job提交之后,由該層負(fù)責(zé)Job的調(diào)度(由Hadoop內(nèi)置的調(diào)度機(jī)制決定),Job的執(zhí)行情況也是通過(guò)API的方式為上層獲取。 小結(jié)本章研究了并行ETL的可擴(kuò)展技術(shù),分為三個(gè)主要方面:可擴(kuò)展組件技術(shù)、并行ETL工具集成技術(shù)和優(yōu)化規(guī)則可擴(kuò)展技術(shù),確定了實(shí)現(xiàn)三種技術(shù)的可行方案。優(yōu)化時(shí),按順序取出規(guī)則集來(lái)對(duì)計(jì)劃進(jìn)行檢查,發(fā)現(xiàn)可以優(yōu)化的地方就進(jìn)行優(yōu)化,如此反復(fù),直到確定沒(méi)有匹配的地方或者到達(dá)迭代次數(shù)的上限(缺省500次,但是HExecutionEngine在初始化LogicalPlanOptimizer的時(shí)候?qū)⑵湓O(shè)為了100,只有設(shè)得小于1才會(huì)用缺省值),然后換下一個(gè)規(guī)則集。每個(gè)Transform的實(shí)現(xiàn)類都依賴一個(gè)Dispatcher接口的實(shí)現(xiàn)類和一個(gè)GraphWalker接口的實(shí)現(xiàn)類,前者負(fù)責(zé)匹配優(yōu)化規(guī)則,后者負(fù)責(zé)指定遍歷邏輯計(jì)劃的方式。相鄰的Map任務(wù)和Reduce任務(wù)構(gòu)成一個(gè)MapReduce作業(yè)。Kettle提供一個(gè)名為“Pig Script Executor”的組件以實(shí)現(xiàn)在Hadoop集群上執(zhí)行指定的Pig Latin腳本文件。同時(shí),現(xiàn)有并行ETL工具均以腳本方式運(yùn)行,不方便用戶構(gòu)建和管理ETL流程,構(gòu)建ETL流程的過(guò)程也不直觀,并且,這些并行ETL工具均無(wú)法提供作業(yè)的定時(shí)調(diào)度功能,無(wú)法真正作為一個(gè)完整的ETL工具使用。第二種方式的實(shí)現(xiàn)機(jī)制是將包含驅(qū)動(dòng)程序、MapReduce程序的Java工程打包成jar文件并放到集群中的某個(gè)節(jié)點(diǎn)上特定目錄下,然后執(zhí)行Hadoop的jar命令來(lái)運(yùn)行作業(yè)。實(shí)際使用中,可以通過(guò)兩種方式指定jar文件,一種是在驅(qū)動(dòng)程序中使用JobConf類的setJar()方法來(lái)指定,另一種是通過(guò)Hadoop的jar命令來(lái)指定jar文件。本論文研究如何為基于B/S架構(gòu)的并行ETL工具添加一個(gè)可定制的組件,該組件可以接受自定義的MapReduce代碼,因此在使用中如果出現(xiàn)新的需求,可以不用重新編譯、打包和部署工程。1 可擴(kuò)展組件技術(shù)商用ETL工具本身提供的組件極為豐富,盡管如此,它們也還是在可定制方面提供了相應(yīng)的支持。這一點(diǎn)相對(duì)于B/S架構(gòu)有如下缺點(diǎn):用戶需要安裝特定的客戶端并進(jìn)行相應(yīng)的配置才能使用,而有些配置會(huì)相對(duì)復(fù)雜。在經(jīng)過(guò)一番調(diào)研和測(cè)試,選擇了功能和性能都滿足要求的Fel。 優(yōu)化規(guī)則在不進(jìn)行優(yōu)化的前提下,針對(duì)相同的ETL需求,使用相同的ETL工具,因?yàn)閷?duì)工具熟悉程度的差異、對(duì)需求理解程度的差異等各方面的緣故,不同的用戶會(huì)設(shè)計(jì)出不同的ETL方案,不同方案的執(zhí)行效率可能會(huì)有比較大的差異。如果將最后的結(jié)果也以Hive數(shù)據(jù)表的形式存儲(chǔ)在HDFS上面,那么這種ETL方式便是上文提到的ELT,即將數(shù)據(jù)抽取并載入到Hive中最后進(jìn)行數(shù)據(jù)的轉(zhuǎn)換。Hive和Pig均是開(kāi)源的并行ETL工具,它們提供腳本給用戶來(lái)設(shè)計(jì)ETL處理流程,這些腳本被解析成MapReduce任務(wù)進(jìn)而可以在Hadoop集群上執(zhí)行。2. MapReduce編程模型MapReduce將并行計(jì)算分成兩個(gè)相互關(guān)聯(lián)的階段:map階段和reduce階段,分別對(duì)應(yīng)一個(gè)Mapper類和Reducer類。在HDFS中,一個(gè)文件在被分成若干個(gè)數(shù)據(jù)塊并存儲(chǔ)到不同數(shù)據(jù)節(jié)點(diǎn)的同時(shí),每個(gè)數(shù)據(jù)塊又有若干冗余,這些冗余同樣被存儲(chǔ)在不同的數(shù)據(jù)節(jié)點(diǎn)上以保證單節(jié)點(diǎn)磁盤故障等問(wèn)題不會(huì)影響整個(gè)集群上數(shù)據(jù)的安全。目錄節(jié)點(diǎn)與數(shù)據(jù)節(jié)點(diǎn)是不同的邏輯概念,在實(shí)際的物理集群上,它們可以是同一物理節(jié)點(diǎn)。而隨著數(shù)據(jù)倉(cāng)庫(kù)功能的日益強(qiáng)大,ETL逐漸演化出一種ELT的實(shí)現(xiàn)方式,抽取的數(shù)據(jù)不再通過(guò)中間的處理而是直接載入到數(shù)據(jù)倉(cāng)庫(kù)中,對(duì)數(shù)據(jù)的轉(zhuǎn)化在數(shù)據(jù)倉(cāng)庫(kù)中完成,其處理結(jié)果也就直接保存在了數(shù)據(jù)倉(cāng)庫(kù)中。清洗可以在抽取的過(guò)程中進(jìn)行,也可以作為轉(zhuǎn)換的組成部分,主要是將有問(wèn)題的紀(jì)錄剔除出來(lái)。 論文結(jié)構(gòu)本論文的組織結(jié)構(gòu)如下:第一章是緒論:首先說(shuō)明了課題的研究背景和意義,接著對(duì)相關(guān)技術(shù)的現(xiàn)狀進(jìn)行了分析,最后介紹了作者的主要工作和論文結(jié)構(gòu)。MapReduce計(jì)算框架使開(kāi)發(fā)人員可以方便地實(shí)現(xiàn)并行計(jì)算,本文希望提供一種自定義MapReduce代碼的嵌入方案,以便開(kāi)發(fā)人員方便地?cái)U(kuò)展ETL功能以應(yīng)對(duì)比較特殊的應(yīng)用場(chǎng)景或者方便地利用已有成果。商用ETL工具利用基于代價(jià)的優(yōu)化器優(yōu)化查詢計(jì)劃,但是在最初也都是使用基于規(guī)則的優(yōu)化器。Hive是基于Hadoop的大規(guī)模數(shù)據(jù)處理工具,它提供類SQL的腳本語(yǔ)言HQL,該工具提供了常用的關(guān)系操作、表達(dá)式操作和內(nèi)置函數(shù),并支持用戶以UDF和嵌入自定義腳本的形式對(duì)工具進(jìn)行擴(kuò)展[3] 。Teradata的ETL Automation采用命令行的方式操作,只提供了兩種簡(jiǎn)單的GUI:一種用來(lái)定義和管理在ETL Automation中的作業(yè)和作業(yè)關(guān)系;另一種用來(lái)完成對(duì)任務(wù)執(zhí)行狀態(tài)的監(jiān)控。本文工作基于作者參與開(kāi)發(fā)的一個(gè)并行數(shù)據(jù)挖掘平臺(tái)項(xiàng)目,旨在提高并行ETL工具的可擴(kuò)展性,使得并行ETL工具能夠應(yīng)用于更多的場(chǎng)景,能夠以統(tǒng)一的拖拽組件構(gòu)建ETL流程的方式集成開(kāi)源并行ETL產(chǎn)品Hive和Pig,并設(shè)計(jì)實(shí)現(xiàn)了優(yōu)化規(guī)則的表示和實(shí)現(xiàn)機(jī)制,使得實(shí)際應(yīng)用中發(fā)現(xiàn)的優(yōu)化規(guī)則或者從其他產(chǎn)品中借鑒的優(yōu)化規(guī)則可以方便地添加到系統(tǒng)中以實(shí)現(xiàn)比較好的可擴(kuò)展性。關(guān)鍵字:ETL 可擴(kuò)展 MapReduce Hive優(yōu)化規(guī)則 參考RESEARCH AND IMPLEMENTATION OF PARALLEL ETL TOOLS’ EXTENSIBLE TECHNOLOGYABSTRACTETL tools, which are the foundation of data mining and online analytical processing, are used to extract data from distributed heterogeneous data source and load the result into data mart or warehouse after cleaning and transformation. ETL tools usually provide some basic operations, such as correlation, summary, and so on, but due to the diversity of ETL application scenario, the plexity of operation logic, these mon operations often cannot satisfy the needs of users, which requires the ETL tools must have certain extensibility, to meet the special needs of various. At the same time, in the era of big data, ETL tools handle huge amounts of data by integrating cloud puting technology. Traditional ETL tools make up for the large data processing by integrating parallel ETL tools such as Hive and Pig, but the existence of the high price of mercial tools and the problem that the open source tools’ integration is not enough. Therefore, how to integrate Hive and Pig better in order to realize the expansion of the function is very important. ETL workflow, on the other hand, as a logical plan, needs to be optimized according to a series of optimization rules in the process of being parsed into a physical plan. As the optimization rules are not set in stone and new optimization rules would be concluded in the process of using ETL tool, we need to make the optimization rules have high scalability.In this paper, based on Hadoop and B/S mode, we put forward a parallel ETL system and study how to extend the parallel ETL system. The main work in this paper includes:Through analyzing the implementation details of the MapReduce parallel puting framework, design and realize two kinds of solutions to plete the function extension of dealing with the plex requirements by embedding custom MapReduce code in the existing tool.Based on the analysis and summary the language grammar characteristics of Hive and Pig script, bined with the actual application requirements, select a set of basic operations and design functional ponents according to them. Then through analyzing the dependency between these operations, design and implement the workflow parsing module, which parses a workflow into a script with the same logic as the manually written script. This integration way extends the functionality of the parallel ETL tool and ensures that the system can provide a unified graphical user interface at the same time. Through analyzing how Hive and Pig implement their optimization mechanism, design and implement our own mechanism. A rule is designed to be a set of matching pattern and the corresponding operation, the mechanism of matching the rules and walking in the plan is isolated and abstracted. Based on this kind of design, optimization rules can be extended easily.KEY WORDS:ETL,extensibility,MapReduce,Hive,optimization rule 參考目錄第一章 緒論 1 論文研究背景及意義 1 相關(guān)研究現(xiàn)狀 1 研究?jī)?nèi)容及成果 3 論文結(jié)構(gòu) 4第二章 相關(guān)概念及技術(shù)介紹 5 ETL 5 Hadoop 5 HDFS 6 MapReduce 7 并行ETL 8 Hive 8
點(diǎn)擊復(fù)制文檔內(nèi)容
數(shù)學(xué)相關(guān)推薦
文庫(kù)吧 www.dybbs8.com
備案圖鄂ICP備17016276號(hào)-1