KM404_Course_Guide.pdf - \u00ae Course Guide IBM Infosphere Advanced DataStage Parallel Framework v11.5 Course code KM404 ERC 1.0 IBM Training Preface April

KM404_Course_Guide.pdf - u00ae Course Guide IBM Infosphere...

This preview shows page 1 out of 388 pages.

Unformatted text preview: ® Course Guide IBM Infosphere Advanced DataStage Parallel Framework v11.5 Course code KM404 ERC 1.0 IBM Training Preface April, 2016 NOTICES This information was developed for products and services offered in the USA. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 United States of America The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. TRADEMARKS IBM, the IBM logo, and ibm.com, DataStage and InfoSphere are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at . Adobe, and the Adobe logo, are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. © Copyright International Business Machines Corporation 2016. This document may not be reproduced in whole or in part without the prior written permission of IBM. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. © Copyright IBM Corp. 2005, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. P-2 Preface Contents Preface................................................................................................................. P-1 Contents ............................................................................................................. P-3 Course overview............................................................................................... P-11 Document conventions ..................................................................................... P-12 Additional training resources ............................................................................ P-13 IBM product help .............................................................................................. P-14 Introduction to the parallel framework architecture ........................... 1-1 Unit objectives .................................................................................................... 1-3 Why study the parallel architecture? ................................................................... 1-4 What we need to master..................................................................................... 1-5 DataStage parallel job documentation ................................................................ 1-6 Key parallel concepts ......................................................................................... 1-7 Scalable hardware environments ....................................................................... 1-8 Drawbacks of traditional batch processing ......................................................... 1-9 Pipeline parallelism .......................................................................................... 1-10 Partition parallelism .......................................................................................... 1-11 Partitioning illustration ...................................................................................... 1-12 DataStage combines partitioning and pipelining ............................................... 1-13 Job design versus execution ............................................................................ 1-14 Defining parallelism .......................................................................................... 1-15 Configuration file .............................................................................................. 1-16 Example configuration file ................................................................................ 1-17 Generating mock data ...................................................................................... 1-18 Job design for generating mock data ................................................................ 1-19 Specifying the generating algorithm ................................................................. 1-20 Inside the Lookup stage ................................................................................... 1-21 Configuration file displayed in job log ............................................................... 1-22 Checkpoint ....................................................................................................... 1-23 Checkpoint solutions ........................................................................................ 1-24 Demonstration 1: Introduction to the parallel framework architecture ............... 1-25 Unit summary ................................................................................................... 1-39 Compiling and executing jobs ............................................................. 2-1 Unit objectives .................................................................................................... 2-3 Parallel job compilation....................................................................................... 2-4 More about compiling a Transformer job ............................................................ 2-5 Generated OSH.................................................................................................. 2-6 Stage to OSH operator mappings ....................................................................... 2-7 Generated OSH primer....................................................................................... 2-8 © Copyright IBM Corp. 2005, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. P-3 Preface Virtual data sets in the OSH ............................................................................... 2-9 DataStage GUI versus OSH terminology .......................................................... 2-10 Configuration file .............................................................................................. 2-11 Processing nodes (partitions) ........................................................................... 2-12 Configuration file format ................................................................................... 2-13 Primary elements of a configuration file ............................................................ 2-14 Sample configuration file .................................................................................. 2-15 Resource pools ................................................................................................ 2-16 Sorting resource pools...................................................................................... 2-17 Another configuration file example ................................................................... 2-18 Constraining operators to specific node pools .................................................. 2-19 Configuration editor .......................................................................................... 2-20 Parallel job startup............................................................................................ 2-22 Parallel job run time .......................................................................................... 2-23 Viewing the job Score ....................................................................................... 2-24 Example job Score ........................................................................................... 2-25 Job execution: The orchestra metaphor ........................................................... 2-26 Runtime control and data networks .................................................................. 2-27 Parallel data flow .............................................................................................. 2-28 Monitoring job startup and execution in the log................................................. 2-29 Counting the total number of processes ........................................................... 2-30 Peeking at the data stream ............................................................................... 2-31 Peeking at the data stream design ................................................................... 2-32 Using Transformer stage variables ................................................................... 2-33 Checkpoint ....................................................................................................... 2-34 Checkpoint solutions ........................................................................................ 2-35 Demonstration 1: Compile and execute jobs .................................................... 2-36 Unit summary ................................................................................................... 2-51 Partitioning and collecting data ........................................................... 3-1 Unit objectives .................................................................................................... 3-3 Partitioning and collecting................................................................................... 3-4 Partitioning and collecting icons ......................................................................... 3-5 Partitioners ......................................................................................................... 3-6 Where partitioning is specified ............................................................................ 3-7 The Score........................................................................................................... 3-8 Viewing the Score operators .............................................................................. 3-9 Interpreting the Score partitioning..................................................................... 3-10 Score partitioning example ............................................................................... 3-12 Partition numbers ............................................................................................. 3-13 Partitioning methods ......................................................................................... 3-14 Selecting a partitioning method ........................................................................ 3-15 © Copyright IBM Corp. 2005, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. P-4 Preface Same partitioning algorithm .............................................................................. 3-17 Caution regarding Same partitioning ................................................................ 3-18 Round Robin and Random ............................................................................... 3-19 Parallel runtime example .................................................................................. 3-20 Entire partitioning ............................................................................................. 3-21 Hash partitioning .............................................................................................. 3-22 Unequal distribution example ........................................................................... 3-23 Modulus partitioning ......................................................................................... 3-24 Range partitioning ............................................................................................ 3-25 Using Range partitioning .................................................................................. 3-26 Example partitioning icons ................................................................................ 3-27 Auto partitioning ............................................................................................... 3-28 Preserve Partitioning flag ................................................................................. 3-29 Partitioning strategy .......................................................................................... 3-30 Collecting data.................................................................................................. 3-32 Collectors ......................................................................................................... 3-33 Specifying the collector method ........................................................................ 3-34 Collector methods ............................................................................................ 3-35 Sort Merge example ......................................................................................... 3-36 Non-deterministic execution ............................................................................. 3-37 Choosing a collector method ............................................................................ 3-38 Collector method versus Funnel stage ............................................................. 3-39 Parallel number sequences .............................................................................. 3-40 Row Generator sequences of numbers ............................................................ 3-41 Generated numbers ......................................................................................... 3-42 Transformer example using @INROWNUM ..................................................... 3-43 Transformer example using parallel variables .................................................. 3-44 Header and detail processing ........................................................................... 3-45 Job design ........................................................................................................ 3-46 Inside the Transformer ..................................................................................... 3-47 Examining the Score ........................................................................................ 3-48 Difficulties with the design ................................................................................ 3-49 Examining the Score ........................................................................................ 3-50 Generating a header detail data file .................................................................. 3-51 Inside the Column Export stage........................................................................ 3-52 Inside the Funnel stage .................................................................................... 3-53 Checkpoint ....................................................................................................... 3-54 Checkpoint solutions ........................................................................................ 3-55 Demonstration 1: Read data with multiple record format .................................. 3-56 Unit summary ................................................................................................... 3-68 © Copyright IBM Corp. 2005, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. P-5 Preface Sorting data ........................................................................................... 4-1 Unit objectives .................................................................................................... 4-3 Traditional (sequential) sort ................................................................................ 4-4 Parallel sort ........................................................................................................ 4-5 Example parallel sort .......................................................................................... 4-6 Stages that require sorted data .......................................................................... 4-7 Parallel sorting methods ..................................................................................... 4-8 In-stage sorting................................................................................................... 4-9 Sort stage ......................................................................................................... 4-10 Stable sorts ...................................................................................................... 4-11 Resorting on sub-groups .................................................................................. 4-12 Don't sort (previously grouped) ......................................................................... 4-13 Partitioning and sort order ................................................................................ 4-14 Global sorting methods..................................................................................... 4-15 Inserted tsorts................................................................................................... 4-16 Changing inserted ...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture