In the parallel architecture, the key is in choosing an effective partitioning and collection strategy. Mainly partitioning is intended to divide the larger dataset into smaller (preferably equal sized) datasets. Once data is partitioned the same solution is applied on these smaller datasets at the same time to take the benefit of parallel architectures like Grid computing, SMP, MPP, Clusters, NUMA, etc. to provide fast computation and better throughput (in case job was I/O limited!!). For example, Suppose we have 2000 products and 2000 manufacturer. These two details are in two separate tables. If the requirement says that we need to join these tables and create a new set of records. By brute force, it will take 2000 X 2000/2 = 20,00,000 steps. Now, if we divide this into 4 partitions, it will take 4 * 500 * 2000/2 = 20,00,000 steps. Assuming the number of steps is directly proportional to time taken and we have sufficient computing power to run 4 separate instances of the jobs performing join; we can complete this task in effectively 1/4th time. This is what we call power of partition.
There are mainly 8 different kind of partitioning supported by enterprise edition (I have excluded DB2 at this moment from the list.). Usage of these partition mechanism completely depends on what kind of data distribution you have or you will have. Some of the questions you need to ask yourself are
- Do we need to have the related data together?
- Do we need to have a look at complete dataset at a time?
- Does it matter if we are working on subset of data?
- If a data set is divided into subsets then do you ever need collect and again think about partitioning
If you can answer following questions, then you have already won half of the battle. Now you need to know, the options available in datastage and based on that decide, which one suits you the most. DataStage provides you following partitioning methodologies:
1) Round Robin This is useful when the need is to have equal (almost) sized partitions. DataSets will be divided among different partitions – first record going to first partition (processing node), second record going to second partition (processing node), …till the last processing node … again looping back, as long as we have data to be divided into partitions. Generally DataStage will use this method to start with.
2) Random Similar to Round Robin partitioning, this also gives almost equal sized partitions. However, there is a small overhead of generating of random number to decide, which partition the data will go. So, when do we use random partitioning? Since its behavior is similar to Round Robin and round robin seems to be more efficient than random partitioning, there must be a scenario when Random Partitioning should be useful.
3) Same This method is useful when data stays within same processing node and passed across to different stages without a need to repartition the data. Obviously, this is the fastest partitioning method.
4) Entire In this case dataset is not at all partitioned. Rather complete dataset is sent to all the processing nodes. This is specially useful when the dataset is acting like a lookup table.
5)Hash by fields It is used to group the related records into the same partition (processing node). The idea is that the related records will have similar primary key (one column) and secondary key (can be any number of columns, of course less than total number of columns). Thus hash key calculation will provide same value for them and the records will be grouped in a common partition.
Make sure that you do not choose hash key as fields which contains limited values like Yes/No/Maybe or M/F, etc.
Also, note that hash key calculation for single column key will be faster than the keys containing one primary key and one or more secondary key. One of the example could be to remove duplicate records from the record set. In this case, if the data are not grouped ( or not grouped on the right key) then removing duplicates will be quite challenging and erroneous. However, if the data are grouped and sorted on the correct key then removing duplicate is quite easy.
6) Modulus If the data needs to be grouped based on the understanding that data will contain some information so that modulus (#processing node on which partition executes) will return balanced partition then this method is suitable. Actually, data will be partitioned on one numeric field by calculating modulus against number of partitions. Remember that modulus calculation will be significantly faster than hash key calculation.
7) Range This technique is used to divide the record set into approximately equal sized partitions. Basically, if a key column ( can have more than one partitioning keys) has value in a given range then it goes into specific partition.
If a dataset is unsorted and the next stage needs completely sorted data ( or possibly next stage is performing total sort) then range partitioning is preferably used. In addition to giving equal sized partition, the range partition also make sure that related records (records with same/similar partitioning key) are in the same partition. Thus processing will be balanced!!
To use range partitioning create a range map using Write Range Map Stage. Basically, range map calculates partition boundary for the ranges. So, range intervals may be different as it will depend on the data. Probabilistic splitting technique is used by Write Map Range partitioner to find out the range boundaries. Probabilistic splitting technique mainly consists of following steps:
- To find an approximate splitting vector for the record set. (Remember that dataset will be some file on the hard disk.)
- Compute random subset of the dataset.
- Find an exact splitting vector for the random subset.
- Use this exact splitting vector to split the whole file into equal sized partitioned datasets.
8. Auto DataStage decides which partitioning will be suitable in a given context. If you are not sure about the advantage of choosing new partitioning method for a down stream stage then let datastage decide which partitioning to apply. This is the most commonly seen partitioning method (especially in intermediate stages) in datastage.
Now that you know the available partitioning methods and most probably you might have decided about which one will be applicable for which dataset, lets also try to understand the collection concepts. You may like to collect data in following scenarios:
— When there is a sequential stage in your job, which forces you to collect all the partitioned data.
— When the job execution completes and you want to write the final output into a file/table.
There are mainly 4 collecting methods.
1) Round Robin Exactly opposite to Round Robin partitioning, it reads one record from each partition. If any partition is exhausted then next time that partition is skipped. This is not very popular and not frequently used collection method.
2) Ordered This method collects the data in ordered manner. What it means is that data from first partition is collected first then from the second and so on. If the data is totally sorted in the partitions and partitions themselves are in sorted manner (ordered) then the data collected using this method into a single file will also be sorted. This sorting method in conjunction with range partitioning and sorting will give you the over all sorted file.
Ordered collection is not necessarily used to collect sorted data. All it says that it will collect data in ordered manner!!
3) Sorted Merge Based on one (primary key) or more columns (secondary keys) of already partitioned dataset record, it will read records in a given order.
— Typically used in conjunction with sort stage or sorted data.
— The collection key should be same as the key using which the partitioned dataset have been sorted.
As you must be expecting that data type for the key on which a record is partitioned or collected should not be raw, subrec, tagged or vector. Unlike Ordered collection, it doesn’t ask you to have all the partitions in a given order to be able to provide you a sorted data.
4) Auto This is the fastest collection method and most frequently seen too. DataStage automatically decides when to collect the record. In that process if it detects that collected data needs to be sorted then it will also do that.