Parallel Architecture


In the parallel architecture, the key is in choosing an effective partitioning and collection strategy. Mainly partitioning is intended to divide the larger dataset into smaller (preferably equal sized) datasets. Once data is partitioned the same solution is applied on these smaller datasets at the same time to take the benefit of parallel architectures like Grid computing, SMP, MPP, Clusters, NUMA, etc. to provide fast computation and better throughput (in case job was I/O limited!!). For example, Suppose we have 2000 products and 2000 manufacturer. These two details are in two separate tables. If the requirement says that we need to join these tables and create a new set of records. By brute force, it will take 2000 X 2000/2 = 20,00,000 steps. Now, if we divide this into 4 partitions, it will take 4 * 500 * 2000/2 = 20,00,000 steps. Assuming the number of steps is directly proportional to time taken and we have sufficient computing power to run 4 separate instances of the jobs performing join; we can complete this task in effectively 1/4th time. This is what we call power of partition.

There are mainly 8 different kind of partitioning supported by enterprise edition (I have excluded DB2 at this moment from the list.). Usage of these partition mechanism completely depends on what kind of data distribution you have or you will have. Some of the questions you need to ask yourself are

  1. Do we need to have the related data together?
  2. Do we need to have a look at complete dataset at a time?
  3. Does it matter if we are working on subset of data?
  4. If a data set is divided into subsets then do you ever need collect and again think about partitioning

If you can answer following questions, then you have already won half of the battle. Now you need to know, the options available in datastage and based on that decide, which one suits you the most. DataStage provides you following partitioning methodologies:
1) Round Robin This is useful when the need is to have equal (almost) sized partitions. DataSets will be divided among different partitions – first record going to first partition (processing node), second record going to second partition (processing node), …till the last processing node … again looping back, as long as we have data to be divided into partitions. Generally DataStage will use this method to start with.

2) Random Similar to Round Robin partitioning, this also gives almost equal sized partitions. However, there is a small overhead of generating of random number to decide, which partition the data will go. So, when do we use random partitioning? Since its behavior is similar to Round Robin and round robin seems to be more efficient than random partitioning, there must be a scenario when Random Partitioning should be useful.

3) Same This method is useful when data stays within same processing node and passed across to different stages without a need to repartition the data. Obviously, this is the fastest partitioning method.

4) Entire In this case dataset is not at all partitioned. Rather complete dataset is sent to all the processing nodes. This is specially useful when the dataset is acting like a lookup table.

5)Hash by fields It is used to group the related records into the same partition (processing node). The idea is that the related records will have similar primary key (one column) and secondary key (can be any number of columns, of course less than total number of columns). Thus hash key calculation will provide same value for them and the records will be grouped in a common partition.
Make sure that you do not choose hash key as fields which contains limited values like Yes/No/Maybe or M/F, etc.
Also, note that hash key calculation for single column key will be faster than the keys containing one primary key and one or more secondary key. One of the example could be to remove duplicate records from the record set. In this case, if the data are not grouped ( or not grouped on the right key) then removing duplicates will be quite challenging and erroneous. However, if the data are grouped and sorted on the correct key then removing duplicate is quite easy.

6) Modulus If the data needs to be grouped based on the understanding that data will contain some information so that modulus (#processing node on which partition executes) will return balanced partition then this method is suitable. Actually, data will be partitioned on one numeric field by calculating modulus against number of partitions. Remember that modulus calculation will be significantly faster than hash key calculation.
7) Range This technique is used to divide the record set into approximately equal sized partitions. Basically, if a key column ( can have more than one partitioning keys) has value in a given range then it goes into specific partition.
If a dataset is unsorted and the next stage needs completely sorted data ( or possibly next stage is performing total sort) then range partitioning is preferably used. In addition to giving equal sized partition, the range partition also make sure that related records (records with same/similar partitioning key) are in the same partition. Thus processing will be balanced!!
To use range partitioning create a range map using Write Range Map Stage. Basically, range map calculates partition boundary for the ranges. So, range intervals may be different as it will depend on the data. Probabilistic splitting technique is used by Write Map Range partitioner to find out the range boundaries. Probabilistic splitting technique mainly consists of following steps:

  1. To find an approximate splitting vector for the record set. (Remember that dataset will be some file on the hard disk.)
  2. Compute random subset of the dataset.
  3. Find an exact splitting vector for the random subset.
  4. Use this exact splitting vector to split the whole file into equal sized partitioned datasets.

8. Auto DataStage decides which partitioning will be suitable in a given context. If you are not sure about the advantage of choosing new partitioning method for a down stream stage then let datastage decide which partitioning to apply. This is the most commonly seen partitioning method (especially in intermediate stages) in datastage.

Now that you know the available partitioning methods and most probably you might have decided about which one will be applicable for which dataset, lets also try to understand the collection concepts. You may like to collect data in following scenarios:
— When there is a sequential stage in your job, which forces you to collect all the partitioned data.
— When the job execution completes and you want to write the final output into a file/table.

There are mainly 4 collecting methods.

1) Round Robin Exactly opposite to Round Robin partitioning, it reads one record from each partition. If any partition is exhausted then next time that partition is skipped. This is not very popular and not frequently used collection method.

2) Ordered This method collects the data in ordered manner. What it means is that data from first partition is collected first then from the second and so on. If the data is totally sorted in the partitions and partitions themselves are in sorted manner (ordered) then the data collected using this method into a single file will also be sorted. This sorting method in conjunction with range partitioning and sorting will give you the over all sorted file.
Ordered collection is not necessarily used to collect sorted data. All it says that it will collect data in ordered manner!!

3) Sorted Merge Based on one (primary key) or more columns (secondary keys) of already partitioned dataset record, it will read records in a given order.
— Typically used in conjunction with sort stage or sorted data.
— The collection key should be same as the key using which the partitioned dataset have been sorted.
As you must be expecting that data type for the key on which a record is partitioned or collected should not be raw, subrec, tagged or vector. Unlike Ordered collection, it doesn’t ask you to have all the partitions in a given order to be able to provide you a sorted data.

4) Auto This is the fastest collection method and most frequently seen too. DataStage automatically decides when to collect the record. In that process if it detects that collected data needs to be sorted then it will also do that.

Tagged with: , ,
Posted in Data Warehouse & BI
11 comments on “Parallel Architecture
  1. Ankita says:

    I am getting so much information from your blog which i didn’t get even after working for 2 yrs in DataStage. Thanks 🙂

  2. GeneTinsley says:

    Your blog is so informative … ..I just bookmarked you….keep up the good work!!!!

  3. Great site…keep up the good work.

  4. wtcindia says:

    Given a source DataSet, describe the degree of parallelism using auto and same partitioning.
    Auto partitioning can be used for letting datastage decide about the most suitable partitioning for a given flow. In this case degree of parallelism mainly depends:
    o How many logical processing nodes we have?
    o what is the suitable partitioning method for at the given stage?
    o What kind of Node Pool or Resource Pools constraint has been put on the given stage. There are mainly two factors which decide the partitioning method that will be opted by the DataStage.
    o How data has been partitioned by previous stages.
    o What kind of stage the current stage is – i.e. what kind of data requirement does it have?

    If the current stage doesn’t have any specific data setup requirement and it doesn’t have any previous stage then DataStage typically uses Round Robin partitioning to make sure that data is evenly distributed among different partitions.

    Same partitioning is a little different from the auto partitioning. In this case user is trying to instruct DataStage to use the same partitioning as it has been used by the previous stages. So, partitioning method at current stage is actually determined by the partitioning configuration at previous stages. This is generally used when you know that same partitioning will be more effective than allowing datastage to pick up a suitable partitioning, which may cause repartitioning. So, if you are using this partitioning, then data flows inside the same processing node. No redistribution of data occurs!! Degree of parallelism is decided by
    o The configuration at previous stages.
    o Constraints on the current node.In this case – there is no reason why one should go for same partitioning. Suppose previous stages had data partitioned on 5 processing node, while current stage is constrained to run on only 3 partitioning node then using same partitioning doesn’t make sense. Data has to be repartitioned.

  5. wtcindia says:

    What is the purpose and use of resource and node pools.
    Before we discuss about what is resource pools or node pools, it is important to understand the meaning of pool.
    o A pool is set of nodes or resources which can be used in configuring the datastage jobs to make them run on suitable nodes and use appropriate resources while processing data. Node pool is used to create set of logical node on which a job can run its processes. Basically all the nodes with similar characteristics are placed into same node pools. Depending on the hardware, software and software licenses available you create a list of node pools. During job design you decide which stage will be suitable on which node and thus you constrain the stage to run on specific nodes. This is needed otherwise your job will not be able to take full advantage of the available resources.
    o Example – you do not want to run a sort stage on a machine which doesn’t have enough scratch disk space to hold the temporary files.
    o It is very important to have node pool declaration and use of node constraint in such a way that it doesn’t cause any unnecessary repartitioning. Resource pool is mainly used to decide which node will have access to what resources.
    o E.g. If a job needs a lot of I/O operations then you need to assign high speed I/O resources to the node on which you plan to run I/O intensive processes.
    o Similarly, if a machine has SAS installed on it (in MPP system) and you have to extract/load data into SAS system then you may like to create a SAS_NODEPOOL containing information about these machines.

  6. wtcindia says:

    Given a job design and configuration file, provide estimates of the number of processes generated at runtime.
    Number of processes generated at run time depends on various factors. Moreover, there is no straight forward formula to give the exact number of processes created. Here are the main factors that affect creation of processes:
    In ideal case, where all the stages are running on all the processing nodes and there are no constraints on hardware, I/O devices, processing nodes etc. then DataStage will start one process per stage per partition?
    oThis definitely sounds like a straight forward mathematics. If there are N partitions ( i.e. N logical nodes) and M stages in the job then at approximately M * N processes will be created. Remember that DataStage will also start group leader processes and process on the conductor node to keep the communications uniform and consistent. So, number of process could be actually more than M * N. Configuration File is the first place where we put the information regarding on which node a stage can run and on which node it cannot. In fact, it is not specified directly inside the configuration file, however you declare Node Pools & Resource Pools in the configuration file. Thus you have fair idea about what kind of stages can be run on these pools.
    o You also have choice of not making a node or resource as part of default pool. So, you can run the processes selectively on a given node. In this case – in the job design you select the list of node on which a stage can run. This actually determines how many processes will be started. Even here we have further constraints associated. It is quite possible that your job is constrained to run on specific set of nodes. In that case, the processes will run on the nodes – common to job as well as the current stage.
    o Ultimately processes depend on how many partitions of data is being done. That again depends on whether data is being partitioned or collected and what kind of constraints are associated with the current stage. In addition to Node Pool and Resource pool constraints – you can also specify Node Map constraints. This basically forces the parallel execution to run on specified nodes. In fact, it is equivalent to creating another node pool (in addition to whatever is already existing in configuration files) containing set of nodes on which this stage can run. So, process will be started accordingly. Also, note that – if a stage is running in sequential mode then one process will run corresponding to it and that too will run on the Conductor node.

  7. wtcindia says:

    Given a scenario how to identify partitioning type, parallel/sequential by analyzing a DataStage EE screen shot.
    Each parallel stage in a job can partition or repartition incoming data or accept same partitioned data before it operates on it. There is an icon on the input link (called link marking) to a stage which shows how the stage handles partitioning. While deciding about the parallel or sequential mode of operation at a given stage, this information is really useful. We have following types of Icons:
    o None Shows Sequential flow –>Sequential flow .
    o Auto Shows DataStage will decide the most suitable partitioning for the stage..
    o Bow Tie It shows that repartitioning has occurred. Mainly because downstream stage has different partitioning needs. Basically from Parallel to Parallel flow. Well – it is still parallel to parallel, however, sometimes it is as bad as sequential.
    o Fan Out It shows that data is being partitioned. What it means that either it is start of the job or before this stage the execution was in sequential mode. Basically Sequential to Parallel flow – if the stage is in the middle of job flow.
    o Same (box) It shows that next stage is going to use the same partitioning. Basically, Parallel to Parallel flow.
    o Fan InIt shows that data is being collected at this stage. Basically from Parallel to Sequential flow.

  8. wtcindia says:

    Describe input of partitioning and re-partitioning in an MPP/cluster environment.
    Input of partitioning is
    1) Number of processing node available for the stage and the properties specified on the partitioning tab of that particular stage.
    2) Partition type used by the stage.

    Input to repartitioning is
    1) The number of processing node available to the current stage. If the number of processing node for current stage is different (especially less) than previous stages then datastage decides to repartition the data.
    2) Different partition type being used by the current stage than the one used for previous stages.
    3) The data requirement is such that data from different partitions need to be re-grouped (even though same partition type is used!!). E.g. Suppose customer data is grouped based on age of the customers. However, current stage needs to know about all the customer belonging to current state (possibly based on state_cd). Now, we do need to repartition the data.

    For better performance of your job, you must try to avoid re-partitioning as much as possible. Remember that repartitioning involves two steps (which may be unnecessary and taxing). Hence understanding where, when and why re-partitioning occurs in a job is necessary for being able to improve the performance of the job.
    o First collects all the data being processed by different processing node.
    o Partition the data based on new partitioning rules specified/identified.
    o On MPP or Cluster systems, we have additional overhead of passing the data on the network during the repartitioning.

  9. wtcindia says:

    What are the differences between funnel stage and a collector

    Funnel stage copies different datasets into a single output dataset, while collector collects different partitions of same dataset. In fact, collector works on a single link – divided among different processing nodes – to collect datasets from different partitions.
    Funnel stage can run in parallel as well as sequential mode. In case of Parallel mode different input datasets to the funnel stage will be partitioned and processed on different processing node. Again, processed partitions will be collected and funneled into the final output dataset. This can be controlled from the partitioning tab of Funnel Stage. In case of sequential mode, if Funnel Stage is in the middle of the job design then it first collects all the data from different partitions and then funnels all the incoming datasets.
    Of course metadata for all the incoming inputs to the funnel stage should be same for funnel stage to be able to work. However, the funnel stage allows you to specify (through mapping tab) how output columns are derived from the input column. This is something simple collection doesn’t do. Remember that metadata needs to be same for the entire input column. So, only one set of input column is shown and that too is read only.
    Funnel has mainly three mode in which it operates Continuous Funnel (similar to round robin, reads one record at a time from each link), Sort Funnel, Sequence Funnel (Similar to Ordered collection)

  10. wtcindia says:

    There may be situations where there will be differences between the partitioning keys and sorting (stage) keys. For example, suppose we have a customer data containing following fields:
    o CustomerID
    o CompanyName
    o ContactName
    o ContactAddress
    o City
    o Region
    o PostalCode
    o Country
    o Phone
    o SalesAmt

    Initially data was partitioned based on CompanyName as partition key. Some of the stage did the needed processing. However, next stage needs data sorted on Country (primary) and City(Secondary), so that one can focus on particular market area.
    — This definitely demands that data be collected and repartitioned.
    — The repartitioning should be based on grouping on Country and City.
    Of course there may be option of choosing appropriate partitioning method. This step forces the job to behave sequential at this point. The purpose of parallelism is lost!! Is there any alternative?
    The answer is – to some extent yes, but not very sure how much advantage it will have. Sort (During sorting – partitioning type must be same. Otherwise DataStage will try to repartition the datasets.) the already partitioned data on Country and City as key. Now use sorted merge collection method. Well data is still being collected!! Job becomes sequenatial. However, no further repartitioning will be needed!! Unless there is another mismatch.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

We Have Moved Our Blog!

We have moved our blog to our company site. Check out https://walkingtree.tech/index.php/blog for all latest blogs.

Sencha Select Partner Sencha Training Partner
Xamarin Authorized Partner
Recent Publication
%d bloggers like this: