Persistence Storage

Given a scenario, explain the process of importing/exporting data to/from framework (e.g., sequential file, external source/target).
1. Explain use of various file stages (e.g., file, CFF, fileset, dataset) and where appropriate to use
2. If USS, define the native file format (e.g., EBCDIC, VSDM)
You can export/import data between different frameworks. However, one thing you must make sure is that you are providing appropriate metadata (e.g. column definition, formatting rules, etc.) needed for exporting/importing the data.

Use Of DataSet Stage
1. In parallel jobs, data is moved around in data sets. These carry metadata with them – column definitions and information about the configuration that was in effect when the data set was created.
1. These information are used by datastage in passing on the data to the next stage as well as taking decision like same partitioning is needed or repartitioning may be required. Example, if you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, DataStage can detect that the data will need repartitioning.
2. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Persistent data sets are stored in a series of files linked by a control file. So, never ever try to manipulate these files using commands like rm, mv, tr, etc. as it will corrupt the control file. If needed, use DataSet Management Utility to manage datasets.
3. Data sets are operating system files.
4. Using data sets wisely can be key to good performance in a set of linked jobs.
5. DataSet stage allows you to read from and write to dataset (.ds) files.

Use of FileSet Stage
1. It allows you to read data from or write data to a file set.
2. It only operates in parallel mode.
3. DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, “.fs”.
4. File Set is really useful when OS limits the size of data file to 2GB and you need to distribute files among nodes to prevent overruns.
5. The number of files created by a file set depends on:
1. The number of processing nodes in the default node pool.
2. The number of disks in the export or default disk pool connected to each processing node in the default node pool.
3. The size of the partitions of the data set.
6. The amount of data that can be stored in the destination file is limited by:
1. The characteristics of the file system and
2. The amount of free disk space available.
7. Unlike data sets, file sets carry formatting information that describes the format of the files to be read or written.
Sequential File Stage
1. The stage executes in parallel mode if reading/writing to multiple files but executes sequentially if it is only reading/writing one file.
2. You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems.
3. You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node). 2 & 3 are mutually exclusive.
4. Generally used to read/write flat/text files.

Lookup File Set Stage
1. It allows you to create a lookup file set or reference one for a lookup.
2. The stage can have a single input link or a single output link. The output link must be a reference link.
3. When performing lookups, Lookup File stages are used in conjunction with Lookup stages.
4. If you are planning to perform lookup on particular key combination then it is recommended to use this file stage. In case other file stage is being used for looking purpose, lookup become sequential.

External Source Stage
1. This stage allows you to read data that is output from one or more source programs.
2. The stage calls the program and passes appropriate arguments.
3. The stage can have a single output link, and a single rejects link.
4. It allows you to perform actions such as interface with databases not currently supported by the DataStage Enterprise Edition.
5. External Source stages, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on External Source stages if you have used the Schema File property to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns.

External Target Stage (Similar to external source stage!)
Complex Flat File Stage
1. Allows you to read or write complex flat files on a mainframe machine. This is intended for use on USS systems.
2. When used as a source, the stage allows you to read data from one or more complex flat files, including MVS datasets with QSAM (Queued Sequential Access Method) and VSAM (Virtual Storage Access Method, a file management system for IBM mainframe systems) files.
3. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS, or OCCURS DEPENDING ON clauses. Complex Flat File source stages execute in parallel mode when they are used to read multiple files, but you can configure the stage to execute sequentially if it is only reading one file with a single reader.
4. CFF typically have hierarchical structure or include legacy data types.
1. Complex Flat File source stages execute in parallel mode when they are used to read multiple files, but you can configure the stage to execute sequentially if it is only reading one file with a single reader.
Native File Format For USS (Unix Support Systems) EBCDIC (Extended Binary Coded Decimal Interchange Code) is an 8-bit character encoding used on IBM mainframe operating systems.
ASCII (American Standard Code for Information Interchange) EBCDIC and ASCII, both are ways of mapping computer codes to characters and numbers, as well as other symbols typically used in writing. Most current computers have a basic storage element of 8 bits, normally called a byte. This can have 256 possible values. 26 of these values need to be used for A-Z, and another 26 for a-z. 0-9 take up 10, and then there are many accented characters and punctuation marks, as well as control codes such as carriage return (CR) and line feed (LF). EBCDIC and ASCII both perform the same task, but they use different values for each symbol. For instance, in ASCII a ‘E’ is code 69, but in EBCDIC it is 197.Text conversion is very easy, however, numeric conversion is quite tricky. For example:
1. Text string : It is very simple and portable. A simple mapping can be used to map the string to the code and vice versa.
2. Binary : Binary numbers use the raw bytes in the computer to store numbers. Thus a single byte can be used to store any number from 0 to 255. If two bytes are used (16 bits) then numbers up to 65535 can be saved. The biggest problem with this type of number storage is how the bytes are ordered [i.e Little Endian (Intel uses this, least significant first!! i.e. the high byte on the right) or Big Indian (Motorola uses this, the high byte on the left) or Native Endian]. E.g 260 in Little Endian will be 04H 01H while it in Big Endian, it will be 01H 04H.
3. Packed decimal : In text mode each digit takes a single byte. In packed decimal, each digit takes just 4 bits (a nibble). These nibbles are then packed together, and the final nibble represents the sign. These are C (credit) is a +, D (debit) is a – and F is unsigned, i.e. +. The number 260 in packed decimal would be: 26H 0FH or 26H 0CH.
4. Floating point : Floating-point numbers are much harder to describe but have the advantage that they can represent a very large range of values including many decimal places.Of course there are some rounding problems as well.
The Problem with ASCII/EBCDIC conversion when dealing with records that contain both text and numbers is that numbers must not be converted with the same byte conversion that is used for ASCII/EBCDIC conversion. The only truly portable way to convert such records is on a per field basis. Typically, from EBCDIC to ASCII, and numeric fields, packed decimal fields etc are converted to ASCII strings. How is this done? The only way to do an EBCDIC to ASCII conversion is with a program that has knowledge of the record layout. With DataStage, details of each record structure are entered and how each field is to be converted can be set. Files with multiple record structures and multiple fields can then be converted on a field-by-field basis to give exactly the correct type of conversion. It is an ideal solution for an EBCDIC and ASCII conversion as all data is retained. Packed decimal fields are normally found on mainframe type applications often Cobol related. RR32 can be used to create Access compatible files or even take ASCII files and create files with packed decimal numbers. Reference for ASCII and EBCDIC code mappings:

Tagged with: ,
Posted in Data Warehouse & BI
2 comments on “Persistence Storage
  1. There is obviously a lot to know about this. There are some good points here. 🙂

  2. JimmyBean says:

    I don’t know If I said it already but …This blog rocks! I gotta say, that I read a lot of blogs on a daily basis and for the most part, people lack substance but, I just wanted to make a quick comment to say I’m glad I found your blog. Thanks, 🙂

    A definite great read..Jim Bean

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

We Have Moved Our Blog!

We have moved our blog to our company site. Check out for all latest blogs.

Sencha Select Partner Sencha Training Partner
Xamarin Authorized Partner
Recent Publication
%d bloggers like this: