Psyche Scribbles: Data warehouse Basics

Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis ^[1].

Dimension table:

In data warehousing, a dimension table is one of the set of companion tables to a fact table.

The fact table contains business facts or measures and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.

The dimension tables contain attributes (or fields) used to constrain and group data when performing data warehousing queries.

Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a company may change. Kimball refers to this phenomenon as Slowly Changing Dimensions. Strategies for dealing with this kind of change are divided into three categories:

Type One - Simply overwrite the old value(s).
Type Two - Add a new row containing the new value(s), and distinguish between the rows using Tuple-versioning techniques.
Type Three - Add a new attribute to the existing row.

Facts Table:

In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema, surrounded by dimension tables.

Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.

The facts that the data warehouse helps analyze are classified along different dimensions: the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed.

Dimension tables have a simple primary key, while fact tables have a compound primary key consisting of the aggregate of relevant dimension keys.

It is common for dimension tables to consolidate redundant data and be in second normal form, while fact tables are usually in third normal form because all data depend on either one dimension or all of them, not on combinations of a few dimensions.

The star schema is a way to implement multi-dimensional database (MDDB) functionality using a mainstream relational database: given the typical commitment to relational databases of most organizations, a specialized multidimensional DBMS is likely to be both expensive and inconvenient.

Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema.

Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions.^[1] OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application.

Online analytical processing, or OLAP (pronounced /ˈoʊlæp/), is an approach to quickly answer multi-dimensional analytical queries.^[1] OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining.^[2] The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).^[3]

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and hierarchical databases that are faster than relational databases.^[4]

The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the rows and columns of the matrix; the measures form the values.

At the core of any OLAP system is the concept of an OLAP cube (also called a multidimensional cube or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.

Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure.

A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a dimension. Each Sale has a Date/Time label that describes more about that sale.

Any number of dimensions can be added to the structure such as Store, Cashier, or Customer by adding a column to the fact table. This allows an analyst to view the measures along any combination of the dimensions.

ETL Steps

Datalab sends a space formatted flat file. The field layout is provided by the Business and based on the field layout the file is generated by the datalab and sent to the ETL team.

Testing team should validate the data in the file. We can use Microsoft access or Microsoft Excel to convert the space formatted flat file to a readable format and validate the layouts and the values in the file.

Once the data is validated, then it is loaded into the database using Ab Initio graphs.

Testing team should force start the ETL batch process and monitor them for results. If any graphs in the process fails then the tester should navigate to the error logs and validate the errors and rejected files. This is done through the Unix Box. The schedular for Ab Initio is called as Autosys which is used for running and monitoring the batch jobs in unix box.

The first step is GateKeeper process. The code and test scripts for this process is actually derived from the data model. Based on the data model the scenarios are validated.

Next the data is validated against the business requirements and business rules. Also the data integrity, the relationship between the tables and the data is validated. We will have to make sure there are no orphan records generated in any tables.

All the data validation is done using Oracle SQL queries. The tool can be either Toad or SQL developer for Symphony.

Once the data and data integrity is validated, we test the data through application, do a complete UI, functionality, regression validations and capture all the transactions (dispositions) in the database tables.

And this is validated using Oracle SQL queries.

Once the transactional data is validated, then this data is extracted, transformed and loaded into the reporting tables using Ab Initio graphs. This is the Data Warehousing process in the project. This is not a straight forward loading but there are some business rules and requirements based on which this is validated. This is again validated using Oracle SQL queries.

Once the data in the reporting tables is validated, then the data is tested in the front end using a BI tool called Cognos. The data in the reporting tables are loaded in multi dimensional format and the type of schema which is used is Star Schema for loading and retrieving data.

All these validations have a proper test strategy, test plan and test cases which is loaded in Quality Center and executed. Any issues is logged as a defect in QC and will be discussed and validated with the business owners and the development team.

Data Validation Roles and Responsibilities:

Validate the boundary conditions,

Validate the characteristics (Alpha, Numeric, etc..)

Not Null validations (Constraints and Triggers validation)

Data integrity (Relationship validations, Parent-child)

Validating the error records through error logs in Unix.

Writing joins in the DB(Oracle DB), Oracle SQL queries and we are actually joining more than 16 tables and creating a materialistic view and validating the data against the source files.

Validate Data mapping whether the correct column from source is mapped to the correct column in the DB

Write test strategy, test plans, test cases for ETL and Data Warehousing process

Use Cognos (BI tool) to validate the data in Business Intelligence.

Psyche Scribbles

Monday, March 7, 2011

Data warehouse Basics - Mani

No comments:

About Me