Wednesday, August 1, 2012

SQL Interview Questions

Basic SQL Interview questions and answers:



SOURCE: http://www.techinterviews.com/interview-questions/database/


QUESITON:Name few Data types in database?
ANSWER:
Max. columns in a table is 255. Max.

Char size is 255,

Long is 64K and

Number is 38 digits.Cannot Query on a long column.Char,

Varchar2 Max. size is 2000 and default is 1 byte.

Number(p,s) p is precision range 1 to 38, s is scale -84 to 127.

Long Character data of variable length upto 2GB.

Date Range from Jan 4712 BC to Dec 4712 AD.

Raw Stores Binary data (Graphics Image and Digitized Sound).

Max. is 255 bytes.



QUESTION: WHAT IS TABLE CONSTRAINTS, NAME FEW TABLE CONSTRAINTS.

ANSWER:

not null - value in a column must not be NULL

unique - value(s) in specified column(s) must be unique for each row in a table

primary key - value(s) in specified column(s) must be unique for each row in a table and not be NULL; normally each table in a database should have a primary key - it is used to identify individual records

foreign key - value(s) in specified column(s) must reference an existing record in another table (via it's primary key or some other unique constraint)

check - an expression is specified, which must evaluate to true for constraint to be satisfiedA constraint allows you to apply simple referential integrity checks to a table. There are four primary types of constraints that are currently supported by SQL Server:



PRIMARY/UNIQUE - enforces uniqueness of a particular table column.

DEFAULT - specifies a default value for a column in case an insert operation does not provide one. FOREIGN KEY - validates that every value in a column exists in a column of another table.

CHECK - checks that every value stored in a column is in some specified list. Each type of constraint performs a specific type of action. Default is not a constraint.

NOT NULL is one more constraint which does not allow values in the specific column to be null. And also it the only constraint which is not a table level constraint.



QUESTION: WHAT IS DATA INTEGRITY? HOW DO YOU ENSURE DATA INTEGRITY IS ENFORCED IN A TABLE?

ANSWER: NEED TO VALIDATE PROPER CONSTRAINT LIKE PRIMARY KEY, FORIGEN KEY IS DEFINED FOR THE TABLE.



QUESTION: WHAT IS referential integrity mean?

ANSWER: PRIMARY KEY, FORIGEN KEY, COMPOSITE KEY, SURROGATE KEY, CANDIDATE KEY.



QUESTION: NAME FEW SQL CLAUSES:

Where

HAVING

IN

ANY

TOP

ROWNUM



QUESTION: WHAT IS AGGREGATE FUNCTION, Name few aggregate function?

COUNT

SUM

AVG

MIN

MAX



QUESTION: What are the rules when you want to use aggregate functions?

ANSWER: BASIC RULE IS TO USE "GROUP BY" WHILE USING AGGREGATE FUNCTION IN A QUERY.





QUESTION: What is the difference between where and having clause?

ANSWER: in SQL Where filters data on lowest row level.

Having filters data after group by has been performed so it filters on "groups""Where" is a kind of restiriction statement.

You use where clause to restrict all the data from DB.Where clause is using before result retrieving.

But Having clause is using after retrieving the data.Having clause is a kind of filtering command.



QUESTION: IN WHICH SCENARIO WE WILL USE "IN" OPERATOR. CAN YOU EXPLAIN WHAT IS THE PURPOSE OF IT.



QUESTION: WHAT IS THE DIFFERENCE BETWEEN "IN" AND "EXISTS"



QUESTION: SAMPLE QUERY TO FIND SUM OF SALES FOR EACH DEPARTMENT. LETS SAY TABLE NAME IS SALES_ORDER



QUESTION: SAMPLE QUERY USING WHERE CLAUSE and HAVING CLAUSE?

SELECT department, SUM(sales) as "Total sales" FROM order_details GROUP BY department HAVING SUM(sales) > 1000;

SELECT department, COUNT(*) as "Number of employees" FROM employees WHERE salary > 25000 GROUP BY department HAVING COUNT(*) > 10;



QUESTION: WHAT IS UNION, WHAT IS THE DIFFERENCE BETWEEN UNION ALL



QUESTION: HOW DO YOU COMPARE DATA BETWEEN SOURCE and TARGET

ANSWER: MINUSQUESTIONS: How can I create a table from another table with copying DATA and STATS



ANSWER:CREATE TABLE new_table AS (SELECT * FROM old_table WHERE 1=2) WITH DATA AND STATS;



QUESTION: WHAT IS DIFFERENCE BETWEEN DELETE and TRUNCATE



QUESTION: What type of joins have you used?

ANSWER: Joins knowledge is MUST HAVE. This interview question is quite nice because most people used inner join and (left/right) outer join which is rather mandatory knowledge but those more experienced will also mention cross join and self-join. In SQL Server you can also get full outer join.



QUESTION: How can you combine two tables/views together? For instance one table contains 100 rows and the other one contains 200 rows, have exactly the same fields and you want to show a query with all data (300 rows). This sql interview question can get complicated.

ANSWER: You use UNION operator. You can drill down this question and ask what is the different between UNION and UNION ALL (the first one removes duplicates (not always desirable)… in other words shows only DISTINCT rows….Union ALL just combines so it is also faster). More tricky question are how to sort the view (you use order by at the last query), how to name fields so they appear in query results/view schema (first query field names are used). How to filter groups when you use union using SQL (you would create separate query or use common table expression (CTE) or use unions in from with ().



QUESTION: WHAT IS A DERIVED TABLE?



QUESTION: WHAT IS SUBQUERY, HOW TO WRITE SYNTAX.



QUESTION: How would apply date range filter?

ANSWER: This is tricky question. You can use simple condition >= and <= or similar or use between/and but the trick is to know your exact data type. Sometimes date fields contain time and that is where the query can go wrong so it is recommended to use some date related functions to remove the time issue. In SQL Server common function to do that is datediff function. You also have to be aware of different time zones and server time zone.



QUESTION: What type of wildcards have you used? This is usually one of mandatory sql interview question.

ANSWER: First question is what is a wildcard? Wildcards are special characters that allow matching string without having exact match. In simple word they work like contains or begins with. Wildcard characters are software specific and in SQL Server we have % which represent any groups of characters, _ that represent one character (any) and you also get [] where we can [ab] which means characters with letter a or b in a specific place.



QUESTION: How do you find orphans?

ANSWER: This is more comprehensive SQL and database interview question. First of all we test if the candidate knows what an orphan is. An Orphan is a foreign key value in "child table" which doesn’t exist in primary key column in parent table. To get it you can use left outer join (important: child table on left side) with join condition on primary/foreign key columns and with where clause where primary key is null. Adding distinct or count to select is common practise. In SQL Server you can also you except which will show all unique values from first query that don't exist in second query.



QUESTION: How would you solve the following sql queries using today's date? First day of previous month To return first day of previous month using T-SQL we can use the following statement: SELECT DATEADD(m,-1, Dateadd(d,1-DATEPART(d,getdate()),GETDATE())) AS [First Day of Previous Month]1) Our today's date is 20th of Feb. We substract number of days in the month from our date (20) and we get last day of previous month.2) We add one day to get first day of current month.3) We go one month back so first day of the month stays the same and Feb changes to Jan. First day of current monthLast day of previous monthLast day of current month

ANSWER: These tasks require good grasp of SQL functions but also logical thinking which is one of the primary skills involved in solving sql questions. In this case I provided links to actual answers with code samples. Experienced people should give correct answer almost immediately. People with less experience might need more time or would require some help (Google).



QUESTION: How do you find duplicate record in a table.

ANSWER: SELECT name, email COUNT(name) AS NumOccurName, COUNT(email) as NumOccurEmail FROM users GROUP BY Name, Email HAVING ( COUNT(name) > 1 ) AND ( COUNT(email) > 1)



QUESTION: How do you find 2nd maximum salary of an employee



QUESTION: What is "normalization"? "Denormalization"? Why do you sometimes want to denormalize?

ANSWER: Normalizing data means eliminating redundant information from a table and organizing the data so that future changes to the table are easier. Denormalization means allowing redundancy in a table. The main benefit of denormalization is improved performance with simplified data retrieval and manipulation. This is done by reduction in the number of joins needed for data processing.



QUESTION: What types of index data structures can you have? - An index helps to faster search values in tables. The three most commonly used index-types are: - B-Tree: builds a tree of possible values with a list of row IDs that have the leaf value. Needs a lot of space and is the default index type for most databases. - Bitmap: string of bits for each possible value of the column. Each bit string has one bit for each row. Needs only few space and is very fast.(however, domain of value cannot be large, e.g. SEX(m,f); degree(BS,MS,PHD) - Hash: A hashing algorithm is used to assign a set of characters to represent a text string such as a composite of keys or partial keys, and compresses the underlying data. Takes longer to build and is supported by relatively few databases.



QUESTION: WHAT IS TRIGGER?

ANSWER: Triggers are stored procedures created in order to enforce integrity rules in a database. A trigger is executed every time a data-modification operation occurs (i.e., insert, update or delete). Triggers are executed automatically on occurrence of one of the data-modification operations. A trigger is a database object directly associated with a particular table. It fires whenever a specific statement/type of statement is issued against that table. The types of statements are insert,update,delete and query statements. Basically, trigger is a set of SQL statements A trigger is a solution to the restrictions of a constraint. For instance: 1.A database column cannot carry PSEUDO columns as criteria where a trigger can. 2. A database constraint cannot refer old and new values for a row where a trigger can.WHAT IS INDEX?DIFFERENCE BETWEEN CLUSTERED INDEX / NON CLUSTERED INDEX.



QUESTION: What are the tradeoffs with having indexes?

ANSWER:

1. Faster selects, slower updates.

2. Extra storage space to store indexes. Updates are slower because in addition to updating the table you have to update the index.WHAT IS TABLE PARTITIONS?



QUESTION: Why can a "group by" or "order by" clause be expensive to process?

ANSWER: Processing of "group by" or "order by" clause often requires creation of Temporary tables to process the results of the query. Which depending of the result set can be very expensive.



QUESTION: What is a SQL view?

ANSWER: An output of a query can be stored as a view. View acts like small table which meets our criterion. View is a pre complied SQL query which is used to select data from one or more tables. A view is like a table but it doesn’t physically take any space. View is a good way to present data in a particular format if you use that query quite often.View can also be used to restrict users from accessing the tables directly.



QUESTION: WHAT IS MATERIALISTIC VIEW?













Saturday, April 21, 2012

Strategies for Testing Data Warehouse Applications

Businesses are increasingly focusing on the collection and organization of data for strategic decision-making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage.
This article provides practical recommendations for testing extract, transform and load (ETL) applications based on years of experience testing data warehouses in the financial services and consumer retailing areas. Every attempt has been made to keep this article tool-agnostic so as to be applicable to any organization attempting to build or improve on an existing data warehouse.
Testing Goals
There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. Given the importance of early detection of software defects, let's first review some general goals of testing an ETL application:


Data completeness. Ensures that all expected data is loaded.
Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications.
Data quality. Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
Performance and scalability. Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable.
Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes.
User-acceptance testing. Ensures the solution meets users' current expectations and anticipates their future expectations.
Regression testing. Ensures existing functionality remains intact each time a new release of code is completed.

Data CompletenessOne of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:
Comparing record counts between source data, data loaded to the warehouse and rejected records.
Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.
Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example, if the source data field is a string(30) make sure to test it with 30 characters.
Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data TransformationValidating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and "stare and compare" to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:
Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is a good requirements elicitation exercise during design and can also be used during testing.
Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.
Utilize data profiling results to compare range and distribution of values in each field between source and target data.
Validate correct processing of ETL-generated fields such as surrogate keys.
Validate that data types in the warehouse are as specified in the design and/or the data model.
Set up data scenarios that test referential integrity between tables. For example, what happens when the data contains foreign key values not in the parent table?
Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child records are handled.

Data QualityFor the purposes of this discussion, data quality is defined as "how the ETL system handles data rejection, substitution, correction and notification without modifying data." To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:
Reject the record if a certain decimal field has nonnumeric data.
Substitute null if a certain decimal field has nonnumeric data.
Validate and correct the state field if necessary based on the ZIP code.
Compare product code to values in a lookup table, and if there is no match load anyway but report to users.
Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field). Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what's loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals systematic issues with source data. In some cases, it may be beneficial to populate the "before" data in the database for users to view.

Performance and ScalabilityAs the volume of data in a data warehouse grows, ETL load times can be expected to increase, and performance of queries can be expected to degrade. This can be mitigated by having a solid technical architecture and good ETL design. The aim of the performance testing is to point out any potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover performance issues:
Load the database with peak expected production volumes to ensure that this volume of data can be loaded by the ETL process within the agreed-upon window.
Compare these ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Compare the ETL processing times component by component to point out any areas of weakness.
Monitor the timing of the reject process and consider how large volumes of rejected data will be handled.
Perform simple and multiple join queries to validate query performance on large database volumes. Work with business users to develop sample queries and acceptable performance criteria for each query.

Integration TestingTypically, system testing only includes testing within the ETL application. The endpoints for system testing are the input and output of the ETL code being tested. Integration testing shows how the application fits into the overall flow of all upstream and downstream applications. When creating integration test scenarios, consider how the overall process can break and focus on touchpoints between applications rather than within one application. Consider how process failures at each step would be handled and how data would be recovered or deleted if necessary.
Most issues found during integration testing are either data related to or resulting from false assumptions about the design of another application. Therefore, it is important to integration test with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment. As always, don't forget the importance of good communication between the testing and design teams of all systems involved. To help bridge this communication gap, gather team members from all systems together to formulate test scenarios and discuss what could go wrong in production. Run the overall process from end to end in the same order and with the same dependencies as in production. Integration testing should be a combined effort and not the responsibility solely of the team testing the ETL application.

User-Acceptance TestingThe main reason for building a data warehouse application is to make data available to business users. Users know the data best, and their participation in the testing effort is a key component to the success of a data warehouse implementation. User-acceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables, not the mechanics of how the ETL application works. Consider the following strategies:
Use data that is either from production or as near to production data as possible. Users typically find issues once they see the "real" data, sometimes leading to design changes.
Test database views comparing view contents to what is expected. It is important that users sign off and clearly understand how the views are created.
Plan for the system test team to support users during UAT. The users will likely have questions about how the data is populated and need to understand details of how the ETL works.
Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed.

Regression TestingRegression testing is revalidation of existing functionality with each new release of code. When building test cases, remember that they will likely be executed multiple times as new releases are created due to defect fixes, enhancements or upstream systems changes. Building automation during system testing will make the process of regression testing much smoother. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs. When doing a regression test, it is much quicker to compare results to a previous execution than to do an entire data validation again.
Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is produced and prevent costly mistakes from being discovered in production.

Source: http://www.information-management.com/issues/20070601/1086005-1.html?pg=2

Friday, December 16, 2011

Articles on Data Warehouse Testing and Validation Strategy?

As most of my readers know, I typically try to stay away from writing about generic data warehousing practices. I usually like to keep the articles short and sweet. Well, I am breaking both rules this time. As part of a current project, I was asked to help pull together the most recent articles on data warehouse testing and validation strategies. Guess what? These are few and far between. It appears that all the experts want to tell us how to build these things without ever addressing the issue validating its accuracy once it is loaded.
Why? If done properly it is difficult, very difficult. With today’s inexpensive disk storage, it is not uncommon to reach a terabyte anymore. Sure there are the typical database techniques of checking row counts, summary tables and also reviewing the exception logs from the ETL process, but is this enough? It is not for my current client, and it may not be for yours. After many heated discussions, I was forced to rethink my position on validation techniques used in the past.
If you feel that you are happy with the quality of data in your warehouse and are an expert in data warehousing validation, this article is not for you. Data quality, which is big topic right now, really does not address this specific issue. This discussion is not about quality; it is about validation. This article is by no means an all-inclusive recommendation on testing or validation. The purpose is to define the coverage of testing and validation that might be performed. This approach will have to be modified for each engagement but should at least provide a starting point for some.

Enough of the Disclaimers
I have determined that it is best to break the testing and validation process into four well-defined, high-level tasks for a data warehouse engagement: integration testing, system testing, data validation and acceptance testing. Integration testing includes reviewing and accepting the logical data model captured with your data modeling tool (i.e., ERwin or your tool of choice), converting the models to actual physical database tables in the test environment, creating the proper indexes and testing the ETL programs created by your ETL tool. System testing involves increasing the volume of the test data to be loaded, estimating load times and placing data into either a high- volume test area or actually in the production environment. Data validation includes reviewing the ETL mapping encoded in the ETL tool as well as reviewing samples of the data that was loaded into the test environment. Acceptance testing will include completeness of data model to meet the reporting needs for your specific project, reviewing summary table designs, validation of data actually loaded in the production data warehouse environment and a review of the daily upload procedures.
Since the data warehouse will consist of several dimensions and fact tables (if utilizing a star schema), each of the first three testing and validation phases can be executed somewhat in parallel. The initial review of the data elements in the data model should include the identification of data elements in the fact table that are critical for success for your project. During the review, several data elements may have been added as placeholders for future enhancements. These too should be noted so that during the final testing and acceptance procedures null or voided entries in the production system are not noted as failed data elements.
Due to the pure volume of data in an enterprise data warehouse, it is not feasible to validate each row that is loaded when significant transformations are involved. To add to the complexity, loading may not be done all at a single point in time, portioned by year, SSN or other demographic information. Assumptions must be documented by the testing and validation team and accepted by the project sponsor before acceptance testing can begin. These assumptions will state the various levels of testing and validation to be performed and the objectives of each. If possible, data elements that cannot be tested in detail but can be validated through the use of summary tables and counts, acceptable levels of error, if any, must be addressed. If no testing or validation is to occur on certain data elements, an explanation and impact of not testing must be documented.

Levels and Objectives
Integration Testing
The first level of testing and validation begins with the formal acceptance of the logical data model. All further testing and validation must be based on the understanding of each of the data elements in the model and how they will assist the company, specifically in the subject area(s) for your particular increment, meet specific business objectives in the areas of reporting and analysis.
Table names in the logical model should be intuitive, so that a formal definition of the table will not be necessary but still useful. A table is best described by the data elements that reside in it. Each table should be identified by how it is to be used within the warehouse (i.e., fact table, a dimension table or summary table). Primary keys should be identified and placed at the top of each table for quick reference. During the integration testing process, a discussion must take place to validate that the primary key(s) will actually make each row in the table unique usually before loading begins.
Each table will consist of multiple data elements, some of which will be used as primary keys. A description of each data element should be captured in a repository and a physical name should be assigned. Where possible, physical names should closely reflect names in source tables and follow company standards if any exist. Data elements that are created through a transformation or summary process should be clearly identified and calculations for each of these data elements need to be captured and entered into the repository.
The second phase of integration testing is comprised of the verification of the physical model. Some data elements captured in the logical model may reside in the physical tables, but will remain as placeholders for future enhancements. A list of all data elements that will not be loaded must be created and verified. The logical model may also contain several tables that will be collapsed into a single physical table. Based on how the physical tables will be used, a DBA must then determine which data elements should be included in indexes to increase the speed and efficiency of the physical tables. The data model can then be used to create the actual SQL to create the physical tables with the data warehouse test environment.
Actually populating the physical tables utilizing the ETL programs or other database load utilities will be the third element in the integration testing process. As data is moved from source to target, the ETL will identify errors such as data type and size. These errors are logged into an exception report that can be reviewed after the load completes or fails. This process is also useful for verifying the number of rows loaded and to begin identifying potential problems related to data volume and load times.
System Testing
System testing takes the final phase of integration testing to the next level by requiring that the testing team push the limits on the volume to be loaded into the test environment. Large data sets extracted directly from the operational systems can be utilized if possible for the system testing. Final DASD size can be determined, load times can be calculated or validated and proper procedures to put jobs into production can be reviewed. At the completion of the system tests, the JCL required to move the test jobs into production can be approved by the proper organizations and actual load times can be scheduled.
Due to limitations in most test environments of limited direct access storage devices (DASD) and CPU cycles during operational hours, these tests are very useful, but may not be able to identify all of the potential errors during the actual load if the warehouse is being developed on a shared platform. This should however minimize the risk of the initial load significantly.
Data Validation
Reviewing the mapping utilized by the ETL tool is the primary task during data validation. Due to the shear volume of data to be loaded into the fact tables, it is imperative that care is taken that the mapping is validated to verify that the data that you think is being loaded into specific data elements is in fact being sourced from tables that we know contain the information in the operational system.
The ETL tool will create a report that can be viewed as is or imported into an analytical interface with reporting capabilities such as Microsoft Access. If an interface is to be utilized, the name of each table, both logical and physical, as well as each data element, also logical and physical, should be present. Each data element should have a formal description and a mapping back to the source table(s) used to populate it during the ETL process. If any transformation is to occur, this too should be documented. If no transformation is to occur, the sample SQL to capture the data element from the source table may be documented as well, but is not required.
This is an excellent opportunity to get sponsor sign off. An agreement at this time will draw a line in the sand to minimize scope creep and document the understanding at that point in time exactly what is to be developed and how. Changes to the design and transformations will occur if the system is complex at all as the designers and developers of the data warehouse increase their knowledge on how the data elements on the source systems are populated and used in the operational environment.
The next step in data validation is completed by the use of counts. Simple database queries can be run either on an entire table (for most dimensions this is possible) or for subset (i.e., number of activity records for the month of May). If these counts are equal, it can be safely assumed that records were not left out due to an error during the ETL or simple load process. This is further verified by the lack of errors (not necessarily warnings) in the exception reporting by the ETL tool.
Many dimensional tables have an exact duplicate in the operational database schema. Doing a simple table compare can further validate these tables. If a table compare indicates that no difference were found, these tables can be considered 100 percent validated. For additional verification, actual rows from both the operational and data warehouse tables can be printed and listed side by side for comparison.
Randomly selecting rows of data from the test environment and comparing them to data stored within the test data warehouse should give a high level of confidence in the ETL process. A more detailed version of this test is usually conducted in the acceptance testing procedure.
Acceptance Testing
Acceptance testing begins with the final review of the data model. A listing of potential new reports must be created and data requirements to complete these reports must be identified in a process document. mapping of the specific data needs to actual data fields in the physical data model will help ensure that no key elements have been neglected during the design phase. Existing reports may also be identified as key reports to determine if on a future date those too can be redeveloped to take advantage of the data warehouse.
During the evaluation of reporting needs, several summary tables will need to be defined to help minimize the development effort of reports as well to increase the speed of report generation. Summary tables may also be developed to provide certain counts on a monthly basis that can be analyzed via tools such as Microsoft’s Data Analyzer. Actual values from the summary tables can be used to compare to similar summaries from the operational system to help provide a high level of confidence in the values stored within the warehouse. Very seldom do the numbers in the data warehouse match 100 percent with those created in the operational system. This is often the case due to the warehouse being populated once a day or even once a week. It will be up to the individuals assigned to the acceptance team to determine why the values are different and if the difference is significant.
Once the data warehouse is fully populated and the daily load procedures are operational, the final acceptance testing can begin. This should include taking a cross section of specific data elements within each of the fact tables and comparing them directly with facts in the operational system. This can be automated with the use of custom code (COBOL, SAS, C or even simple SQL) if at all possible to maximize the number of rows to be interrogated within the data warehouse. The use of views can also help reduce the complexity of the code that would need to be developed. Due to the volume of data and the amount of CPU cycles to conduct a test of this type, a decision will need to be made if a select few data elements on a large number of rows or if all data elements for a limited number of rows would provide a better sense of valid data. Both methods can be used, but the additional development time to create two procedures of this size must be weighed against the actual value that will be derived. If all previous data validation procedures are producing valid results, maybe only one procedure can be initially created. If this procedure proves no differences, the overall warehouse can be validated with at some level of confidence.
It should be noted that to create a 100 percent validation of the data would require duplicating the entire ETL process utilizing a separate tool or procedure (i.e., all custom-built COBOL code) would be very costly, time-consuming and, quite frankly, impractical. That is the reason you buy an ETL tool in the first place. In addition, the validation process to verify that both the tool and the custom code utilize the same logic would be an extensive exercise in itself. This duplication of effort would also have to be included in all future data warehouse design and development efforts.
Completion Criteria
Since you cannot check every data element, how do you know when testing and validation is complete? Completion of testing has to be determined by the number of defects found and resolved. The test log, which identifies all defects and the status of each, is a good source to use in measuring completion criteria. The true acceptance and determination of completion can only be determined when the sponsor feels comfortable with the results in both sample reporting and the results of the tests previously listed. Without confidence in the numbers, the data warehouse will not be utilized and thus deemed a failure. To this end, it is important to revisit the purpose of the data warehouse. The data warehouse is, by definition, not an operational system. The data stored in the data warehouse is typically for analytical and reporting purposes only. The goals of most data warehousing efforts should be focused on allowing individuals the ability to create new strategies for success and optimize their organization. Neither of these goals requires a 100 percent accuracy rate in the millions of rows of facts stored over several years, but inaccuracies based on incorrect data mapping or invalidated transformations can be worse than not doing anything at all.
Take care in your future testing and validation efforts. This level of validation will add time to the plan and increase costs, but these can be minimized if you work these techniques into the plan early and integrate them into your existing methodology. I believe in the long run that your sponsors and the business users will be glad that you did.
Cody Bateman is one of the founders of Keyleon, a corporation specializing in the development of business intelligence, data warehousing and CRM solutions. Bateman has 19 years of consulting experience in a range of service industries including energy, defense, finance, health care, retail and telecommunications and is the author of the Kenetic Incremental Warehouse Implementation (KIWI) Methodology. He may be reached via e-mail at CBateman@keyleon.com.
For more information on related topics, visit the following channels:

Source:: http://www.information-management.com/infodirect/20021004/5855-1.html

Friday, September 9, 2011

What is Datamart???

Reference From: http://www.learn.geekinterview.com/data-warehouse/data-marts/what-is-data-mart.html

Data Mart is a subset of the data resource, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. The concept of a data mart can apply to any data whether they are operational data, evaluational data, spatial data, or metadata.

A data mart is a repository of a business organization's data implemented to answer very specific questions for a specific group of data consumers such as organizational divisions of marketing, sales, operations, collections and others. A data mart is typically established as one dimensional model or star schema which is composed of a fact table and multi-dimensional table.
In comparison, a data warehouse is also a repository of organizational data implemented as a single repository serving enterprise wide data across many if not all subject areas. The data warehouse is the authoritative repository at atomic level of all fact and dimensional data.

Despite some arguments on the similarity or difference between a data mart and a data warehouse, many still consider a data mart as specialized version of a data warehouse.

Advantage of Data Mart:
The data mart, like the data warehouse, can also provide a picture of a business organization's data and help the organizational staff in formulating strategies based on the aggregated data and statistical analysis of industry trends and patterns as well as part business experiences.

The most notable difference of a data mart from a data warehouse is that the data mart is created based on a very specific and predefined purpose and need for a grouping of certain data. A data mart is configured such that it makes access to relevant information in a specific area very easy and fast.

Within a single business organization, there can more than one data mart. Each of these data marts is relevant or connected in some way to one or more business units that its design was intended for. The relationship among many data marts within a single company may or may not involve interdependency.

They may be related to other data marts if they were designed using conformed facts of dimensions. If one department has a data mart implementation, that department is considered to be the owner of the data mart and it owns all aspects of the data mart including the software, hardware and the data itself. This can help manage data in a huge company by having a modularization method such that a department should only manipulate and develop its own data as they see if without having to alter data from other department's data marts. Then other departments need data from the data mart owned by a certain department, proper permission should be asked first.

In other data mart implementation where there is strict conformed dimension, some shared dimensions exist such as customers and products and business ownership will no longer apply.

Data marts can be designed with star schema, snowflake schema or starflake schema. The star schema is the most simple of all the styles related to data mart and data warehousing. It consists only of few fact tables.

The snowflake schema is a variation of the star schema and the storage method is of multidimensional nature. The starflake schema is a hybrid mixture of both the star and snowflake schemas.

Data marts are especially useful to make access to specific frequently access data very ease. It can give a collective picture or a certain aspect in the business by a specific group of users. Since data marts are smaller compared to a full data warehouse, response time could be lesser and the cost of implantation could also be less expensive.

Thursday, September 8, 2011

zip/unzip
gzip -d

Find:
grep <> filename.txt > newfilename.txt

to count3 of lines
wc -1 filename.txt


chmod filename 777/755/
Linux Shortcuts and Commands:
Linux Newbie Administrator Guide
by Stan and Peter Klimas
This is a practical selection of the commands we use most often. Press to see the listing of all available command (on your PATH). On my small home system, it says there are 2595 executables on my PATH. Many of these "commands" can be accessed from your favourite GUI front-end (probably KDE or Gnome) by clicking on the right menu or button. They can all be run from the command line. Programs that require GUI have to be run from a terminal opened under a GUI.
Legend:
<> = single special or function key on the keyboard. For example indicates the "control" key.
italic = name of the file or variable you probably want to substitute with your own.
fixed width = in-line Linux commands and filenames.
Notes for the UNIX Clueless:
1. LINUX IS CASE-SENSITIVE. For example: Netscape, NETSCAPE and nEtscape are three different commands. Also my_filE, my_file, and my_FILE are three different files. Your user login name and password are also case sensitive. (This goes with the tradition of UNIX and the "c" programming language being case sensitive.)
2. Filenames can be up to 256 characters long and can contain letters, numbers, "." (dot), "_" (underscore), "-" (dash), plus some other not recommended characters.
3. Files with names starting with "." are normally not shown by the ls (list) or dir commands. Think of these files as "hidden". Use ls -a (list with the option "all") to see these files.
4. "/" is an equivalent to DOS "\" (root directory, meaning the parent of all other directories).
5. Under Linux, all directories appear under a single directory tree (there are no DOS-style drive letters).
6. In a configuration file, a line starting with # is a comment.
7.1 Linux essential shortcuts and sanity commands

Switch to the first text terminal. Under Linux you can have several (6 in standard setup) terminals opened at the same time.
(n=1..6)
Switch to the nth text terminal.
tty
Print the name of the terminal in which you are typing this command.

Switch to the first GUI terminal (if X-windows is running on this terminal).
(n=7..12)
Switch to the nth GUI terminal (if a GUI terminal is running on screen n-1). On default, nothing is running on terminals
8 to 12, but you can run another server there.

(In a text terminal) Autocomplete the command if there is only one option, or else show all the available options.
THIS SHORTCUT IS GREAT! It even works at LILO prompt!

Scroll and edit the command history. Press to execute.

Scroll terminal output up. Work also at the login prompt, so you can scroll through your bootup messages.

Scroll terminal output down.
<+>
(in X-windows) Change to the next X-server resolution (if you set up the X-server to more than one resolution). For multiple resolutions on my standard SVGA card/monitor, I have the following line in the file /etc/X11/XF86Config (the first resolution starts on default, the largest determines the size of the "virtual screen"):
Modes "1024x768" "800x600" "640x480" "512x384" "480x300" "400x300" "1152x864"
<->
(in X-windows) Change to the previous X-server resolution.

(in X-windows) Kill the current X-windows server. Use if the X-windows server crushes and cannot be exited normally.

Shut down the system and reboot. This is the normal shutdown command for a user at the text-mode console. Don't just press the "reset" button for shutdown!
c
Kill the current process (mostly in the text mode for small applications).
d
Log out from the current terminal. See also the next command.
d
Send [End-of-File] to the current process. Don't press it twice else you also log out (see the previous command).
s
Stop the transfer to the terminal.
q
Resume the transfer to the terminal. Try if your terminal mysteriously stops responding.
z
Send the current process to the background.
exit
Logout. I can also use logout for the same effect. (If you have started a second shell, e.g., using bash the second shell will be exited and you will be back in the first shell, not logged out.)
reset
Restore a screwed-up terminal (a terminal showing funny characters) to default setting. Use if you tried to "cat" a binary file. You may not be able to see the command as you type it.

Paste the text which is currently highlighted somewhere else. This is the normal "copy-paste" operation in Linux. (It doesn't work with Netscape and WordPerfect which use the MS Windows-style "copy-paste". It does work in the text terminal if you enabled "gpm" service using "setup".) Best used with a Linux-ready 3-button mouse (Logitech or similar) or else set "3-mouse button emulation").
~
(tilde) My home directory (normally the directory /home/my_login_name). For example, the command cd ~/my_dir will change my working directory to the subdirectory "my_dir" under my home directory. Typing just "cd" alone is an equivalent of the command "cd ~".
.
(dot) Current directory. For example, ./my_program will attempt to execute the file "my_program" located in your current working directory.
..
(two dots) Directory parent to the current one. For example, the command cd .. will change my current working directory one one level up.
7.2 Common Linux commands--system info
pwd
Print working directory, i.e., display the name of my current directory on the screen.
hostname
Print the name of the local host (the machine on which you are working). Use netconf (as root) to change the name of the machine.
whoami
Print my login name.
id username
Print user id (uid) and his/her group id (gid), effective id (if different than the real id) and the supplementary groups.
date
Print or change the operating system date and time. E.g., I could change the date and time to 2000-12-31 23:57 using this command:
date 123123572000
To set the hardware (BIOS) clock from the system (Linux) clock, use the command (as root) setclock
time
Determine the amount of time that it takes for a process to complete + other info. Don't confuse it with the date command. E.g. I can find out how long it takes to display a directory content using:
time ls
who
Determine the users logged on the machine.
rwho -a
(=remote who) Determine all users logged on your network. The rwho service must be enabled for this command to run. If it isn't, run setup as root to enable "rwho".
finger user_name
System info about a user. Try: finger root
last
Show listing of users last logged-in on your system.
history more
Show the last (1000 or so) commands executed from the command line on the current account. The " more" causes the display to stop after each screenful.
uptime
Show the amount of time since the last reboot.
ps
(=print status) List the processes currently run by the current user.
ps axu more
List all the processes currently running, even those without the controlling terminal, together with the name of the user that owns each process.
top
Keep listing the currently running processes, sorted by cpu usage (top users first). In KDE, you can get GUI-based Ktop from "K"menu under "System"-"Task Manager" (or by executing "ktop" in an X-terminal).
uname -a
(= Unix name with option "all") Info on your (local) server. I can also use guname (in X-window terminal) to display the info more nicely.
free
Memory info (in kilobytes).
df -h
(=disk free) Print disk info about all the filesystems (in human-readable form)
du / -bh more
(=disk usage) Print detailed disk usage for each subdirectory starting at the "/" (root) directory (in human legible form).
cat /proc/cpuinfo
Cpu info--it show the content of the file cpuinfo. Note that the files in the /proc directory are not real files--they are hooks to look at information available to the kernel.
cat /proc/interrupts
List the interrupts in use.
cat /proc/version
Linux version and other info
cat /proc/filesystems
Show the types of filesystems currently in use.
cat /etc/printcap
Show the setup of printers.
lsmod
(As root. Use /sbin/lsmod to execute this command when you are a non-root user.) Show the kernel modules currently loaded.
setmore
Show the current user environment.
echo $PATH
Show the content of the environment variable "PATH". This command can be used to show other environment variables as well. Use "set" to see the full environment.
dmesg less
Print kernel messages (the content of the so-called kernel ring buffer). Press "q" to quit "less". Use less /var/log/dmesg to see what "dmesg" dumped into this file right after the last system bootup.

7.3 Basic operations
any_command --help more
Display a brief help on a command (works with most commands). "--help" works similar to DOS "/h" switch. The "more" pipe is needed if the output is longer than one screen.
man topic
Display the contents of the system manual pages (help) on the topic. Try man man first. Press "q" to quit the viewer. The command info topic works similar and may contain more up-to-date information. Manual pages can be hard to read. Try any_command --help for short, easy to digest help on a command. If more info needed, have a look to the directory /usr/doc. To display manual page from a specific section, I may use something like in this example: man 3 exit (this displays an info on the command exit from section 3 of the manual pages).
apropos topic
Give me the list of the commands that have something to to do with my topic.
help command
Display brief info on a bash (shell) build-in command.
ls
List the content of the current directory. Under Linux, the command "dir" is an alias to ls. Many users have "ls" to be an alias to "ls --color".
ls -al more
List the content of the current directory, all files (also those starting with a dot), and in a long form. Pipe the output through the "more" command, so that the display pauses after each screenful.
cd directory
Change directory. Using "cd" without the directory name will take you to your home directory. "cd -" will take you to your previous directory and is a convenient way to toggle between two directories. "cd .." will take you one directory up.
cp source destination
Copy files. E.g., cp /home/stan/existing_file_name . will copy a file to my current working directory. Use the "-r" option (for recursive) to copy the contents of whole directories, e.g. , cp -r my_existing/dir/ ~ will copy a subdirectory under my current working directory to my home directory.
mcopy source destination
Copy a file from/to a DOS filesystem (no mounting necessary). E.g., mcopy a:\autoexec.bat ~/junk . See man mtools for related commands: mdir, mcd, mren, mmove, mdel, mmd, mrd, mformat ....
mv source destination
Move or rename files. The same command is used for moving and renaming files and directories.
ln source destination
Create a hard link called destination to the file called source. The link appears as a copy of the original files, but in reality only one copy of the file is kept, just two (or more) directory entries point to it. Any changes the file are automatically visible throughout. When one directory entry is removed, the other(s) stay(s) intact. The limitation of the hard links are: the files have to be on the same filesystem, hard links to directories or special files are impossible.
ln -s source destination
Create a symbolic (soft) link called "destination" to the file called "source". The symbolic link just specifies a path where to look for the file. In contradistinction to hard links, the source and destination don't not have to tbe on the same filesystem. In comparison to hard links, the drawback of symbolic links are: if the original file is removed, the link is "broken", symbolic links can also create circular references (like circular references in spreadsheets or databases, e.g., "a" points to "b" and "b" points back to "a").
rm files
Remove (delete) files. You must own the file in order to be able to remove it. On many systems, you will be asked or confirmation of deleation, if you don't want this, use the "-f" (=force) option, e.g., rm -f * will remove all files in my current working directory, no questions asked.
mkdir directory
Make a new directory.
rmdir directory
Remove an empty directory.
rm -r files
(recursive remove) Remove files, directories, and their subdirectories. Careful with this command as root--you can easily remove all files on the system with such a command executed on the top of your directory tree, and there is no undelete in Linux (yet). But if you really wanted to do it (reconsider), here is how (as root): rm -rf /*
cat filename more
View the content of a text file called "filename", one page a time. The "" is the "pipe" symbol (on many American keyboards it shares the key with "\") The pipe makes the output stop after each screenful. For long files, it is sometimes convenient to use the commands head and tail that display just the beginning and the end of the file. If you happened to use "cat" a binary file and your terminal displays funny characters afterwards, you can restore it with the command "reset".
less filename
Scroll through a content of a text file. Press q when done. "Less" is roughly equivalent to "more" , the command you know from DOS, although very often "less" is more convenient than "more".
pico filename
Edit a text file using the simple and standard text editor called pico.
pico -w filename
Edit a text file, while disabling the long line wrap. Handy for editing configuration files, e.g. /etc/fstab.
find / -name "filename"
Find the file called "filename" on your filesystem starting the search from the root directory "/". The "filename" may contain wildcards (*,?).
locate filename
Find the file name of which contains the string "filename". Easier and faster than the previous command but depends on a database that normally rebuilds at night.
./program_name
Run an executable in the current directory, which is not on your PATH.
touch filename
Change the date/time stamp of the file filename to the current time. Create an empty file if the file does not exist.
xinit
Start a barebone X-windows server (without a windows manager).
startx
Start an X-windows server and the default windows manager. Works like typing "win" under DOS with Win3.1
startx -- :1
Start another X-windows session on the display 1 (the default is opened on display 0). You can have several GUI terminals running concurrently. Switch between them using , , etc.
xterm
(in X terminal) Run a simple X-windows terminal. Typing exit will close it. There are other, more advanced "virtual" terminals for X-windows. I like the popular ones: konsole and kvt (both come with kde) and gnome-terminal (comes with gnome). If you need something really fancy-looking, try Eterm.
xboing
(in X terminal). Very nice, old-fashioned game. Many small games/programs are probably installed on your system. I also like xboard (chess).
shutdown -h now
(as root) Shut down the system to a halt. Mostly used for a remote shutdown. Use for a shutdown at the console (which can be done by any user).
halt
reboot
(as root, two commands) Halt or reboot the machine. Used for remote shutdown, simpler to type than the previous command.

Network apps
netscape
(in X terminal) Run netscape (requires a separate Netscape installation). The current versions of Netscape (4.x) are known to be big and buggy. They occasionally crash by vanishing (no other harm done). Also, when not connected to the network , Netscape likes to refuse to do anything (looks like it hanged)-it revives when you connect.
netscape -display host:0.0
(in X terminal) Run netscape on the current machine and direct the output to machine named "host" display 0 screen 0. Your current machine must have a permission to display on the machine "host" (typically given by executing the command xhost current_machine_name in the xterminal of the machine host. Other X-windows program can be run remotely the same way.
lynx file.html
View an html file or browse the net from the text mode.
pine
A good text-mode mail reader. Another good and standard one is elm. Your Netscape mail will read the mail from your Internet account. pine will let you read the "local" mail, e.g. the mail your son or a cron process sends to you from a computer on your home network. The command mail could also be used for reading/composing mail, but it would be inconvenient--it is meant to be used in scripts for automation.
elm
A good tex-mode mail reader. See the previous command.
mutt
A really basic but extremally useful and fast mail reader.
mail
A basic operating system tool for e-mail. Look at the previous commands for a better e-mail reader. mail is good if you wanted to send an e-mail from a shell script.
licq
(in X term) An icq "instant messaging" client. Another good one is kxicq. Older distributions don't have an icq client installed, you have to do download one and install it.
talk username1
Talk to another user currently logged on your machine (or use "talk username1@machinename" to talk to a user on a different computer) . To accept the invitation to the conversation, type the command "talk username2". If somebody is trying to talk to you and it disrupts your work, your may use the command "mesg n" to refuse accepting messages. You may want to use "who" or "rwho" to determine the users who are currently logged-in.
mc
Launch the "Midnight Commander" file manager (looks like "Norton Commander" for Linux).
telnet server
Connect to another machine using the TELNET protocol. Use a remote machine name or IP address. You will be prompted for your login name and password--you must have an account on the remote machine to login. Telnet will connect you to another machine and let you operate on it as if you were sitting at its keyboard (almost). Telnet is not very secure--everything you type goes in open text, even your password!
rlogin server
(=remote login) Connect to another machine. The login name/password from your current session is used; if it fails you are prompted for a password.
rsh server
(=remote shell) Yet another way to connect to a remote machine. The login name/password from your current session is used; if it fails you are prompted for a password.
ftp server
Ftp another machine. (There is also ncftp which adds extra features and gftp for GUI .) Ftp is good for copying files to/from a remote machine. Try user "anonymous" if you don't have an account on the remote server. After connection, use "?" to see the list of available ftp commands. The essential ftp command are: ls (see the files on the remote system), ASCII, binary (set the file transfer mode to either text or binary, important that you select the proper one ), get (copy a file from the remote system to the local system), mget (get many files at once), put (copy a file from the local system to the remote system), mput (put many files at once), bye (disconnect). For automation in a script, you may want to use ncftpput and ncftpget, for example:
ncftpput -u my_user_name -p my_password -a remote.host.domain remote_dir *local.html
minicom
Minicom program (looks like "Procomm for Linux").
File (de)compression
tar -zxvf filename.tar.gz
(=tape archiver) Untar a tarred and compressed tarball (*.tar.gz or *.tgz) that you downloaded from the Internet.
tar -xvf filename.tar
Untar a tarred but uncompressed tarball (*.tar).
gunzip filename.gz
Decompress a zipped file (*.gz" or *.z). Use gzip (also zip or compress) if you wanted to compress files to this file format.
bunzip2 filename.bz2
(=big unzip) Decompress a file (*.bz2) zipped with bzip2 compression utility. Used for big files.
unzip filename.zip
Decompress a file (*.zip) zipped with a compression utility compatible with PKZIP for DOS.
unarj e filename.arj
Extract the content of an *.arj archive.
uudecode -o outputfile filename
Decode a file encoded with uuencode. uu-encoded files are typically used for transfer of non-text files in e-mail (uuencode transforms any file into an ASCII file).
7.4 Process control
ps
(=print status) Display the list of currently running processes with their process IDs (PID) numbers. Use ps axu to see all processes currently running on your system (also those of other users or without a controlling terminal), each with the name of the owner. Use "top" to keep listing the processes currently running.
fg PID
Bring a background or stopped process to the foreground.
bg PID
Send the process to the background. Opposite to fg. The same can be accomplished with z. If you have stopped jobs, you have to type exit twice in row to log out.
any_command&
Run any command in the background (the symbol "&" means "run the proceeding command in the background").
batch any_command
Run any command (usually one that is going to take more time) when the system load is low. I can logout, and the process will keep running.
at 17:00
Execute a command at a specified time. You will be prompted for the command(s) to run, until you press d.
kill PID
Force a process shutdown. First determine the PID of the process to kill using ps.
killall program_name
Kill program(s) by name.
xkill
(in an xwindow terminal) Kill a GUI-based program with mouse. (Point with your mouse cursor at the window of the process you want to kill and click.)
lpc
(as root) Check and control the printer(s). Type "?" to see the list of available commands.
lpq
Show the content of the printer queue. Under KDE (X-Windows), you may use GUI-based "Printer Queue" available from "K"menu-Utilities.
lprm job_number
Remove a printing job "job_number" from the queue.
nice program_name
Run program_name adjusting its priority. Since the priority is not specified in this example, it will be adjusted by 10 (the process will run slower), from the default value (usually 0). The lower the number (of "niceness" to other users on the system), the higher the priority. The priority value may be in the range -20 to 19. Only root may specify negative values. Use "top" to display the priorities of the running processes.
renice -1 PID
(as root) Change the priority of a running process to -1. Normal users can only adjust processes they own, and only up from the current value (make them run slower).
c, z, s, and q also belong to this chapter but they were described previously. In short they mean: stop the current command, send the current command to the background, stop the data transfer, resume the data transfer.

7.5 Basic administration commands
printtool
(as root in X-terminal) Configuration tool for your printer(s). Settings go to the file /etc/printcap.
setup
(as root) Configure mouse, soundcard, keyboard, X-windows, system services. There are many distibution-specific configuration utilities, setup is the default on RedHat. Mandrake 7.0 offers very nice DrakConf .
linuxconfig
(as root, either in text or graphical mode). You can access and change hundreds of setting from it. Very powerful--don't change too many things at the same time, and be careful with changing entries you don't understand.
xvidtune
(in X-terminal). Adjust the settings of the graphical display for all resolutions so as to eliminate black bands, shift the display right/left/up/down, etc. (First use the knobs on your monitor to fit your text mode correctly on the screen.) To make the changes permanent, display the frequencies on the screen and transfer them to the setup file /etc/X11/XF86Config.
alias ls="ls --color=tty"
Create an alias for the command "ls" to enhance its format with color. In this example, the alias is also called "ls" and the "color" option is only envoke when the output is done to a terminal (not to files). Put the alias into the file /etc/bashrc if you would like the alias to be always accessible to all users on the system. Type "alias" alone to see the list of aliases on your system.
adduser user_name
Create a new account (you must be root). E.g., adduser barbara Don't forget to set up the password for the new user in the next step. The user home directory is /home/user_name.
useradd user_name
The same as the command " adduser user_name ".
userdel user_name
Remove an account (you must be a root). The user's home directory and the undelivered mail must be dealt with separately (manually because you have to decide what to do with the files).
groupadd group_name
Create a new group on your system. Non-essential but can be handy even on a home machine with a small number of users.
passwd
Change the password on your current account. If you are root, you can change the password for any user using: passwd user_name
chmod perm filename
(=change mode) Change the file access permission for the files you own (unless you are root in which case you can change any file). You can make a file accessible in three modes: read (r), write (w), execute (x) to three classes of users: owner (u), members of the same group as the owner (g), others on the system (o). Check the current access permissions using:
ls -l filename
If the file is accessible to all users in all modes it will show:
rwxrwxrwx
The first triplet shows the file permission for the owner of the file, the second for his/her group, the third for others. A "no" permission is shown as "-".
E.g., this command will add the permission to read the file "junk" to all (=user+group+others):
chmod a+r junk
This command will remove the permission to execute the file junk from others:
chmod o-x junk
Also try here for more info.
You can set the default file permissions for the news files that you create using the command umask (see man umask).
chown new_ownername filename
chgrp new_groupname filename
Change the file owner and group. You should use these two commands after you copy a file for use by somebody else.
su
(=substitute user id) Assume the superuser (=root) identity (you will be prompted for the password). Type "exit" to return you to your previous login. Don't habitually work on your machine as root. The root account is for administration and the su command is to ease your access to the administration account when you require it. You can also use "su" to assume any other user identity, e.g. su barbara will make me "barbara" (password required unless I am a superuser).
kernelcfg
(as root in X terminal). GUI to to add/remove kernel modules. You can do the same from the command line using the command "insmod", but "insmode" is less "newbie-friendly".
lsmod
List currently loaded kernel modules. A module is like a device driver--it provides operating system kernel support for a particular piece of hardware or feature.
modprobe -l more
List all the modules available for your kernel. The available modules are determined by how your Linux kernel was compliled. Every possible module/feature can be compiled on linux as either "hard wired" (fast, non-removable), "module" (maybe slower, but loaded/removable on demand), or "no" (no support for this feature at all).
insmod parport
insmod ppa
(as root) Insert modules into the kernel (a module is roughly an equivalent of a DOS device driver). This example shows how to insert the modules for support of the external parallel port zip drive (it appears to be a problem to get the external zip drive to work in any other way under RH6.0 ).
rmmod module_name
(as root, not essential). Remove the module module_name from the kernel.
setserial /dev/cua0 port 0x03f8 irq 4
(as root) Set a serial port to a non-standard setting. The example here shows the standard setting for the first serial port (cua0 or ttyS0). The standard PC settings for the second serial port (cua1or ttyS1) are: address of i/o port 0x02f8, irq 3. The third serial port (cua2 or ttyS2): 0x03e8, irq 4. The forth serial port (cua3 or ttyS3): 0x02e8, irq 3. Add your setting to /etc/rc.d/rc.local if you want it to be set at the boot time. See man setserial for good a overview.
fdisk
(as root) Linux hard drive partitioning utility (DOS has a utility with the same name).
cd /usr/src/linux-2.0.36
make xconfig
(as root in X terminal). Nice GUI front-end for configuration of the kernel options in preparation for compilation of your customized kernel. (The directory name contains the version of your Linux kernel so you may need to modify the directory name if your Linux kernel version is different than 2.0.36 used in this example. You also need the "Tk" interpreter and the kernel source code installed. ) The alternatives to "make xconfig" are: "make config" (runs a scripts that asks you questions in the text mode) and "make menuconfig" (runs a text-based menu-driven configuration utility). Try: less /usr/doc/HOWTO/Kernel-HOWTO for more information.
After the configuration, you may choose to proceed with kernel compilation of the new kernel by issuing the following commands:
make dep
make zImage
The last command will take some time to complete (maybe 0.5 h, depending on your hardware). It produces the file "zImage", which is your new Linux kernel. Next:
make modules
make modules_install
Read: /usr/doc/HOWTO/Kernel-HOWTO for information on how to install the new kernel. You will probably also find it useful to read "man depmode". Configuration, compilation and installation of a new kernel is not difficult but it CAN lead to problems if you don't know what you are doing.
Compilation of a kernel is a good way to test your hardware, because it involves a massive amount of computing. If your hardware is "flaky", you will most likely receive the "signal 11" error (read the beatiful /usr/doc/FAQ/txt/GCC-SIG11-FAQ). See this for details on kernel upgrade.
depmod -a
(as root) Build the module dependency table for the kernel. This can, for example, be useful after installing and booting a new kernel. Use "modprobe -a" to load the modules.
ldconfig
(as root) Re-create the bindings and the cache for the loader of dynamic libraries ("ld"). You may want to run ldconfig after an installation of new dynamically linked libraries on your system. (It is also re-run every time you boot the computer, so if you reboot you don't have to run it manually.)
mknod /dev/fd0 b 2 0
(=make node, as root) Create a device file. This example shows how to create a device file associated with your first floppy drive and could be useful if you happened to accidentally erase it. The options are: b=block mode device (c=character mode device, p=FIFO device, u=unbuffered character mode device). The two integers specify the major and the minor device number.
fdformat /dev/fd0H1440
mkfs -c -t ext2
(=floppy disk format, two commands, as root) Perform a low-level formatting of a floppy in the first floppy drive (/dev/fd0), high density (1440 kB). Then make a Linux filesystem (-t ext2), checking/marking bad blocks (-c ). Making the files system is an equivalent to the high-level format.
badblocks /dev/fd01440 1440
(as root) Check a high-density floppy for bad blocks and display the results on the screen. The parameter "1440" specifies that 1440 blocks are to be checked. This command does not modify the floppy.
fsck -t ext2 /dev/hda2
(=file system check, as root) Check and repair a filesystem. The example uses the partition hda2, filesystem type ext2.
dd if=/dev/fd0H1440 of=floppy_image
dd if=floppy_image of=/dev/fd0H1440
(two commands, dd="data duplicator") Create an image of a floppy to the file called "floppy_image" in the current directory. Then copy floppy_image (file) to another floppy disk. Works like DOS "DISKCOPY".

Program installation
rpm -ivh filename.rpm
(=RedhatPackageManager, install, verbose, hashes displayed to show progress, as root.) Install a content of RedHat rpm package(s) and print info on what happened. Keep reading if you prefer a GUI installation.
rpm -qpi filename.rpm
(=RedhatPackageManager, query, package, list.) Read the info on the content of a yet uninstalled package filename.rpm.
rpm -qpl filename.rpm
(=RedhatPackageManager, query, package, information.) List the files contained in a yet uninstalled package filename.rpm.
rpm -qf filename
(=RedhatPackageManager, query, file.) Find out the name of the *.rpm package to which the file filename (on your hardrive) belongs.
rpm -e packagename
(=RedhatPackageManager, erase=uninstall.) Uninstall a package pagckagename. Packagname is the same as the beginning of the *.rpm package file but without the dash and version number.
kpackage
gnorpm
glint
(in X terminal, as root if you want to be able to install packages) GUI fronts to the Red Hat Package Manager (rpm). "glint" comes with RH5.2, "gnorpm" with RH6.0, "kpackage" comes with RH6.1 or must be installed separately but is the best of the three. Use any of them to view which software packages are installed on your system and the what not-yet-installed packages are available on your RedHat CD, display the info about the packages, and install them if you want (installation must be done as root).

Accessing drives/partitions
mount
See here for details on mounting drives. Examples are shown in the next commands.
mount -t auto /dev/fd0 /mnt/floppy
(as root) Mount the floppy. The directory /mnt/floppy must exist, be empty and NOT be your current directory.
mount -t auto /dev/cdrom /mnt/cdrom
(as root) Mount the CD. You may need to create/modify the /dev/cdrom file depending where your CDROM is. The directory /mnt/cdrom must exist, be empty and NOT be your current directory.
mount /mnt/floppy
(as user or root) Mount a floppy as user. The file /etc/fstab must be set up to do this. The directory /mnt/floppy must not be your current directory.
mount /mnt/cdrom
(as user or root) Mount a CD as user. The file /etc/fstab must be set up to do this. The directory /mnt/cdrom must not be your current directory.
umount /mnt/floppy
Unmount the floppy. The directory /mnt/floppy must not be your (or anybody else's) current working directory. Depending on your setup, you might not be able to unmount a drive that you didn't mount.

7.6 Network administration tools
netconf
(as root) A very good menu-driven setup of your network.
pingmachine_name
Check if you can contact another machine (give the machine's name or IP), press C when done (it keeps going).
route -n
Show the kernel routing table.
nslookup host_to_find
Query your default domain name server (DNS) for an Internet name (or IP number) host_to_find. This way you can check if your DNS works. You can also find out the name of the host of which you only know the IP number.
traceroute host_to_trace
Have a look how you messages trave to host_to_trace (which is either a host name or IP number).
ipfwadm -F -p m
(for RH5.2, seen next command for RH6.0) Set up the firewall IP forwarding policy to masquerading. (Not very secure but simple.) Purpose: all computers from your home network will appear to the outside world as one very busy machine and, for example, you will be allowed to browse the Internet from all computers at once.
echo 1 > /proc/sys/net/ipv4/ip_forward
ipfwadm-wrapper -F -p deny
ipfwadm-wrapper -F -a m -S xxx.xxx.xxx.0/24 -D 0.0.0.0/0
(three commands, RH6.0). Does the same as the previous command. Substitute the "x"s with digits of your class "C" IP address that you assigned to your home network. See here for more details. In RH6.1, masquarading seems broken to me--I think I will install Mandrake Linux:).
ifconfig
(as root) Display info on the network interfaces currently active (ethernet, ppp, etc). Your first ethernet should show up as eth0, second as eth1, etc, first ppp over modem as ppp0, second as ppp1, etc. The "lo" is the "loopback only" interface which should be always active. Use the options (see ifconfig --help) to configure the interfaces.
ifup interface_name
(/sbin/ifup to it run as a user) Startup a network interface. E.g.:
ifup eth0
ifup ppp0
Users can start up or shutdown the ppp interface only when the right permission was checked during the ppp setup (using netconf ). To start a ppp interface (dial-up connection), I normally use kppp available under kde menu "internet".
ifdown interface_name
(/sbin/ifdown to run it as a user). Shut down the network interface. E.g.: ifdown ppp0 Also, see the previous command.
netstat more
Displays a lot (too much?) information on the status of your network.

Music-related commands
cdplay play 1
Play the first track from a audio CD.
eject
Get a free coffee cup holder :))). (Eject the CD ROM tray).
play my_file.wav
Play a wave file.
mpg123 my_file.mp3
Play an mp3 file.
mpg123 -w my_file.wav my_file.mp3
Create a wave audio file from an mp3 audio file.
knapster
(in X terminal) Start the program to downolad mp3 files that other users of napster have displayed for downloading. Really cool!
cdparanoia -B "1-"
(CD ripper) Read the contents of an audio CD and save it into wavefiles in the current directories, one track per wavefile. The "1-"
means "from track 1 to the last". -B forces putting each track into a separate file.
playmidi my_file.mid
Play a midi file. playmidi -r my_file.mid will display text mode effects on the screen.
sox
(argument not given here) Convert from almost any audio file format to another (but not mp3s). See man sox.

Graphics-related commands
kghostview my_file.ps
Display a postscript file on screen. I can also use the older-looking ghostview or gv for the same end effect.
ps2pdf my_file.ps my_file.pdf
Make a pdf (Adobe portable document format) file from a postscript file.
gimp
(in X terminal) A humble looking but very powerful image processor. Takes some learning to use, but it is great for artists, there is almost nothing you can't do with gimp. Use your mouse right button to get local menus, and learn how to use layers. Save your file in the native gimp file format *.xcf (to preserve layers) and only then flatten it and save as png (or whatever). There is a large user manual /usr/
gphoto
(in X terminal) Powerful photo editor.
giftopnm my_file.giff > my_file.pnm
pnmtopng my_file.pnm > my_file.png
Convert the propriatory giff graphics into a raw, portable pnm file. Then convert the pnm into a png file, which is a newer and better standard for Internet pictures (better technically plus there is no danger of being sued by the owner of giff patents).


QA Guidelines for Data Warehouse Quality Verification

QA Guidelines for Data Warehouse Quality Verification

This document describes testing guidelines and steps for verifying data, ETL processes, and SQL during the construction, unit testing, system and integration testing of an application’s data warehouse operational tables and data mart.
1.) Verify and Maintain the Data Low Level Design (LLD)
A first level of testing and validation begins with the formal acceptance of the logical data model and “low level design” (LLD). All further testing and validation will be based on the understanding of each of the data elements in the model.
Data elements that are created through a transformation or summary process must be clearly identified and calculations for each of these data elements must be clear and easily interpreted.
During the LLD reviews and updates, special consideration should be given to typical modeling scenarios that exist in the project. Examples follow:
1. Verify that many-to-many attribute relationships are clarified and resolved.
2. Verify the types of keys that are used: surrogate keys versus natural keys.
3. Verify that the business analyst / DBA reviewed with ETL architect and developers (application) the lineage and business rules for extracting, transforming, and loading the data warehouse?
4. Verify that all transformation rules, summarization rules, and matching and consolidation rules have clear specifications.
5. Verify that specified transformations, business rules and cleansing specified in LLD and other application logic specs have been coded correctly in ETL, JAVA, and SQL used for data loads.
6. Verify that procedures are documented to monitor and control data extraction, transformation and loading. The procedures should describe how to handle exceptions and program failures.
7. Verify that data consolidation of duplicate or merged data was properly handled.
8. Verify that samplings of domain transformations will be taken to verify they are properly changed.
9. Compare unique values of key fields between source data and data loaded to the warehouse. This is a useful technique that points out a variety of possible data errors without doing a full validation on all fields.
10. Validate that target data types are as specified in the design and/or the data model.
11. Verify how sub-class/super-class attributes depicted?
12. Verify that data field types and formats are specified.
13. Verify that defaults are specified for fields where needed.
14. Verify that processing for invalid field values in source are defined
15. Verify that expected ranges of field contents are specified where known.
16. Verify that keys generated by the “sequence generator” are identified.
17. Verify that slowly changing dimensions are described?

2.) Analyze Source Data Before & After Extraction to Staging

Testers should extract representative data from each source file (before or after extract to staging tables) and confirm that the data is consistent with its definition; QA can discover any anomalies in how the data is represented and write defect reports where necessary. The objective is to discover data that does not meet “data quality factors” as described in specifications. See list below and Table 1.
This verification process will be used for temp tables used in a step process for data transformations, cleaning, etc.
• Verify that the scope of values in each column are within specifications
• Identify unexpected values in each field
• Verify relationships between fields
• Identify frequencies of values in columns and whether these frequencies make sense?

Inputs: Application source data models and low level data design, data dictionaries, data attribute sources.
Outputs: Newly discovered attributes, undefined business rules, data anomalies such as fields used for multiple purposes.
Techniques and Tools: Data extraction software, business rule discovery software, data analysis tools.
Process Description:
1. Extract representative samples of data from each source or staging table.
2. Parse the data for the purpose of profiling.
3. Verify that not-null fields are populated and populated as expected.
4. Structure discovery – Does the data match the corresponding metadata? Do field attributes of the data match expected patterns? Does the data adhere to appropriate uniqueness and null value rules?
5. Data discovery – Are the data values complete, accurate and unambiguous?
6. Relationship discovery – Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables or databases? Is there redundant data?
7. Verify that all required data from the source was extracted. Verify that extraction process did not extract more or less data source than it should have.
8. Verify or write defects for exceptions and errors discovered during the ETL process.
9. Verify that extraction process did not extract duplicate data from the source (usually this happens in repeatable processes where at point zero we need to extract all data from the source file, but the during the next intervals we only need to capture the modified, and new rows.).
10. Validate that no data truncation occurred during staging.
11. Utilize a data profiling tool or methods that show the range and value distributions of fields in the source data. This is used to identify any data anomalies from source systems that may be missed even when the data movement is correct.
12. Validation & Certification Method: it is sufficient to identify the requirements and count (via SQL) the number of rows that should be extracted from the source systems. The QA team will also count the number of rows in the result / target sets and match the two for validation. The QA team will maintain a set of SQL statements that are automatically run at this stage to validate that no duplicate data have been extracted from the source systems.
Table 1: Data Quality Factors
FACTOR DESCRIPTION EXAMPLE
Data Consistency Issues:
Varying Data Definitions The data type and length for a particular attribute may vary in files or tables though the semantic definition is the same. Account number may be defined as: Number (9) in one field or table and Varchar2(11) in another table
Misuse of Integrity Constraints When referential integrity constraints are misused, foreign key values may be left “dangling” or inadvertently deleted. An account record is missing but dependent records are not deleted.
Nulls Nulls when field defined as “not-null”. The company has been entered as a null value for a business. A report of all companies would not list the business.
Data Completeness Issues:
Missing data Data elements are missing due to a lack of integrity constraints or nulls that are inadvertently not updated. An account date of estimated arrival is null thus impacting an assessment of variances in estimated/actual account data.
Inaccessible Data Inaccessible records due to missing or redundant identifier values. Business numbers are used to identify a customer record. Because uniqueness was not enforced, the business ID (45656) identifies more than one customer.
Missing Integrity Constraints Missing constraints can cause data errors due to nulls, non-uniqueness, or missing relationships. Account records with a business identifier exist in the database but cannot be matched to an existing business.
Data Correctness Issues:
Loss Projection Tables that are joined over non key attributes will produce non existent data that is shown to the user. Lisa Evans works in the LA office in the Accounting department. When a report is generated, it shows her working in IT department.
Incorrect Data Values Data that is misspelled or inaccurately recorded. 123 Maple Street is recorded with a spelling mistake and a street abbreviation (123 Maple St)
Inappropriate Use of Views Data is updated incorrectly through views. A view contains non key attributes from base tables. When the view is used to update the database, null values are entered into the key columns of the base tables.
Disabled Integrity Constraints Null, non unique, or out of range data may be stored when the integrity constraints are disabled. The primary key constraint is disabled during an import function. Data is entered into the existing data with null unique identifiers.
Non-duplication Testing should be conducted to determine if there’s duplication of data where there should not be. Duplicate rows or column data.
Misuse of Integrity Constraints Check whether null or foreign key constraints are inappropriate or too restrictive. Check constraint only allows hard coded values of “C”, “A”, “X”, and “Z”. But a new code “B” cannot be entered.
Data Comprehension Issues:
Data Aggregation Aggregated data is used to represent a set of data elements. One name field is used to store surname, first name, middle initial, and last name (e.g., John, Hanson, Mr.).
Cryptic Object Definitions Database object (e.g., column) has a cryptic, unidentifiable name. Customer table with a column labeled, “c_avd”. There is no documentation as to what the column might contain.
Unknown or Cryptic Data Cryptic data stored as codes, abbreviations, truncated, or with no apparent meaning. Shipping codes used to represent various parts of the customer base (‘01’, ‘02’, ‘03’). No supporting document to explain the meaning of the codes.
Accuracy Data will be matched against business rules. Boundary values (low, high’s) will be identified for relevant fields and compared with expectations.
Completeness Data will be assessed to verify that all required is present. Missing rows will be identified; Null values will be identified in data elements where a value is expected.
Precision Precision testing is conducted to evaluate the level of data not sufficiently precise based on specifications.

3.) Verify Corrected, Cleaned, Source Data in Staging

This step works to improve the quality of existing data in source files or “defects” that meet source specs but must be corrected before load.
Inputs:
 Files or tables (staging) that require cleansing; data definition and business rule documents, data map of source files and fields; business rules, data anomalies discovered in earlier steps of this process.
 Fixes for data defects that will result in data that does not meet specifications for the application DW.
Outputs: Defect reports, cleansed data, rejected or uncorrectable data
Techniques and Tools: Data reengineering, transformation, and cleansing tools, MS Access, Excel filtering.
Process Description: In this step, data with missing values, known errors, and suspect data is corrected. Automated tools may be identified to best to locate, clean / correct large volumes of data.
1. Document the type of data cleansing approach taken for each data type in the repository.
2. Determine how “uncorrectable” or suspect data is processed, rejected, maintained for corrective action. SME’s and stakeholders should be involved in the decision.
3. Review ETL defect reports to assess rejected data excluded from source files or information group targeted for the warehouse.
4. Determine if data not meeting quality rules was accepted.
5. Document in defect reports, records and important fields that cannot be easily corrected.
6. Document records that were corrected and how corrected.
Certification Method: Validation of data cleansing processes could be a tricky proposition, but certainly doable. All data cleansing requirements should be clearly identified. The QA team should learn all data cleansing tools available and their methods. QA should create various conditions as specified in the requirements for the data cleansing tool to support and validate its results. QA will run a volume of real data through each tool to validate accuracy as well as performance.

4.) Verifying Matched and Consolidated Data

There are often ETL processes where data has been consolidated from various files into a single occurrence of records. The cleaned and consolidated data can be assessed to very matched and consolidated data.
Much of the ETL heavy lifting occurs in the transform step where combined data, data with quality issues, updated data, surrogate keys, build aggregates, are processed.
Inputs: Analysis of all files or databases for each entity type
Outputs:
Report of matched, consolidated, related data that is suspect or in error
List of duplicate data records or fields
List of duplicate data suspects.
Techniques and Tools: Data matching techniques or tools; data cleansing software with matching and merging capabilities.
Process Description:
1. Establish match criteria for data. Select attributes to become the basis for possible duplicate occurrences (e.g., names, account numbers).
2. Determine the impact of incorrectly consolidated records. If the negative impact of consolidating two different occurrences such as different customers into a single customer record exists, submit defect reports. The fix should be higher controls to help avoid such consolidations in the future.
3. Determine the matching techniques to be used: Exact character match in two corresponding fields such as wild card match, key words, close match, etc.
4. Compare match criteria for specific record with all other records within a given file to look for intra-file duplicate records.
5. Compare match criteria for a specific record with all records in another file to seek inter-file duplicate records.
6. Evaluate potential matched occurrences to assure they are, in fact, duplicate.
7. Verify that consolidated data into single occurrences is correct.
8. Examine and re-relate data related to old records being consolidated to new occurrence-of-reference record. Validate that no related data was overlooked.

5.) Verify Transformed / Enhanced / Calculated Data to Target Tables

At this stage, base data is being prepared for loading into the Application operational tables and the data mart. This includes converting and formatting cleansed, consolidated data into the new data architecture and possibly enhancing internal operational data with external data licensed from service providers.
The objective is to successfully map the cleaned, corrected and consolidated data into the DW environment.
Inputs: Cleansed, consolidated data; external data from service providers; business rules governing the source data; business rules governing the target DW data; transformation rules governing the transformation process; DW or target data architecture; data map of source data to standardized data.
Output: Transformed, calculated, enhanced data; updated data map of source data to standardized data; data map of source data to target data architecture
Techniques and Tools: Data transformation software; external or online or public databases.
Process Description:
1. Verify that the data warehouse construction team is using the data map of source data to the DW standardized data, verify the mapping.
2. Verify that the data transformation rules and routines are correct.
3. Verify the data transformations to the DW and assure that the processes were performed according to specifications.
4. Verify that data loaded in the operational tables and data mart meets the definition of the data architecture including data types, formats, accuracy, etc.
5. Develop scenarios to be covered in Load Integration Testing
6. Count Validation: Record Count Verification DWH backend/Reporting queries against source and target as an initial check.
7. Dimensional Analysis: Data integrity exists between the various source tables and parent / child relationships.
8. Statistical Analysis: Validation for various calculations.
9. Data Quality Validation: - Check for missing data, negatives and consistency. Field-by-field data verification will be done to check the consistency of source and target data.
10. Granularity: Validate at the lowest granular level possible (lowest in the hierarchy E.g. Country-City-Sector– start with test cases).
11. Dynamic Transformation Rules & Tables: such methods need to be checked continuously to ensure the correct transformation routines are executed. Verify that dynamic mapping tables and dynamic mapping rules provide an easy, documented, and automated way for transforming values from one or more sources into a standard value presented in the DW.
12. Verification Method: The QA team will identify the detailed requirements as they relate to transformation and validate the dynamic transformation rules and tables against DW records. Utilizing SQL and related tools, the team will identify unique values in source data files that are subject to transformation. The QA team identifies the results from the transformation process and validate that such transformation have accurately taken place.

6.) Front-end UI and Report Testing Using Operational Tables and Data Mart

End user reporting is a major component of the Application Project. The report code may run aggregate SQL queries against the data stored in the data mart and/or the operational tables then display results in a suitable format either in a Web browser or on a client application interface. Once the initial view is rendered, the reporting tool interface provides various ways of manipulating the information such as sorting, pivoting, computing subtotals, and adding view filters to slice-and-dice the information further. Special considerations such as those below will be prepared while testing the reports:
1. The ETL process should be complete, the data mart must be populated and data quality testing should be largely completed.
2. The front-end will use a SQL engine which will generate the SQL based on the how the dimension and fact tables are mapped. Additionally, there may be global or report-specific parameters set to handle very large database (VLDB)-related optimization requirements. As such, testing of the front-end will concentrate on validating the SQL generated; this in turn validates the dimensional model and the report specification vis-à-vis the design.
3. Unit testing of the reports will be conducted to verify the layout format per the design mockup, style sheets, prompts and filters, attributes and metrics on the report.
4. Unit testing will be executed both in the desktop and Web environment.
5. System testing of the reports will concentrate on various report manipulation techniques like the drilling, sorting and export functions of the reports in the Web environment.
6. Reports and/or documents need special consideration for testing because they are high visibility reports used by the top analysts and because they have various charts, gauges and data points to provide a visual insight to the performance of the organization in question.
7. There may be some trending reports, or more specifically called comp reports, that compare the performance of an organizational unit over multiple time periods. Testing these reports needs special consideration especially if a fiscal calendar is used instead of an English calendar for time period comparison.
8. For reports containing derived metrics special focus should be paid to any subtotals. The subtotal row should use a "smart-total," i.e., do the aggregation first and then do the division instead of adding up the individual cost per click of each row in the report.
9. Reports with "non-aggregate-able" metrics (e.g., inventory at hand) also need special attention to the subtotal row. It should not, for example, add up the inventory for each week and show the inventory of the month.
10. During unit testing, all data formats will be verified against the standard. For example, metrics with monetary value should show the proper currency symbol, decimal point precision (at least two places) and the appropriate positive or negative. For example, negative numbers should be shown in red and enclosed in braces.
11. During system testing, while testing the drill-down capability of reports, care will be taken to verify that the subtotal at the drill-down report matches with the corresponding row of the summary report. At times, it is desirable to carry the parent attribute to the drill-down report; verify the requirements for this.
12. When testing reports containing conditional metrics, care will be taken to check for "outer join condition;" i.e., nonexistence of one condition is reflected appropriately with the existence of the other condition.
13. Reports with multilevel sorting will get special attention for testing especially if the multilevel sorting includes both attributes and metrics to be sorted.
14. Reports containing metrics at different dimensionality and with percent-to-total metrics and/or cumulative metrics needs will get special attention to check that the subtotals are hierarchy-aware (i.e., they "break" or "re-initialized" at the appropriate levels).

7.) Operational Table and Data Mart: Build Sanity Test
1. Session Completions: All workflow sessions completed successfully using the Log Viewer.
2. Source to Target Counts: This process verifies that the number of records in the source system matches the number of records received, and ultimately processed, into the data warehouse. If Look-up’s are involved in the ETL process, the count between source and target will not match. The ETL Session log and target table counts are compared.
3. Source to Target Data Verification: The process verifies that all source and reference tables have data before running ETLs. We verify that all target tables were truncated before the load unless target tables are updated. This process verifies that the source field threshold is not subject to truncation during the transformation or loading of data.
4. Field to Field Verification: This process verifies the field values from the source system to target. This process ensures that the data mapping from the source system to the target is correct, and that data sent has been loaded accurately.
5. ETL Exception Processing: Exception processing verification looks for serious data errors that would cause system processing failures or data corruption. An Exception report verifying the number and types of errors encountered is produced and reviewed for additional processing and / or reporting to the customer.
There are two types of Exception process:
1. Database Exception:
• Not Null - Source column is null while target is not null
• Reference Key - The records coming from the source data do not have a corresponding parent key in the parent table.
• Unique Key - The record already exists in the target table.
• Check Constraint - CHECK constraints enforce domain integrity by limiting the values that are accepted by a column
2. Business Exception
These are the exceptions thrown based on certain business rules defined for specific data elements or group of data elements
• ETL process utilizes a single Exception Table to capture the exceptions from various ETL sessions and an Error Lookup table which has various error codes and their description.
• We check the Exception process using the Session Log and Exception Table.

8.) Sanity Test: Exit and Suspension Criteria
1. No critical defects unfixed; No more than 3 high severity defects.
2. 80% or more of build functionality can be tested – functionality might fail because of JAVA / report code.
3. Platform performance is such that test team can productively work to schedule
4. Fewer than 15% of build fixes failed