Why? If done properly it is difficult, very difficult. With today’s inexpensive disk storage, it is not uncommon to reach a terabyte anymore. Sure there are the typical database techniques of checking row counts, summary tables and also reviewing the exception logs from the ETL process, but is this enough? It is not for my current client, and it may not be for yours. After many heated discussions, I was forced to rethink my position on validation techniques used in the past.
If you feel that you are happy with the quality of data in your warehouse and are an expert in data warehousing validation, this article is not for you. Data quality, which is big topic right now, really does not address this specific issue. This discussion is not about quality; it is about validation. This article is by no means an all-inclusive recommendation on testing or validation. The purpose is to define the coverage of testing and validation that might be performed. This approach will have to be modified for each engagement but should at least provide a starting point for some.
Enough of the Disclaimers
I have determined that it is best to break the testing and validation process into four well-defined, high-level tasks for a data warehouse engagement: integration testing, system testing, data validation and acceptance testing. Integration testing includes reviewing and accepting the logical data model captured with your data modeling tool (i.e., ERwin or your tool of choice), converting the models to actual physical database tables in the test environment, creating the proper indexes and testing the ETL programs created by your ETL tool. System testing involves increasing the volume of the test data to be loaded, estimating load times and placing data into either a high- volume test area or actually in the production environment. Data validation includes reviewing the ETL mapping encoded in the ETL tool as well as reviewing samples of the data that was loaded into the test environment. Acceptance testing will include completeness of data model to meet the reporting needs for your specific project, reviewing summary table designs, validation of data actually loaded in the production data warehouse environment and a review of the daily upload procedures.
Since the data warehouse will consist of several dimensions and fact tables (if utilizing a star schema), each of the first three testing and validation phases can be executed somewhat in parallel. The initial review of the data elements in the data model should include the identification of data elements in the fact table that are critical for success for your project. During the review, several data elements may have been added as placeholders for future enhancements. These too should be noted so that during the final testing and acceptance procedures null or voided entries in the production system are not noted as failed data elements.
Due to the pure volume of data in an enterprise data warehouse, it is not feasible to validate each row that is loaded when significant transformations are involved. To add to the complexity, loading may not be done all at a single point in time, portioned by year, SSN or other demographic information. Assumptions must be documented by the testing and validation team and accepted by the project sponsor before acceptance testing can begin. These assumptions will state the various levels of testing and validation to be performed and the objectives of each. If possible, data elements that cannot be tested in detail but can be validated through the use of summary tables and counts, acceptable levels of error, if any, must be addressed. If no testing or validation is to occur on certain data elements, an explanation and impact of not testing must be documented.
Levels and Objectives
Integration Testing
The first level of testing and validation begins with the formal acceptance of the logical data model. All further testing and validation must be based on the understanding of each of the data elements in the model and how they will assist the company, specifically in the subject area(s) for your particular increment, meet specific business objectives in the areas of reporting and analysis.
Table names in the logical model should be intuitive, so that a formal definition of the table will not be necessary but still useful. A table is best described by the data elements that reside in it. Each table should be identified by how it is to be used within the warehouse (i.e., fact table, a dimension table or summary table). Primary keys should be identified and placed at the top of each table for quick reference. During the integration testing process, a discussion must take place to validate that the primary key(s) will actually make each row in the table unique usually before loading begins.
Each table will consist of multiple data elements, some of which will be used as primary keys. A description of each data element should be captured in a repository and a physical name should be assigned. Where possible, physical names should closely reflect names in source tables and follow company standards if any exist. Data elements that are created through a transformation or summary process should be clearly identified and calculations for each of these data elements need to be captured and entered into the repository.
The second phase of integration testing is comprised of the verification of the physical model. Some data elements captured in the logical model may reside in the physical tables, but will remain as placeholders for future enhancements. A list of all data elements that will not be loaded must be created and verified. The logical model may also contain several tables that will be collapsed into a single physical table. Based on how the physical tables will be used, a DBA must then determine which data elements should be included in indexes to increase the speed and efficiency of the physical tables. The data model can then be used to create the actual SQL to create the physical tables with the data warehouse test environment.
Actually populating the physical tables utilizing the ETL programs or other database load utilities will be the third element in the integration testing process. As data is moved from source to target, the ETL will identify errors such as data type and size. These errors are logged into an exception report that can be reviewed after the load completes or fails. This process is also useful for verifying the number of rows loaded and to begin identifying potential problems related to data volume and load times.
System Testing
System testing takes the final phase of integration testing to the next level by requiring that the testing team push the limits on the volume to be loaded into the test environment. Large data sets extracted directly from the operational systems can be utilized if possible for the system testing. Final DASD size can be determined, load times can be calculated or validated and proper procedures to put jobs into production can be reviewed. At the completion of the system tests, the JCL required to move the test jobs into production can be approved by the proper organizations and actual load times can be scheduled.
Due to limitations in most test environments of limited direct access storage devices (DASD) and CPU cycles during operational hours, these tests are very useful, but may not be able to identify all of the potential errors during the actual load if the warehouse is being developed on a shared platform. This should however minimize the risk of the initial load significantly.
Data Validation
Reviewing the mapping utilized by the ETL tool is the primary task during data validation. Due to the shear volume of data to be loaded into the fact tables, it is imperative that care is taken that the mapping is validated to verify that the data that you think is being loaded into specific data elements is in fact being sourced from tables that we know contain the information in the operational system.
The ETL tool will create a report that can be viewed as is or imported into an analytical interface with reporting capabilities such as Microsoft Access. If an interface is to be utilized, the name of each table, both logical and physical, as well as each data element, also logical and physical, should be present. Each data element should have a formal description and a mapping back to the source table(s) used to populate it during the ETL process. If any transformation is to occur, this too should be documented. If no transformation is to occur, the sample SQL to capture the data element from the source table may be documented as well, but is not required.
This is an excellent opportunity to get sponsor sign off. An agreement at this time will draw a line in the sand to minimize scope creep and document the understanding at that point in time exactly what is to be developed and how. Changes to the design and transformations will occur if the system is complex at all as the designers and developers of the data warehouse increase their knowledge on how the data elements on the source systems are populated and used in the operational environment.
The next step in data validation is completed by the use of counts. Simple database queries can be run either on an entire table (for most dimensions this is possible) or for subset (i.e., number of activity records for the month of May). If these counts are equal, it can be safely assumed that records were not left out due to an error during the ETL or simple load process. This is further verified by the lack of errors (not necessarily warnings) in the exception reporting by the ETL tool.
Many dimensional tables have an exact duplicate in the operational database schema. Doing a simple table compare can further validate these tables. If a table compare indicates that no difference were found, these tables can be considered 100 percent validated. For additional verification, actual rows from both the operational and data warehouse tables can be printed and listed side by side for comparison.
Randomly selecting rows of data from the test environment and comparing them to data stored within the test data warehouse should give a high level of confidence in the ETL process. A more detailed version of this test is usually conducted in the acceptance testing procedure.
Acceptance Testing
Acceptance testing begins with the final review of the data model. A listing of potential new reports must be created and data requirements to complete these reports must be identified in a process document. mapping of the specific data needs to actual data fields in the physical data model will help ensure that no key elements have been neglected during the design phase. Existing reports may also be identified as key reports to determine if on a future date those too can be redeveloped to take advantage of the data warehouse.
During the evaluation of reporting needs, several summary tables will need to be defined to help minimize the development effort of reports as well to increase the speed of report generation. Summary tables may also be developed to provide certain counts on a monthly basis that can be analyzed via tools such as Microsoft’s Data Analyzer. Actual values from the summary tables can be used to compare to similar summaries from the operational system to help provide a high level of confidence in the values stored within the warehouse. Very seldom do the numbers in the data warehouse match 100 percent with those created in the operational system. This is often the case due to the warehouse being populated once a day or even once a week. It will be up to the individuals assigned to the acceptance team to determine why the values are different and if the difference is significant.
Once the data warehouse is fully populated and the daily load procedures are operational, the final acceptance testing can begin. This should include taking a cross section of specific data elements within each of the fact tables and comparing them directly with facts in the operational system. This can be automated with the use of custom code (COBOL, SAS, C or even simple SQL) if at all possible to maximize the number of rows to be interrogated within the data warehouse. The use of views can also help reduce the complexity of the code that would need to be developed. Due to the volume of data and the amount of CPU cycles to conduct a test of this type, a decision will need to be made if a select few data elements on a large number of rows or if all data elements for a limited number of rows would provide a better sense of valid data. Both methods can be used, but the additional development time to create two procedures of this size must be weighed against the actual value that will be derived. If all previous data validation procedures are producing valid results, maybe only one procedure can be initially created. If this procedure proves no differences, the overall warehouse can be validated with at some level of confidence.
It should be noted that to create a 100 percent validation of the data would require duplicating the entire ETL process utilizing a separate tool or procedure (i.e., all custom-built COBOL code) would be very costly, time-consuming and, quite frankly, impractical. That is the reason you buy an ETL tool in the first place. In addition, the validation process to verify that both the tool and the custom code utilize the same logic would be an extensive exercise in itself. This duplication of effort would also have to be included in all future data warehouse design and development efforts.
Completion Criteria
Since you cannot check every data element, how do you know when testing and validation is complete? Completion of testing has to be determined by the number of defects found and resolved. The test log, which identifies all defects and the status of each, is a good source to use in measuring completion criteria. The true acceptance and determination of completion can only be determined when the sponsor feels comfortable with the results in both sample reporting and the results of the tests previously listed. Without confidence in the numbers, the data warehouse will not be utilized and thus deemed a failure. To this end, it is important to revisit the purpose of the data warehouse. The data warehouse is, by definition, not an operational system. The data stored in the data warehouse is typically for analytical and reporting purposes only. The goals of most data warehousing efforts should be focused on allowing individuals the ability to create new strategies for success and optimize their organization. Neither of these goals requires a 100 percent accuracy rate in the millions of rows of facts stored over several years, but inaccuracies based on incorrect data mapping or invalidated transformations can be worse than not doing anything at all.
Take care in your future testing and validation efforts. This level of validation will add time to the plan and increase costs, but these can be minimized if you work these techniques into the plan early and integrate them into your existing methodology. I believe in the long run that your sponsors and the business users will be glad that you did.
Cody Bateman is one of the founders of Keyleon, a corporation specializing in the development of business intelligence, data warehousing and CRM solutions. Bateman has 19 years of consulting experience in a range of service industries including energy, defense, finance, health care, retail and telecommunications and is the author of the Kenetic Incremental Warehouse Implementation (KIWI) Methodology. He may be reached via e-mail at CBateman@keyleon.com.
For more information on related topics, visit the following channels:
Source:: http://www.information-management.com/infodirect/20021004/5855-1.html
 
No comments:
Post a Comment