Psyche Scribbles: 2011

Friday, December 16, 2011

Articles on Data Warehouse Testing and Validation Strategy?

As most of my readers know, I typically try to stay away from writing about generic data warehousing practices. I usually like to keep the articles short and sweet. Well, I am breaking both rules this time. As part of a current project, I was asked to help pull together the most recent articles on data warehouse testing and validation strategies. Guess what? These are few and far between. It appears that all the experts want to tell us how to build these things without ever addressing the issue validating its accuracy once it is loaded.
Why? If done properly it is difficult, very difficult. With today’s inexpensive disk storage, it is not uncommon to reach a terabyte anymore. Sure there are the typical database techniques of checking row counts, summary tables and also reviewing the exception logs from the ETL process, but is this enough? It is not for my current client, and it may not be for yours. After many heated discussions, I was forced to rethink my position on validation techniques used in the past.
If you feel that you are happy with the quality of data in your warehouse and are an expert in data warehousing validation, this article is not for you. Data quality, which is big topic right now, really does not address this specific issue. This discussion is not about quality; it is about validation. This article is by no means an all-inclusive recommendation on testing or validation. The purpose is to define the coverage of testing and validation that might be performed. This approach will have to be modified for each engagement but should at least provide a starting point for some.

Enough of the Disclaimers
I have determined that it is best to break the testing and validation process into four well-defined, high-level tasks for a data warehouse engagement: integration testing, system testing, data validation and acceptance testing. Integration testing includes reviewing and accepting the logical data model captured with your data modeling tool (i.e., ERwin or your tool of choice), converting the models to actual physical database tables in the test environment, creating the proper indexes and testing the ETL programs created by your ETL tool. System testing involves increasing the volume of the test data to be loaded, estimating load times and placing data into either a high- volume test area or actually in the production environment. Data validation includes reviewing the ETL mapping encoded in the ETL tool as well as reviewing samples of the data that was loaded into the test environment. Acceptance testing will include completeness of data model to meet the reporting needs for your specific project, reviewing summary table designs, validation of data actually loaded in the production data warehouse environment and a review of the daily upload procedures.
Since the data warehouse will consist of several dimensions and fact tables (if utilizing a star schema), each of the first three testing and validation phases can be executed somewhat in parallel. The initial review of the data elements in the data model should include the identification of data elements in the fact table that are critical for success for your project. During the review, several data elements may have been added as placeholders for future enhancements. These too should be noted so that during the final testing and acceptance procedures null or voided entries in the production system are not noted as failed data elements.
Due to the pure volume of data in an enterprise data warehouse, it is not feasible to validate each row that is loaded when significant transformations are involved. To add to the complexity, loading may not be done all at a single point in time, portioned by year, SSN or other demographic information. Assumptions must be documented by the testing and validation team and accepted by the project sponsor before acceptance testing can begin. These assumptions will state the various levels of testing and validation to be performed and the objectives of each. If possible, data elements that cannot be tested in detail but can be validated through the use of summary tables and counts, acceptable levels of error, if any, must be addressed. If no testing or validation is to occur on certain data elements, an explanation and impact of not testing must be documented.

Levels and Objectives
Integration Testing
The first level of testing and validation begins with the formal acceptance of the logical data model. All further testing and validation must be based on the understanding of each of the data elements in the model and how they will assist the company, specifically in the subject area(s) for your particular increment, meet specific business objectives in the areas of reporting and analysis.
Table names in the logical model should be intuitive, so that a formal definition of the table will not be necessary but still useful. A table is best described by the data elements that reside in it. Each table should be identified by how it is to be used within the warehouse (i.e., fact table, a dimension table or summary table). Primary keys should be identified and placed at the top of each table for quick reference. During the integration testing process, a discussion must take place to validate that the primary key(s) will actually make each row in the table unique usually before loading begins.
Each table will consist of multiple data elements, some of which will be used as primary keys. A description of each data element should be captured in a repository and a physical name should be assigned. Where possible, physical names should closely reflect names in source tables and follow company standards if any exist. Data elements that are created through a transformation or summary process should be clearly identified and calculations for each of these data elements need to be captured and entered into the repository.
The second phase of integration testing is comprised of the verification of the physical model. Some data elements captured in the logical model may reside in the physical tables, but will remain as placeholders for future enhancements. A list of all data elements that will not be loaded must be created and verified. The logical model may also contain several tables that will be collapsed into a single physical table. Based on how the physical tables will be used, a DBA must then determine which data elements should be included in indexes to increase the speed and efficiency of the physical tables. The data model can then be used to create the actual SQL to create the physical tables with the data warehouse test environment.
Actually populating the physical tables utilizing the ETL programs or other database load utilities will be the third element in the integration testing process. As data is moved from source to target, the ETL will identify errors such as data type and size. These errors are logged into an exception report that can be reviewed after the load completes or fails. This process is also useful for verifying the number of rows loaded and to begin identifying potential problems related to data volume and load times.
System Testing
System testing takes the final phase of integration testing to the next level by requiring that the testing team push the limits on the volume to be loaded into the test environment. Large data sets extracted directly from the operational systems can be utilized if possible for the system testing. Final DASD size can be determined, load times can be calculated or validated and proper procedures to put jobs into production can be reviewed. At the completion of the system tests, the JCL required to move the test jobs into production can be approved by the proper organizations and actual load times can be scheduled.
Due to limitations in most test environments of limited direct access storage devices (DASD) and CPU cycles during operational hours, these tests are very useful, but may not be able to identify all of the potential errors during the actual load if the warehouse is being developed on a shared platform. This should however minimize the risk of the initial load significantly.
Data Validation
Reviewing the mapping utilized by the ETL tool is the primary task during data validation. Due to the shear volume of data to be loaded into the fact tables, it is imperative that care is taken that the mapping is validated to verify that the data that you think is being loaded into specific data elements is in fact being sourced from tables that we know contain the information in the operational system.
The ETL tool will create a report that can be viewed as is or imported into an analytical interface with reporting capabilities such as Microsoft Access. If an interface is to be utilized, the name of each table, both logical and physical, as well as each data element, also logical and physical, should be present. Each data element should have a formal description and a mapping back to the source table(s) used to populate it during the ETL process. If any transformation is to occur, this too should be documented. If no transformation is to occur, the sample SQL to capture the data element from the source table may be documented as well, but is not required.
This is an excellent opportunity to get sponsor sign off. An agreement at this time will draw a line in the sand to minimize scope creep and document the understanding at that point in time exactly what is to be developed and how. Changes to the design and transformations will occur if the system is complex at all as the designers and developers of the data warehouse increase their knowledge on how the data elements on the source systems are populated and used in the operational environment.
The next step in data validation is completed by the use of counts. Simple database queries can be run either on an entire table (for most dimensions this is possible) or for subset (i.e., number of activity records for the month of May). If these counts are equal, it can be safely assumed that records were not left out due to an error during the ETL or simple load process. This is further verified by the lack of errors (not necessarily warnings) in the exception reporting by the ETL tool.
Many dimensional tables have an exact duplicate in the operational database schema. Doing a simple table compare can further validate these tables. If a table compare indicates that no difference were found, these tables can be considered 100 percent validated. For additional verification, actual rows from both the operational and data warehouse tables can be printed and listed side by side for comparison.
Randomly selecting rows of data from the test environment and comparing them to data stored within the test data warehouse should give a high level of confidence in the ETL process. A more detailed version of this test is usually conducted in the acceptance testing procedure.
Acceptance Testing
Acceptance testing begins with the final review of the data model. A listing of potential new reports must be created and data requirements to complete these reports must be identified in a process document. mapping of the specific data needs to actual data fields in the physical data model will help ensure that no key elements have been neglected during the design phase. Existing reports may also be identified as key reports to determine if on a future date those too can be redeveloped to take advantage of the data warehouse.
During the evaluation of reporting needs, several summary tables will need to be defined to help minimize the development effort of reports as well to increase the speed of report generation. Summary tables may also be developed to provide certain counts on a monthly basis that can be analyzed via tools such as Microsoft’s Data Analyzer. Actual values from the summary tables can be used to compare to similar summaries from the operational system to help provide a high level of confidence in the values stored within the warehouse. Very seldom do the numbers in the data warehouse match 100 percent with those created in the operational system. This is often the case due to the warehouse being populated once a day or even once a week. It will be up to the individuals assigned to the acceptance team to determine why the values are different and if the difference is significant.
Once the data warehouse is fully populated and the daily load procedures are operational, the final acceptance testing can begin. This should include taking a cross section of specific data elements within each of the fact tables and comparing them directly with facts in the operational system. This can be automated with the use of custom code (COBOL, SAS, C or even simple SQL) if at all possible to maximize the number of rows to be interrogated within the data warehouse. The use of views can also help reduce the complexity of the code that would need to be developed. Due to the volume of data and the amount of CPU cycles to conduct a test of this type, a decision will need to be made if a select few data elements on a large number of rows or if all data elements for a limited number of rows would provide a better sense of valid data. Both methods can be used, but the additional development time to create two procedures of this size must be weighed against the actual value that will be derived. If all previous data validation procedures are producing valid results, maybe only one procedure can be initially created. If this procedure proves no differences, the overall warehouse can be validated with at some level of confidence.
It should be noted that to create a 100 percent validation of the data would require duplicating the entire ETL process utilizing a separate tool or procedure (i.e., all custom-built COBOL code) would be very costly, time-consuming and, quite frankly, impractical. That is the reason you buy an ETL tool in the first place. In addition, the validation process to verify that both the tool and the custom code utilize the same logic would be an extensive exercise in itself. This duplication of effort would also have to be included in all future data warehouse design and development efforts.
Completion Criteria
Since you cannot check every data element, how do you know when testing and validation is complete? Completion of testing has to be determined by the number of defects found and resolved. The test log, which identifies all defects and the status of each, is a good source to use in measuring completion criteria. The true acceptance and determination of completion can only be determined when the sponsor feels comfortable with the results in both sample reporting and the results of the tests previously listed. Without confidence in the numbers, the data warehouse will not be utilized and thus deemed a failure. To this end, it is important to revisit the purpose of the data warehouse. The data warehouse is, by definition, not an operational system. The data stored in the data warehouse is typically for analytical and reporting purposes only. The goals of most data warehousing efforts should be focused on allowing individuals the ability to create new strategies for success and optimize their organization. Neither of these goals requires a 100 percent accuracy rate in the millions of rows of facts stored over several years, but inaccuracies based on incorrect data mapping or invalidated transformations can be worse than not doing anything at all.
Take care in your future testing and validation efforts. This level of validation will add time to the plan and increase costs, but these can be minimized if you work these techniques into the plan early and integrate them into your existing methodology. I believe in the long run that your sponsors and the business users will be glad that you did.
Cody Bateman is one of the founders of Keyleon, a corporation specializing in the development of business intelligence, data warehousing and CRM solutions. Bateman has 19 years of consulting experience in a range of service industries including energy, defense, finance, health care, retail and telecommunications and is the author of the Kenetic Incremental Warehouse Implementation (KIWI) Methodology. He may be reached via e-mail at CBateman@keyleon.com.
For more information on related topics, visit the following channels:

Source:: http://www.information-management.com/infodirect/20021004/5855-1.html

Friday, September 9, 2011

What is Datamart???

Reference From: http://www.learn.geekinterview.com/data-warehouse/data-marts/what-is-data-mart.html

Data Mart is a subset of the data resource, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. The concept of a data mart can apply to any data whether they are operational data, evaluational data, spatial data, or metadata.

A data mart is a repository of a business organization's data implemented to answer very specific questions for a specific group of data consumers such as organizational divisions of marketing, sales, operations, collections and others. A data mart is typically established as one dimensional model or star schema which is composed of a fact table and multi-dimensional table.
In comparison, a data warehouse is also a repository of organizational data implemented as a single repository serving enterprise wide data across many if not all subject areas. The data warehouse is the authoritative repository at atomic level of all fact and dimensional data.

Despite some arguments on the similarity or difference between a data mart and a data warehouse, many still consider a data mart as specialized version of a data warehouse.

Advantage of Data Mart:
The data mart, like the data warehouse, can also provide a picture of a business organization's data and help the organizational staff in formulating strategies based on the aggregated data and statistical analysis of industry trends and patterns as well as part business experiences.

The most notable difference of a data mart from a data warehouse is that the data mart is created based on a very specific and predefined purpose and need for a grouping of certain data. A data mart is configured such that it makes access to relevant information in a specific area very easy and fast.

Within a single business organization, there can more than one data mart. Each of these data marts is relevant or connected in some way to one or more business units that its design was intended for. The relationship among many data marts within a single company may or may not involve interdependency.

They may be related to other data marts if they were designed using conformed facts of dimensions. If one department has a data mart implementation, that department is considered to be the owner of the data mart and it owns all aspects of the data mart including the software, hardware and the data itself. This can help manage data in a huge company by having a modularization method such that a department should only manipulate and develop its own data as they see if without having to alter data from other department's data marts. Then other departments need data from the data mart owned by a certain department, proper permission should be asked first.

In other data mart implementation where there is strict conformed dimension, some shared dimensions exist such as customers and products and business ownership will no longer apply.

Data marts can be designed with star schema, snowflake schema or starflake schema. The star schema is the most simple of all the styles related to data mart and data warehousing. It consists only of few fact tables.

The snowflake schema is a variation of the star schema and the storage method is of multidimensional nature. The starflake schema is a hybrid mixture of both the star and snowflake schemas.

Data marts are especially useful to make access to specific frequently access data very ease. It can give a collective picture or a certain aspect in the business by a specific group of users. Since data marts are smaller compared to a full data warehouse, response time could be lesser and the cost of implantation could also be less expensive.

Thursday, September 8, 2011

zip/unzip
gzip -d

Find:
grep <> filename.txt > newfilename.txt

to count3 of lines
wc -1 filename.txt

chmod filename 777/755/
Linux Shortcuts and Commands:
Linux Newbie Administrator Guide
by Stan and Peter Klimas
This is a practical selection of the commands we use most often. Press to see the listing of all available command (on your PATH). On my small home system, it says there are 2595 executables on my PATH. Many of these "commands" can be accessed from your favourite GUI front-end (probably KDE or Gnome) by clicking on the right menu or button. They can all be run from the command line. Programs that require GUI have to be run from a terminal opened under a GUI.
Legend:
<> = single special or function key on the keyboard. For example indicates the "control" key.
italic = name of the file or variable you probably want to substitute with your own.
fixed width = in-line Linux commands and filenames.
Notes for the UNIX Clueless:
1. LINUX IS CASE-SENSITIVE. For example: Netscape, NETSCAPE and nEtscape are three different commands. Also my_filE, my_file, and my_FILE are three different files. Your user login name and password are also case sensitive. (This goes with the tradition of UNIX and the "c" programming language being case sensitive.)
2. Filenames can be up to 256 characters long and can contain letters, numbers, "." (dot), "_" (underscore), "-" (dash), plus some other not recommended characters.
3. Files with names starting with "." are normally not shown by the ls (list) or dir commands. Think of these files as "hidden". Use ls -a (list with the option "all") to see these files.
4. "/" is an equivalent to DOS "\" (root directory, meaning the parent of all other directories).
5. Under Linux, all directories appear under a single directory tree (there are no DOS-style drive letters).
6. In a configuration file, a line starting with # is a comment.
7.1 Linux essential shortcuts and sanity commands

Switch to the first text terminal. Under Linux you can have several (6 in standard setup) terminals opened at the same time.
(n=1..6)
Switch to the nth text terminal.
tty
Print the name of the terminal in which you are typing this command.

Switch to the first GUI terminal (if X-windows is running on this terminal).
(n=7..12)
Switch to the nth GUI terminal (if a GUI terminal is running on screen n-1). On default, nothing is running on terminals
8 to 12, but you can run another server there.

(In a text terminal) Autocomplete the command if there is only one option, or else show all the available options.
THIS SHORTCUT IS GREAT! It even works at LILO prompt!

Scroll and edit the command history. Press to execute.

Scroll terminal output up. Work also at the login prompt, so you can scroll through your bootup messages.

Scroll terminal output down.
<+>
(in X-windows) Change to the next X-server resolution (if you set up the X-server to more than one resolution). For multiple resolutions on my standard SVGA card/monitor, I have the following line in the file /etc/X11/XF86Config (the first resolution starts on default, the largest determines the size of the "virtual screen"):
Modes "1024x768" "800x600" "640x480" "512x384" "480x300" "400x300" "1152x864"
<->
(in X-windows) Change to the previous X-server resolution.

(in X-windows) Kill the current X-windows server. Use if the X-windows server crushes and cannot be exited normally.

Shut down the system and reboot. This is the normal shutdown command for a user at the text-mode console. Don't just press the "reset" button for shutdown!
c
Kill the current process (mostly in the text mode for small applications).
d
Log out from the current terminal. See also the next command.
d
Send [End-of-File] to the current process. Don't press it twice else you also log out (see the previous command).
s
Stop the transfer to the terminal.
q
Resume the transfer to the terminal. Try if your terminal mysteriously stops responding.
z
Send the current process to the background.
exit
Logout. I can also use logout for the same effect. (If you have started a second shell, e.g., using bash the second shell will be exited and you will be back in the first shell, not logged out.)
reset
Restore a screwed-up terminal (a terminal showing funny characters) to default setting. Use if you tried to "cat" a binary file. You may not be able to see the command as you type it.

Paste the text which is currently highlighted somewhere else. This is the normal "copy-paste" operation in Linux. (It doesn't work with Netscape and WordPerfect which use the MS Windows-style "copy-paste". It does work in the text terminal if you enabled "gpm" service using "setup".) Best used with a Linux-ready 3-button mouse (Logitech or similar) or else set "3-mouse button emulation").
~
(tilde) My home directory (normally the directory /home/my_login_name). For example, the command cd ~/my_dir will change my working directory to the subdirectory "my_dir" under my home directory. Typing just "cd" alone is an equivalent of the command "cd ~".
.
(dot) Current directory. For example, ./my_program will attempt to execute the file "my_program" located in your current working directory.
..
(two dots) Directory parent to the current one. For example, the command cd .. will change my current working directory one one level up.
7.2 Common Linux commands--system info
pwd
Print working directory, i.e., display the name of my current directory on the screen.
hostname
Print the name of the local host (the machine on which you are working). Use netconf (as root) to change the name of the machine.
whoami
Print my login name.
id username
Print user id (uid) and his/her group id (gid), effective id (if different than the real id) and the supplementary groups.
date
Print or change the operating system date and time. E.g., I could change the date and time to 2000-12-31 23:57 using this command:
date 123123572000
To set the hardware (BIOS) clock from the system (Linux) clock, use the command (as root) setclock
time
Determine the amount of time that it takes for a process to complete + other info. Don't confuse it with the date command. E.g. I can find out how long it takes to display a directory content using:
time ls
who
Determine the users logged on the machine.
rwho -a
(=remote who) Determine all users logged on your network. The rwho service must be enabled for this command to run. If it isn't, run setup as root to enable "rwho".
finger user_name
System info about a user. Try: finger root
last
Show listing of users last logged-in on your system.
history more
Show the last (1000 or so) commands executed from the command line on the current account. The " more" causes the display to stop after each screenful.
uptime
Show the amount of time since the last reboot.
ps
(=print status) List the processes currently run by the current user.
ps axu more
List all the processes currently running, even those without the controlling terminal, together with the name of the user that owns each process.
top
Keep listing the currently running processes, sorted by cpu usage (top users first). In KDE, you can get GUI-based Ktop from "K"menu under "System"-"Task Manager" (or by executing "ktop" in an X-terminal).
uname -a
(= Unix name with option "all") Info on your (local) server. I can also use guname (in X-window terminal) to display the info more nicely.
free
Memory info (in kilobytes).
df -h
(=disk free) Print disk info about all the filesystems (in human-readable form)
du / -bh more
(=disk usage) Print detailed disk usage for each subdirectory starting at the "/" (root) directory (in human legible form).
cat /proc/cpuinfo
Cpu info--it show the content of the file cpuinfo. Note that the files in the /proc directory are not real files--they are hooks to look at information available to the kernel.
cat /proc/interrupts
List the interrupts in use.
cat /proc/version
Linux version and other info
cat /proc/filesystems
Show the types of filesystems currently in use.
cat /etc/printcap
Show the setup of printers.
lsmod
(As root. Use /sbin/lsmod to execute this command when you are a non-root user.) Show the kernel modules currently loaded.
setmore
Show the current user environment.
echo $PATH
Show the content of the environment variable "PATH". This command can be used to show other environment variables as well. Use "set" to see the full environment.
dmesg less
Print kernel messages (the content of the so-called kernel ring buffer). Press "q" to quit "less". Use less /var/log/dmesg to see what "dmesg" dumped into this file right after the last system bootup.

7.3 Basic operations
any_command --help more
Display a brief help on a command (works with most commands). "--help" works similar to DOS "/h" switch. The "more" pipe is needed if the output is longer than one screen.
man topic
Display the contents of the system manual pages (help) on the topic. Try man man first. Press "q" to quit the viewer. The command info topic works similar and may contain more up-to-date information. Manual pages can be hard to read. Try any_command --help for short, easy to digest help on a command. If more info needed, have a look to the directory /usr/doc. To display manual page from a specific section, I may use something like in this example: man 3 exit (this displays an info on the command exit from section 3 of the manual pages).
apropos topic
Give me the list of the commands that have something to to do with my topic.
help command
Display brief info on a bash (shell) build-in command.
ls
List the content of the current directory. Under Linux, the command "dir" is an alias to ls. Many users have "ls" to be an alias to "ls --color".
ls -al more
List the content of the current directory, all files (also those starting with a dot), and in a long form. Pipe the output through the "more" command, so that the display pauses after each screenful.
cd directory
Change directory. Using "cd" without the directory name will take you to your home directory. "cd -" will take you to your previous directory and is a convenient way to toggle between two directories. "cd .." will take you one directory up.
cp source destination
Copy files. E.g., cp /home/stan/existing_file_name . will copy a file to my current working directory. Use the "-r" option (for recursive) to copy the contents of whole directories, e.g. , cp -r my_existing/dir/ ~ will copy a subdirectory under my current working directory to my home directory.
mcopy source destination
Copy a file from/to a DOS filesystem (no mounting necessary). E.g., mcopy a:\autoexec.bat ~/junk . See man mtools for related commands: mdir, mcd, mren, mmove, mdel, mmd, mrd, mformat ....
mv source destination
Move or rename files. The same command is used for moving and renaming files and directories.
ln source destination
Create a hard link called destination to the file called source. The link appears as a copy of the original files, but in reality only one copy of the file is kept, just two (or more) directory entries point to it. Any changes the file are automatically visible throughout. When one directory entry is removed, the other(s) stay(s) intact. The limitation of the hard links are: the files have to be on the same filesystem, hard links to directories or special files are impossible.
ln -s source destination
Create a symbolic (soft) link called "destination" to the file called "source". The symbolic link just specifies a path where to look for the file. In contradistinction to hard links, the source and destination don't not have to tbe on the same filesystem. In comparison to hard links, the drawback of symbolic links are: if the original file is removed, the link is "broken", symbolic links can also create circular references (like circular references in spreadsheets or databases, e.g., "a" points to "b" and "b" points back to "a").
rm files
Remove (delete) files. You must own the file in order to be able to remove it. On many systems, you will be asked or confirmation of deleation, if you don't want this, use the "-f" (=force) option, e.g., rm -f * will remove all files in my current working directory, no questions asked.
mkdir directory
Make a new directory.
rmdir directory
Remove an empty directory.
rm -r files
(recursive remove) Remove files, directories, and their subdirectories. Careful with this command as root--you can easily remove all files on the system with such a command executed on the top of your directory tree, and there is no undelete in Linux (yet). But if you really wanted to do it (reconsider), here is how (as root): rm -rf /*
cat filename more
View the content of a text file called "filename", one page a time. The "" is the "pipe" symbol (on many American keyboards it shares the key with "\") The pipe makes the output stop after each screenful. For long files, it is sometimes convenient to use the commands head and tail that display just the beginning and the end of the file. If you happened to use "cat" a binary file and your terminal displays funny characters afterwards, you can restore it with the command "reset".
less filename
Scroll through a content of a text file. Press q when done. "Less" is roughly equivalent to "more" , the command you know from DOS, although very often "less" is more convenient than "more".
pico filename
Edit a text file using the simple and standard text editor called pico.
pico -w filename
Edit a text file, while disabling the long line wrap. Handy for editing configuration files, e.g. /etc/fstab.
find / -name "filename"
Find the file called "filename" on your filesystem starting the search from the root directory "/". The "filename" may contain wildcards (*,?).
locate filename
Find the file name of which contains the string "filename". Easier and faster than the previous command but depends on a database that normally rebuilds at night.
./program_name
Run an executable in the current directory, which is not on your PATH.
touch filename
Change the date/time stamp of the file filename to the current time. Create an empty file if the file does not exist.
xinit
Start a barebone X-windows server (without a windows manager).
startx
Start an X-windows server and the default windows manager. Works like typing "win" under DOS with Win3.1
startx -- :1
Start another X-windows session on the display 1 (the default is opened on display 0). You can have several GUI terminals running concurrently. Switch between them using , , etc.
xterm
(in X terminal) Run a simple X-windows terminal. Typing exit will close it. There are other, more advanced "virtual" terminals for X-windows. I like the popular ones: konsole and kvt (both come with kde) and gnome-terminal (comes with gnome). If you need something really fancy-looking, try Eterm.
xboing
(in X terminal). Very nice, old-fashioned game. Many small games/programs are probably installed on your system. I also like xboard (chess).
shutdown -h now
(as root) Shut down the system to a halt. Mostly used for a remote shutdown. Use for a shutdown at the console (which can be done by any user).
halt
reboot
(as root, two commands) Halt or reboot the machine. Used for remote shutdown, simpler to type than the previous command.

Network apps
netscape
(in X terminal) Run netscape (requires a separate Netscape installation). The current versions of Netscape (4.x) are known to be big and buggy. They occasionally crash by vanishing (no other harm done). Also, when not connected to the network , Netscape likes to refuse to do anything (looks like it hanged)-it revives when you connect.
netscape -display host:0.0
(in X terminal) Run netscape on the current machine and direct the output to machine named "host" display 0 screen 0. Your current machine must have a permission to display on the machine "host" (typically given by executing the command xhost current_machine_name in the xterminal of the machine host. Other X-windows program can be run remotely the same way.
lynx file.html
View an html file or browse the net from the text mode.
pine
A good text-mode mail reader. Another good and standard one is elm. Your Netscape mail will read the mail from your Internet account. pine will let you read the "local" mail, e.g. the mail your son or a cron process sends to you from a computer on your home network. The command mail could also be used for reading/composing mail, but it would be inconvenient--it is meant to be used in scripts for automation.
elm
A good tex-mode mail reader. See the previous command.
mutt
A really basic but extremally useful and fast mail reader.
mail
A basic operating system tool for e-mail. Look at the previous commands for a better e-mail reader. mail is good if you wanted to send an e-mail from a shell script.
licq
(in X term) An icq "instant messaging" client. Another good one is kxicq. Older distributions don't have an icq client installed, you have to do download one and install it.
talk username1
Talk to another user currently logged on your machine (or use "talk username1@machinename" to talk to a user on a different computer) . To accept the invitation to the conversation, type the command "talk username2". If somebody is trying to talk to you and it disrupts your work, your may use the command "mesg n" to refuse accepting messages. You may want to use "who" or "rwho" to determine the users who are currently logged-in.
mc
Launch the "Midnight Commander" file manager (looks like "Norton Commander" for Linux).
telnet server
Connect to another machine using the TELNET protocol. Use a remote machine name or IP address. You will be prompted for your login name and password--you must have an account on the remote machine to login. Telnet will connect you to another machine and let you operate on it as if you were sitting at its keyboard (almost). Telnet is not very secure--everything you type goes in open text, even your password!
rlogin server
(=remote login) Connect to another machine. The login name/password from your current session is used; if it fails you are prompted for a password.
rsh server
(=remote shell) Yet another way to connect to a remote machine. The login name/password from your current session is used; if it fails you are prompted for a password.
ftp server
Ftp another machine. (There is also ncftp which adds extra features and gftp for GUI .) Ftp is good for copying files to/from a remote machine. Try user "anonymous" if you don't have an account on the remote server. After connection, use "?" to see the list of available ftp commands. The essential ftp command are: ls (see the files on the remote system), ASCII, binary (set the file transfer mode to either text or binary, important that you select the proper one ), get (copy a file from the remote system to the local system), mget (get many files at once), put (copy a file from the local system to the remote system), mput (put many files at once), bye (disconnect). For automation in a script, you may want to use ncftpput and ncftpget, for example:
ncftpput -u my_user_name -p my_password -a remote.host.domain remote_dir *local.html
minicom
Minicom program (looks like "Procomm for Linux").
File (de)compression
tar -zxvf filename.tar.gz
(=tape archiver) Untar a tarred and compressed tarball (*.tar.gz or *.tgz) that you downloaded from the Internet.
tar -xvf filename.tar
Untar a tarred but uncompressed tarball (*.tar).
gunzip filename.gz
Decompress a zipped file (*.gz" or *.z). Use gzip (also zip or compress) if you wanted to compress files to this file format.
bunzip2 filename.bz2
(=big unzip) Decompress a file (*.bz2) zipped with bzip2 compression utility. Used for big files.
unzip filename.zip
Decompress a file (*.zip) zipped with a compression utility compatible with PKZIP for DOS.
unarj e filename.arj
Extract the content of an *.arj archive.
uudecode -o outputfile filename
Decode a file encoded with uuencode. uu-encoded files are typically used for transfer of non-text files in e-mail (uuencode transforms any file into an ASCII file).
7.4 Process control
ps
(=print status) Display the list of currently running processes with their process IDs (PID) numbers. Use ps axu to see all processes currently running on your system (also those of other users or without a controlling terminal), each with the name of the owner. Use "top" to keep listing the processes currently running.
fg PID
Bring a background or stopped process to the foreground.
bg PID
Send the process to the background. Opposite to fg. The same can be accomplished with z. If you have stopped jobs, you have to type exit twice in row to log out.
any_command&
Run any command in the background (the symbol "&" means "run the proceeding command in the background").
batch any_command
Run any command (usually one that is going to take more time) when the system load is low. I can logout, and the process will keep running.
at 17:00
Execute a command at a specified time. You will be prompted for the command(s) to run, until you press d.
kill PID
Force a process shutdown. First determine the PID of the process to kill using ps.
killall program_name
Kill program(s) by name.
xkill
(in an xwindow terminal) Kill a GUI-based program with mouse. (Point with your mouse cursor at the window of the process you want to kill and click.)
lpc
(as root) Check and control the printer(s). Type "?" to see the list of available commands.
lpq
Show the content of the printer queue. Under KDE (X-Windows), you may use GUI-based "Printer Queue" available from "K"menu-Utilities.
lprm job_number
Remove a printing job "job_number" from the queue.
nice program_name
Run program_name adjusting its priority. Since the priority is not specified in this example, it will be adjusted by 10 (the process will run slower), from the default value (usually 0). The lower the number (of "niceness" to other users on the system), the higher the priority. The priority value may be in the range -20 to 19. Only root may specify negative values. Use "top" to display the priorities of the running processes.
renice -1 PID
(as root) Change the priority of a running process to -1. Normal users can only adjust processes they own, and only up from the current value (make them run slower).
c, z, s, and q also belong to this chapter but they were described previously. In short they mean: stop the current command, send the current command to the background, stop the data transfer, resume the data transfer.

7.5 Basic administration commands
printtool
(as root in X-terminal) Configuration tool for your printer(s). Settings go to the file /etc/printcap.
setup
(as root) Configure mouse, soundcard, keyboard, X-windows, system services. There are many distibution-specific configuration utilities, setup is the default on RedHat. Mandrake 7.0 offers very nice DrakConf .
linuxconfig
(as root, either in text or graphical mode). You can access and change hundreds of setting from it. Very powerful--don't change too many things at the same time, and be careful with changing entries you don't understand.
xvidtune
(in X-terminal). Adjust the settings of the graphical display for all resolutions so as to eliminate black bands, shift the display right/left/up/down, etc. (First use the knobs on your monitor to fit your text mode correctly on the screen.) To make the changes permanent, display the frequencies on the screen and transfer them to the setup file /etc/X11/XF86Config.
alias ls="ls --color=tty"
Create an alias for the command "ls" to enhance its format with color. In this example, the alias is also called "ls" and the "color" option is only envoke when the output is done to a terminal (not to files). Put the alias into the file /etc/bashrc if you would like the alias to be always accessible to all users on the system. Type "alias" alone to see the list of aliases on your system.
adduser user_name
Create a new account (you must be root). E.g., adduser barbara Don't forget to set up the password for the new user in the next step. The user home directory is /home/user_name.
useradd user_name
The same as the command " adduser user_name ".
userdel user_name
Remove an account (you must be a root). The user's home directory and the undelivered mail must be dealt with separately (manually because you have to decide what to do with the files).
groupadd group_name
Create a new group on your system. Non-essential but can be handy even on a home machine with a small number of users.
passwd
Change the password on your current account. If you are root, you can change the password for any user using: passwd user_name
chmod perm filename
(=change mode) Change the file access permission for the files you own (unless you are root in which case you can change any file). You can make a file accessible in three modes: read (r), write (w), execute (x) to three classes of users: owner (u), members of the same group as the owner (g), others on the system (o). Check the current access permissions using:
ls -l filename
If the file is accessible to all users in all modes it will show:
rwxrwxrwx
The first triplet shows the file permission for the owner of the file, the second for his/her group, the third for others. A "no" permission is shown as "-".
E.g., this command will add the permission to read the file "junk" to all (=user+group+others):
chmod a+r junk
This command will remove the permission to execute the file junk from others:
chmod o-x junk
Also try here for more info.
You can set the default file permissions for the news files that you create using the command umask (see man umask).
chown new_ownername filename
chgrp new_groupname filename
Change the file owner and group. You should use these two commands after you copy a file for use by somebody else.
su
(=substitute user id) Assume the superuser (=root) identity (you will be prompted for the password). Type "exit" to return you to your previous login. Don't habitually work on your machine as root. The root account is for administration and the su command is to ease your access to the administration account when you require it. You can also use "su" to assume any other user identity, e.g. su barbara will make me "barbara" (password required unless I am a superuser).
kernelcfg
(as root in X terminal). GUI to to add/remove kernel modules. You can do the same from the command line using the command "insmod", but "insmode" is less "newbie-friendly".
lsmod
List currently loaded kernel modules. A module is like a device driver--it provides operating system kernel support for a particular piece of hardware or feature.
modprobe -l more
List all the modules available for your kernel. The available modules are determined by how your Linux kernel was compliled. Every possible module/feature can be compiled on linux as either "hard wired" (fast, non-removable), "module" (maybe slower, but loaded/removable on demand), or "no" (no support for this feature at all).
insmod parport
insmod ppa
(as root) Insert modules into the kernel (a module is roughly an equivalent of a DOS device driver). This example shows how to insert the modules for support of the external parallel port zip drive (it appears to be a problem to get the external zip drive to work in any other way under RH6.0 ).
rmmod module_name
(as root, not essential). Remove the module module_name from the kernel.
setserial /dev/cua0 port 0x03f8 irq 4
(as root) Set a serial port to a non-standard setting. The example here shows the standard setting for the first serial port (cua0 or ttyS0). The standard PC settings for the second serial port (cua1or ttyS1) are: address of i/o port 0x02f8, irq 3. The third serial port (cua2 or ttyS2): 0x03e8, irq 4. The forth serial port (cua3 or ttyS3): 0x02e8, irq 3. Add your setting to /etc/rc.d/rc.local if you want it to be set at the boot time. See man setserial for good a overview.
fdisk
(as root) Linux hard drive partitioning utility (DOS has a utility with the same name).
cd /usr/src/linux-2.0.36
make xconfig
(as root in X terminal). Nice GUI front-end for configuration of the kernel options in preparation for compilation of your customized kernel. (The directory name contains the version of your Linux kernel so you may need to modify the directory name if your Linux kernel version is different than 2.0.36 used in this example. You also need the "Tk" interpreter and the kernel source code installed. ) The alternatives to "make xconfig" are: "make config" (runs a scripts that asks you questions in the text mode) and "make menuconfig" (runs a text-based menu-driven configuration utility). Try: less /usr/doc/HOWTO/Kernel-HOWTO for more information.
After the configuration, you may choose to proceed with kernel compilation of the new kernel by issuing the following commands:
make dep
make zImage
The last command will take some time to complete (maybe 0.5 h, depending on your hardware). It produces the file "zImage", which is your new Linux kernel. Next:
make modules
make modules_install
Read: /usr/doc/HOWTO/Kernel-HOWTO for information on how to install the new kernel. You will probably also find it useful to read "man depmode". Configuration, compilation and installation of a new kernel is not difficult but it CAN lead to problems if you don't know what you are doing.
Compilation of a kernel is a good way to test your hardware, because it involves a massive amount of computing. If your hardware is "flaky", you will most likely receive the "signal 11" error (read the beatiful /usr/doc/FAQ/txt/GCC-SIG11-FAQ). See this for details on kernel upgrade.
depmod -a
(as root) Build the module dependency table for the kernel. This can, for example, be useful after installing and booting a new kernel. Use "modprobe -a" to load the modules.
ldconfig
(as root) Re-create the bindings and the cache for the loader of dynamic libraries ("ld"). You may want to run ldconfig after an installation of new dynamically linked libraries on your system. (It is also re-run every time you boot the computer, so if you reboot you don't have to run it manually.)
mknod /dev/fd0 b 2 0
(=make node, as root) Create a device file. This example shows how to create a device file associated with your first floppy drive and could be useful if you happened to accidentally erase it. The options are: b=block mode device (c=character mode device, p=FIFO device, u=unbuffered character mode device). The two integers specify the major and the minor device number.
fdformat /dev/fd0H1440
mkfs -c -t ext2
(=floppy disk format, two commands, as root) Perform a low-level formatting of a floppy in the first floppy drive (/dev/fd0), high density (1440 kB). Then make a Linux filesystem (-t ext2), checking/marking bad blocks (-c ). Making the files system is an equivalent to the high-level format.
badblocks /dev/fd01440 1440
(as root) Check a high-density floppy for bad blocks and display the results on the screen. The parameter "1440" specifies that 1440 blocks are to be checked. This command does not modify the floppy.
fsck -t ext2 /dev/hda2
(=file system check, as root) Check and repair a filesystem. The example uses the partition hda2, filesystem type ext2.
dd if=/dev/fd0H1440 of=floppy_image
dd if=floppy_image of=/dev/fd0H1440
(two commands, dd="data duplicator") Create an image of a floppy to the file called "floppy_image" in the current directory. Then copy floppy_image (file) to another floppy disk. Works like DOS "DISKCOPY".

Program installation
rpm -ivh filename.rpm
(=RedhatPackageManager, install, verbose, hashes displayed to show progress, as root.) Install a content of RedHat rpm package(s) and print info on what happened. Keep reading if you prefer a GUI installation.
rpm -qpi filename.rpm
(=RedhatPackageManager, query, package, list.) Read the info on the content of a yet uninstalled package filename.rpm.
rpm -qpl filename.rpm
(=RedhatPackageManager, query, package, information.) List the files contained in a yet uninstalled package filename.rpm.
rpm -qf filename
(=RedhatPackageManager, query, file.) Find out the name of the *.rpm package to which the file filename (on your hardrive) belongs.
rpm -e packagename
(=RedhatPackageManager, erase=uninstall.) Uninstall a package pagckagename. Packagname is the same as the beginning of the *.rpm package file but without the dash and version number.
kpackage
gnorpm
glint
(in X terminal, as root if you want to be able to install packages) GUI fronts to the Red Hat Package Manager (rpm). "glint" comes with RH5.2, "gnorpm" with RH6.0, "kpackage" comes with RH6.1 or must be installed separately but is the best of the three. Use any of them to view which software packages are installed on your system and the what not-yet-installed packages are available on your RedHat CD, display the info about the packages, and install them if you want (installation must be done as root).

Accessing drives/partitions
mount
See here for details on mounting drives. Examples are shown in the next commands.
mount -t auto /dev/fd0 /mnt/floppy
(as root) Mount the floppy. The directory /mnt/floppy must exist, be empty and NOT be your current directory.
mount -t auto /dev/cdrom /mnt/cdrom
(as root) Mount the CD. You may need to create/modify the /dev/cdrom file depending where your CDROM is. The directory /mnt/cdrom must exist, be empty and NOT be your current directory.
mount /mnt/floppy
(as user or root) Mount a floppy as user. The file /etc/fstab must be set up to do this. The directory /mnt/floppy must not be your current directory.
mount /mnt/cdrom
(as user or root) Mount a CD as user. The file /etc/fstab must be set up to do this. The directory /mnt/cdrom must not be your current directory.
umount /mnt/floppy
Unmount the floppy. The directory /mnt/floppy must not be your (or anybody else's) current working directory. Depending on your setup, you might not be able to unmount a drive that you didn't mount.

7.6 Network administration tools
netconf
(as root) A very good menu-driven setup of your network.
pingmachine_name
Check if you can contact another machine (give the machine's name or IP), press C when done (it keeps going).
route -n
Show the kernel routing table.
nslookup host_to_find
Query your default domain name server (DNS) for an Internet name (or IP number) host_to_find. This way you can check if your DNS works. You can also find out the name of the host of which you only know the IP number.
traceroute host_to_trace
Have a look how you messages trave to host_to_trace (which is either a host name or IP number).
ipfwadm -F -p m
(for RH5.2, seen next command for RH6.0) Set up the firewall IP forwarding policy to masquerading. (Not very secure but simple.) Purpose: all computers from your home network will appear to the outside world as one very busy machine and, for example, you will be allowed to browse the Internet from all computers at once.
echo 1 > /proc/sys/net/ipv4/ip_forward
ipfwadm-wrapper -F -p deny
ipfwadm-wrapper -F -a m -S xxx.xxx.xxx.0/24 -D 0.0.0.0/0
(three commands, RH6.0). Does the same as the previous command. Substitute the "x"s with digits of your class "C" IP address that you assigned to your home network. See here for more details. In RH6.1, masquarading seems broken to me--I think I will install Mandrake Linux:).
ifconfig
(as root) Display info on the network interfaces currently active (ethernet, ppp, etc). Your first ethernet should show up as eth0, second as eth1, etc, first ppp over modem as ppp0, second as ppp1, etc. The "lo" is the "loopback only" interface which should be always active. Use the options (see ifconfig --help) to configure the interfaces.
ifup interface_name
(/sbin/ifup to it run as a user) Startup a network interface. E.g.:
ifup eth0
ifup ppp0
Users can start up or shutdown the ppp interface only when the right permission was checked during the ppp setup (using netconf ). To start a ppp interface (dial-up connection), I normally use kppp available under kde menu "internet".
ifdown interface_name
(/sbin/ifdown to run it as a user). Shut down the network interface. E.g.: ifdown ppp0 Also, see the previous command.
netstat more
Displays a lot (too much?) information on the status of your network.

Music-related commands
cdplay play 1
Play the first track from a audio CD.
eject
Get a free coffee cup holder :))). (Eject the CD ROM tray).
play my_file.wav
Play a wave file.
mpg123 my_file.mp3
Play an mp3 file.
mpg123 -w my_file.wav my_file.mp3
Create a wave audio file from an mp3 audio file.
knapster
(in X terminal) Start the program to downolad mp3 files that other users of napster have displayed for downloading. Really cool!
cdparanoia -B "1-"
(CD ripper) Read the contents of an audio CD and save it into wavefiles in the current directories, one track per wavefile. The "1-"
means "from track 1 to the last". -B forces putting each track into a separate file.
playmidi my_file.mid
Play a midi file. playmidi -r my_file.mid will display text mode effects on the screen.
sox
(argument not given here) Convert from almost any audio file format to another (but not mp3s). See man sox.

Graphics-related commands
kghostview my_file.ps
Display a postscript file on screen. I can also use the older-looking ghostview or gv for the same end effect.
ps2pdf my_file.ps my_file.pdf
Make a pdf (Adobe portable document format) file from a postscript file.
gimp
(in X terminal) A humble looking but very powerful image processor. Takes some learning to use, but it is great for artists, there is almost nothing you can't do with gimp. Use your mouse right button to get local menus, and learn how to use layers. Save your file in the native gimp file format *.xcf (to preserve layers) and only then flatten it and save as png (or whatever). There is a large user manual /usr/
gphoto
(in X terminal) Powerful photo editor.
giftopnm my_file.giff > my_file.pnm
pnmtopng my_file.pnm > my_file.png
Convert the propriatory giff graphics into a raw, portable pnm file. Then convert the pnm into a png file, which is a newer and better standard for Internet pictures (better technically plus there is no danger of being sued by the owner of giff patents).

QA Guidelines for Data Warehouse Quality Verification

QA Guidelines for Data Warehouse Quality Verification

This document describes testing guidelines and steps for verifying data, ETL processes, and SQL during the construction, unit testing, system and integration testing of an application’s data warehouse operational tables and data mart.
1.) Verify and Maintain the Data Low Level Design (LLD)
A first level of testing and validation begins with the formal acceptance of the logical data model and “low level design” (LLD). All further testing and validation will be based on the understanding of each of the data elements in the model.
Data elements that are created through a transformation or summary process must be clearly identified and calculations for each of these data elements must be clear and easily interpreted.
During the LLD reviews and updates, special consideration should be given to typical modeling scenarios that exist in the project. Examples follow:
1. Verify that many-to-many attribute relationships are clarified and resolved.
2. Verify the types of keys that are used: surrogate keys versus natural keys.
3. Verify that the business analyst / DBA reviewed with ETL architect and developers (application) the lineage and business rules for extracting, transforming, and loading the data warehouse?
4. Verify that all transformation rules, summarization rules, and matching and consolidation rules have clear specifications.
5. Verify that specified transformations, business rules and cleansing specified in LLD and other application logic specs have been coded correctly in ETL, JAVA, and SQL used for data loads.
6. Verify that procedures are documented to monitor and control data extraction, transformation and loading. The procedures should describe how to handle exceptions and program failures.
7. Verify that data consolidation of duplicate or merged data was properly handled.
8. Verify that samplings of domain transformations will be taken to verify they are properly changed.
9. Compare unique values of key fields between source data and data loaded to the warehouse. This is a useful technique that points out a variety of possible data errors without doing a full validation on all fields.
10. Validate that target data types are as specified in the design and/or the data model.
11. Verify how sub-class/super-class attributes depicted?
12. Verify that data field types and formats are specified.
13. Verify that defaults are specified for fields where needed.
14. Verify that processing for invalid field values in source are defined
15. Verify that expected ranges of field contents are specified where known.
16. Verify that keys generated by the “sequence generator” are identified.
17. Verify that slowly changing dimensions are described?

2.) Analyze Source Data Before & After Extraction to Staging

Testers should extract representative data from each source file (before or after extract to staging tables) and confirm that the data is consistent with its definition; QA can discover any anomalies in how the data is represented and write defect reports where necessary. The objective is to discover data that does not meet “data quality factors” as described in specifications. See list below and Table 1.
This verification process will be used for temp tables used in a step process for data transformations, cleaning, etc.
• Verify that the scope of values in each column are within specifications
• Identify unexpected values in each field
• Verify relationships between fields
• Identify frequencies of values in columns and whether these frequencies make sense?

Inputs: Application source data models and low level data design, data dictionaries, data attribute sources.
Outputs: Newly discovered attributes, undefined business rules, data anomalies such as fields used for multiple purposes.
Techniques and Tools: Data extraction software, business rule discovery software, data analysis tools.
Process Description:
1. Extract representative samples of data from each source or staging table.
2. Parse the data for the purpose of profiling.
3. Verify that not-null fields are populated and populated as expected.
4. Structure discovery – Does the data match the corresponding metadata? Do field attributes of the data match expected patterns? Does the data adhere to appropriate uniqueness and null value rules?
5. Data discovery – Are the data values complete, accurate and unambiguous?
6. Relationship discovery – Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables or databases? Is there redundant data?
7. Verify that all required data from the source was extracted. Verify that extraction process did not extract more or less data source than it should have.
8. Verify or write defects for exceptions and errors discovered during the ETL process.
9. Verify that extraction process did not extract duplicate data from the source (usually this happens in repeatable processes where at point zero we need to extract all data from the source file, but the during the next intervals we only need to capture the modified, and new rows.).
10. Validate that no data truncation occurred during staging.
11. Utilize a data profiling tool or methods that show the range and value distributions of fields in the source data. This is used to identify any data anomalies from source systems that may be missed even when the data movement is correct.
12. Validation & Certification Method: it is sufficient to identify the requirements and count (via SQL) the number of rows that should be extracted from the source systems. The QA team will also count the number of rows in the result / target sets and match the two for validation. The QA team will maintain a set of SQL statements that are automatically run at this stage to validate that no duplicate data have been extracted from the source systems.
Table 1: Data Quality Factors
FACTOR DESCRIPTION EXAMPLE
Data Consistency Issues:
Varying Data Definitions The data type and length for a particular attribute may vary in files or tables though the semantic definition is the same. Account number may be defined as: Number (9) in one field or table and Varchar2(11) in another table
Misuse of Integrity Constraints When referential integrity constraints are misused, foreign key values may be left “dangling” or inadvertently deleted. An account record is missing but dependent records are not deleted.
Nulls Nulls when field defined as “not-null”. The company has been entered as a null value for a business. A report of all companies would not list the business.
Data Completeness Issues:
Missing data Data elements are missing due to a lack of integrity constraints or nulls that are inadvertently not updated. An account date of estimated arrival is null thus impacting an assessment of variances in estimated/actual account data.
Inaccessible Data Inaccessible records due to missing or redundant identifier values. Business numbers are used to identify a customer record. Because uniqueness was not enforced, the business ID (45656) identifies more than one customer.
Missing Integrity Constraints Missing constraints can cause data errors due to nulls, non-uniqueness, or missing relationships. Account records with a business identifier exist in the database but cannot be matched to an existing business.
Data Correctness Issues:
Loss Projection Tables that are joined over non key attributes will produce non existent data that is shown to the user. Lisa Evans works in the LA office in the Accounting department. When a report is generated, it shows her working in IT department.
Incorrect Data Values Data that is misspelled or inaccurately recorded. 123 Maple Street is recorded with a spelling mistake and a street abbreviation (123 Maple St)
Inappropriate Use of Views Data is updated incorrectly through views. A view contains non key attributes from base tables. When the view is used to update the database, null values are entered into the key columns of the base tables.
Disabled Integrity Constraints Null, non unique, or out of range data may be stored when the integrity constraints are disabled. The primary key constraint is disabled during an import function. Data is entered into the existing data with null unique identifiers.
Non-duplication Testing should be conducted to determine if there’s duplication of data where there should not be. Duplicate rows or column data.
Misuse of Integrity Constraints Check whether null or foreign key constraints are inappropriate or too restrictive. Check constraint only allows hard coded values of “C”, “A”, “X”, and “Z”. But a new code “B” cannot be entered.
Data Comprehension Issues:
Data Aggregation Aggregated data is used to represent a set of data elements. One name field is used to store surname, first name, middle initial, and last name (e.g., John, Hanson, Mr.).
Cryptic Object Definitions Database object (e.g., column) has a cryptic, unidentifiable name. Customer table with a column labeled, “c_avd”. There is no documentation as to what the column might contain.
Unknown or Cryptic Data Cryptic data stored as codes, abbreviations, truncated, or with no apparent meaning. Shipping codes used to represent various parts of the customer base (‘01’, ‘02’, ‘03’). No supporting document to explain the meaning of the codes.
Accuracy Data will be matched against business rules. Boundary values (low, high’s) will be identified for relevant fields and compared with expectations.
Completeness Data will be assessed to verify that all required is present. Missing rows will be identified; Null values will be identified in data elements where a value is expected.
Precision Precision testing is conducted to evaluate the level of data not sufficiently precise based on specifications.

3.) Verify Corrected, Cleaned, Source Data in Staging

This step works to improve the quality of existing data in source files or “defects” that meet source specs but must be corrected before load.
Inputs:
 Files or tables (staging) that require cleansing; data definition and business rule documents, data map of source files and fields; business rules, data anomalies discovered in earlier steps of this process.
 Fixes for data defects that will result in data that does not meet specifications for the application DW.
Outputs: Defect reports, cleansed data, rejected or uncorrectable data
Techniques and Tools: Data reengineering, transformation, and cleansing tools, MS Access, Excel filtering.
Process Description: In this step, data with missing values, known errors, and suspect data is corrected. Automated tools may be identified to best to locate, clean / correct large volumes of data.
1. Document the type of data cleansing approach taken for each data type in the repository.
2. Determine how “uncorrectable” or suspect data is processed, rejected, maintained for corrective action. SME’s and stakeholders should be involved in the decision.
3. Review ETL defect reports to assess rejected data excluded from source files or information group targeted for the warehouse.
4. Determine if data not meeting quality rules was accepted.
5. Document in defect reports, records and important fields that cannot be easily corrected.
6. Document records that were corrected and how corrected.
Certification Method: Validation of data cleansing processes could be a tricky proposition, but certainly doable. All data cleansing requirements should be clearly identified. The QA team should learn all data cleansing tools available and their methods. QA should create various conditions as specified in the requirements for the data cleansing tool to support and validate its results. QA will run a volume of real data through each tool to validate accuracy as well as performance.

4.) Verifying Matched and Consolidated Data

There are often ETL processes where data has been consolidated from various files into a single occurrence of records. The cleaned and consolidated data can be assessed to very matched and consolidated data.
Much of the ETL heavy lifting occurs in the transform step where combined data, data with quality issues, updated data, surrogate keys, build aggregates, are processed.
Inputs: Analysis of all files or databases for each entity type
Outputs:
Report of matched, consolidated, related data that is suspect or in error
List of duplicate data records or fields
List of duplicate data suspects.
Techniques and Tools: Data matching techniques or tools; data cleansing software with matching and merging capabilities.
Process Description:
1. Establish match criteria for data. Select attributes to become the basis for possible duplicate occurrences (e.g., names, account numbers).
2. Determine the impact of incorrectly consolidated records. If the negative impact of consolidating two different occurrences such as different customers into a single customer record exists, submit defect reports. The fix should be higher controls to help avoid such consolidations in the future.
3. Determine the matching techniques to be used: Exact character match in two corresponding fields such as wild card match, key words, close match, etc.
4. Compare match criteria for specific record with all other records within a given file to look for intra-file duplicate records.
5. Compare match criteria for a specific record with all records in another file to seek inter-file duplicate records.
6. Evaluate potential matched occurrences to assure they are, in fact, duplicate.
7. Verify that consolidated data into single occurrences is correct.
8. Examine and re-relate data related to old records being consolidated to new occurrence-of-reference record. Validate that no related data was overlooked.

5.) Verify Transformed / Enhanced / Calculated Data to Target Tables

At this stage, base data is being prepared for loading into the Application operational tables and the data mart. This includes converting and formatting cleansed, consolidated data into the new data architecture and possibly enhancing internal operational data with external data licensed from service providers.
The objective is to successfully map the cleaned, corrected and consolidated data into the DW environment.
Inputs: Cleansed, consolidated data; external data from service providers; business rules governing the source data; business rules governing the target DW data; transformation rules governing the transformation process; DW or target data architecture; data map of source data to standardized data.
Output: Transformed, calculated, enhanced data; updated data map of source data to standardized data; data map of source data to target data architecture
Techniques and Tools: Data transformation software; external or online or public databases.
Process Description:
1. Verify that the data warehouse construction team is using the data map of source data to the DW standardized data, verify the mapping.
2. Verify that the data transformation rules and routines are correct.
3. Verify the data transformations to the DW and assure that the processes were performed according to specifications.
4. Verify that data loaded in the operational tables and data mart meets the definition of the data architecture including data types, formats, accuracy, etc.
5. Develop scenarios to be covered in Load Integration Testing
6. Count Validation: Record Count Verification DWH backend/Reporting queries against source and target as an initial check.
7. Dimensional Analysis: Data integrity exists between the various source tables and parent / child relationships.
8. Statistical Analysis: Validation for various calculations.
9. Data Quality Validation: - Check for missing data, negatives and consistency. Field-by-field data verification will be done to check the consistency of source and target data.
10. Granularity: Validate at the lowest granular level possible (lowest in the hierarchy E.g. Country-City-Sector– start with test cases).
11. Dynamic Transformation Rules & Tables: such methods need to be checked continuously to ensure the correct transformation routines are executed. Verify that dynamic mapping tables and dynamic mapping rules provide an easy, documented, and automated way for transforming values from one or more sources into a standard value presented in the DW.
12. Verification Method: The QA team will identify the detailed requirements as they relate to transformation and validate the dynamic transformation rules and tables against DW records. Utilizing SQL and related tools, the team will identify unique values in source data files that are subject to transformation. The QA team identifies the results from the transformation process and validate that such transformation have accurately taken place.

6.) Front-end UI and Report Testing Using Operational Tables and Data Mart

End user reporting is a major component of the Application Project. The report code may run aggregate SQL queries against the data stored in the data mart and/or the operational tables then display results in a suitable format either in a Web browser or on a client application interface. Once the initial view is rendered, the reporting tool interface provides various ways of manipulating the information such as sorting, pivoting, computing subtotals, and adding view filters to slice-and-dice the information further. Special considerations such as those below will be prepared while testing the reports:
1. The ETL process should be complete, the data mart must be populated and data quality testing should be largely completed.
2. The front-end will use a SQL engine which will generate the SQL based on the how the dimension and fact tables are mapped. Additionally, there may be global or report-specific parameters set to handle very large database (VLDB)-related optimization requirements. As such, testing of the front-end will concentrate on validating the SQL generated; this in turn validates the dimensional model and the report specification vis-à-vis the design.
3. Unit testing of the reports will be conducted to verify the layout format per the design mockup, style sheets, prompts and filters, attributes and metrics on the report.
4. Unit testing will be executed both in the desktop and Web environment.
5. System testing of the reports will concentrate on various report manipulation techniques like the drilling, sorting and export functions of the reports in the Web environment.
6. Reports and/or documents need special consideration for testing because they are high visibility reports used by the top analysts and because they have various charts, gauges and data points to provide a visual insight to the performance of the organization in question.
7. There may be some trending reports, or more specifically called comp reports, that compare the performance of an organizational unit over multiple time periods. Testing these reports needs special consideration especially if a fiscal calendar is used instead of an English calendar for time period comparison.
8. For reports containing derived metrics special focus should be paid to any subtotals. The subtotal row should use a "smart-total," i.e., do the aggregation first and then do the division instead of adding up the individual cost per click of each row in the report.
9. Reports with "non-aggregate-able" metrics (e.g., inventory at hand) also need special attention to the subtotal row. It should not, for example, add up the inventory for each week and show the inventory of the month.
10. During unit testing, all data formats will be verified against the standard. For example, metrics with monetary value should show the proper currency symbol, decimal point precision (at least two places) and the appropriate positive or negative. For example, negative numbers should be shown in red and enclosed in braces.
11. During system testing, while testing the drill-down capability of reports, care will be taken to verify that the subtotal at the drill-down report matches with the corresponding row of the summary report. At times, it is desirable to carry the parent attribute to the drill-down report; verify the requirements for this.
12. When testing reports containing conditional metrics, care will be taken to check for "outer join condition;" i.e., nonexistence of one condition is reflected appropriately with the existence of the other condition.
13. Reports with multilevel sorting will get special attention for testing especially if the multilevel sorting includes both attributes and metrics to be sorted.
14. Reports containing metrics at different dimensionality and with percent-to-total metrics and/or cumulative metrics needs will get special attention to check that the subtotals are hierarchy-aware (i.e., they "break" or "re-initialized" at the appropriate levels).

7.) Operational Table and Data Mart: Build Sanity Test
1. Session Completions: All workflow sessions completed successfully using the Log Viewer.
2. Source to Target Counts: This process verifies that the number of records in the source system matches the number of records received, and ultimately processed, into the data warehouse. If Look-up’s are involved in the ETL process, the count between source and target will not match. The ETL Session log and target table counts are compared.
3. Source to Target Data Verification: The process verifies that all source and reference tables have data before running ETLs. We verify that all target tables were truncated before the load unless target tables are updated. This process verifies that the source field threshold is not subject to truncation during the transformation or loading of data.
4. Field to Field Verification: This process verifies the field values from the source system to target. This process ensures that the data mapping from the source system to the target is correct, and that data sent has been loaded accurately.
5. ETL Exception Processing: Exception processing verification looks for serious data errors that would cause system processing failures or data corruption. An Exception report verifying the number and types of errors encountered is produced and reviewed for additional processing and / or reporting to the customer.
There are two types of Exception process:
1. Database Exception:
• Not Null - Source column is null while target is not null
• Reference Key - The records coming from the source data do not have a corresponding parent key in the parent table.
• Unique Key - The record already exists in the target table.
• Check Constraint - CHECK constraints enforce domain integrity by limiting the values that are accepted by a column
2. Business Exception
These are the exceptions thrown based on certain business rules defined for specific data elements or group of data elements
• ETL process utilizes a single Exception Table to capture the exceptions from various ETL sessions and an Error Lookup table which has various error codes and their description.
• We check the Exception process using the Session Log and Exception Table.

8.) Sanity Test: Exit and Suspension Criteria
1. No critical defects unfixed; No more than 3 high severity defects.
2. 80% or more of build functionality can be tested – functionality might fail because of JAVA / report code.
3. Platform performance is such that test team can productively work to schedule
4. Fewer than 15% of build fixes failed

ETL Testing Concepts

ETL stands for extract, transform, and load. It can consolidate the scattered data for any
organization while working with different departments. It can very well handle the data
coming from different departments.
For example, a health insurance organization might have information on a customer in
several departments and each department might have that customer's information listed in
a different way. The membership department might list the customer by name, whereas
the claims department might list the customer by number. ETL can bundle all this data
and consolidate it into a uniform presentation, such as for storing in a database or data
warehouse.
ETL can transform not only data from different departments but also data from different
sources altogether. For example, any organization is running its business on different
environments like SAP and Oracle Apps for their businesses. If the higher management
wants to take discussion on their business, they want to make the data integrated and used
it for their reporting purposes. ETL can take these two source system data and make it
integrated in to single format and load it into the tables.

Generally the normal testing steps are:
• Requirements Analysis
• Testing Methodologies
• Test Plans and approach
• Test Cases
• Test Execution
• Verification and Validation
• Reviews and Walkthroughs

The main difference in testing a ETL is that we basically involve the
SQL queries in our test case documents. It is vital to test both the initial loads of the Data
Warehouse from the source i.e. when it gets extracted and then updating it on the target
table i.e. the loading step. In specific cases, where trouble shooting is required, we verify
intermediate steps as well.

A defect or bug detection can be appreciated if and only if it is detected early and is fixed
at the right time without leading to a high cost. So to achieve it, it is very important to set
some basic testing rules. They are:
• No Data losses
• Correct transformation rules
• Data validation
• Regression Testing
• Oneshot/ retrospective testing
• Prospective testing
• View testing
• Sampling
• Post implementation

sql query to find duplicate values

sql query to find duplicate values

"SELECT name, email COUNT(name) AS NumOccurName, COUNT(email) as NumOccurEmail FROM users GROUP BY Name, Email HAVING ( COUNT(name) > 1 ) AND ( COUNT(email) > 1)"

select * from table1 a where rowid>(select min(rowid) from table2 b where a.col1=b.col1 and a.col2=b.col2…)

SELECT * FROM maintable p WHERE ((SELECT COUNT(id) FROM maintable WHERE p.columnwithdupes = maintable.columnwithdupes AND p.id <> maintable.id) > 0)

Delete dup

delete from test
where exists(
select * from to_delete
where to_delete.day = test.day and to_delete.min_id <> test.id
)

difference between where clause and having clause
Why can’t you use a WHERE clause?
A WHERE clause filters the rows before they are grouped together. A HAVING clause filters them after grouping. That’s why you can’t use a WHERE clause in the above query.

What is surrogate key ? where we use it expalin with examples

Surrogate key is a unique identification key, it is like an artificial or alternative key to production key, bz the production key may be alphanumeric or composite key but the surrogate key is always single numeric key.
============================
Surrogate key is the primary key for the Dimensional table.
===============================
surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

2. Adapted from response by Vincent on Thursday, March 13, 2003

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.
========================
When creating a dimension table in a data warehouse, we generally create the tables witha system generated key to unqiuely identify a row in the dimension. This key is also known as a surrogate key. The surrogate key is used as the primary key in the dimension table. The surrogate key will also be placed in the fact table and a foreign key will be defined between the two tables. When you ultimately join the data it will join just as any other join within the database. should you need any further assistance pls revert to this mail id venkatdba2000@yahoo.com or venkata.veluri@gmail.comRegardsVen (Venkat)
===========================
Surrogate Key a simple concept.

Correct n exact answer for SURROGATE KEY IS BELOW:

Definition of Surrogate Key:

Alternate of Primary Key that allows duplication of datas/records.

Need, Where & Why we use Surrogate Key:

OLTP is of "Normalised Form" whereas OLAP (i.e.) Datawarehouse is of "De-normalised form".

Actually the DWH concept is to maintain the historic datas for analysing. So its should denormalized form.

To be denormalise form duplication should be allowed in DWH. When datas entering the DWH Surrogate key a new column named serial number is introduced to allow duplication in OLAP Systems to maintain historic datas.

You all know one thing a single mobile is used by other person if it is not in use for more than one year. how is it posssible just because of this Surrogate Key.

===========================
I think there already enough statements of what a surrogate key is. From experience, the huge advantage:-

Our warehouse captured data from an existing system which identified SALES by a unique number. We implemented this as the "Natural" or "Business" Key but also generated a unique "Surrogate" key. Seems pointless as the incoming business key is numeric already. But we did it.

Six months into the project, the users announced the feeder system would be replaced, but the two SALES systems would run in parallel for a year. Our existing system had a numeric natural key, the new system had a unique ID of a 20 character field.

We simply added the 20 character field to the existing dimension, and the ETL looked up using the appropriate key (existing system key or new replacement ID). Either way it matched and was converted a source system independent surrogate key.

No fuss, no re-design. It just worked. Surrogate key importance proven.

So there's little room for confusion:-

Dimension: SALESMAN

SALES_ID Number not null /* Primary Key */
LEGACY_SALE_IDENTIFIER Number /* Legacy system business key */
NEW_SALE_IDENTIFIER varchar(20) /* New system business key */
...Other attributes...

The SALES_ID the the "Surrogate key" which is the Foreign Key link to the Fact Table(s). This key is simply a generated sequence whenever a new SALESMAN entry is created.

The design (as already described) can be extended to support type 2 Slowly changing dimensions.
====================================

Monday, March 7, 2011

Oracle SQL Basics

Primary Key:

The PRIMARY KEY constraint uniquely identifies each record in a database table.

Primary keys must contain unique values.

A primary key column cannot contain NULL values.

Each table should have a primary key, and each table can have only one primary key.

Foreign Key:

A FOREIGN KEY in one table points to a PRIMARY KEY in another table.

The SELECT statement is used to select data from a database.

The DISTINCT keyword can be used to return only distinct (different) values.

The WHERE clause is used to extract only those records that fulfill a specified criterion.

The AND operator displays a record if both the first condition and the second condition is true.

The OR operator displays a record if either the first condition or the second condition is true.

The ORDER BY keyword is used to sort the result-set by a specified column.

The ORDER BY keyword sort the records in ascending order by default.

If you want to sort the records in a descending order, you can use the DESC keyword.

The INSERT INTO statement is used to insert a new row in a table.

The UPDATE statement is used to update existing records in a table.

The DELETE statement is used to delete records in a table.

The IN operator allows you to specify multiple values in a WHERE clause.

The BETWEEN operator selects a range of data between two values. The values can be numbers, text, or dates.

The COUNT() function returns the number of rows that matches a specified criteria.

EDIT statement is used to manually edit the values in the tables

ALIAS - With SQL, an alias name can be given to a table or to a column.

Aggregate functions often need an added GROUP BY statement.

--------------------------------------------------------------------------------

The GROUP BY Statement

The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.

SQL GROUP BY Syntax

SELECT column_name, aggregate_function(column_name)

FROM table_name

WHERE column_name operator value

GROUP BY column_name

The HAVING Clause

The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.

SQL HAVING Syntax

SELECT column_name, aggregate_function(column_name)

FROM table_name

WHERE column_name operator value

GROUP BY column_name

HAVING aggregate_function(column_name) operator value

MINUS

MINUS returns all rows from the first SELECT that are not also returned by the second SELECT.

INTERSECT

INTERSECT returns only the rows retrieved by both component queries. Compare this with UNION, which returns the rows retrieved by any of the component queries. If UNION acts like 'OR', INTERSECT acts like 'AND'.

Select Query:

SELECT column_name(s)

FROM table_name (OR)

SELECT * FROM table_name

Distinct Query:

SELECT DISTINCT column_name(s) FROM table_name

Where Query:

SELECT column_name(s) FROM table_name WHERE column_name operator value (CONDITION)

Order By:

SELECT column_name(s) FROM table_name ORDER BY column_name(s) ASC|DESC

Update Query:

UPDATE table_name SET column1=value, column2=value2,... WHERE some_column=some_value

Delete Query:

DELETE FROM table_name

WHERE some_column=some_value

Edit Query:

EDIT table_name

IN Query:

SELECT column_name(s) FROM table_name WHERE column_name IN (value1,value2,...)

BETWEEN Query:

SELECT column_name(s) FROM table_name WHERE column_name BETWEEN value1 AND value2

ALIAS Query:

SELECT column_name(s) FROM table_name AS alias_name

COUNT Query:

SELECT COUNT(column_name) FROM table_name (OR)

SELECT COUNT(*) FROM table_name

COMMIT

ROLLBACK

AND/OR Query:

SELECT column_name(s) FROM table_name WHERE column_name operator value AND/OR column_name operator value

SQL JOIN

The JOIN keyword is used in an SQL statement to query data from two or more tables, based on a relationship between certain columns in these tables.

Tables in a database are often related to each other with keys.

A primary key is a column (or a combination of columns) with a unique value for each row. Each primary key value must be unique within the table. The purpose is to bind data together, across tables, without repeating all of the data in every table.

SQL INNER JOIN Keyword

The INNER JOIN keyword return rows when there is at least one match in both tables.

SQL INNER JOIN Syntax

SELECT column_name(s)

FROM table_name1

INNER JOIN table_name2

ON table_name1.column_name=table_name2.column_name

PS: INNER JOIN is the same as JOIN.

SQL LEFT JOIN Keyword

The LEFT JOIN keyword returns all rows from the left table (table_name1), even if there are no matches in the right table (table_name2).

SQL LEFT JOIN Syntax

SELECT column_name(s)

FROM table_name1

LEFT JOIN table_name2

ON table_name1.column_name=table_name2.column_name

PS: In some databases LEFT JOIN is called LEFT OUTER JOIN.

SQL RIGHT JOIN Keyword

The RIGHT JOIN keyword Return all rows from the right table (table_name2), even if there are no matches in the left table (table_name1).

SQL RIGHT JOIN Syntax

SELECT column_name(s)

FROM table_name1

RIGHT JOIN table_name2

ON table_name1.column_name=table_name2.column_name

PS: In some databases RIGHT JOIN is called RIGHT OUTER JOIN.

SQL FULL JOIN Keyword

The FULL JOIN keyword return rows when there is a match in one of the tables.

SQL FULL JOIN Syntax

SELECT column_name(s)

FROM table_name1

FULL JOIN table_name2

ON table_name1.column_name=table_name2.column_name

The SQL UNION Operator

The UNION operator is used to combine the result-set of two or more SELECT statements.

Notice that each SELECT statement within the UNION must have the same number of columns. The columns must also have similar data types. Also, the columns in each SELECT statement must be in the same order.

SQL UNION Syntax

SELECT column_name(s) FROM table_name1

UNION

SELECT column_name(s) FROM table_name2

Note: The UNION operator selects only distinct values by default. To allow duplicate values, use UNION ALL.

SQL UNION ALL Syntax

SELECT column_name(s) FROM table_name1

UNION ALL

SELECT column_name(s) FROM table_name2

PS: The column names in the result-set of a UNION are always equal to the column names in the first SELECT statement in the UNION.

SQL Constraints

Constraints are used to limit the type of data that can go into a table.

Constraints can be specified when a table is created (with the CREATE TABLE statement) or after the table is created (with the ALTER TABLE statement).

We will focus on the following constraints:

NOT NULL

UNIQUE

PRIMARY KEY

FOREIGN KEY

CHECK

DEFAULT

SQL NOT NULL Constraint

The NOT NULL constraint enforces a column to NOT accept NULL values.

The NOT NULL constraint enforces a field to always contain a value. This means that you cannot insert a new record, or update a record without adding a value to this field.

SQL UNIQUE Constraint

The UNIQUE constraint uniquely identifies each record in a database table.

The UNIQUE and PRIMARY KEY constraints both provide a guarantee for uniqueness for a column or set of columns.

A PRIMARY KEY constraint automatically has a UNIQUE constraint defined on it.

Note that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table.

The CREATE INDEX statement is used to create indexes in tables.

Indexes allow the database application to find data fast; without reading the whole table.

--------------------------------------------------------------------------------

Indexes

An index can be created in a table to find data more quickly and efficiently.

The users cannot see the indexes, they are just used to speed up searches/queries.

Note: Updating a table with indexes takes more time than updating a table without (because the indexes also need an update). So you should only create indexes on columns (and tables) that will be frequently searched against.

SQL CREATE INDEX Syntax

Creates an index on a table. Duplicate values are allowed:

CREATE INDEX index_name

ON table_name (column_name)

SQL CREATE UNIQUE INDEX Syntax

Creates a unique index on a table. Duplicate values are not allowed:

CREATE UNIQUE INDEX index_name

ON table_name (column_name)

Note: The syntax for creating indexes varies amongst different databases. Therefore: Check the syntax for creating indexes in your database.

A view is a virtual table.

This chapter shows how to create, update, and delete a view.