Psyche Scribbles: 2012

Wednesday, August 1, 2012

SQL Interview Questions

Basic SQL Interview questions and answers:

SOURCE: http://www.techinterviews.com/interview-questions/database/

QUESITON:Name few Data types in database?
ANSWER:
Max. columns in a table is 255. Max.

Char size is 255,

Long is 64K and

Number is 38 digits.Cannot Query on a long column.Char,

Varchar2 Max. size is 2000 and default is 1 byte.

Number(p,s) p is precision range 1 to 38, s is scale -84 to 127.

Long Character data of variable length upto 2GB.

Date Range from Jan 4712 BC to Dec 4712 AD.

Raw Stores Binary data (Graphics Image and Digitized Sound).

Max. is 255 bytes.

QUESTION: WHAT IS TABLE CONSTRAINTS, NAME FEW TABLE CONSTRAINTS.

ANSWER:

not null - value in a column must not be NULL

unique - value(s) in specified column(s) must be unique for each row in a table

primary key - value(s) in specified column(s) must be unique for each row in a table and not be NULL; normally each table in a database should have a primary key - it is used to identify individual records

foreign key - value(s) in specified column(s) must reference an existing record in another table (via it's primary key or some other unique constraint)

check - an expression is specified, which must evaluate to true for constraint to be satisfiedA constraint allows you to apply simple referential integrity checks to a table. There are four primary types of constraints that are currently supported by SQL Server:

PRIMARY/UNIQUE - enforces uniqueness of a particular table column.

DEFAULT - specifies a default value for a column in case an insert operation does not provide one. FOREIGN KEY - validates that every value in a column exists in a column of another table.

CHECK - checks that every value stored in a column is in some specified list. Each type of constraint performs a specific type of action. Default is not a constraint.

NOT NULL is one more constraint which does not allow values in the specific column to be null. And also it the only constraint which is not a table level constraint.

QUESTION: WHAT IS DATA INTEGRITY? HOW DO YOU ENSURE DATA INTEGRITY IS ENFORCED IN A TABLE?

ANSWER: NEED TO VALIDATE PROPER CONSTRAINT LIKE PRIMARY KEY, FORIGEN KEY IS DEFINED FOR THE TABLE.

QUESTION: WHAT IS referential integrity mean?

ANSWER: PRIMARY KEY, FORIGEN KEY, COMPOSITE KEY, SURROGATE KEY, CANDIDATE KEY.

QUESTION: NAME FEW SQL CLAUSES:

Where

HAVING

IN

ANY

TOP

ROWNUM

QUESTION: WHAT IS AGGREGATE FUNCTION, Name few aggregate function?

COUNT

SUM

AVG

MIN

MAX

QUESTION: What are the rules when you want to use aggregate functions?

ANSWER: BASIC RULE IS TO USE "GROUP BY" WHILE USING AGGREGATE FUNCTION IN A QUERY.

QUESTION: What is the difference between where and having clause?

ANSWER: in SQL Where filters data on lowest row level.

Having filters data after group by has been performed so it filters on "groups""Where" is a kind of restiriction statement.

You use where clause to restrict all the data from DB.Where clause is using before result retrieving.

But Having clause is using after retrieving the data.Having clause is a kind of filtering command.

QUESTION: IN WHICH SCENARIO WE WILL USE "IN" OPERATOR. CAN YOU EXPLAIN WHAT IS THE PURPOSE OF IT.

QUESTION: WHAT IS THE DIFFERENCE BETWEEN "IN" AND "EXISTS"

QUESTION: SAMPLE QUERY TO FIND SUM OF SALES FOR EACH DEPARTMENT. LETS SAY TABLE NAME IS SALES_ORDER

QUESTION: SAMPLE QUERY USING WHERE CLAUSE and HAVING CLAUSE?

SELECT department, SUM(sales) as "Total sales" FROM order_details GROUP BY department HAVING SUM(sales) > 1000;

SELECT department, COUNT(*) as "Number of employees" FROM employees WHERE salary > 25000 GROUP BY department HAVING COUNT(*) > 10;

QUESTION: WHAT IS UNION, WHAT IS THE DIFFERENCE BETWEEN UNION ALL

QUESTION: HOW DO YOU COMPARE DATA BETWEEN SOURCE and TARGET

ANSWER: MINUSQUESTIONS: How can I create a table from another table with copying DATA and STATS

ANSWER:CREATE TABLE new_table AS (SELECT * FROM old_table WHERE 1=2) WITH DATA AND STATS;

QUESTION: WHAT IS DIFFERENCE BETWEEN DELETE and TRUNCATE

QUESTION: What type of joins have you used?

ANSWER: Joins knowledge is MUST HAVE. This interview question is quite nice because most people used inner join and (left/right) outer join which is rather mandatory knowledge but those more experienced will also mention cross join and self-join. In SQL Server you can also get full outer join.

QUESTION: How can you combine two tables/views together? For instance one table contains 100 rows and the other one contains 200 rows, have exactly the same fields and you want to show a query with all data (300 rows). This sql interview question can get complicated.

ANSWER: You use UNION operator. You can drill down this question and ask what is the different between UNION and UNION ALL (the first one removes duplicates (not always desirable)… in other words shows only DISTINCT rows….Union ALL just combines so it is also faster). More tricky question are how to sort the view (you use order by at the last query), how to name fields so they appear in query results/view schema (first query field names are used). How to filter groups when you use union using SQL (you would create separate query or use common table expression (CTE) or use unions in from with ().

QUESTION: WHAT IS A DERIVED TABLE?

QUESTION: WHAT IS SUBQUERY, HOW TO WRITE SYNTAX.

QUESTION: How would apply date range filter?

ANSWER: This is tricky question. You can use simple condition >= and <= or similar or use between/and but the trick is to know your exact data type. Sometimes date fields contain time and that is where the query can go wrong so it is recommended to use some date related functions to remove the time issue. In SQL Server common function to do that is datediff function. You also have to be aware of different time zones and server time zone.

QUESTION: What type of wildcards have you used? This is usually one of mandatory sql interview question.

ANSWER: First question is what is a wildcard? Wildcards are special characters that allow matching string without having exact match. In simple word they work like contains or begins with. Wildcard characters are software specific and in SQL Server we have % which represent any groups of characters, _ that represent one character (any) and you also get [] where we can [ab] which means characters with letter a or b in a specific place.

QUESTION: How do you find orphans?

ANSWER: This is more comprehensive SQL and database interview question. First of all we test if the candidate knows what an orphan is. An Orphan is a foreign key value in "child table" which doesn’t exist in primary key column in parent table. To get it you can use left outer join (important: child table on left side) with join condition on primary/foreign key columns and with where clause where primary key is null. Adding distinct or count to select is common practise. In SQL Server you can also you except which will show all unique values from first query that don't exist in second query.

QUESTION: How would you solve the following sql queries using today's date? First day of previous month To return first day of previous month using T-SQL we can use the following statement: SELECT DATEADD(m,-1, Dateadd(d,1-DATEPART(d,getdate()),GETDATE())) AS [First Day of Previous Month]1) Our today's date is 20th of Feb. We substract number of days in the month from our date (20) and we get last day of previous month.2) We add one day to get first day of current month.3) We go one month back so first day of the month stays the same and Feb changes to Jan. First day of current monthLast day of previous monthLast day of current month

ANSWER: These tasks require good grasp of SQL functions but also logical thinking which is one of the primary skills involved in solving sql questions. In this case I provided links to actual answers with code samples. Experienced people should give correct answer almost immediately. People with less experience might need more time or would require some help (Google).

QUESTION: How do you find duplicate record in a table.

ANSWER: SELECT name, email COUNT(name) AS NumOccurName, COUNT(email) as NumOccurEmail FROM users GROUP BY Name, Email HAVING ( COUNT(name) > 1 ) AND ( COUNT(email) > 1)

QUESTION: How do you find 2nd maximum salary of an employee

QUESTION: What is "normalization"? "Denormalization"? Why do you sometimes want to denormalize?

ANSWER: Normalizing data means eliminating redundant information from a table and organizing the data so that future changes to the table are easier. Denormalization means allowing redundancy in a table. The main benefit of denormalization is improved performance with simplified data retrieval and manipulation. This is done by reduction in the number of joins needed for data processing.

QUESTION: What types of index data structures can you have? - An index helps to faster search values in tables. The three most commonly used index-types are: - B-Tree: builds a tree of possible values with a list of row IDs that have the leaf value. Needs a lot of space and is the default index type for most databases. - Bitmap: string of bits for each possible value of the column. Each bit string has one bit for each row. Needs only few space and is very fast.(however, domain of value cannot be large, e.g. SEX(m,f); degree(BS,MS,PHD) - Hash: A hashing algorithm is used to assign a set of characters to represent a text string such as a composite of keys or partial keys, and compresses the underlying data. Takes longer to build and is supported by relatively few databases.

QUESTION: WHAT IS TRIGGER?

ANSWER: Triggers are stored procedures created in order to enforce integrity rules in a database. A trigger is executed every time a data-modification operation occurs (i.e., insert, update or delete). Triggers are executed automatically on occurrence of one of the data-modification operations. A trigger is a database object directly associated with a particular table. It fires whenever a specific statement/type of statement is issued against that table. The types of statements are insert,update,delete and query statements. Basically, trigger is a set of SQL statements A trigger is a solution to the restrictions of a constraint. For instance: 1.A database column cannot carry PSEUDO columns as criteria where a trigger can. 2. A database constraint cannot refer old and new values for a row where a trigger can.WHAT IS INDEX?DIFFERENCE BETWEEN CLUSTERED INDEX / NON CLUSTERED INDEX.

QUESTION: What are the tradeoffs with having indexes?

ANSWER:

1. Faster selects, slower updates.

2. Extra storage space to store indexes. Updates are slower because in addition to updating the table you have to update the index.WHAT IS TABLE PARTITIONS?

QUESTION: Why can a "group by" or "order by" clause be expensive to process?

ANSWER: Processing of "group by" or "order by" clause often requires creation of Temporary tables to process the results of the query. Which depending of the result set can be very expensive.

QUESTION: What is a SQL view?

ANSWER: An output of a query can be stored as a view. View acts like small table which meets our criterion. View is a pre complied SQL query which is used to select data from one or more tables. A view is like a table but it doesn’t physically take any space. View is a good way to present data in a particular format if you use that query quite often.View can also be used to restrict users from accessing the tables directly.

QUESTION: WHAT IS MATERIALISTIC VIEW?

Saturday, April 21, 2012

Strategies for Testing Data Warehouse Applications

Businesses are increasingly focusing on the collection and organization of data for strategic decision-making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage.
This article provides practical recommendations for testing extract, transform and load (ETL) applications based on years of experience testing data warehouses in the financial services and consumer retailing areas. Every attempt has been made to keep this article tool-agnostic so as to be applicable to any organization attempting to build or improve on an existing data warehouse.
Testing Goals
There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. Given the importance of early detection of software defects, let's first review some general goals of testing an ETL application:

Data completeness. Ensures that all expected data is loaded.
Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications.
Data quality. Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
Performance and scalability. Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable.
Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes.
User-acceptance testing. Ensures the solution meets users' current expectations and anticipates their future expectations.
Regression testing. Ensures existing functionality remains intact each time a new release of code is completed.

Data CompletenessOne of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:
Comparing record counts between source data, data loaded to the warehouse and rejected records.
Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.
Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example, if the source data field is a string(30) make sure to test it with 30 characters.
Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data TransformationValidating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and "stare and compare" to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:
Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is a good requirements elicitation exercise during design and can also be used during testing.
Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.
Utilize data profiling results to compare range and distribution of values in each field between source and target data.
Validate correct processing of ETL-generated fields such as surrogate keys.
Validate that data types in the warehouse are as specified in the design and/or the data model.
Set up data scenarios that test referential integrity between tables. For example, what happens when the data contains foreign key values not in the parent table?
Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child records are handled.

Data QualityFor the purposes of this discussion, data quality is defined as "how the ETL system handles data rejection, substitution, correction and notification without modifying data." To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:
Reject the record if a certain decimal field has nonnumeric data.
Substitute null if a certain decimal field has nonnumeric data.
Validate and correct the state field if necessary based on the ZIP code.
Compare product code to values in a lookup table, and if there is no match load anyway but report to users.
Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field). Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what's loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals systematic issues with source data. In some cases, it may be beneficial to populate the "before" data in the database for users to view.

Performance and ScalabilityAs the volume of data in a data warehouse grows, ETL load times can be expected to increase, and performance of queries can be expected to degrade. This can be mitigated by having a solid technical architecture and good ETL design. The aim of the performance testing is to point out any potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover performance issues:
Load the database with peak expected production volumes to ensure that this volume of data can be loaded by the ETL process within the agreed-upon window.
Compare these ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Compare the ETL processing times component by component to point out any areas of weakness.
Monitor the timing of the reject process and consider how large volumes of rejected data will be handled.
Perform simple and multiple join queries to validate query performance on large database volumes. Work with business users to develop sample queries and acceptable performance criteria for each query.

Integration TestingTypically, system testing only includes testing within the ETL application. The endpoints for system testing are the input and output of the ETL code being tested. Integration testing shows how the application fits into the overall flow of all upstream and downstream applications. When creating integration test scenarios, consider how the overall process can break and focus on touchpoints between applications rather than within one application. Consider how process failures at each step would be handled and how data would be recovered or deleted if necessary.
Most issues found during integration testing are either data related to or resulting from false assumptions about the design of another application. Therefore, it is important to integration test with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment. As always, don't forget the importance of good communication between the testing and design teams of all systems involved. To help bridge this communication gap, gather team members from all systems together to formulate test scenarios and discuss what could go wrong in production. Run the overall process from end to end in the same order and with the same dependencies as in production. Integration testing should be a combined effort and not the responsibility solely of the team testing the ETL application.

User-Acceptance TestingThe main reason for building a data warehouse application is to make data available to business users. Users know the data best, and their participation in the testing effort is a key component to the success of a data warehouse implementation. User-acceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables, not the mechanics of how the ETL application works. Consider the following strategies:
Use data that is either from production or as near to production data as possible. Users typically find issues once they see the "real" data, sometimes leading to design changes.
Test database views comparing view contents to what is expected. It is important that users sign off and clearly understand how the views are created.
Plan for the system test team to support users during UAT. The users will likely have questions about how the data is populated and need to understand details of how the ETL works.
Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed.

Regression TestingRegression testing is revalidation of existing functionality with each new release of code. When building test cases, remember that they will likely be executed multiple times as new releases are created due to defect fixes, enhancements or upstream systems changes. Building automation during system testing will make the process of regression testing much smoother. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs. When doing a regression test, it is much quicker to compare results to a previous execution than to do an entire data validation again.
Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is produced and prevent costly mistakes from being discovered in production.

Source: http://www.information-management.com/issues/20070601/1086005-1.html?pg=2