Pocket Joshua - SQL Journal - The Dangers of Failing to Understand T-SQL’s DISTINCT and GROUP BY

Monday

May172010

The Dangers of Failing to Understand T-SQL’s DISTINCT and GROUP BY

Monday, May 17, 2010 at 9:59AM

Example of the Difference Between DISTINCT and GROUP BY

The distinct keyword in a T-SQL query to ensure an unique result set is often a quick and easy solution, which is often fraught with peril and performance issues. The key difference between GROUP BY and DISTINCT is when the aggregation or elimination of duplicates occurs during the query processing step. The distinct operation is completed after the non distinct result set has materialized. Thus it can not take advantage of indexes or early reduction of unnecessary data such as the GROUP BY clause can achieve.

The code snippet below is a simple example of when and how the aggregation or elimination of duplicate data occurs. The data isn’t all that important except to realize that it does contain some duplicates in the data of interest.

DECLARE @MyData TABLE(
    SomeValue    VARCHAR(3) NOT NULL
)

INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'CCC')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'CCC')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB')
INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA')


SELECT DISTINCT
      ROW_NUMBER() OVER(ORDER BY md.SomeValue) AS rn
    , SomeValue
FROM 
    @MyData AS md    


SELECT 
      ROW_NUMBER() OVER(ORDER BY md.SomeValue) AS rn
    , SomeValue
FROM 
    @MyData AS md    
GROUP BY
      SomeValue

Result Set # 1		Result Set # 2
rn	SomeValue	1	AAA
1	AAA	2	BBB
2	AAA	3	CCC
3	AAA
4	AAA
5	AAA
6	AAA
7	BBB
8	BBB
9	BBB
10	BBB
11	CCC
12	CCC

Looking at the results it should be evident that the first query using the DISTINCT directive didn't apply the de-duplication phase until after the Row Numbers were uniquely generated. Thus there were no records to eliminate due to duplication.

The second result set has the results that may have been expected in that only the unique non-duplicate set of data was returned with a simple row number.

The take-away from this is not that the GROUP BY clause is correct, rather that there are differences between the DISTINCT directive and the GROUP BY clause. They are not always interchangeable from a data perspective. The performance issue is a whole other discussion.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

Post a New Comment

Enter your information below to add a new comment.

My response is on my own website »

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

All HTML will be escaped. Hyperlinks will be created for URLs automatically.