The Dangers of Failing to Understand T-SQL’s DISTINCT and GROUP BY
Example of the Difference Between DISTINCT and GROUP BY
The distinct keyword in a T-SQL query to ensure an unique result set is often a quick and easy solution, which is often fraught with peril and performance issues. The key difference between GROUP BY and DISTINCT is when the aggregation or elimination of duplicates occurs during the query processing step. The distinct operation is completed after the non distinct result set has materialized. Thus it can not take advantage of indexes or early reduction of unnecessary data such as the GROUP BY clause can achieve.
The code snippet below is a simple example of when and how the aggregation or elimination of duplicate data occurs. The data isn’t all that important except to realize that it does contain some duplicates in the data of interest.
DECLARE @MyData TABLE( SomeValue VARCHAR(3) NOT NULL ) INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB') INSERT INTO @MyData ( SomeValue ) VALUES ( 'CCC') INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB') INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB') INSERT INTO @MyData ( SomeValue ) VALUES ( 'CCC') INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') INSERT INTO @MyData ( SomeValue ) VALUES ( 'BBB') INSERT INTO @MyData ( SomeValue ) VALUES ( 'AAA') SELECT DISTINCT ROW_NUMBER() OVER(ORDER BY md.SomeValue) AS rn , SomeValue FROM @MyData AS md SELECT ROW_NUMBER() OVER(ORDER BY md.SomeValue) AS rn , SomeValue FROM @MyData AS md GROUP BY SomeValue
Result Set # 1 | Result Set # 2 | |||
rn | SomeValue | 1 | AAA | |
1 | AAA | 2 | BBB | |
2 | AAA | 3 | CCC | |
3 | AAA | |||
4 | AAA | |||
5 | AAA | |||
6 | AAA | |||
7 | BBB | |||
8 | BBB | |||
9 | BBB | |||
10 | BBB | |||
11 | CCC | |||
12 | CCC |
Looking at the results it should be evident that the first query using the DISTINCT directive didn't apply the de-duplication phase until after the Row Numbers were uniquely generated. Thus there were no records to eliminate due to duplication.
The second result set has the results that may have been expected in that only the unique non-duplicate set of data was returned with a simple row number.
The take-away from this is not that the GROUP BY clause is correct, rather that there are differences between the DISTINCT directive and the GROUP BY clause. They are not always interchangeable from a data perspective. The performance issue is a whole other discussion.
Reader Comments