distinct

In the course of trying to answer a question for my wife this evening, I ran across an interesting type of query that, though it would seem simple and common, did take me a moment to figure out the best approach.

She had a very large set of (poorly cleaned) data that she needed to know how many of each unique set were present. Essentially, the question boils down to “how do you perform a grouping query while simultaneously removing duplicates?”

So, for example, if you had the following data set and wanted to know how many dogs of each color there are?

kind	color	name
dog	black	rover
dog	black	rover
dog	white	spot

If you were to do the standard GROUP BY:

SELECT
  A,
  B,
  COUNT(C)
FROM
  test
GROUP BY
  A,
  B
ORDER BY
  A,
  B;

against this table, you would get the following:

A     B      COUNT    
----  -----  --------
dog   black  2        
dog   white  1        

However, as discussed above, my wife didn’t want that duplicate dog counted.

Here’s the query I used (in PostgreSQL) to achieve that:

SELECT
  A,
  B,
  COUNT(B)
FROM
  (SELECT DISTINCT A, B, C FROM test) AS t
GROUP BY
  A,
  B
ORDER BY
  A,
  B;

resulting in:

A     B      COUNT    
----  -----  --------
dog   black  1
dog   white  1

Of course, the reason this rarely comes up is because, with good data/models, you really shouldn’t have duplicate rows, but you don’t always control that. :)

Daryl Richter

distinct

You might also enjoy (View all posts)