In the course of trying to answer a question for my wife this evening, I ran across an interesting type of query that, though it would seem simple and common, did take me a moment to figure out the best approach.
She had a very large set of (poorly cleaned) data that she needed to know how many of each unique set were present. Essentially, the question boils down to “how do you perform a grouping query while simultaneously removing duplicates?”
So, for example, if you had the following data set and wanted to know how many dogs of each color there are?
kind | color | name |
---|---|---|
dog | black | rover |
dog | black | rover |
dog | white | spot |
If you were to do the standard GROUP BY:
against this table, you would get the following:
A B COUNT
---- ----- --------
dog black 2
dog white 1
However, as discussed above, my wife didn’t want that duplicate dog counted.
Here’s the query I used (in PostgreSQL) to achieve that:
resulting in:
A B COUNT
---- ----- --------
dog black 1
dog white 1
Of course, the reason this rarely comes up is because, with good data/models, you really shouldn’t have duplicate rows, but you don’t always control that. :)