Note added 15Sep2017
I talked to Lloyd Tabb at the Looker Join conference and asked him how he came up with this strategy: A colleague asked him whether he knew that you could do a sum(distinct . . .)
. Lloyd did not know, but when told, this idea came to him very quickly.
It has been submitted as a patent.
Looker and Symmetric Aggregates (the 1:many "fanout" problem with aggregates)
I was looking at some Lookergenerated SQL and saw this:
SELECT
COALESCE(SUM(customers.visits ), 0) AS "customers.total_visits"
FROM public.customers AS customers
(See http://localhost:9999/explore/fanout2/customers?qid=ek2NHkVfoJ34yRkLHgagFj)
Simple. Then I added an aggregate from a joined view (a "view" in Looker is a thin façade on a table) and I saw this:
SELECT
COALESCE(COALESCE( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(customers.visits ,0)*(1000000*1.0)) AS DECIMAL(65,0))) + ('x'  MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616 + ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) )  SUM(DISTINCT ('x'  MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616 + ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) ) / (1000000*1.0), 0), 0) AS "customers.total_visits",
COALESCE(COALESCE( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(orders.amount ,0)*(1000000*1.0)) AS DECIMAL(65,0))) + ('x'  MD5(orders.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616 + ('x'  SUBSTR(MD5(orders.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) )  SUM(DISTINCT ('x'  MD5(orders.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616 + ('x'  SUBSTR(MD5(orders.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) ) / (1000000*1.0), 0), 0) AS "orders.total_amount"
FROM public.customers AS customers
LEFT JOIN public.orders AS orders ON customers.id = orders.customer_id
What are they doing? Looker [rightly, I think] brags about this and calls it "symmetric aggregates" (and here is the business case), but they don't explain how it works. I conducted some searches and it would seem that Looker's strategy here is unique and a genuine innovation.
Sample database
drop table if exists customers;
drop table if exists orders;
create table customers (id int, first_name varchar, last_name varchar, visits int);
create table orders (id int, amount numeric(6,2), customer_id int);
insert into customers (id, first_name, last_name, visits) values (1, 'Amelia', 'Earhart', 2);
insert into customers (id, first_name, last_name, visits) values (2, 'Charles', 'Lindberg', 2);
insert into customers (id, first_name, last_name, visits) values (3, 'Wilbur', 'Wright', 4);
insert into orders (id, amount, customer_id) values (1, 25.00, 1);
insert into orders (id, amount, customer_id) values (2, 50.00, 1);
insert into orders (id, amount, customer_id) values (3, 75.00, 2);
insert into orders (id, amount, customer_id) values (4, 100.00, 3);
How many visits are there?
select sum(visits) from customers;
 returns 8
Yay! What is the average length of a visit?
select avg(visits) from customers;
 returns 2.66
And now let's look at the classic "surprise" for beginning users of SQL: Join in a column, and . . .
select avg(visits) from customers left join orders on orders.customer_id = customers.id;
 returns 2.5
 oops!
Let's add some more columns, remove the aggregate, and look at the data:
select
customers.id as customer_id,
customers.first_name,
customers.last_name,
customers.visits,
orders.id as order_id,
orders.amount
from customers left join orders on orders.customer_id = customers.id
(http://localhost:9999/sql/mx9cgdrvfhwsdc)
That produces:
customer_id  first_name  last_name  visits  order_id  amount 

1  Amelia  Earhart  2  1  25.00 
1  Amelia  Earhart  2  2  50.00 
2  Charles  Lindberg  2  3  75.00 
3  Wilbur  Wright  4  4  100.00 
Clear enough. We have "extra" values in the visits column because of the 1:many ("fanout") relationship between customers and orders.
If we want the counts of customers and the counts of amounts, we can just use
count(distinct ...)
:
select
count(distinct customers.id) as "customers count",
count(distinct orders.id) as "orders count"
from customers left join orders on orders.customer_id = customers.id;
Works great! However, sum
is problematic.
select
sum(customers.visits) as "customers total visits",
sum(orders.amount) as "orders total amount"
from customers left join orders on orders.customer_id = customers.id;
(http://localhost:9999/sql/2brb8hdmwdbhtx)
Result:
customers total visits  orders total amount 

10  250.00 
Clearly this is going to seem "wrong" to people who aren't comfortable with joins. The classic ways to fix this: with a subquery or creative selfjoins or window functions. However, these solutions can introduces as many problems as they solve (subqueries and window functions aren't as easily composable as the Looker solution), especially for an analytic tool where we don't want people to have to worry about SQL.
Quick demo in Looker
(Show the live SQL querywriting.)
How they do it
The technique is to do a sum(distinct something)
where the "something"
for visits is the same for each combination of customers.id
and visits
.
That way when we do sum(distinct something)
there will be only one value
for the combination of customers.id = 1
and visits = 2
(see the table
above that joins customers
and orders
). We can then take this sum and
subtract the "something" applied only to the customer id, and the remainder
will be the total of visits from the customers table. I am going to call the calculation on the
customer id id_offset
.
Another way to think about this: The aggregate is being calculated not against all of the values in the column (which can have duplicates because of the join), but against the distinct primary key from the source table.
Alright, let's try a naive solution. We'll say that something
is
(customers_id + 100) + visits
. In other words,
select
customers.id as customer_id,
customers.first_name,
customers.last_name,
customers.visits,
(customers.id + 100) as id_offset,
(customers.id + 100) + customers.visits as something,
orders.id as order_id,
orders.amount
from customers left join orders on orders.customer_id = customers.id;
(http://localhost:9999/sql/jvctwzbqnkywjr)
Now we can get a correct sum of the visits column like so:
select sum(distinct something)  sum(distinct id_offset) as "customers total visits"
from (
select
customers.id as customer_id,
customers.first_name,
customers.last_name,
customers.visits,
(customers.id + 100) as id_offset,
(customers.id + 100) + customers.visits as something,
orders.id as order_id,
orders.amount
from customers left join orders on orders.customer_id = customers.id
) as orig;
 returns 8 (correct!)
(http://localhost:9999/sql/bbctxqmdvwnv2z)
Right? (103 + 104 + 107)  (101 + 102 + 103) = 8
.
That's the concept in a nutshell. However, the selection of the function
is critical. Here's an example where the column data doesn't work with
the + 100
function: Customer/visit combinations must have a value for something
that is unique.
So if we had two customers like this (id: 1, visits: 2; id: 2, visits: 1),
with three total visits we would get
customer id  visits  id_offset  something 

1  2  101  103 
2  1  102  103 
Here, sum(distinct something)
is 103, but sum(distinct id_offset)
is
203, and we would produce a total for visits of 100. Oops. So we are screwed. We get a bit of improvement with (customers.id * 100):
customer id  visits  id_offset  something 

1  2  100  102 
2  1  200  201 
Now sum(distinct something)
= 303, and sum(dstinct id_offset)
= 300, do
303  300 = 3, and we're good again.
Therefore, generalizing the function would help. Here's how Looker does it:
COALESCE(COALESCE( (
SUM(DISTINCT (CAST(FLOOR(COALESCE(customers.visits ,0)*(1000000*1.0)) AS DECIMAL(65,0)))
+ ('x'  MD5(customers.id::varchar))::bit(64)::bigint::DECIMAL(65,0) * 18446744073709551616
+ ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) )
 SUM(DISTINCT ('x'  MD5(customers.id::varchar))::bit(64)::bigint::DECIMAL(65,0) * 18446744073709551616
+ ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) )
/ (1000000*1.0), 0), 0)
AS "customers.total_visits"
You should be able to see the skeleton of the subtraction of what I have
called the sum(dstinct id_offset)
from the sum(distinct something)
.
The meat of this is in the use of the MD5 function to get a 128 bit hash value. What they are doing is getting a decimal number out of the hash. Here are a few queries that will build up to the actual Looker query and start to make this clear.
select md5(1::varchar);
 the basic hash: c4ca4238a0b923820dcc509a6f75849b
select 'x'  md5(1::varchar);
 makes something that can be cast to a bit(64): xc4ca4238a0b923820dcc509a6f75849b
select ('x'  md5(1::varchar))::bit(64);
 cast to binary: 1100010011001010010000100011100010100000101110010010001110000010
select ('x'  md5(1::varchar))::bit(64)::bigint;
 cast to 8byte bigint: 4266524885998034046
select ('x'  md5(1::varchar))::bit(64)::bigint::decimal(65,0);
 cast to exact precision decimal: 4266524885998034046
Some references to the PostgreSQL documentation: casting; bit string constants expressed as hexadecimal and the leading "x"; bit string type; numeric/decimal type and arbitrary precious arithmetic.
Remember that the 128 bit md5 was cast to an 8byte bigint, which happens to be signed. The range of this int is 9223372036854775808 to +9223372036854775807; there are 18446744073709551616 values in this range. The multiplication is essentially moving the digits of the md5 calculation.
select ('x'  md5(1::varchar))::bit(64)::bigint::decimal(65,0) * 18446744073709551616;
 Expand the space: 78703492656118554855266830188162318336
Now conduct the same computation on just the first 17 characters (I guess this is to avoid md5 collisions):
select ('x'  substr(md5(1::varchar), 17))::bit(64)::bigint::decimal(65,0);
 produces: 994258241967195291
And add these together:
select
('x'  md5(1::varchar))::bit(64)::bigint::decimal(65,0) * 18446744073709551616
+ ('x'  substr(md5(1::varchar), 17))::bit(64)::bigint::decimal(65,0);
 produces: 78703492656118554854272571946195123045
The same idea is done for the subtrahend.
The last piece of the puzzle is the multiplication times 1000000 and division by 1000000. This is done to preserve fractional value in integer arithmetic.
Why additional addition of part of the hash?
As I say, probably they are trying to avoid md5 collisions. Seems to work pretty well without it. Here I'm commenting it out. Same result.
SELECT
(
SUM(DISTINCT
(
CAST(FLOOR(customers.visits * (1000000*1.0)) AS DECIMAL(65,0)))
+ ('x'  MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616
 + ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)
)

SUM(DISTINCT
(
'x'  MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0) *18446744073709551616
 + ('x'  SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)
)
)
/ (1000000*1.0)
AS "customers.total_visits"
FROM public.customers AS customers
LEFT JOIN public.orders AS orders ON customers.id = orders.customer_id