symmetric

Note added 15-Sep-2017

I talked to Lloyd Tabb at the Looker Join conference and asked him how he came up with this strategy: A colleague asked him whether he knew that you could do a sum(distinct . . .). Lloyd did not know, but when told, this idea came to him very quickly.

It has been submitted as a patent.

Looker and Symmetric Aggregates (the 1:many "fanout" problem with aggregates)

I was looking at some Looker-generated SQL and saw this:

SELECT
	COALESCE(SUM(customers.visits ), 0) AS "customers.total_visits"
FROM public.customers  AS customers

Simple. Then I added an aggregate from a joined view (a "view" in Looker is a thin façade on a table) and I saw this:

SELECT
	COALESCE(COALESCE( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(customers.visits ,0)*(1000000*1.0)) AS DECIMAL(65,0))) + ('x' || MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616 + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) ) - SUM(DISTINCT ('x' || MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616 + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) )  / (1000000*1.0), 0), 0) AS "customers.total_visits",
	COALESCE(COALESCE( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(orders.amount ,0)*(1000000*1.0)) AS DECIMAL(65,0))) + ('x' || MD5(orders.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616 + ('x' || SUBSTR(MD5(orders.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) ) - SUM(DISTINCT ('x' || MD5(orders.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616 + ('x' || SUBSTR(MD5(orders.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) )  / (1000000*1.0), 0), 0) AS "orders.total_amount"
FROM public.customers  AS customers
LEFT JOIN public.orders  AS orders ON customers.id = orders.customer_id

What are they doing? Looker [rightly, I think] brags about this and calls it "symmetric aggregates" (and here is the business case), but they don't explain how it works. I conducted some searches and it would seem that Looker's strategy here is unique and a genuine innovation.

Sample database

drop table if exists customers;
drop table if exists orders;

create table customers (id int, first_name varchar, last_name varchar, visits int);
create table orders (id int, amount numeric(6,2), customer_id int);

insert into customers (id, first_name, last_name, visits) values (1, 'Amelia',  'Earhart',  2);
insert into customers (id, first_name, last_name, visits) values (2, 'Charles', 'Lindberg', 2);
insert into customers (id, first_name, last_name, visits) values (3, 'Wilbur',  'Wright',   4);

insert into orders (id, amount, customer_id) values (1,  25.00, 1);
insert into orders (id, amount, customer_id) values (2,  50.00, 1);
insert into orders (id, amount, customer_id) values (3,  75.00, 2);
insert into orders (id, amount, customer_id) values (4, 100.00, 3);

How many visits are there?

select sum(visits) from customers;
-- returns 8

Yay! What is the average length of a visit?

select avg(visits) from customers;
-- returns 2.66

And now let's look at the classic "surprise" for beginning users of SQL: Join in a column, and . . .

select avg(visits) from customers left join orders on orders.customer_id = customers.id;
-- returns 2.5
-- oops!

Let's add some more columns, remove the aggregate, and look at the data:

select
customers.id as customer_id,
customers.first_name,
customers.last_name,
customers.visits,
orders.id as order_id,
orders.amount
from customers left join orders on orders.customer_id = customers.id

(http://localhost:9999/sql/mx9cgdrvfhwsdc)

That produces:

customer_id	first_name	last_name	visits	order_id	amount
1	Amelia	Earhart	2	1	25.00
1	Amelia	Earhart	2	2	50.00
2	Charles	Lindberg	2	3	75.00
3	Wilbur	Wright	4	4	100.00

Clear enough. We have "extra" values in the visits column because of the 1:many ("fanout") relationship between customers and orders.

If we want the counts of customers and the counts of amounts, we can just use count(distinct ...):

select
count(distinct customers.id) as "customers count",
count(distinct orders.id) as "orders count"
from customers left join orders on orders.customer_id = customers.id;

Works great! However, sum is problematic.

select
sum(customers.visits) as "customers total visits",
sum(orders.amount) as "orders total amount"
from customers left join orders on orders.customer_id = customers.id;

(http://localhost:9999/sql/2brb8hdmwdbhtx)

Result:

customers total visits	orders total amount
10	250.00

Clearly this is going to seem "wrong" to people who aren't comfortable with joins. The classic ways to fix this: with a subquery or creative self-joins or window functions. However, these solutions can introduces as many problems as they solve (subqueries and window functions aren't as easily composable as the Looker solution), especially for an analytic tool where we don't want people to have to worry about SQL.

Quick demo in Looker

(Show the live SQL query-writing.)

How they do it

The technique is to do a sum(distinct something) where the "something" for visits is the same for each combination of customers.id and visits. That way when we do sum(distinct something) there will be only one value for the combination of customers.id = 1 and visits = 2 (see the table above that joins customers and orders). We can then take this sum and subtract the "something" applied only to the customer id, and the remainder will be the total of visits from the customers table. I am going to call the calculation on the customer id id_offset.

Another way to think about this: The aggregate is being calculated not against all of the values in the column (which can have duplicates because of the join), but against the distinct primary key from the source table.

Alright, let's try a naive solution. We'll say that something is (customers_id + 100) + visits. In other words,

select
customers.id as customer_id,
customers.first_name,
customers.last_name,
customers.visits,
(customers.id + 100) as id_offset,
(customers.id + 100) + customers.visits as something,
orders.id as order_id,
orders.amount
from customers left join orders on orders.customer_id = customers.id;

(http://localhost:9999/sql/jvctwzbqnkywjr)

Now we can get a correct sum of the visits column like so:

select sum(distinct something) - sum(distinct id_offset) as "customers total visits"
from (
  select
  customers.id as customer_id,
  customers.first_name,
  customers.last_name,
  customers.visits,
  (customers.id + 100) as id_offset,
  (customers.id + 100) + customers.visits as something,
  orders.id as order_id,
  orders.amount
from customers left join orders on orders.customer_id = customers.id
) as orig;
-- returns 8 (correct!)

(http://localhost:9999/sql/bbctxqmdvwnv2z)

Right? (103 + 104 + 107) - (101 + 102 + 103) = 8.

That's the concept in a nutshell. However, the selection of the function is critical. Here's an example where the column data doesn't work with the + 100 function: Customer/visit combinations must have a value for something that is unique. So if we had two customers like this (id: 1, visits: 2; id: 2, visits: 1), with three total visits we would get

customer id	visits	id_offset	something
1	2	101	103
2	1	102	103

Here, sum(distinct something) is 103, but sum(distinct id_offset) is 203, and we would produce a total for visits of 100. Oops. So we are screwed. We get a bit of improvement with (customers.id * 100):

customer id	visits	id_offset	something
1	2	100	102
2	1	200	201

Now sum(distinct something) = 303, and sum(dstinct id_offset) = 300, do 303 - 300 = 3, and we're good again.

Therefore, generalizing the function would help. Here's how Looker does it:

COALESCE(COALESCE( (
  SUM(DISTINCT (CAST(FLOOR(COALESCE(customers.visits ,0)*(1000000*1.0)) AS DECIMAL(65,0)))
  + ('x' || MD5(customers.id::varchar))::bit(64)::bigint::DECIMAL(65,0) * 18446744073709551616
  + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0) )
  - SUM(DISTINCT ('x' || MD5(customers.id::varchar))::bit(64)::bigint::DECIMAL(65,0) * 18446744073709551616
  + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)) )
        / (1000000*1.0), 0), 0)
AS "customers.total_visits"

You should be able to see the skeleton of the subtraction of what I have called the sum(dstinct id_offset) from the sum(distinct something).

The meat of this is in the use of the MD5 function to get a 128 bit hash value. What they are doing is getting a decimal number out of the hash. Here are a few queries that will build up to the actual Looker query and start to make this clear.

select md5(1::varchar);
  -- the basic hash: c4ca4238a0b923820dcc509a6f75849b
select 'x' || md5(1::varchar);
  -- makes something that can be cast to a bit(64): xc4ca4238a0b923820dcc509a6f75849b
select ('x' || md5(1::varchar))::bit(64);
  -- cast to binary: 1100010011001010010000100011100010100000101110010010001110000010
select ('x' || md5(1::varchar))::bit(64)::bigint;
  -- cast to 8-byte bigint: -4266524885998034046
select ('x' || md5(1::varchar))::bit(64)::bigint::decimal(65,0);
  -- cast to exact precision decimal: -4266524885998034046

Some references to the PostgreSQL documentation: casting; bit string constants expressed as hexadecimal and the leading "x"; bit string type; numeric/decimal type and arbitrary precious arithmetic.

Remember that the 128 bit md5 was cast to an 8-byte bigint, which happens to be signed. The range of this int is -9223372036854775808 to +9223372036854775807; there are 18446744073709551616 values in this range. The multiplication is essentially moving the digits of the md5 calculation.

select ('x' || md5(1::varchar))::bit(64)::bigint::decimal(65,0) * 18446744073709551616;
  -- Expand the space: -78703492656118554855266830188162318336

Now conduct the same computation on just the first 17 characters (I guess this is to avoid md5 collisions):

select ('x' || substr(md5(1::varchar), 17))::bit(64)::bigint::decimal(65,0);
-- produces: 994258241967195291

And add these together:

select
  ('x' || md5(1::varchar))::bit(64)::bigint::decimal(65,0) * 18446744073709551616
+ ('x' || substr(md5(1::varchar), 17))::bit(64)::bigint::decimal(65,0);
-- produces: -78703492656118554854272571946195123045

The same idea is done for the subtrahend.

The last piece of the puzzle is the multiplication times 1000000 and division by 1000000. This is done to preserve fractional value in integer arithmetic.

Why additional addition of part of the hash?

As I say, probably they are trying to avoid md5 collisions. Seems to work pretty well without it. Here I'm commenting it out. Same result.

SELECT
(
SUM(DISTINCT
  (
      CAST(FLOOR(customers.visits * (1000000*1.0)) AS DECIMAL(65,0)))
    + ('x' || MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616
--  + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)
  )
-
SUM(DISTINCT
  (
      'x' || MD5(customers.id ::varchar))::bit(64)::bigint::DECIMAL(65,0)  *18446744073709551616
-- + ('x' || SUBSTR(MD5(customers.id ::varchar),17))::bit(64)::bigint::DECIMAL(65,0)
  )

)

/ (1000000*1.0)

AS "customers.total_visits"
FROM public.customers  AS customers
LEFT JOIN public.orders  AS orders ON customers.id = orders.customer_id