06 Apr 2009

Building your own Aggregate Functions with PostGreSQL

Nick Burch

Creating custom PostGreSQL aggregate functions for fun and profit

Aggregation

In SQL, aggregate functions are powerful things that act upon a number of rows, and output a single value from them. With PostGreSQL 8.x, it's now possible to easily define your own aggregate functions, to aggreate your data however you need. One use of this is to generate powerful reports very easily.

Let us think of a very simple (if slightly stilted) database:

CREATE SEQUENCE s_types;
CREATE TABLE types (
  id INTEGER NOT NULL DEFAULT NEXTVAL('s_types'),
  name VARCHAR(16) NOT NULL,
  CONSTRAINT pk_types PRIMARY KEY(id)
);
INSERT INTO types (name) VALUES ('coding');
INSERT INTO types (name) VALUES ('design');
INSERT INTO types (name) VALUES ('sysadmin');
CREATE SEQUENCE s_customers;
CREATE TABLE customers (
  id INTEGER NOT NULL DEFAULT NEXTVAL('s_customers'),
  name VARCHAR(255) NOT NULL,
  CONSTRAINT pk_customers PRIMARY KEY(id) 
);
CREATE TABLE sales (
  customer_id INTEGER NOT NULL,
  sale_date timestamp NOT NULL DEFAULT now(),
  type_id INTEGER NOT NULL,
  CONSTRAINT fk_sales_customer FOREIGN KEY ("customer_id")
    REFERENCES "customers" ("id"),
  CONSTRAINT fk_sales_type FOREIGN KEY ("type_id")
    REFERENCES "types" ("id")
);
CREATE INDEX sales_customer ON sales (customer_id);

All goes well with our database for a while. Someone asks for a report of the first and last sale for each customer, and we're able to produce that directly in SQL, with just one result row per customer.

Then they ask for something fancier - they don't just want to know the first and last sale date for every customer, they also want the list of sale types too. The snag is they only want one line per customer, and ideally we don't want to have to do all the heavy lifting work in our own code.

The first and last sale dates are easy:

SELECT id, name, MIN(sale_date) AS first_sale, MAX(sale_date) AS last_sale
FROM customers
INNER JOIN sales ON (id = customer_id)
GROUP BY id,name
 id | name     | first_sale          | last_sale           
----+----------+---------------------+---------------------
1 | Torchbox | 2007-01-03 10:02:11 | 2007-10-06 09:44:21

 

But we also want an aggregate function that'll give us the list of sale types. We have a look through the list of standard postgresql aggregate functions, but we don't find anything that'll effectively turn a sequence of strings into a list or an array. After a quick google, we can't find a list aggregate or array aggregate function anywhere.

Luckily, it's possible to define your own PostGreSQL aggregate functions, without needing to resort to coding them in c (which is possible, by the way, just requires some work). We can ask PostGreSQL to build us a new aggregate function, simply by gluing together normal functions to act on the data as it goes past

For our use case, we want something that takes a sequence of strings (but possibly also numbers for a similar use), and returns us an array of the distinct values at the end. The distinct bit is easy - just do func(DISTINCT val) rather than just func(val) and PostGreSQL will do it for us. For the rest, we'll want a custom aggregate function that starts with a new array, and does an array append with new values. Say, something like:

CREATE AGGREGATE to_array (
  sfunc = array_append,
  basetype = anyelement,
  stype = anyarray,
  initcond = '{}'
)

We can now get our list of customers, their first and last sale, and all their sale types, all with one row per customer, all in sql:

SELECT customers.id,
customers.name,
MIN(sale_date) AS first_sale,
MAX(sale_date) AS last_sale,
to_array(DISTINCT types.name) AS sale_types
FROM customers
INNER JOIN sales ON (customers.id = customer_id)
INNER JOIN types ON (type_id = types.id)
GROUP BY customers.id,customers.name
 id | name     | first_sale          | last_sale           | sale_types       
----+----------+---------------------+---------------------+------------------
1 | Torchbox | 2007-01-03 10:02:11 | 2007-10-06 09:44:21 | {sysadmin,coding}

If we wanted to exclude null entries from going into our array, then we'd use a similar new aggregate function, with a custom function to power it:

CREATE FUNCTION array_append_not_null(anyarray,anyelement)
RETURNS anyarray
  AS 'SELECT CASE WHEN $2 IS NULL THEN $1 ELSE array_append($1,$2) END '
LANGUAGE SQL IMMUTABLE RETURNS NULL ON NULL INPUT;
CREATE AGGREGATE to_array_not_null (
  sfunc = array_append_not_null,
  basetype = anyelement,
  stype = anyarray,
  initcond = '{}'
)

There are lots of other possibilities too, and PostGreSQL provides loads of handy functions to build up your new aggregate with. If we start using our own functions to build up new aggregates, then the possibilities are almost endless!

We like...


typechart

Typechart

Compare the rendering of your favourite fonts on Windows and Mac and grab the CSS while you're there.


Every block

EveryBlock

Civic information, photos and fun. A news feed for your neighbourhood (if you live in one of the 9 US cities currently covered).


dopplr

Dopplr's Personal Annual Report

Super smart visualization of Barack Obama's travels in 2008.