3 Aggregates

8.12

3 Aggregates🔗ℹ

If you are going to follow along, be sure you

> (require plisqin-examples/adventure-works/schema)

3.1 Grouping🔗ℹ

To understand aggregates, you must first understand how group-by changes the shape of the result set. First, consider the following query. It will return every row in the Product table.

(from p Product)

If we filter out some rows using where clauses, every row in the result set still corresponds to exactly one Product.

(from p Product
(where (.> (ListPrice p)
(val 100))))

When we add a group-by clause, the shape of the result set is very different. In the following query, each row of the result set no longer corresponds to a single Product, but to a group of Products sharing the same Color.

> (aw:show-table
   (from p Product
         (group-by (Color p))
         (select (Color p))))
Show TableShow SQL
select p.Color as Color
from Product p
group by p.Color
Color
#<sql-null>
Black
Blue
Grey
Multi
Red
Silver
Silver/Black
White
Yellow

An Illegal Select

What would happen if you were to add (select (ListPrice p)) to the previous example? I call this an illegal select because it doesn’t make sense – there might be 10 yellow Products that all have different List Prices. PostgreSQL and MSSQL will raise an error if you try this. Unfortunately for my demonstration, SQLite tolerates illegal selects. In this example, it would choose a single ListPrice at random from the group. But I recommend that you avoid illegal selects.

Note that (select (Color p)) was not a illegal select. We grouped by Color, so each group will have exactly one Color by definition.

3.2 Traditional Aggregation🔗ℹ

It seems that group-by is mostly useless on its own. Once we add aggregates, things start to make sense. Plisqin provides the following aggregate operations: count, avg, min, max, sum.

If we have Products grouped by Color, it does not make sense to ask "What is the ListPrice of each group of Products?" But we can ask "What is the maximum ListPrice of each group of Products?"

> (aw:show-table
   (from p Product
         (group-by (Color p))
         (select (Color p))
         (select (count p))
         (select (min (ListPrice p)))
         (select (max (ListPrice p)))))
Show TableShow SQL
select p.Color as Color
  , count(*)
  , min(p.ListPrice)
  , max(p.ListPrice)
from Product p
group by p.Color
Color
count(*)
min(p.ListPrice)
max(p.ListPrice)
#<sql-null>
248
0.0
229.49
Black
93
0.0
3374.99
Blue
26
34.99
2384.07
Grey
1
125.0
125.0
Multi
8
8.99
89.99
Red
38
34.99
3578.27
Silver
43
0.0
3399.99
Silver/Black
7
40.49
80.99
White
4
8.99
9.5
Yellow
36
53.99
2384.07

The previous query is an example of traditional aggregation. Soon we will contrast traditional aggregation with another style.

Implicit Group Caveat

You can use aggregates on an ungrouped query. Doing so implicitly puts every item into a single group. The following example is aggregating over the group of all Products.

> (aw:show-table
   (from p Product
         (select (count p))
         (select (min (ListPrice p)))
         (select (max (ListPrice p)))))
Show TableShow SQL
select count(*)
  , min(p.ListPrice)
  , max(p.ListPrice)
from Product p
count(*)
min(p.ListPrice)
max(p.ListPrice)
504
0.0
3578.27

Attempting to add (select (ListPrice p)) to the previous example would be another illegal select. The grouping is implicit, but the reasoning is the same – the group of all Products does not have a single List Price.

3.3 Illegal vs Quasi-Legal Selects🔗ℹ

We previously saw the following example of traditional aggregation:

(from p Product
      (group-by (Color p))
      (select (Color p))
      (select (count p))
      (select (min (ListPrice p)))
      (select (max (ListPrice p))))

What if we wanted to modify this query to group by ProductCategory instead of by Color? Sticking with traditional aggregation, we can write the following query:

> (aw:show-table
   (from p Product
         (group-by (ProductCategoryID p))
         (select (ProductCategoryID p))
         (select (CategoryName p))
         (select (count p))
         (select (min (ListPrice p)))
         (select (max (ListPrice p)))))
Show TableShow SQL
select subcat.ProductCategoryID as ProductCategoryID
  , cat.Name as CategoryName
  , count(*)
  , min(p.ListPrice)
  , max(p.ListPrice)
from Product p
left join ProductSubcategory subcat
   on (p.ProductSubcategoryID is not null and (subcat.ProductSubcategoryID = p.ProductSubcategoryID))
left join ProductCategory cat
   on (subcat.ProductCategoryID is not null and (cat.ProductCategoryID = subcat.ProductCategoryID))
group by subcat.ProductCategoryID
ProductCategoryID
CategoryName
count(*)
min(p.ListPrice)
max(p.ListPrice)
#<sql-null>
#<sql-null>
209
0.0
196.92
1
Bikes
97
539.99
3578.27
2
Components
134
20.24
1431.5
3
Clothing
35
8.99
89.99
4
Accessories
29
2.29
159.0

In the previous query, (select (CategoryName p)) might look like an illegal select, but I am going to say that it is a quasi-legal select instead, and here is why:

We have already grouped by ProductCategoryID.
ProductCategoryID is a unique key of the Category table.
Therefore, for every column C of the Category table, each group contains exactly one value of C.
CategoryName is a column of the Category table.
Therefore, each group contains exactly one CategoryName.
Therefore, "What is the CategoryName of each group?" has an unambiguous answer, and it is not an illegal select.

I think I read that PostgreSQL now accepts some quasi-legal selects, but only when it can prove that they are not illegal, which is probably impossible to do with 100% accuracy.

None of this would be worth mentioning if every database permitted these quasi-legal selects, but they don’t. MSSQL rejects all quasi-legal selects, so we would have to modify our query. Adding (group-by (CategoryName p)) would be acceptable to MSSQL. But we are using SQLite which accepts all quasi-legal selects (and illegal selects too), so the query works and produces a reasonable result.

3.4 Grouped Join Aggregation🔗ℹ

We have seen a few examples of traditional aggregation, but now it is time to introduce grouped join aggregation. First, we will define a grouped join to mean "any join which contains a group-by clause." The following procedure returns a grouped join:

> (define/contract (Products-by-Category cat)
    (-> (instanceof ProductCategory) (instanceof Product))
    (join p Product #:to cat
          (join-type 'left)
          (group-by (ProductCategoryID p))
          (join-on (.= (?? (ProductCategoryID p) /void)
                       (ProductCategoryID cat)))))

When a grouped join is passed into an aggregate, the aggregate recognizes the grouped join and performs the aggregation over that group. This is called grouped join aggregation; here is an example:

According to my normal naming convention, group-of-products would be named productsG for brevity. The longer name is for clarity.

> (aw:show-table
   (from cat ProductCategory
         (select (ProductCategoryID cat))
         (select (CategoryName cat))
         (define group-of-products
           (Products-by-Category cat))
         (select (count group-of-products))
         (select (min (ListPrice group-of-products)))
         (select (max (ListPrice group-of-products)))))
Show TableShow SQL
select cat.ProductCategoryID as ProductCategoryID
  , cat.Name as CategoryName
  , p.__INJECT1
  , p.__INJECT2
  , p.__INJECT3
from ProductCategory cat
left join (
  select subcat.ProductCategoryID as __INJECT0
    , count(*) as __INJECT1
    , min(p.ListPrice) as __INJECT2
    , max(p.ListPrice) as __INJECT3
  from Product p
  left join ProductSubcategory subcat
     on (p.ProductSubcategoryID is not null and (subcat.ProductSubcategoryID = p.ProductSubcategoryID))
  group by subcat.ProductCategoryID
) p
   on (p.__INJECT0 is not null and (p.__INJECT0 = cat.ProductCategoryID))
ProductCategoryID
CategoryName
__INJECT1
__INJECT2
__INJECT3
1
Bikes
97
539.99
3578.27
2
Components
134
20.24
1431.5
3
Clothing
35
8.99
89.99
4
Accessories
29
2.29
159.0

Notice that grouped join aggregation gives us better composability than traditional aggregation thanks to separation of concerns. Specifically, the "grouping" and "aggregating" concerns are now separated. The Products-by-Category procedure says nothing about aggregation; it only encodes the relationship "a ProductCategory has a group of Products." This means we can reuse it in other queries with other aggregations. And each aggregate (such as max) says nothing about grouping, but it will accept any grouped join you give it.

Also notice that we no longer have the quasi-legal select that we did in the traditional aggregation version of this query. Each select is fully legal here. This means that grouped join aggregation is usually easier to use if you are using a database that does not allow quasi-legal selects.

Finally, notice how grouped join aggregation allows us to define an aggregate expression as a function of a single argument. The caller of this function sees MinListPrice as just another property of the ProductCategory table:

(define/contract (MinListPrice cat)
(-> (instanceof ProductCategory) Number?)
(min (ListPrice (Products-by-Category cat))))

3.5 Summary🔗ℹ

Grouped join aggregation is more composable than traditional aggregation, but it is also more limited. That is, grouped join aggregation can always be converted to traditional aggregation, but the reverse it not true. Let’s look at the first example of traditional aggregation again:

; Traditional aggregation, this works:
(from p Product
      (group-by (Color p))
      (select (Color p))
      (select (count p))
      (select (min (ListPrice p)))
      (select (max (ListPrice p))))
; Grouped join aggregation, this does not work
; because there is no Color table:
(from c Color
      (define productsG (Products-by-Color c))
      (select (ColorName c))
      (select (count p))
      (select (min (ListPrice p)))
      (select (max (ListPrice p))))

In the previous example, grouped join aggregation is impossible because the Color table was hypothetical; it does not exist. But we do have a ProductCategory table, which is why both styles are possible in the following example:

; traditional aggregation:
(from p Product
      (group-by (ProductCategoryID p))
      (select (ProductCategoryID p))
      ; A quasi-legal select; this won't work in some databases:
      (select (CategoryName p))
      (select (count p))
      (select (min (ListPrice p)))
      (select (max (ListPrice p))))
; grouped join aggregation:
(from cat ProductCategory
      ; All selects are fully legal here:
      (select (ProductCategoryID cat))
      (select (CategoryName cat))
      (define group-of-products
        (Products-by-Category cat))
      (select (count group-of-products))
      (select (min (ListPrice group-of-products)))
      (select (max (ListPrice group-of-products))))

In the previous example, the two styles will produce different result sets. The traditional aggregation will include one more row representing the group of Products that don’t belong to any ProductCategory. (If every Product belonged to a Category, there would be no difference.) In this situation, I recommend that you choose grouped join aggregation unless you need the following "null Category" to appear in your result set:

> (aw:show-table
   (from p Product
         (group-by (ProductCategoryID p))
         (select (ProductCategoryID p))
         (select (CategoryName p))
         (select (count p))
         (select (min (ListPrice p)))
         (select (max (ListPrice p)))))
Show TableShow SQL
select subcat.ProductCategoryID as ProductCategoryID
  , cat.Name as CategoryName
  , count(*)
  , min(p.ListPrice)
  , max(p.ListPrice)
from Product p
left join ProductSubcategory subcat
   on (p.ProductSubcategoryID is not null and (subcat.ProductSubcategoryID = p.ProductSubcategoryID))
left join ProductCategory cat
   on (subcat.ProductCategoryID is not null and (cat.ProductCategoryID = subcat.ProductCategoryID))
group by subcat.ProductCategoryID
ProductCategoryID
CategoryName
count(*)
min(p.ListPrice)
max(p.ListPrice)
#<sql-null>
#<sql-null>
209
0.0
196.92
1
Bikes
97
539.99
3578.27
2
Components
134
20.24
1431.5
3
Clothing
35
8.99
89.99
4
Accessories
29
2.29
159.0

1	Read Me First
2	Using define-schema
3	Aggregates
4	Nullability
5	Refactoring Recipes
6	Reference
7	Plisqin as a Research Language

3.1	Grouping
3.2	Traditional Aggregation
3.3	Illegal vs Quasi-Legal Selects
3.4	Grouped Join Aggregation
3.5	Summary