SQL partitions (come up with a better title)

SQL partitions (come up with a better title)

I built a feed generator for myself, mostly as a side project to procrastinate from another side project.


General overview

The implementation for getting entries is sorta simple:

But that's the easy part. The general maintainence/operations of the script shouldn't allow me to use up all my disk space. Not that it's likely to happen, but I really don't need to keep data around for longer than I need. I only want to keep SOME_NUMBER of posts per source, and only if they're recently seen.

The SQL query I came up with to prune is:

DELETE FROM posts
WHERE (source, title, link) NOT IN (
    WITH by_source AS (
        SELECT
            row_number() OVER (
                PARTITION BY source
                ORDER BY seen DESC
            ) AS rn,
            source, title, link
        FROM posts
    )
    SELECT source, title, link
    FROM by_source
    WHERE rn <= :max_posts_per_source;
);

Now, I have no fucking clue what that means. I wrote it at ~3am. So, let me try breaking it down (mostly for myself, as a way to retain whatever knowledge I once had at ~3am).


Analysis

First, let's do some basic setup. I'll provide some example files if you'd like to follow along.

We'll need a seeded database:

; sqlite3 state.db
-- Loading resources from /home/user/.sqliterc
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> .read setup.sql
sqlite> .read seed.sql
sqlite> .schema
CREATE TABLE posts (
    source TEXT,
    title TEXT NOT NULL,
    link TEXT,
    seen TEXT DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (title, link, source) ON CONFLICT IGNORE,
    CHECK (seen > datetime(CURRENT_TIMESTAMP, '-1 year'))
);
sqlite> select * from posts limit 3;
source    title    link                   seen
--------  -------  ---------------------  -------------------
bheesham  no atom  bheesham/blog/no atom  2023-07-20 00:26:20
simon     goasm    simon/goasm            2023-07-20 00:26:20
axo       oranda   axo/oranda             2023-07-20 00:26:20

Nice. So, the stage is set. To reset we'd just need to re-read the .sql files. Now we can start breaking that prune query up.

What's the explanation of the SELECT bits? Let's start with the easier to understand EXPLAIN QUERY PLAN (docs).

sqlite> EXPLAIN QUERY PLAN WITH by_source AS (
    SELECT
        row_number() OVER (
            PARTITION BY source
            ORDER BY seen DESC
        ) AS rn,
        source, title, link, seen
    FROM posts
)
SELECT source, title, link, seen
FROM by_source
WHERE rn <= 3
ORDER BY source, seen DESC          -- "source, " added in for my sanity
   ...> ;
QUERY PLAN
|--CO-ROUTINE by_source
|  |--CO-ROUTINE (subquery-3)
|  |  |--SCAN posts
|  |  `--USE TEMP B-TREE FOR ORDER BY
|  `--SCAN (subquery-3)
|--SCAN by_source
`--USE TEMP B-TREE FOR ORDER BY

So I guess in plain English, what's happening is:

  1. we create a co-routine, by_source
  2. which creates another co-routine, subquery-3, for the partition
  3. which sorts by order, and I guess here's where the magic happens
  4. then we scan co-routine subquery-3 -- what does this mean though? why are we scanning here? is this to prepare a result set?
  5. and then in the main SELECT we scan by_source, which should filter for
    rn <= 3
  6. which then sorts

The more detailed EXPLAIN (docs) shows:

addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     105   0                    0   Start at 105
1     InitCoroutine  1     83    2                    0   by_source
2     Null           0     2     0                    0   r[2]=NULL
3     InitCoroutine  4     28    4                    0   (subquery-3)
4     SorterOpen     7     9     0     k(2,B,-B)      0   
5     OpenRead       1     2     0     4              0   root=2 iDb=0; posts
6     Rewind         1     16    0                    0   
7       Column         1     0     7                    0   r[7]= cursor 1 column 0
8       Column         1     1     8                    0   r[8]= cursor 1 column 1
9       Column         1     2     9                    0   r[9]= cursor 1 column 2
10      Column         1     3     10                   0   r[10]=posts.seen
11      Column         1     0     5                    0   r[5]= cursor 1 column 0
12      Column         1     3     6                    0   r[6]=posts.seen
13      MakeRecord     5     6     13                   0   r[13]=mkrec(r[5..10])
14      SorterInsert   7     13    5     6              0   key=r[13]
15    Next           1     7     0                    1   
16    OpenPseudo     8     14    9                    0   9 columns in r[14]
17    SorterSort     7     27    0                    0   
18      SorterData     7     14    8                    0   r[14]=data
19      Column         8     1     12                   0   r[12]= cursor 8 column 1; 
20      Column         8     0     11                   0   r[11]= cursor 8 column 0; 
21      Column         8     5     10                   0   r[10]= cursor 8 column 5; 
22      Column         8     4     9                    0   r[9]= cursor 8 column 4; 
23      Column         8     3     8                    0   r[8]= cursor 8 column 3; 
24      Column         8     2     7                    0   r[7]= cursor 8 column 2; 
25      Yield          4     0     0                    0   
26    SorterNext     7     18    0                    0   
27    EndCoroutine   4     0     0                    0   
28    OpenEphemeral  2     6     0                    0   nColumn=6
29    OpenDup        3     2     0                    0   
30    OpenDup        4     2     0                    0   
31    OpenDup        5     2     0                    0   
32    Null           0     15    15                   0   r[15..15]=NULL
33    Integer        1     16    0                    0   r[16]=1
34    InitCoroutine  4     0     4                    0   
35      Yield          4     61    0                    0   next row of 
36      Copy           7     18    0                    2   r[18]=r[7]
37      Copy           8     19    0                    2   r[19]=r[8]
38      Copy           9     20    0                    2   r[20]=r[9]
39      Copy           10    21    0                    2   r[21]=r[10]
40      Copy           11    22    0                    2   r[22]=r[11]
41      Copy           12    23    0                    2   r[23]=r[12]
42      MakeRecord     18    6     24                   0   r[24]=mkrec(r[18..23])
43      Compare        22    15    1     k(1,B)         0   r[22] <-> r[15]
44      Jump           45    47    45                   0   
45      Gosub          26    62    0                    0   call flush_partition
46      Copy           22    15    0                    0   r[15]=r[22]
47      NewRowid       3     25    0                    0   r[25]=rowid
48      Insert         3     24    25                   0   intkey=r[25] data=r[24]
49      Ne             16    54    25                   0   if r[25]!=r[16] goto 54
50      Null           0     2     0                    0   r[2]=NULL
51      Rewind         2     1     0                    0   
52      Rewind         5     1     0                    0   
53      Goto           0     60    0                    0   
54      AggStep        0     27    2     row_number(0)  0   accum=r[2] step(r[27])
55      Next           5     56    0                    0   
56      AggValue       2     0     3     row_number(0)  0   r[3]=value N=0
57      Gosub          17    74    0                    0   
58      Delete         2     0     0                    2   
59      Next           2     60    0                    0   
60    Goto           0     35    0                    0   
61    Integer        72    26    0                    0   r[26]=72
62    Rewind         3     71    0                    0   
63    AggStep        0     27    2     row_number(0)  0   accum=r[2] step(r[27])
64    Next           5     65    0                    0   
65    AggValue       2     0     3     row_number(0)  0   r[3]=value N=0
66    Gosub          17    74    0                    0   
67    Delete         2     0     0                    2   
68    Next           2     70    0                    0   
69    Goto           0     71    0                    0   
70    Goto           0     65    0                    0   
71    ResetSorter    2     0     0                    0   
72    Return         26    0     0                    0   
73    Goto           0     82    0                    0   
74    Noop           0     0     0                    0   inner-loop subroutine
75    Copy           3     27    0                    0   r[27]=r[3]
76    Column         2     0     28                   0   r[28]= cursor 2 column 0
77    Column         2     1     29                   0   r[29]= cursor 2 column 1
78    Column         2     2     30                   0   r[30]= cursor 2 column 2
79    Column         2     3     31                   0   r[31]= cursor 2 column 3
80    Yield          1     0     0                    0   
81    Return         17    0     0                    0   end inner-loop subroutine
82    EndCoroutine   1     0     0                    0   
83    SorterOpen     9     7     0     k(2,B,-B)      0   
84    InitCoroutine  1     0     2                    0   
85      Yield          1     95    0                    0   next row of by_source
86      Copy           27    32    0                    2   r[32]=r[27]
87      Gt             33    94    32    BINARY-8       80  if r[32]>r[33] goto 94
88      Copy           29    36    0                    2   r[36]=r[29]
89      Copy           30    37    0                    2   r[37]=r[30]
90      Copy           28    34    0                    2   r[34]=r[28]
91      Copy           31    35    0                    2   r[35]=r[31]
92      MakeRecord     34    4     40                   0   r[40]=mkrec(r[34..37])
93      SorterInsert   9     40    34    4              0   key=r[40]
94    Goto           0     85    0                    0   
95    OpenPseudo     10    41    7                    0   7 columns in r[41]
96    SorterSort     9     104   0                    0   
97      SorterData     9     41    10                   0   r[41]=data
98      Column         10    1     39                   0   r[39]=seen
99      Column         10    3     38                   0   r[38]=link
100     Column         10    2     37                   0   r[37]=title
101     Column         10    0     36                   0   r[36]=source
102     ResultRow      36    4     0                    0   output=r[36..39]
103   SorterNext     9     97    0                    0   
104   Halt           0     0     0                    0   
105   Transaction    0     0     1     0              1   usesStmtJournal=0
106   Integer        3     33    0                    0   r[33]=3
107   Goto           0     1     0                    0