I built a feed generator for myself, mostly as a side project to procrastinate from another side project.
The implementation for getting entries is sorta simple:
link
tagsSOME_NUMBER
of entries into a databaseBut that's the easy part. The general maintainence/operations of the script
shouldn't allow me to use up all my disk space. Not that it's likely to happen,
but I really don't need to keep data around for longer than I need. I only want
to keep SOME_NUMBER
of posts per source, and only if they're
recently seen.
The SQL query I came up with to prune is:
DELETE FROM posts WHERE (source, title, link) NOT IN ( WITH by_source AS ( SELECT row_number() OVER ( PARTITION BY source ORDER BY seen DESC ) AS rn, source, title, link FROM posts ) SELECT source, title, link FROM by_source WHERE rn <= :max_posts_per_source; );
Now, I have no fucking clue what that means. I wrote it at ~3am. So, let me try breaking it down (mostly for myself, as a way to retain whatever knowledge I once had at ~3am).
First, let's do some basic setup. I'll provide some example files if you'd like to follow along.
We'll need a seeded database:
; sqlite3 state.db -- Loading resources from /home/user/.sqliterc SQLite version 3.40.1 2022-12-28 14:03:47 Enter ".help" for usage hints. sqlite> .read setup.sql sqlite> .read seed.sql sqlite> .schema CREATE TABLE posts ( source TEXT, title TEXT NOT NULL, link TEXT, seen TEXT DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (title, link, source) ON CONFLICT IGNORE, CHECK (seen > datetime(CURRENT_TIMESTAMP, '-1 year')) ); sqlite> select * from posts limit 3; source title link seen -------- ------- --------------------- ------------------- bheesham no atom bheesham/blog/no atom 2023-07-20 00:26:20 simon goasm simon/goasm 2023-07-20 00:26:20 axo oranda axo/oranda 2023-07-20 00:26:20
Nice. So, the stage is set. To reset we'd just need to re-read
the .sql
files. Now we can start breaking that prune query up.
What's the explanation of the SELECT
bits? Let's start with the
easier to understand EXPLAIN QUERY PLAN
(docs).
sqlite> EXPLAIN QUERY PLAN WITH by_source AS (
SELECT
row_number() OVER (
PARTITION BY source
ORDER BY seen DESC
) AS rn,
source, title, link, seen
FROM posts
)
SELECT source, title, link, seen
FROM by_source
WHERE rn <= 3
ORDER BY source, seen DESC -- "source, " added in for my sanity
...> ;
QUERY PLAN
|--CO-ROUTINE by_source
| |--CO-ROUTINE (subquery-3)
| | |--SCAN posts
| | `--USE TEMP B-TREE FOR ORDER BY
| `--SCAN (subquery-3)
|--SCAN by_source
`--USE TEMP B-TREE FOR ORDER BY
So I guess in plain English, what's happening is:
by_source
subquery-3
, for the
partitionsubquery-3
-- what does this mean
though? why are we scanning here? is this to prepare a result set?SELECT
we scan
by_source
, which should filter for rn <= 3
The more detailed EXPLAIN
(docs) shows:
addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 105 0 0 Start at 105 1 InitCoroutine 1 83 2 0 by_source 2 Null 0 2 0 0 r[2]=NULL 3 InitCoroutine 4 28 4 0 (subquery-3) 4 SorterOpen 7 9 0 k(2,B,-B) 0 5 OpenRead 1 2 0 4 0 root=2 iDb=0; posts 6 Rewind 1 16 0 0 7 Column 1 0 7 0 r[7]= cursor 1 column 0 8 Column 1 1 8 0 r[8]= cursor 1 column 1 9 Column 1 2 9 0 r[9]= cursor 1 column 2 10 Column 1 3 10 0 r[10]=posts.seen 11 Column 1 0 5 0 r[5]= cursor 1 column 0 12 Column 1 3 6 0 r[6]=posts.seen 13 MakeRecord 5 6 13 0 r[13]=mkrec(r[5..10]) 14 SorterInsert 7 13 5 6 0 key=r[13] 15 Next 1 7 0 1 16 OpenPseudo 8 14 9 0 9 columns in r[14] 17 SorterSort 7 27 0 0 18 SorterData 7 14 8 0 r[14]=data 19 Column 8 1 12 0 r[12]= cursor 8 column 1; 20 Column 8 0 11 0 r[11]= cursor 8 column 0; 21 Column 8 5 10 0 r[10]= cursor 8 column 5; 22 Column 8 4 9 0 r[9]= cursor 8 column 4; 23 Column 8 3 8 0 r[8]= cursor 8 column 3; 24 Column 8 2 7 0 r[7]= cursor 8 column 2; 25 Yield 4 0 0 0 26 SorterNext 7 18 0 0 27 EndCoroutine 4 0 0 0 28 OpenEphemeral 2 6 0 0 nColumn=6 29 OpenDup 3 2 0 0 30 OpenDup 4 2 0 0 31 OpenDup 5 2 0 0 32 Null 0 15 15 0 r[15..15]=NULL 33 Integer 1 16 0 0 r[16]=1 34 InitCoroutine 4 0 4 0 35 Yield 4 61 0 0 next row of 36 Copy 7 18 0 2 r[18]=r[7] 37 Copy 8 19 0 2 r[19]=r[8] 38 Copy 9 20 0 2 r[20]=r[9] 39 Copy 10 21 0 2 r[21]=r[10] 40 Copy 11 22 0 2 r[22]=r[11] 41 Copy 12 23 0 2 r[23]=r[12] 42 MakeRecord 18 6 24 0 r[24]=mkrec(r[18..23]) 43 Compare 22 15 1 k(1,B) 0 r[22] <-> r[15] 44 Jump 45 47 45 0 45 Gosub 26 62 0 0 call flush_partition 46 Copy 22 15 0 0 r[15]=r[22] 47 NewRowid 3 25 0 0 r[25]=rowid 48 Insert 3 24 25 0 intkey=r[25] data=r[24] 49 Ne 16 54 25 0 if r[25]!=r[16] goto 54 50 Null 0 2 0 0 r[2]=NULL 51 Rewind 2 1 0 0 52 Rewind 5 1 0 0 53 Goto 0 60 0 0 54 AggStep 0 27 2 row_number(0) 0 accum=r[2] step(r[27]) 55 Next 5 56 0 0 56 AggValue 2 0 3 row_number(0) 0 r[3]=value N=0 57 Gosub 17 74 0 0 58 Delete 2 0 0 2 59 Next 2 60 0 0 60 Goto 0 35 0 0 61 Integer 72 26 0 0 r[26]=72 62 Rewind 3 71 0 0 63 AggStep 0 27 2 row_number(0) 0 accum=r[2] step(r[27]) 64 Next 5 65 0 0 65 AggValue 2 0 3 row_number(0) 0 r[3]=value N=0 66 Gosub 17 74 0 0 67 Delete 2 0 0 2 68 Next 2 70 0 0 69 Goto 0 71 0 0 70 Goto 0 65 0 0 71 ResetSorter 2 0 0 0 72 Return 26 0 0 0 73 Goto 0 82 0 0 74 Noop 0 0 0 0 inner-loop subroutine 75 Copy 3 27 0 0 r[27]=r[3] 76 Column 2 0 28 0 r[28]= cursor 2 column 0 77 Column 2 1 29 0 r[29]= cursor 2 column 1 78 Column 2 2 30 0 r[30]= cursor 2 column 2 79 Column 2 3 31 0 r[31]= cursor 2 column 3 80 Yield 1 0 0 0 81 Return 17 0 0 0 end inner-loop subroutine 82 EndCoroutine 1 0 0 0 83 SorterOpen 9 7 0 k(2,B,-B) 0 84 InitCoroutine 1 0 2 0 85 Yield 1 95 0 0 next row of by_source 86 Copy 27 32 0 2 r[32]=r[27] 87 Gt 33 94 32 BINARY-8 80 if r[32]>r[33] goto 94 88 Copy 29 36 0 2 r[36]=r[29] 89 Copy 30 37 0 2 r[37]=r[30] 90 Copy 28 34 0 2 r[34]=r[28] 91 Copy 31 35 0 2 r[35]=r[31] 92 MakeRecord 34 4 40 0 r[40]=mkrec(r[34..37]) 93 SorterInsert 9 40 34 4 0 key=r[40] 94 Goto 0 85 0 0 95 OpenPseudo 10 41 7 0 7 columns in r[41] 96 SorterSort 9 104 0 0 97 SorterData 9 41 10 0 r[41]=data 98 Column 10 1 39 0 r[39]=seen 99 Column 10 3 38 0 r[38]=link 100 Column 10 2 37 0 r[37]=title 101 Column 10 0 36 0 r[36]=source 102 ResultRow 36 4 0 0 output=r[36..39] 103 SorterNext 9 97 0 0 104 Halt 0 0 0 0 105 Transaction 0 0 1 0 1 usesStmtJournal=0 106 Integer 3 33 0 0 r[33]=3 107 Goto 0 1 0 0