For all those little papers scattered across your desk
Day 2 of “Advent of Racket”
Many of the smartest people I know keep an external cortex or exobrain: notes, a personal wiki, or even a blog. Inspired by Cory Doctorow’s “memex method” and a RacketCon question, I’m writing again when the mood strikes—see the uptick in posts since the middle of this year.
The advantage of a memex or external digital cortex is several-fold. The act of setting my thoughts out for an audience helps me to elucidate what otherwise might be a 10-word bullet, filed away and forgotten about. For Cory Doctorow, it keeps information connected in a tangled web that eventually crystallizes or nucleates into a bigger form. (Sound familiar? I’ve written about how my brain often works that way, too.)
If you made it this far, you’re probably wandering what this has to do with scraping sitemaps… as Cory Doctorow writes, “systematically reviewing your older work” is “hugely beneficial.” Looking at the patterns (wrong and right) is a “useful process of introspection” to improve our abilities to “spot and avoid” pitfalls.
Doctorow revisits “this day in history” on the major anniversaries:
For more than a decade, I’ve revisited “this day in history” from my own blogging archive, looking back one year, five years, ten years (and then, eventually, 15 years and 20 years). Every day, I roll back my blog archives to this day in years gone past, pull out the most interesting headlines and publish a quick blog post linking back to them.
This structured, daily work of looking back on where I’ve been is more valuable to helping me think about where I’m going than I can say.
This review idea fascinated me. While I don’t have the online tenure that Doctorow does, I do have some old writing. So the idea to add a program to my daily life to show me that writing was born.
The program should take a month and day (defaulting to the current) and show me
every post that I’ve written on that day. URLs are sufficient; I can click
them or pipe them to xargs -L1 open
. I don’t need to worry about the year, at
least not yet. It would be an easy modification to add later though. Since I
publish an XML sitemap on my blog, we’ll scrape that rather than the raw HTML.
The most up-to-date version of the script will always be in my Dotfiles; here’s a permalink to the version at time of writing.
We start with a stanza to make this executable by the shell but written in
Racket (and we make sure to let Vim know what to do with it, since my
filetype-detection
code
for Racket doesn’t work with #!
lines
yet):
#! /usr/bin/env racket
#lang racket
; vim: ft=racket
Now we require a few libraries from the main distribution; that means this program should work with most non-minimal Racket installations without depending on other packages being installed:
(require xml
xml/path
net/http-client
racket/date)
We need to know the month and day to use for our scraping; as mentioned, we’ll default to the current day but optionally parse values out of the command line:
(define now (current-date))
(define-values (month day)
(command-line
#:args ([month (~a (date-month now))] [day (~a (date-day now))])
(unless (string->number month)
(error "month should be numeric: " month))
(unless (string->number day)
(error "day should be numeric: " day))
(values (~r (string->number month) #:min-width 2 #:pad-string "0")
(~r (string->number day) #:min-width 2 #:pad-string "0"))))
The duplication is a bit bothersome, but in a ~40-line program I’m not concerned for the moment. It is important that we pad the dates to match my site’s URL format, which uses 2-digit months and days everywhere.
Next, we fire off a request to the sitemap. Notice the lack of error handling: this doesn’t need to be production grade, so we’ll assume the request succeeds.
(define-values (_status _headers response)
(http-sendrecv "benknoble.github.io" "/sitemap.xml" #:ssl? #t))
Now response
is an input
port:
we can read from it, but we haven’t materialized a full (byte)string yet. We
know it contains an XML document, so let’s read it as XML, extract the main
document, and turn that into an
xexpr:
(define doc
(xml->xexpr (document-element (read-xml response))))
Almost done: we can query the document for the URLs (which happen to be loc
elements) and filter them by our month-day combo:
(define locations
(se-path*/list '(loc) doc))
(define posts
(filter-map
(λ (loc)
(regexp-match (pregexp (~a ".*" month "/" day ".*")) loc))
locations))
Note how useful filter-map
is with regexp-match
: filter-map
discards any
#f
results from the mapping function, while regexp-match
returns #f
for
any inputs that don’t match. Simultaneously it transforms matching inputs to
describe the matches.
Finally, we display all the (newline-separated) results! We use first
to
extract the full original input string because the earlier regexp-match
produces (list full-match sub-group ...)
; our full-match
is the whole string
because we bracket month/day
with .*
patterns.
(for-each (compose1 displayln first) posts)
And that’s a wrap!
In practice, I try to run blog-posts-on
(the name of the script) once a day.
Sometimes I forget, so I build up a range of month/day combinations with
something like (Zsh):
print -l -- 11\ {17..22} | xargs -L1 blog-posts-on
That gets me the posts for November 17th through 22nd, for example. If I want to
open them all immediately, I pipe that to xargs -L1 open
as mentioned
(substitute xdg-open
or equivalent on your operating system).
#! /usr/bin/env racket
#lang racket
; vim: ft=racket
(require xml
xml/path
net/http-client
racket/date)
(define now (current-date))
(define-values (month day)
(command-line
#:args ([month (~a (date-month now))] [day (~a (date-day now))])
(unless (string->number month)
(error "month should be numeric: " month))
(unless (string->number day)
(error "day should be numeric: " day))
(values (~r (string->number month) #:min-width 2 #:pad-string "0")
(~r (string->number day) #:min-width 2 #:pad-string "0"))))
(define-values (_status _headers response)
(http-sendrecv "benknoble.github.io" "/sitemap.xml" #:ssl? #t))
(define doc
(xml->xexpr (document-element (read-xml response))))
(define locations
(se-path*/list '(loc) doc))
(define posts
(filter-map
(λ (loc)
(regexp-match (pregexp (~a ".*" month "/" day ".*")) loc))
locations))
(for-each (compose1 displayln first) posts)