
All of the other options in related functions use strings instead of characters. I’d rather write " "
than #\space
.

That’s a good enough reason, thanks.

<https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/racket.html|racket is slower than Common Lisp (SBCL)?>

<https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/racket.html|Common Lisp (SBCL) is slower than Racket?>

lol. same link, two different “headlines” while technically the same meaning, but leads the reader in a different way. :smile:

Just commenting that SBCL being significantly faster than Racket shouldn’t be at all surprising.

Why?

Natively compiled code vs. bytecode interpreting and optimizations - esp. wrt mutability in building lists.

@massung those results are for Racket CS, which doesn’t have any bytecode or interpretation overhead

I thought racket cs was compiled?

“incremental native-code compiler”

So, here’s a quick compare of just list building + gc…
;; racket
(define (iota n)
(for/list ([i n]) i))
(time (for ([i 10000]) (iota 1000)))
;; sbcl
(defun iota (n)
(loop for i below n collect i))
(time (dotimes (i 10000) (iota 1000)))
Racket runs in 223 ms of which 118 ms gc SBCL runs in 63 ms of which 44 ms is gc

assuming racket is compiled (I don’t know that it is, but I can (disassemble iota)
in SBCL), it’s just more efficient

Does SBCL have “contracts” or checks to guard against bad inputs?

I wonder how much faster Racket will be if it completely disables all contracts / checks.

I don’t disagree that SBCL usually generates faster code than Racket for a variety of reasons, but it’s definitely compiled in roughly the same way that SBCL is.

Not sure what you mean by that? Obviously SBCL raises an exception on (+ 5 "hello")

And SBCL checks types at compile-time to throw warnings

Also, if you use in-range
in that for/list
it’s about twice as fast.

not twice actually

but that measurement is very variable because you might or might not get a GC pause

I want to build the list intentionally. obviously not building the list is faster

SBCL also might build that list using mutation, which isn’t possible in Racket (which is one reason that it might be faster in less small benchmarks)

> SBCL also might build that list using mutation Absolutely uses mutation. Why wouldn’t you?? :slightly_smiling_face:

Sam meant: (define (iota n)
(for/list ([i (in-range n)]) i))

No, I mean that when measuring this sort of thing, you want to avoid variance based on GC timing. So you usually want to run GC beforehand, and allocate an amount that either consistently requires a collection or consistently doesn’t.

In either case, there’s no list being built for iterating over n
(there’s of course a list being built for collecting the values)

In Racket, because of first-class continuations you can’t build a list with mutation in the same way.

I went looking for a disassemble
function :cry:

I can always use an external disassembler

raco pkg install disassemble

I feel like this has devolved into an “us” vs. “them” discussion, which always sucks.
The original comment was just that I fully expected SBCL to be faster than Racket. I also fully expect C/Rust to be faster than SBCL. There are obvious reasons for that: mutability in solutions, better optimizations, likely a better GC (guessing) and man-hours put into the implementation. Common Lisp is also a very fixed spec. The code that exists using it is well known and knowing what to optimize is often more important than what the optimization ends up being.

I’m not bothered about speed. Any performance problems are far more likely due to my code. A compiler can’t fix all my sins.

I also expect SBCL to be faster than Racket, but I think the reasons are often less obvious, and a few of the ones you suggested originally were not correct.

Fair enough.

Thanks guys. Very informative. I feel like I’ve only touched the surface.

As long as it’s reasoned, an “us vs them” discussion can still be quite informative and may even lead to improvements :slightly_smiling_face:

Of course now I want to know why c/rust is faster than sbcl.

@laurent.orseau can you take a look at http://drdr.racket-lang.org/60350/racket/share/pkgs/quickscript-test/drracket.rkt

@spdegabrielle (a) no GC (b) more precise type information (c) fewer safety checks (d) less need for pointers

It’s starting to look like part of the answer is “the build machine’s disk has gone bad,” and it just happens to have started failing the same week as the build-system change.

@massung I’d be interested in a comparison between the SBCL and Racket disassembly of that function (I don’t have SBCL handy)

Without doing a major deep-dive…
The racket disasm is performing a lot of memory accesses. This is on an M1 (ARM) so it’s a shallower pipeline that the x86, but given many sequential instructions there’s possible LHS stalls. It looks like there’s very low register usage as opposed to utilizing the stack quite a bit.
The SBCL disasm basically optimizes realizing the loop counter is a fixnum and keeps the loop value in a register. The list building all appears to be inlined as well. It may be inlined in the Racket code, but that’s difficult to discern at a quick glance given the size of the output for the racket disasm.

Again, that’s just a 5-minute glance and probably not worth taking too seriously. :wink:

That disassemble
package for racket is nice. :slightly_smiling_face:

Thanks :slightly_smiling_face:

The other thing that’s notable is that the Racket code builds the list in reverse order and then calls reverse
.

So if the SBCL code avoids that it’s a significant speedup

It’s hard to know if it’s very accurate, though. For example: (define (add a b) (+ a b))
(disassemble add)
This produces some pretty nutty assembly that’s difficult to dissect. There’s no obvious ret
instruction so it’s possible it’s outputting a block of code that’s in excess of the actual function body. There’s some code that appears to be possible type checking taking place (looks like pointer tagging?), but it’s quite a bit for just a simple +
operation. More than I’d expect.

There’s no ret
because it’s a tail call.

Here’s what I get: 0: 4883fd02 (cmp rbp #x2)
4: 7569 (jnz (+ rip #x69)) ; => 6f
6: 4889f9 (mov rcx rdi)
9: 4c09c1 (or rcx r8)
c: f6c107 (test cl #x7)
f: 750c (jnz (+ rip #xc)) ; => 1d
11: 4c89c5 (mov rbp r8)
14: 4801fd (add rbp rdi)
17: 7004 (jo (+ rip #x4)) ; => 1d
19: 41ff6500 (jmp (mem64+ r13 #x0))
1d: 49836e6801 (sub (mem64+ r14 #x68) #x1) ; <=
22: 753f (jnz (+ rip #x3f)) ; => 63
24: 4d894508 (mov (mem64+ r13 #x8) r8)
28: 49897d10 (mov (mem64+ r13 #x10) rdi)
2c: 4983c518 (add r13 #x18)
30: 488d0d20000000 (lea rcx (mem+ rip #x20)) ; => 57
37: 49894d00 (mov (mem64+ r13 #x0) rcx)
3b: e9c0fbbdf7 (jmp (+ rip #x-8420440)) ; #<code event>
40: 0f1f800000000088 (data)
48: 000000000000008d (data)
50: 01000000000000 (data)
57: 4983ed18 (sub r13 #x18) ; <=
5b: 4d8b4508 (mov r8 (mem64+ r13 #x8))
5f: 498b7d10 (mov rdi (mem64+ r13 #x10))
63: e98853bdf7 (jmp (+ rip #x-842ac78)) ; #<code +> ; <=
68: 0f1f8000000000 (data)
6f: e97cbfbbf7 (jmp (+ rip #x-8444084)) ; #<code doargerr> ; <=
74: 0f1f8000000000 (data)

Here’s the ARM output for comparison: 0: ff0a00f1 (subs xzr x23 #x2)
4: 81100054 (b.ne (+ pc #x210)) ; => 214
8: ef0300aa (orr x15 xzr x0)
c: ee0301aa (orr x14 xzr x1)
10: 1e369cd2 (movz x30 #xe1b0) ; #<code winder-dummy>
14: bef8a0f2 (movk x30 (lsl #x7c5 #x10))
18: 3e00c0f2 (movk x30 (lsl #x1 #x20))
1c: 1e00e0f2 (movk x30 (lsl #x0 #x30))
20: 8d0240f9 (ldr x13 (mem+ x20 (lsl #x0 #x3)))
24: bf011eeb (subs xzr x13 x30)
28: 810e0054 (b.ne (+ pc #x1d0)) ; => 1f8
2c: 7ece40f9 (ldr x30 (mem+ x19 (lsl #x33 #x3)))
30: 6dda40f9 (ldr x13 (mem+ x19 (lsl #x36 #x3)))
34: ccb343f8 (ldur x12 (mem+ x30 #x3b))
38: 9f010deb (subs xzr x12 x13)
3c: 210c0054 (b.ne (+ pc #x184)) ; => 1c0
40: c00080d2 (movz x0 #x6)
44: 1f1800f1 (subs xzr x0 #x6) ; <=
48: 81080054 (b.ne (+ pc #x110)) ; => 158
4c: 4db340f8 (ldur x13 (mem+ x26 #xb))
50: b81e00d1 (sub x24 x21 #x7)
54: b5420091 (add x21 x21 #x10)
58: 7e5240f9 (ldr x30 (mem+ x19 (lsl #x14 #x3)))
5c: df0315eb (subs xzr x30 x21)
60: 03070054 (b.cc (+ pc #xe0)) ; => 140
64: f70318aa (orr x23 xzr x24) ; <=
68: ed7200f8 (stur x13 (mem+ x23 #x7))
6c: fe618ed2 (movz x30 #x730f) ; '#(3-unsaved-editor 3 18 33 7)
70: 9e40abf2 (movk x30 (lsl #x5a04 #x10))
74: 3e00c0f2 (movk x30 (lsl #x1 #x20))
78: 1e00e0f2 (movk x30 (lsl #x0 #x30))
7c: fef200f8 (stur x30 (mem+ x23 #xf))
80: 6dda40f9 (ldr x13 (mem+ x19 (lsl #x36 #x3))) ; <=
84: b81e00d1 (sub x24 x21 #x7)
88: b5420091 (add x21 x21 #x10)
8c: 7e5240f9 (ldr x30 (mem+ x19 (lsl #x14 #x3)))
90: df0315eb (subs xzr x30 x21)
94: a3040054 (b.cc (+ pc #x94)) ; => 128
98: 177300f8 (stur x23 (mem+ x24 #x7)) ; <=
9c: 0df300f8 (stur x13 (mem+ x24 #xf))
a0: 78da00f9 (str x24 (mem+ x19 (lsl #x36 #x3)))
a4: fe010eaa (orr x30 x15 x14)
a8: df0b40f2 (ands xzr x30 #x7)
ac: a1000054 (b.ne (+ pc #x14)) ; => c0
b0: f7010eab (adds x23 x15 x14)
b4: 66000054 (b.vs (+ pc #xc)) ; => c0
b8: 8a0240f8 (ldur x10 (mem+ x20))
bc: 40011fd6 (br x10)
c0: d60600f1 (subs x22 x22 #x1) ; <=
c4: 41020054 (b.ne (+ pc #x48)) ; => 10c
c8: 8f0600f9 (str x15 (mem+ x20 (lsl #x1 #x3)))
cc: 8e0a00f9 (str x14 (mem+ x20 (lsl #x2 #x3)))
d0: 94620091 (add x20 x20 #x18)
d4: 7e010010 (adr x30 (+ pc #x2c)) ; => 100
d8: 9e0200f9 (str x30 (mem+ x20 (lsl #x0 #x3)))
dc: 0a768cd2 (movz x10 #x63b0) ; #<code event>
e0: 0af9a0f2 (movk x10 (lsl #x7c8 #x10))
e4: 2a00c0f2 (movk x10 (lsl #x1 #x20))
e8: 0a00e0f2 (movk x10 (lsl #x0 #x30))
ec: 40011fd6 (br x10)
f0: 3101000000000000 (data)
f8: 8d01000000000000 (data)
100: 946200d1 (sub x20 x20 #x18) ; <=
104: 8f0640f9 (ldr x15 (mem+ x20 (lsl #x1 #x3)))
108: 8e0a40f9 (ldr x14 (mem+ x20 (lsl #x2 #x3)))
10c: e0030faa (orr x0 xzr x15) ; <=
110: e1030eaa (orr x1 xzr x14)
114: 0afa94d2 (movz x10 #xa7d0) ; #<code +>
118: eaf8a0f2 (movk x10 (lsl #x7c7 #x10))
11c: 2a00c0f2 (movk x10 (lsl #x1 #x20))
120: 0a00e0f2 (movk x10 (lsl #x0 #x30))
124: 40011fd6 (br x10)
128: 0a7899d2 (movz x10 #xcbc0) ; #<code get-room> ; <=
12c: aaf8a0f2 (movk x10 (lsl #x7c5 #x10))
130: 2a00c0f2 (movk x10 (lsl #x1 #x20))
134: 0a00e0f2 (movk x10 (lsl #x0 #x30))
138: 40013fd6 (blr x10)
13c: d7ffff17 (b (+ pc #x-a4)) ; => 98
140: 0a7899d2 (movz x10 #xcbc0) ; #<code get-room> ; <=
144: aaf8a0f2 (movk x10 (lsl #x7c5 #x10))
148: 2a00c0f2 (movk x10 (lsl #x1 #x20))
14c: 0a00e0f2 (movk x10 (lsl #x0 #x30))
150: 40013fd6 (blr x10)
154: c4ffff17 (b (+ pc #x-f0)) ; => 64
158: 41b340f8 (ldur x1 (mem+ x26 #xb)) ; <=
15c: e2618ed2 (movz x2 #x730f) ; '#(3-unsaved-editor 3 18 33 7)
160: 8240abf2 (movk x2 (lsl #x5a04 #x10))
164: 2200c0f2 (movk x2 (lsl #x1 #x20))
168: 0200e0f2 (movk x2 (lsl #x0 #x30))
16c: 8f0600f9 (str x15 (mem+ x20 (lsl #x1 #x3)))
170: 8e0a00f9 (str x14 (mem+ x20 (lsl #x2 #x3)))
174: 94620091 (add x20 x20 #x18)
178: 78ef93d2 (movz x24 #x9f7b) ; 'mark-frame-update
17c: 1815a1f2 (movk x24 (lsl #x8a8 #x10))
180: 3800c0f2 (movk x24 (lsl #x1 #x20))
184: 1800e0f2 (movk x24 (lsl #x0 #x30))
188: 1a5340f8 (ldur x26 (mem+ x24 #x5))
18c: 3e010010 (adr x30 (+ pc #x24)) ; => 1b0
190: 9e0200f9 (str x30 (mem+ x20 (lsl #x0 #x3)))
194: 770080d2 (movz x23 #x3)
198: 0ad340f8 (ldur x10 (mem+ x24 #xd))
19c: 40011fd6 (br x10)
1a0: e101000000000000 (data)
1a8: 8f01000000000000 (data)
1b0: 946200d1 (sub x20 x20 #x18) ; <=
1b4: 8f0640f9 (ldr x15 (mem+ x20 (lsl #x1 #x3)))
1b8: 8e0a40f9 (ldr x14 (mem+ x20 (lsl #x2 #x3)))
1bc: b1ffff17 (b (+ pc #x-13c)) ; => 80
1c0: deb343f8 (ldur x30 (mem+ x30 #x3b)) ; <=
1c4: df1b00f1 (subs xzr x30 #x6)
1c8: 01010054 (b.ne (+ pc #x20)) ; => 1e8
1cc: c00080d2 (movz x0 #x6)
1d0: 0a389dd2 (movz x10 #xe9c0) ; #<code reify-1cc>
1d4: aaf8a0f2 (movk x10 (lsl #x7c5 #x10))
1d8: 2a00c0f2 (movk x10 (lsl #x1 #x20))
1dc: 0a00e0f2 (movk x10 (lsl #x0 #x30))
1e0: 40013fd6 (blr x10)
1e4: 98ffff17 (b (+ pc #x-1a0)) ; => 44
1e8: bef140f8 (ldur x30 (mem+ x13 #xf)) ; <=
1ec: 7eda00f9 (str x30 (mem+ x19 (lsl #x36 #x3)))
1f0: a07140f8 (ldur x0 (mem+ x13 #x7))
1f4: 94ffff17 (b (+ pc #x-1b0)) ; => 44
1f8: c00080d2 (movz x0 #x6) ; <=
1fc: 0a389dd2 (movz x10 #xe9c0) ; #<code reify-1cc>
200: aaf8a0f2 (movk x10 (lsl #x7c5 #x10))
204: 2a00c0f2 (movk x10 (lsl #x1 #x20))
208: 0a00e0f2 (movk x10 (lsl #x0 #x30))
20c: 40013fd6 (blr x10)
210: 8dffff17 (b (+ pc #x-1cc)) ; => 44
214: 0a669bd2 (movz x10 #xdb30) ; #<code doargerr> ; <=
218: aaf8a0f2 (movk x10 (lsl #x7c5 #x10))
21c: 2a00c0f2 (movk x10 (lsl #x1 #x20))
220: 0a00e0f2 (movk x10 (lsl #x0 #x30))
224: 40011fd6 (br x10)

Is it possible you have errortrace on?

Can you try at the command line?

Ah, much better

0: ff0a00f1 (subs xzr x23 #x2)
4: 01040054 (b.ne (+ pc #x80)) ; => 84
8: 1e0001aa (orr x30 x0 x1)
c: df0b40f2 (ands xzr x30 #x7)
10: a1000054 (b.ne (+ pc #x14)) ; => 24
14: 170001ab (adds x23 x0 x1)
18: 66000054 (b.vs (+ pc #xc)) ; => 24
1c: 8a0240f8 (ldur x10 (mem+ x20))
20: 40011fd6 (br x10)
24: d60600f1 (subs x22 x22 #x1) ; <=
28: 41020054 (b.ne (+ pc #x48)) ; => 70
2c: 800600f9 (str x0 (mem+ x20 (lsl #x1 #x3)))
30: 810a00f9 (str x1 (mem+ x20 (lsl #x2 #x3)))
34: 94620091 (add x20 x20 #x18)
38: 7e010010 (adr x30 (+ pc #x2c)) ; => 64
3c: 9e0200f9 (str x30 (mem+ x20 (lsl #x0 #x3)))
40: 0a769cd2 (movz x10 #xe3b0) ; #<code event>
44: 4abaa0f2 (movk x10 (lsl #x5d2 #x10))
48: 2a00c0f2 (movk x10 (lsl #x1 #x20))
4c: 0a00e0f2 (movk x10 (lsl #x0 #x30))
50: 40011fd6 (br x10)
54: 9500000000000000 (data)
5c: 8d01000000000000 (data)
64: 946200d1 (sub x20 x20 #x18) ; <=
68: 800640f9 (ldr x0 (mem+ x20 (lsl #x1 #x3)))
6c: 810a40f9 (ldr x1 (mem+ x20 (lsl #x2 #x3)))
70: 0afa84d2 (movz x10 #x27d0) ; #<code +> ; <=
74: 4abaa0f2 (movk x10 (lsl #x5d2 #x10))
78: 2a00c0f2 (movk x10 (lsl #x1 #x20))
7c: 0a00e0f2 (movk x10 (lsl #x0 #x30))
80: 40011fd6 (br x10)
84: 0a668bd2 (movz x10 #x5b30) ; #<code doargerr> ; <=
88: 0abaa0f2 (movk x10 (lsl #x5d0 #x10))
8c: 2a00c0f2 (movk x10 (lsl #x1 #x20))
90: 0a00e0f2 (movk x10 (lsl #x0 #x30))
94: 40011fd6 (br x10)

Yes, that looks like what I see on x64 — it basically inlines fixnum addition, and also has tests for various things to call the regular + or handle errors

+1

Here’s the disassemble of (define (add m n) (unsafe-fx+ m n))
: 0: 4883fd02 (cmp rbp #x2)
4: 7508 (jnz (+ rip #x8)) ; => e
6: 4a8d2c07 (lea rbp (mem+ rdi (* r8 #x1)))
a: 41ff6500 (jmp (mem64+ r13 #x0))
e: e9fde473fc (jmp (+ rip #x-38c1b03)) ; #<code doargerr> ; <=
13: 0f1f8000000000 (data)

Although I don’t really understand what’s going on there

Also note that you will get different results if you put the code in a module.

0: 4883fd02 (cmp rbp #x2) ; are there two arguments?
4: 7508 (jnz (+ rip #x8)) ; if not, jump to call doargerr
6: 4a8d2c07 (lea rbp (mem+ rdi (* r8 #x1))) ; add arguments
a: 41ff6500 (jmp (mem64+ r13 #x0)) ; jump to return address
e: e9fde473fc (jmp (+ rip #x-38c1b03)) ; call doargerr
13: 0f1f8000000000 (data) ; info for GC and stack unwinding

Thanks!

it was the mapping between “add arguments” and that lea
that I was struggling with

Safe and generic version: 0: ff0a00f1 (subs xzr x23 #x2) ; are there two arguments?
4: 01040054 (b.ne (+ pc #x80)) ; => 84 ; jump to slow path if not
8: 1e0001aa (orr x30 x0 x1) ; "or" to combine mask bits
c: df0b40f2 (ands xzr x30 #x7) ; zero low bits => both are fixnums
10: a1000054 (b.ne (+ pc #x14)) ; => 24 ; jump to slow path if not
14: 170001ab (adds x23 x0 x1) ; add
18: 66000054 (b.vs (+ pc #xc)) ; => 24 ; jump to slow path on overflow
1c: 8a0240f8 (ldur x10 (mem+ x20)) ; load return address
20: 40011fd6 (br x10) ; return
24: d60600f1 (subs x22 x22 #x1) ; <= ; decr. engine counter
28: 41020054 (b.ne (+ pc #x48)) ; => 70 ; continue if not checking events
2c: 800600f9 (str x0 (mem+ x20 (lsl #x1 #x3))) ; start saving context...
30: 810a00f9 (str x1 (mem+ x20 (lsl #x2 #x3)))
34: 94620091 (add x20 x20 #x18)
38: 7e010010 (adr x30 (+ pc #x2c)) ; => 64
3c: 9e0200f9 (str x30 (mem+ x20 (lsl #x0 #x3)))
40: 0a7684d2 (movz x10 #x23b0) ; load address of event-checking code
44: ca5fa0f2 (movk x10 (lsl #x2fe #x10))
48: 2a00c0f2 (movk x10 (lsl #x1 #x20))
4c: 0a00e0f2 (movk x10 (lsl #x0 #x30))
50: 40011fd6 (br x10) ; jump to event-checking code
54: 9500000000000000 (data) ; data for GC and stack unwind
5c: 8d01000000000000 (data)
64: 946200d1 (sub x20 x20 #x18) ; <= ; return from event-checking code
68: 800640f9 (ldr x0 (mem+ x20 (lsl #x1 #x3)))
6c: 810a40f9 (ldr x1 (mem+ x20 (lsl #x2 #x3)))
70: 0afa8cd2 (movz x10 #x67d0) ; #<code +> ; load address of general `+`
74: aa5fa0f2 (movk x10 (lsl #x2fd #x10))
78: 2a00c0f2 (movk x10 (lsl #x1 #x20))
7c: 0a00e0f2 (movk x10 (lsl #x0 #x30))
80: 40011fd6 (br x10) ; jump to generate `+`
84: 0a6693d2 (movz x10 #x9b30) ; load address of bad-arity function
88: 6a5fa0f2 (movk x10 (lsl #x2fb #x10))
8c: 2a00c0f2 (movk x10 (lsl #x1 #x20))
90: 0a00e0f2 (movk x10 (lsl #x0 #x30))
94: 40011fd6 (br x10) ; jump to bad arity

The block there from 24–50 is often a more compact event-detour
pattern, but this longer form tends to be used if the function starts with a short way out.

I preserved that here: https://gist.github.com/samth/688370af4fa74af8b38f644f6b6d915e


Shall we do another export to update the archive, too?

yes

If helpful, the last update was 5th November 2021

I sent you the new zip file