pocmatos
2018-11-23 08:01:53

Does anybody here have any thoughts on creating a racket lang that compiles straight into a gcc plugin/assembler/llvm ir for performance? This would be a subset of racket optimized for performance. How far could one go until hitting hard translation problems?


macocio
2018-11-23 08:04:20

@pocmatos I’ve considered it. The GC is definitely getting in my way atm :neutral_face:, even in incremental mode. Maybe compile to C++ as a backend or something for a first iteration. Unfortunately build times can be pretty high since racket has a long startup time.


macocio
2018-11-23 08:26:06

If we can get incremental builds going using ninja or something it could be nice, especially if we allow all forms of metaprogramming (hello enums with automatic conversion to strings)


pocmatos
2018-11-23 09:25:58

@macocio don’t care so much about build times. runtime is king in my world.


pocmatos
2018-11-23 09:27:28

I had a popcount implemented in Racket which statistical profiling showed to be taking 12% of my runtime. x86_64 and other archs can do it in a single instruction. A ffi call away and I lowered that to less that 0.2%. Quite surprised that there’s not yet a function doing this in Racket.



pocmatos
2018-11-23 10:44:38

@soegaard2 true, I noticed that but I have fixnums?, that expects mpz? and I don’t want to constantly allocate an mpz for the calculation. I was expecting something like fxpopcount for example. I will create a simple package for this highly optimized functions.


pocmatos
2018-11-23 10:45:12

I never distributed a package online with C code. When I do so, is there a way to get the C code compiled on the target machine so I can use -march=native?


pocmatos
2018-11-23 10:45:37

i.e. compile the C code when the user does raco pkg install ...?


popa.bogdanp
2018-11-23 14:34:26

I’m pretty late to the game here, but it seems all the other benchmarks run multiple parallel processes per server in order to handle the requests, whereas your example only uses one process (thereby one os thread) to handle the load.


popa.bogdanp
2018-11-23 14:36:32

Oh wait, nvm, I see there are two sets of benchmarks: one multiprocess and the other single process, but the Racket benchmark seems to be grouped with the multiprocess ones.



pocmatos
2018-11-23 14:49:09

@soegaard2 woot? How did you find that gem and how come it’s not provided by racket? Curious to do some benchmarking… will come back to you on this.


soegaard2
2018-11-23 14:52:09

It’s used in data/bit-vector which I wrote. I can’t remember whether I am responsible for the pop count code though. There were a discussion on the mailing list at some point.


soegaard2
2018-11-23 14:59:26

I think it was Ian Johnson: https://gist.github.com/deeglaze/4154642


soegaard2
2018-11-23 15:03:46

Note that the original implementation of bit-vectors used fxvector and the current one uses bytes. I remember there were to change at some point to use 32 bit fixnums also at 64 bit machines. So if the current fxpopcount only handles 32 bit fixnums, look at an older version.


pocmatos
2018-11-23 15:17:07

Thanks.


pocmatos
2018-11-23 15:17:56

So the C version is about 2% faster, which is quite surprising given that in C it’s a single cpu instruction. But, of course, the call to C might also introduce some cruft.


soegaard2
2018-11-23 16:58:16

! That’s better than I thought.


krismicinski
2018-11-24 04:08:21

@pocmatos I’d be surprised if nobody looked at an LLVM backend for racket tbh


krismicinski
2018-11-24 04:08:25

but you all would know better than I