
Wow, that is one wicked plot. I sure didn’t expect things to bounce around that much!

its not supposed to animate but I think slack just loads it slower giving that effect, I was surprised when I opened it!

this one shows that we are about as accurate and a little bit faster after we add warmup time to rkt.

(Of course a plot of means should have standard error.) Is the rkt advantage more obvious in a plot with more classes (so, harder inference problem)?


from when I ploted with 20 trials the only difference was jags having much larger startup time

I am running 25–10000 right now, lets see

@ccshan here is the script for setting up mallet https://github.com/rjnw/hakaru-benchmarks/blob/master/testcode/mallet/Makefile it needs 20_newsgroup in the input folder which can be generated using https://github.com/rjnw/hakaru-benchmarks/blob/master/input/download-data.sh

also the jags script for naive bayes https://github.com/rjnw/hakaru-benchmarks/blob/master/testcode/jagssrc/NaiveBayesModel.R I ran it but it didn’t finish for 1 hr where the execution didn’t even reach line 50

okay for 25–10000 jags does first sweep at 100 seconds after that around .7 seconds each, rkt does 20 sweeps in 4–5 seconds. should I still go forward with this and run more or reduce the dataset?

I am slightly disappointed that the GMM benchmarks so far have not shown that simplification (integrating out latent variables) helps. So if you can find some evidence of that (whether it’s by having more data points or fewer), that would be nice. But it’s probably more urgent for you to describe the experimental setup in writing.

For example, I wonder for the 25–10000 experiment you describe: If you ignore the initialization time, how does jags compare against rkt in terms of accuracy over time?


I will run them for longer tonight, but from this one it looks like their might be some advantage in terms of accuracy as well

Yeah… If you divide the shading width by the square root of the number of trials (https://en.m.wikipedia.org/wiki/Standard_error), maybe you can get the shading to not overlap? :thinking_face::sunglasses:


I also removed the warmup

how many classes for 10000 do you think we can go? as the runtime depends on points I will just run for the maximum classes for 10000 that make sense

50 maybe?

If you think it might help, sure, I don’t see what’s wrong with 50 classes

well if the maximum accuracy goes really low that might not also look good. for 25 it already is only 50%

okay it’s not that bad, around 45%

Throwing away warmup seems wrong. We care about total time to get good answers, no? And that means looking at, well, total time and accuracy.

Also, wouldn’t it make sense to also run these tests on manufactured data where we know the ‘ground truth’? That way we’d at least be able to know if we’re actually converging or just floundering around.

Right now, if we take into account warm-up time, we win even bigger. But we’re not taking into account the time it takes to simplify the Hakaru, for example.
In this GMM benchmark, we are using manufactured data where we know the ground truth. That’s how accuracy is computed. You’d know this if you had read the nonexistent experimental setup description write-up.


it might be a good idea to have numbers for our “startup time” including simplification and all