carette
2018-3-6 13:19:00

Wow, that is one wicked plot. I sure didn’t expect things to bounce around that much!


rjnw
2018-3-6 15:23:42

its not supposed to animate but I think slack just loads it slower giving that effect, I was surprised when I opened it!


rjnw
2018-3-6 15:24:54

this one shows that we are about as accurate and a little bit faster after we add warmup time to rkt.


ccshan
2018-3-6 17:09:00

(Of course a plot of means should have standard error.) Is the rkt advantage more obvious in a plot with more classes (so, harder inference problem)?


rjnw
2018-3-6 18:51:15

rjnw
2018-3-6 18:53:29

from when I ploted with 20 trials the only difference was jags having much larger startup time


rjnw
2018-3-6 18:53:48

I am running 25–10000 right now, lets see


rjnw
2018-3-6 18:56:14

@ccshan here is the script for setting up mallet https://github.com/rjnw/hakaru-benchmarks/blob/master/testcode/mallet/Makefile it needs 20_newsgroup in the input folder which can be generated using https://github.com/rjnw/hakaru-benchmarks/blob/master/input/download-data.sh


rjnw
2018-3-6 19:02:39

also the jags script for naive bayes https://github.com/rjnw/hakaru-benchmarks/blob/master/testcode/jagssrc/NaiveBayesModel.R I ran it but it didn’t finish for 1 hr where the execution didn’t even reach line 50


rjnw
2018-3-6 19:23:25

okay for 25–10000 jags does first sweep at 100 seconds after that around .7 seconds each, rkt does 20 sweeps in 4–5 seconds. should I still go forward with this and run more or reduce the dataset?


ccshan
2018-3-6 23:57:51

I am slightly disappointed that the GMM benchmarks so far have not shown that simplification (integrating out latent variables) helps. So if you can find some evidence of that (whether it’s by having more data points or fewer), that would be nice. But it’s probably more urgent for you to describe the experimental setup in writing.


ccshan
2018-3-7 00:01:57

For example, I wonder for the 25–10000 experiment you describe: If you ignore the initialization time, how does jags compare against rkt in terms of accuracy over time?


rjnw
2018-3-7 02:35:41

rjnw
2018-3-7 02:37:10

I will run them for longer tonight, but from this one it looks like their might be some advantage in terms of accuracy as well


ccshan
2018-3-7 02:40:37

Yeah… If you divide the shading width by the square root of the number of trials (https://en.m.wikipedia.org/wiki/Standard_error), maybe you can get the shading to not overlap? :thinking_face::sunglasses:


rjnw
2018-3-7 02:48:00

rjnw
2018-3-7 02:48:05

I also removed the warmup


rjnw
2018-3-7 02:49:50

how many classes for 10000 do you think we can go? as the runtime depends on points I will just run for the maximum classes for 10000 that make sense


rjnw
2018-3-7 02:50:13

50 maybe?


ccshan
2018-3-7 02:53:58

If you think it might help, sure, I don’t see what’s wrong with 50 classes


rjnw
2018-3-7 02:55:26

well if the maximum accuracy goes really low that might not also look good. for 25 it already is only 50%


rjnw
2018-3-7 02:59:28

okay it’s not that bad, around 45%


carette
2018-3-7 03:08:24

Throwing away warmup seems wrong. We care about total time to get good answers, no? And that means looking at, well, total time and accuracy.


carette
2018-3-7 03:10:14

Also, wouldn’t it make sense to also run these tests on manufactured data where we know the ‘ground truth’? That way we’d at least be able to know if we’re actually converging or just floundering around.


ccshan
2018-3-7 04:08:16

Right now, if we take into account warm-up time, we win even bigger. But we’re not taking into account the time it takes to simplify the Hakaru, for example.

In this GMM benchmark, we are using manufactured data where we know the ground truth. That’s how accuracy is computed. You’d know this if you had read the nonexistent experimental setup description write-up.


rjnw
2018-3-7 05:17:47

rjnw
2018-3-7 05:26:53

it might be a good idea to have numbers for our “startup time” including simplification and all