Basically Daniel’s email made me worry about memory bytes being interpreted incorrectly
I will put some printf’s to see the values of those arrays
I have another issue. LdaLikelihood is always returning 0. https://github.com/rjnw/hakaru-benchmarks/blob/master/runners/hk/LdaGibbs/Likelihood.hs
Don’t use fromProb
, which causes underflow (so the 0 you get is actually promising). Use log
to get the log likelihood, without underflow.
(Of course I mean the log
from our LogFloatPrelude
)
@ccshan is this augur lda code correct (ndocs : Int, ntopics : Int, nwords : Int, topics_prior : Vec Real, word_prior : Vec Real, doc : Vec Int) => {
param theta[d] ~ Dirichlet(topics_prior)
for d <- 0 until ndocs ;
param phi[k] ~ Dirichlet(word_prior)
for k <- 0 until ntopics ;
param z[d] ~ Categorical(theta[doc[d]])
for d <- 0 until nwords ;
data w[d] ~ Categorical(phi[z[d]])
for d <- 0 until nwords ;
}
I had to change lda to 1D as well
It’s giving me this error: NameError: Error: [CgConj] \| Product, could not match Pi(t10 <- 0 until ndocs) { Dirichlet(theta[t10] ; topics_prior)
} with Pi(t12 <- 0 until nwords) {
let t14 = doc[t12] in
let t15 = theta[t14] in
Categorical(z[t12] ; t15)
}
I found where this error comes from. It doesn’t work better with 2D LDA?
I still haven’t figured out how to do irregular arrays. I will try that again then.
My guess is that, in order for 1D LDA to work, “the normalization rule…where z is a Categorical variable” in the AugurV2 paper needs to be generalized to where z is not necessarily a Categorical variable (such as z[d]
) but possibly a bounded Int variable (such as doc[d]
). Maybe this rule is implemented under -- == Mixture factoring
in RwCore.hs and can be fixed, but I don’t understand that code yet. Meanwhile, if you could show what goes wrong when you try irregular arrays in 2D LDA, we can ask Daniel Huang about that.
Is this correct for 2D LDA (ntopics : Int, ndocs : Int, w_shape : Vec Int, topics_prior : Vec Real, words_prior : Vec Real) => {
param theta[d] ~ Dirichlet(topics_prior)
for d <- 0 until ndocs ;
param phi[k] ~ Dirichlet(words_prior)
for k <- 0 until ntopics ;
param z[d, n] ~ Categorical(theta[d])
for d <- 0 until ndocs, n <- 0 until w_shape[d] ;
data w[d, n] ~ Categorical(phi[z[d, n]])
for d <- 0 until ndocs, n <- 0 until w_shape[d] ;
}
This gets terminated by segfault, address boundary error
w_shape
is number of words in document at index i, ndocs
is length of w_shape
Well, I’m just going by augurv2/examples/lda.py
but that seems right. Does that example (2D LDA with all documents equal length) work for you? What if you take exactly that code but remove one word from one document? Can you add printf to the your generated code to make sure that memory bytes are interpreted as intended, or trace where the segfault happens?
Feel free to send your segfaulting Python code to Daniel Huang and cc me…
okay let me try
this time it’s stuck for 20 minutes with a smaller data set of 200 documents. :confused:
When I run it with their small data from examples it works, but doesn’t work with 20newsgroup.
sent an email to dan