proditus's comments

proditus · on Nov 4, 2016

stan and edward dev here. happy to answer any questions.

(shakir's blog posts are amazing; i recommend them all.)

murbard2 · on Nov 4, 2016

Very cool, many questions

1) Why create a project distinct from Stan? Was it the prospect of benefiting of all the work going into TF and focus solely on the sampling procedures rather than autodiff or GPU integration?

2) Are you implementing NUTS?

3) Any plans to implement parallel tempering

4) Any plans to handle "tall" data using stochastic estimates of the likelihood?

proditus · on Nov 4, 2016

great questions.

1. you touch upon the right strengths of TF; that was certainly one consideration. edward is designed to address two goals that complement stan. the first is to be a platform for inference research: as such, edward is primarily a tool for machine learning researchers. the second is to support a wider class of models than stan (at the cost of not offering a "works out of the box" solution).

our recent whitepaper explains these goals in a bit more detail:

https://arxiv.org/pdf/1610.09787.pdf

2) no immediate plans. but we have HMC and are looking for volunteers :)

3) same answer as above :) should be relatively easy to implement tempering.

4) this is already in the works! stay tuned!

marmaduke · on Nov 4, 2016

Why TF instead of Theano as PyMC3 has done? Shouln't it be straightforward to port PyMC3 algos over TF?

My main gripe with Theano is that OpenCL support is near non-existent, but this is also the case with TF.

murbard2 · on Nov 4, 2016

4) which approach are you using? Generalized Poisson Estimator, or estimating the convexity effect of the exponential by looking at the sample variance of the log likelihood? The former is more pure, the latter may be more practical if ugly.

proditus · on Nov 4, 2016

theses are great insights.

our first approach is the simplest: stochastic variational inference. consider a likelihood that factorizes over datapoints. stochastic variational inference then computes stochastic gradients of the variational objective function at each iteration by subsampling a "minibatch" of data at random.

i reckon the techniques you suggest would work as we move forward!

murbard2 · on Nov 4, 2016

Edit: ah never mind, variational inference, got it! I was thinking stochastic HMC!

---

Ok but that will get an unbiased estimate of the log-likelihood. MCMC or HMC do work with noisy estimators, but they require unbiased estimates of the likelihood.

At the very least, you need to do a convexity adjustment by measuring the variance inside your mini batch. Or you can use the Poisson technique which will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit at the cost of introducing a lot of variance).

proditus · on Nov 7, 2016

great points; yes, the challenge becomes considerably more challenging with MCMC!

marmaduke · on Nov 4, 2016

if I can bug you also since you're working with Alp, how does Edward handle ADVI covariance? Is it diagonal or dense or some sparse structure estimated?

proditus · on Nov 4, 2016

you may bug me on this. i work too closely with alp :)

edward does not implement completely implement advi yet. the piece that is missing is the automated transformation of constrained latent variable spaces to the real coordinate space. however, edward offers much more flexibility in specifying the dependency structure of the posterior approximation. diagonal is, just like in stan, the fastest and easiest. however introducing structure (e.g. assigning a dense normal to some of the latent variables while assigning a diagonal to others) is much easier in edward.

marmaduke · on Nov 5, 2016

OK I would be interested in seeing how to do that. Are there any examples or hints on how to start? I worked a lot with time series models (think nonlinear autoregressive), where there's strong short term autocorrelation, and the coercion to diagonal covariance seemed inappropriate.

I have also a naïve question: why not use the graphical structure of the model itself to add structure to the covariance? For example, in an AR model, each time point places prior on the next time point, so why not assume a banded covariance? More generally, one could use a cutoff on shortest path length (through the model's graphic structure) between parameters to decide if they should have nonzero coefficients.

marmaduke · on Nov 7, 2016

I came across the examples in the repo and commented on

https://github.com/blei-lab/edward/issues/211

so I'll try to do my homework before asking more questions ;)

marmaduke · on Nov 4, 2016

Whoa cool but the build is failing tisk tisk.

A big issue I ran into with stan even with advi was scaling to large datasets since it (and Eigen) are single threaded. Would Edward answer all my prayers?

When is Riemannian HMC going to arrive?

proditus · on Nov 4, 2016

i'm assuming you're referring to building edward? installation is a bit of a pain because tensorflow is not on pypi yet.

please take a look here: http://edwardlib.org/troubleshooting

edward should answer some of your prayers :) there's still some time until stan goes parallel/gpu, though there's lots of interest there.

riemannin hmc is likely just around the corner!

marmaduke · on Nov 4, 2016

I was shallow-ly referring to the Travis CI badge on the GitHub page..

I've been working with both Stan & PyMC3 on some large datasets and will definitely try Edward on them.

proditus · on Nov 4, 2016

ah ok. i agree. we're working on that. :)

give it a shot at let us know!

proditus · on Sept 19, 2015

the Stan manual [1] is like a textbook. while it's a bit long (and a fair bit longer than a research paper), i highly encourage that you take a look. it's very informative.

[1] http://mc-stan.org/documentation/

proditus · on Sept 19, 2015

do let me know how it goes if you do try ADVI.

we're also working on making it more robust to initialization and step-sizes. stay tuned.

mjw · on Sept 19, 2015

AVDI is incredibly awesome. Thanks so much for this work. Really nice readable paper on it too.

As someone working with large datasets I think automating variational inference + SGD to work with a broad class of models is really the way forwards and a big force multiplier for machine learning.

proditus · on Sept 19, 2015

hi. i'm one of the Stan devs. (i work on variational inference: ADVI).

happy to answer any questions here.

chmullig · on Sept 19, 2015

Yay Stan! So what's the deal with BBVI? I heard it's integrated in the newest Stan, but can I use it? What for?

proditus · on Sept 19, 2015

glad to see the enthusiasm!

ADVI [1] is a variant of BBVI [2] where we fully leverage all of the amazing things that Stan has to offer (like automatic differentiation and automatic transformations of constrained parameters).

you can use ADVI to get an approximation to the Bayesian posterior. the advantage of using ADVI over sampling is that ADVI is typically faster for large models (both in terms of # of parameters and # of data observations). ADVI also a bit better at handling models with multi-modal posteriors, such as mixture models.

ADVI is currently in cmdStan (but not in RStan or PyStan). we're continuing to make the algorithm more robust.

[1] http://arxiv.org/abs/1506.03431 [2] http://www.jmlr.org/proceedings/papers/v33/ranganath14.pdf

marmaduke · on Sept 19, 2015

hi, funny this came up on HN; I'm just trying to get a handle on fitting a dynamical systems model of neural activity propagation with Stan.

What in general are the disadvantages of variational inference vs full MCMC? Are there in general significant advantages besides the speed up?

proditus · on Sept 19, 2015

that's a really good question. some aspects of VI vs MCMC are areas of active research. so it's tough to respond succinctly, but i'll try.

the key disadvantages of VI (particularly ADVI) are:

1. mean-field variational inference cannot model posterior correlations. so if you expect your model + dataset to give a "skewed" posterior, then mean-field variational inference will have a difficult time describing such a posterior. (it will under estimate marginal variances.)

2. full-rank variational inference can model posterior correlations. but it can become too expensive for big models. there is a lot of great research coming up in this vein, such as [1,2].

3. in either case, the version of variational inference we have in Stan (ADVI) uses a normal approximation in a transformed parameter space. thus, there is an additional mismatch of the shape of the variational posterior to the full MCMC posterior.

in terms of advantages:

1. variational inference is (in general) a non-convex optimization problem. so it's easy to know when we've converged to a local optimum. convergence in MCMC is a bit more tricky to assess.

2. if your model has a multi-modal posterior, then variational inference will focus on just one of the modes. this is sometimes desirable as MCMC techniques might end up jumping around all of the modes and producing poor samples.

this is just the tip of the iceberg. but i hope it helps!

[1] http://arxiv.org/pdf/1506.03159.pdf [2] http://arxiv.org/pdf/1502.07685.pdf