\documentclass[preprint]{sigplanconf}

\usepackage{amsmath}
\usepackage{comment}
\excludecomment{hidden}
\usepackage{bm}
\usepackage[all,web]{xy}

% Instruct includegraphics command to treat unknown file extensions as
% *.mps files when running pdflatex.  This allows us to use Metapost for
% figures. (Which we're not actually doing.  Oh, well.)
% \usepackage{graphicx,ifpdf}
% \ifpdf
% \DeclareGraphicsRule{*}{mps}{*}{} 
% \fi

% pdftex wants to come after all other packages.
\usepackage[pdftex,linkbordercolor={0 0 1},bookmarks=true]{hyperref}

%include polycode.fmt 
\bibliographystyle{plain}

% Format variable names with subscripts.
%format v1
%format v2
%format p1
%format p2
%format w1
%format w2
%format d1
%format d2
%format x1

% The real numbers and a generic monad.
\newcommand{\R}{\mathbb{R}}
\newcommand{\M}{\mathbb{M}}

% To work around lhs2TeX's dislike of "%" near || bars.
\newcommand{\percent}{\verb+%+}

% A vector of probabilities.
\newcommand{\Pb}{\mathbf{P}}

% Terms being defined.
\newcommand{\term}[1]{\textit{#1}}

\begin{document}

%\conferenceinfo{Haskell Workshop 2007}{September 30, 2007, Freiburg, Germany.}
%\copyrightyear{2007}
%\copyrightdata{0-00000-000-0/00/0000}
%\authorpermission

\preprintfooter{Draft paper for Hac 07 II in Freiburg.}
\titlebanner{DRAFT}

\title{Build your own probability monads}
% \subtitle{A toolkit for Bayesian filtering and particle filters}

\authorinfo{Eric Kidd} 
           {Dartmouth College}
           {eric.kidd@@dartmouth.edu}

\maketitle

\begin{hidden}

> import Control.Monad
> import Control.Monad.Maybe
> import Control.Monad.Random
> import Control.Monad.Trans
> import Data.List
> import Data.Maybe
> import Data.Monoid
> import Data.Ratio
> import System.Random

\end{hidden}

\begin{abstract}
% SPJ's advice on structuring abstracts, apparently based on advice by
% Kent Beck:
%
% 1. State the problem: "Many papers are badly written and hard to understand."
%
% 2. Say why it's an interesting problem: "This is a pity, because their
%    good ideas may go unappreciated."
%
% 3. Say what your solution achieves: "Following simple guidelines can
%    dramatically improve the quality of your papers."
%
% 4. Say what follows from your solution: "Your work will be used more, and
%    the feedback you get from others will in turn improve your research."
  Probability is often counter-intuitive, and it always involves a great
  deal of math.  This is unfortunate, because many applications in robotics
  and AI increasingly rely on probability theory.  We introduce a modular
  toolkit for constructing probability monads, and show that it can be used
  for everything from discrete distributions to weighted particle
  filtering.  This modular approach allows us to present a single,
  easy-to-use API for working with many kinds of probability distributions.

  Our toolkit combines several existing components (the list monad, the
  |Rand| monad, and the |MaybeT| monad transformer), with a stripped down
  version of |WriterT Prob|, and a new monad for sequential Monte Carlo
  sampling.  Using these components, we show that |MaybeT| can be used to
  implement Bayes' theorem.  We also show how to implement a monad for
  weighted particle filtering.
\end{abstract}

\category{D.3.3}{Programming Languages}
                {Language Constructs and Features}
                [Control structures]
\category{G.3}{Probability and Statistics}
              {}
% And so on.

\terms
Probability, Monads

\keywords
Bayesian filtering, particle filters

\begin{quote}
  \textit{A very senior Microsoft developer who moved to Google told me
    that Google works and thinks at a higher level of abstraction than
    Microsoft. ``Google uses Bayesian filtering the way Microsoft uses the
    if statement,'' he said.} -Joel Spolsky
% http://www.joelonsoftware.com/oldnews/pages/October2005.html
\end{quote}

\section{Introduction}
\label{Intro}
% 1 page.

Probability is notoriously tricky and counter-intuitive.  It's
easy to ignore prior probabilities, confuse $P(A||B)$ with $P(B||A)$, or
get lost while trying to generalize Bayes' theorem.  But we encounter
probabilities more than ever, thanks to recent trends in search algorithms,
robotics and artificial intelligence.

To address these challenges, researchers have built several excellent
programming languages based on probability distribution monads
\cite{ramsey02stochastic, park05probabilistic, erwig06probabilistic}.  Some
of these languages use random sampling; others compute exact results.  But
all of these languages are delightful tools---they make previously subtle
problems intuitive and easy.

But these waters are deeper than a casual glance reveals.  Many problems in
probability theory can be expressed in terms of monads
\cite{lawvere62category, giry81categorical, jones89probabilistic}.  And if
we examine a few such monads, certain repeated patterns become obvious.  In
fact, most probability distribution monads can be built from a small
``toolkit'' of monads and monad transformers.  The same parts are shared by
discrete distributions, random sampling monads, and even particle filters.

In general, the monads built from this toolkit make pleasant programming
languages.  For example, imagine that we have an influenza test with a
30\percent{} false-positive rate, and a 10\percent{} false-negative rate
\cite{cdcInfluenza}.  If we assume that 10\percent{} of the population has
influenza, what is the probability that someone with a positive test result
is actually infected?  Instead of messing around with Bayes' theorem, we
can simply write:

\begin{hidden}

> data Status  = Flu | Healthy
>   deriving (Show,Eq,Ord)
> data Test    = Pos | Neg
>   deriving (Show,Eq,Ord)
>
> percent p v1 v2 = weighted [(p, v1), (100-p, v2)]
> percentWithFlu p = percent p Flu Healthy
> percentPositive p = percent p Pos Neg
>
> fluStatusGivenPositiveTest :: (Dist d, MonadPlus d) => d Status

\end{hidden}

> fluStatusGivenPositiveTest = do
>   fluStatus   <-  percentWithFlu 10
>   testResult  <-  if fluStatus == Flu
>                     then  percentPositive  70
>                     else  percentPositive  10
>   guard (testResult == Pos)
>   return fluStatus

This function will return a probability distribution of 44\percent{} |Flu|
and 56\percent{} |Healthy|.  Because only a small portion of the population
is actually infected, the false positives actually outnumber the true ones.
This monad in this example is similar to other implementations of the
discrete probability monad \cite{erwig06probabilistic}, with the addition
of |guard|, which is used to do implicit Bayesian filtering.

We make the following contributions:

\begin{itemize}

\item We introduce a toolkit for constructing probability distribution
  monads (Section \ref{Toolkit}).  The toolkit consists of three monads
  (the list monad, a Monte Carlo monad, and a sequential Monte Carlo
  monad), and two monad transformers (|PerhapsT| and |MaybeT|).  We use
  this toolkit to recreate an existing probability distribution monad
  (Sections \ref{ToolkitPerhapsT} and \ref{MonoidsAndDiscrete}).

\item We implement Bayes' theorem using the |MaybeT| monad transformer
  (Sections \ref{ToolkitMaybeT} and \ref{BayesAndMaybeT}).  |MaybeT| allows
  us to discard possible outcomes, neatly encapsulating the notion of
  sampling with rejections.

\item We develop several new monads for sequential Monte Carlo sampling,
  also known as particle filtering (Sections \ref{ToolkitMC} and
  \ref{SMC}).  These monads support both rejection filters (using |MaybeT|)
  and weighted filters (using |PerhapsT|).

\end{itemize}

\section{Background}
\label{Background}
% 1 page.

Probability theory poses many challenges for programming language
designers.  We must choose from a bewildering variety of representations
and techniques.  We must work around the limitations of the human
intuition, which is notoriously bad at probability.  And we must keep the
math from obscuring the actual problems of interest.  Solving all these
problems at once is beyond the scope of any existing programming language.
At best, we can hope to find a sweet spot in the design space.

\subsection{Choosing a representation}
\label{BackgroundRepresentations}

% TODO: Still need more citations, but we're getting there.

Probability theory offers a rich variety of problem-solving tools.  At
times, this variety is perhaps too rich.  For example, probability
distributions may be represented as discrete distributions
\cite{erwig06probabilistic}, random sampling functions
\cite{ramsey02stochastic, park05probabilistic, erwig06probabilistic},
measure terms \cite{ramsey02stochastic}, Kalman filters \cite{russel03ai,
  fox05bayesian}, multi-hypothesis Kalman filters \cite{fox05bayesian}, or
particle filters \cite{fox05bayesian, doucet00sequential}.  Similarly,
Bayes' theorem may be implemented using explicit calculations, the rejection
method \cite{ramsey02stochastic}, importance sampling
\cite{ramsey02stochastic}, weighted particle filters, or more sophisticated
techniques.

Of course, each of these choices comes with tradeoffs.  Discrete
distributions offer exact answers, but they require exponential time to
solve certain problems.  Sampling functions can represent arbitrary
distributions, but they can only calculate approximate distributions.
Kalman filters are extremely efficient, but they treat all distributions as
simple Guassians.

To further complicate matters, many of these techniques are traditionally
described in ways that blur the underlying algebraic connections.  For
example, the various implementations of Bayes' theorem have quite a lot in
common.  But those similarities are hard to see if we compare the textbook
formula for Bayes' theorem \cite{russel03ai} with a description of weighted
particle filters \cite{fox05bayesian}.

In an ideal world, we would be able consolidate as many of these techniques
as possible under a single programming interface, making the algebraic
connections obvious.

\subsection{Coping with faulty intuitions}
\label{BackgroundIntuitions}

Probability is often counter-intuitive.  Even physicians, who work with
diagnostic tests on a daily basis, commonly make order-of-magnitude errors
in interpretation.  In an informal study by Eddy, most physicians concluded
that a patient with an apparently benign breast mass but a positive
mammogram had a 75\% chance of cancer.  The actual chance of cancer, as
calculated by Bayes' theorm, was only 7.7\% \cite{eddy82probabilistic}.

Other probability puzzles are similarly misleading.  In a famous {\it
  Parade Magazine} article, Marilyn vos Savant described the ``Monty Hall''
problem, in which contestants must choose from several doors, one of which
hides a prize \cite{savant90monty}.  Most of vos Savant's readers chose the
wrong door, thanks to subtle ambiguities in the problem statement and the
use of inappropriate heuristics \cite{mueser99monty}.

To correctly answer a question about probability, we must first specify the
details.  In particular, we must know the starting conditions, the
protocols followed by any agents in the puzzle, and the exact values we
want to compute.  This kind of precise specification is well-suited to a
programming language.

\subsection{Separating the problem from the math}
\label{BackgroundBayesExample}

Once we've decided how to represent a problem, and specified it precisely,
we still need to do the math.  But the math itself is frequently a
barrier---instead of focusing on the actual problem, we often wind up
trying to remember which version of Bayes' theorem applies in our
particular case.  Even worse, this preoccupation with formulas can blind us
to simpler ways of looking at the problem.

Consider the influenza test in Section \ref{Intro}.  We know that the test
has a 30\% false-positive rate and a 10\% false-negative rate.  If 10\% of
the population has influenza, we know that
\begin{align*}
P(I) &= 0.1 & P(+||I) &= 0.7 & P(+||\neg I) &= 0.1
\end{align*}
\noindent where $P(I)$ is the probability that a patient has influenza,
$P(+||I)$ is the probability of a true positive, and $P(+||\neg I)$ is the
probability of a false positive.  We can plug these numbers into Bayes'
theorem, and calculate $P(I||+)$, the probability that patient has
influenza, given a positive test result.
\[
P(I||+) = \frac{P(+||I)P(I)}{P(+||I)P(I) + P(+||\neg I)P(\neg I)} = \frac{7}{16}
\]

But we can solve this problem more easily using a visual approach.
Assuming we have 100 patients, we can separate them into 10 patients with
influenza, and 90 healthy patients.  If we then represent positive test
results with ``$+$'', our population looks like:
\[
\renewcommand{\latticebody}{
   \ifnum\latticeB=9
     % Do nothing.  This is our spacer line.
   \else
     \ifnum\latticeB=10 % Patients with influenza.
        \ifnum\latticeA<7  \drop@@{+}  \else  \drop@@{o}  \fi
     \else % Healthy patients.
        \ifnum\latticeA<1  \drop@@{+}  \else  \drop@@{o}  \fi
     \fi
  \fi
}
\begin{xy}
*\xybox{0;<4mm,0mm>:<0mm,4mm>::
        ,0,{\xylattice{0}{9}{0}{10}}="P"
        % ,{(-5,9) \ar@@{-} (9.5,9)}
        ,"P"+(-1,10)*+!R{\rm Influenza}
        ,"P"+(-1,8)*+!R{\rm Healthy}}
\end{xy}\]

% TODO: Try to use vulgar fraction below?

\noindent We can see that $7/16$ths of positive test results occur in
patients with influenza.  The ease with which we can read answers off this
diagram represents our ideal; any programming language should be as
straightforward.


\section{The probability monad toolkit}
\label{Toolkit}
% 2 pages.

We are now ready to introduce our toolkit for building probability monads.
Our toolkit relies on two main ideas:

\begin{enumerate}
\item \term{Monads} allow us to split a computation into two parts: the
  main program, and the bookkeeping details \cite{moggi89computational,
    wadler90comprehending}.  In the main program, we describe our problem
  in high-level terms, leaving out most of the math.  But behind the
  scenes, a monad keeps track of the math for us, calculating probabilities
  and applying Bayes' theorem.
\item \term{Monad transformers} allow us to start with base monads, and
  layer on extra features as needed \cite{moggi89abstract, king92combining,
    steele94building, espinosa95semantic}.  Our base monads represent
  simple ideas: lists, sampling functions, and sets of particles.
  Everything else---including the probability calculations, the weights of
  the particles, and the implementation of Bayes' theorem---is supplied by
  one or more monad transformers.
\end{enumerate}

\noindent Using only three base monads and two monad transformers, we are
able to construct a wide range of probability distribution monads (Table
\ref{ToolkitTable}).

\begin{table}
\begin{tabular}{ll}
|[]| & Lists of outcomes \\
|PerhapsT []| & Discrete distributions \\
|MaybeT (PerhapsT [])| & \ldots{}with rejection \\
\\
|MC| & Monte Carlo sampling \\
|MaybeT MC| & \ldots{}with rejection \\
|PerhapsT MC| & \ldots{}with weights \\
\\
|SMC| & Sequential Monte Carlo sampling\\
|MaybeT SMC| & \ldots{}with rejection \\
|PerhapsT SMC| & \ldots{}with weights \\
\end{tabular}
\caption{The probability monad toolkit.}
\label{ToolkitTable}
\end{table}

\begin{table*}
\begin{tabular*}{\textwidth}{@@{\extracolsep\fill}lllll}
& \begin{tabular}{l}|[]| (list monad)\end{tabular}
& \begin{tabular}{l}|PerhapsT []|\end{tabular}
& \begin{tabular}{l}|MaybeT (PerhapsT [])|\end{tabular} & \\ \\

& \begin{tabular}{l}
|(Flu,Pos)| \\
|(Flu,Neg)| \\
|(Healthy,Pos)| \\
|(Healthy,Neg)| \\
\end{tabular}

& \begin{tabular}{r@@{\ }l}
 7\percent & |(Flu,Pos)|     \\
 3\percent & |(Flu,Neg)|     \\
 9\percent & |(Healthy,Pos)| \\
81\percent & |(Healthy,Neg)| \\
\end{tabular}

& \begin{tabular}{r@@{\ }l}
 7\percent & |Just (Flu,Pos)| \\
 3\percent & |Nothing| \\
 9\percent & |Just (Healthy,Pos)| \\
81\percent & |Nothing|
\end{tabular} &

\end{tabular*}
\caption{Building a monad in layers, with example data.}
\label{BuildingBDDist}
\end{table*}

\subsection{The list monad}
\label{ToolkitList}

The list monad allows us to combine elements from several lists, generating
every possible outcome \cite{wadler90comprehending}.  Continuing with our
influenza example, we define two data types:

\begin{spec}
data Status  = Flu  |  Healthy
data Test    = Pos  |  Neg
\end{spec}

\noindent Using Haskell's |do|-notation, we pick one item from each list
and return the result.  In this example, the |<-| symbol should be read as
``pick one item from.''

> outcomes :: [(Status,Test)]
> outcomes = do
>   status  <-  [Flu, Healthy]
>   test    <-  [Pos, Neg]
>   return (status,test)

The type declaration |[(Status,Test)]| is important, because it tells
Haskell to interpret the |do|-body as a computation in the list monad.
When we run this code, it will return a list of every possible outcome:

\begin{quote}
\begin{tabular}{l@@{\qquad}l}
|(Flu,Pos)|     & |(Flu,Neg)| \\
|(Healthy,Pos)| & |(Healthy,Neg)| \\
\end{tabular}
\end{quote}

\noindent Whenever the list monad encounters |<-|, it makes \emph{every}
possible choice, backtracking as necessary.  The list monad also appears in
Haskell and other languages as a \term{list comprehension}, a special
syntax for building lists \cite{wadler90comprehending}:

\begin{spec}
[(status,test) |  status  <-  [Flu, Healthy],
                  test    <-  [Pos, Neg]]
\end{spec}

\noindent For more information on the list monad, see Section \ref{Monads}.

\subsection{The PerhapsT monad transformer}
\label{ToolkitPerhapsT}

By itself, the list monad has no way to keep track of probabilities.  We
can fix this using the |PerhapsT| monad transformer.

\begin{spec}
type DDist = PerhapsT []
\end{spec}

\noindent |PerhapsT| takes an existing monad, and attaches a probability to
each value in the computation.  The probabilities are tracked invisibly in
the background, giving us a discrete probability distribution:

\begin{quote}
\begin{tabular}{r@@{\ }l@@{\qquad}r@@{\ }l}
7\percent & |(Flu,Pos)|     &   3\percent & |(Flu,Neg)| \\
9\percent & |(Healthy,Pos)| &  81\percent & |(Healthy,Neg)| \\
\end{tabular}
\end{quote}

\noindent The code that generates this distribution is similar to our
previous example.  We replace the lists with calls to |weighted|, which
constructs weighted distributions:

> weightedOutcomes :: DDist (Status,Test)
> weightedOutcomes = do
>   status  <-  weighted [(10,Flu), (90,Healthy)]
>   test    <-
>     if (status == Flu)
>       then  weighted [(  70,  Pos), (  30,  Neg)]
>       else  weighted [(  10,  Pos), (  90,  Neg)]
>   return (status,test)

When run, |weightedOutcomes| returns a list of possible results and their
probabilities, as shown in the table above.  Note that the final |DDist| monad is equivalent
to Erwig and Kollmansberger's |Dist| monad \cite{erwig06probabilistic}.
For more information on the |PerhapsT| monad transformer, see Section
\ref{MonoidsAndDiscrete}.

\subsection{The MaybeT monad transformer}
\label{ToolkitMaybeT}

In Section \ref{BackgroundBayesExample}, we implemented Bayes' theorem by
counting up all the patients marked ``$+$'', and ignoring the rest.  We can
get a similar effect using the |MaybeT| monad transformer:

\begin{spec}
type BDDist = MaybeT DDist
\end{spec}

\noindent |MaybeT| takes an existing monad, and replaces each value of type
|a| with either |Just a| or |Nothing| \cite{NewMonads}.  The |Nothing|
values represent failed ``branches'' in the computation, values of no
interest to us.  Filtering for |test == Pos|, we get:

\begin{quote}
\begin{tabular}{r@@{\ }l@@{\qquad}r@@{\ }l}
7\percent & |Just (Flu,Pos)|     &   3\percent & |Nothing| \\
9\percent & |Just (Healthy,Pos)| &  81\percent & |Nothing| \\
\end{tabular}
\end{quote}

\noindent At the end of the computation, we can discard all the |Nothing|
values, and scale the remaining percentages so that they sum to
100\percent.  Compare this example to the diagram in Section
\ref{BackgroundBayesExample}.  Both are implementations of Bayes' theorem
using the rejection method and a normalization factor
\cite{park05probabilistic, russel03ai}.

The |filteredWeightedOutcomes| function is identical to our earlier
|weightedOutcomes|, except for the type declaration and the |guard|
function on the second-to-last line:

\begin{hidden}

> filteredWeightedOutcomes :: BDDist (Status,Test)
> filteredWeightedOutcomes = do
>   status  <-  weighted [(10,Flu), (90,Healthy)]
>   test    <-
>     if (status == Flu)
>       then  weighted [(  70,  Pos), (  30,  Neg)]
>       else  weighted [(  10,  Pos), (  90,  Neg)]
>   guard (test == Pos)
>   return (status,test)

\end{hidden}

\begin{spec}
filteredWeightedOutcomes :: BDDist (Status,Test)
filteredWeightedOutcomes = do
  status  <-  ...
  test    <-  ...
  guard (test == Pos)
  return (status,test)
\end{spec}

\noindent The |guard| function checks to see if |test == Pos|, and if not,
replaces the current branch of the computation with |Nothing|.

See Table \ref{BuildingBDDist} for a step-by-step breakdown of how the
$PerhapsT$ and $MaybeT$ monad transformers are used to construct |BDDist|.
For more information on the |MaybeT| monad transformer, see Section
\ref{BayesAndMaybeT}.

\subsection{The MC and SMC monads}
\label{ToolkitMC}

\term{Monte Carlo algorithms} rely on random sampling to compute
approximate answers \cite{mitzenmacher05probability}.  We implement random
sampling using the |MC| monad, described under various names by previous
researchers \cite{ramsey02stochastic, park05probabilistic,
  erwig06probabilistic}.  The example code for the |MC| monad is identical
to |weightedOutcomes|, except for the type declaration:

\begin{spec}
sampledOutcomes  ::  MC (Status,Test)
sampledOutcomes  =  ...

sample sampledOutcomes 10
\end{spec}

\noindent The |sample| function runs our code repeatedly, collecting the
results in a list.

\term{Sequential Monte Carlo sampling}, also known as \term{particle
  filtering}, differs from ordinary sampling in that all our samples pass
through each step of the computation as a group \cite{fox05bayesian,
  doucet00sequential}.  See Figure \ref{MCvsSMC} for a comparison of the
two approaches.  We can perform sequential Monte Carlo sampling using the
|SMC| monad. % TODO - reword.

\begin{figure}
\begin{center}
%\includegraphics[scale=0.4]{figs.1}

\[\xygraph{
[]{x_1}="X1" : [r]{y_1}="Y1" : [r]{z_1}="Z1" & {x_1}="X1S" & {y_1}="Y1S" & {z_1}="Z1S" \\
  {x_2}="X2" : [r]{y_2}="Y2" : [r]{z_2}="Z2" & {x_2}="X2S" & {y_2}="Y2S" & {z_2}="Z2S" \\
  {x_3}="X3" : [r]{y_3}="Y3" : [r]{z_3}="Z3" & {x_3}="X3S" & {y_3}="Y3S" & {z_3}="Z3S"
% Dotted arrows on left.
"Z1" :@@{-->} "X2"
"Z2" :@@{-->} "X3"
% Arrows on right.
"X2S" :@@3 "Y2S" :@@3 "Z2S"
% Boxes on right.
[] !{"X1S";"X3S"**\frm{-}}
   !{"Y1S";"Y3S"**\frm{-}}
   !{"Z1S";"Z3S"**\frm{-}}
}\]

\end{center}
\label{MCvsSMC}
\caption{Monte Carlo sampling (left) and sequential Monte Carlo sampling
  (right).  In the latter case, we represent our samples with a cloud of
  particles, which travel as a group through each step $x,y,z$ of the
  computation.}
\end{figure}

The |MC| and |SMC| monads can also be combined with our monad transformers
(Table \ref{ToolkitTable}).  For example, we can use |PerhapsT| to
construct a weighted version of the |SMC| monad:

\begin{spec}
type WSMC = PerhapsT SMC
\end{spec}

\noindent Using the |WSMC| monad, we can replace our earlier calls to
|guard| with conditional probabilities.  For example, if we know a
patient's test result is positive, we can write:

> statusGivenPosResult  ::  WSMC Status 
> statusGivenPosResult  =   do
>   status  <-  weighted [(10,Flu), (90,Healthy)]
>   if (status == Flu)
>     then  applyProb 0.7
>     else  applyProb 0.1
>   return status

This is a classic weighted particle filter \cite{fox05bayesian}.  For more
information on the |MC| and |SMC| monads, see Sections \ref{MC} and
\ref{SMC}, respectively.

\begin{spec}
\end{spec}



% -------------------------------------------------------------------------
% Details sections
% 5 pages.

\section{Monads and probability in Haskell}
\label{MonadsAndProbability}

So far, our presentation of probability distribution monads has been
informal.  Now we provide the theory that supports the toolkit, beginning
with a short introduction to monads and monad transformers.

\subsection{Monads}
\label{Monads}

In Haskell, a \term{type class} describes an abstract interface, which may
be implemented by one or more actual types.  The |Monad| type class
specifies two functions that must be defined by every monad |m|
\cite{wadler90comprehending, haskell98}.  It also constrains |m| to be a
type constructor with a single argument:

\begin{spec}
class Monad m where
  return  ::  a -> m a
  (>>=)   ::  m a -> (a -> m b) -> m b
\end{spec}

\noindent The |return| function takes a value of type |a|, and constructs a
new value ``in the monad,'' that is, a new value of type |m a|.  The |>>=|
operator\footnote{Read as ``bind.''} is a bit trickier.  It is best
understood as two operations, |liftM| and |join|:

\begin{spec}
liftM     ::  (Monad m) => (a -> b) -> m a -> m b
join      ::  (Monad m) => m (m a) -> m a

ma >>= f  =   join (liftM f ma)
\end{spec}

\noindent The |liftM| function is analogous to the standard |map| function.
Given a function of type |a -> b|, and value of type |m a|, it reaches
inside |m a|, and replaces |a| with |b|.  Similarly, the |join| function is
analogous to |concat|.  It takes a nested value of type |m (m a)|, and
collapses into a value of type |m a|.

For example, Haskell's standard list type forms a monad.  Note the use of
|concat| and |map| in place of |join| and |liftM|:

\begin{spec}
instance Monad [] where
  return x  =  [x]
  ma >>= f  =  concat (map f ma)
\end{spec}

\noindent Haskell's |do|-notation is syntactic sugar for the |>>=|
operator.  It expands as follows:

\begin{spec}
do  x <- [1,2]
    return (x*3)

[1,2] >>= (\x -> return (x*3))
\end{spec}

\noindent These two expressions are equivalent, and both return |[3,6]|.

\begin{hidden} % Rmoved 
% TODO - Edit more
Monads have their roots in category theory.  The |return| function
corresponds to the arrow $\eta_M\!: A \to \M\,A$, the |join| function to the
arrow $\mu_M\!: \M(\M\,A) \to \M\,A$, and the combination of |m| and |liftM|
to the functor $\M$\footnote{A \term{functor} is essentially a
  single-argument type constructor with an associated version of |map|.}
Thus, we can describe a monad with a triple $(M,\eta_M,\mu_M)$.
\end{hidden}

\subsection{Probabilities and distributions}
\label{Dist}

We represent a probability as a rational number, allowing us to work
exact probabilities whenever possible.

> newtype Prob = Prob Rational
>   deriving (Eq, Ord, Num, Fractional)

The |deriving| clause automatically implements the specified type classes
for us, based on the implementations for |Rational|.

The |Dist| type class specifies the interface that must be implemented by a
distribution type |d|.

> class (Monad d) => Dist d where
>   weighted :: [(Rational, a)] -> d a

The constraint |(Monad d)| requires all distributions to be monads.  The
|Dist| type class requires only one function, |weighted|, which constructs
a weighted distribution from a list of weights and values.

We can define a |uniform| distribution by assigning each value a weight of
$1$:

> uniform :: Dist d => [a] -> d a
> uniform = weighted . map (\v -> (1, v))

For a more realistic |Dist| interface, see earlier papers by Park and
colleagues \cite{park05probabilistic} and Erwig and Kollmansberger
\cite{erwig06probabilistic}.

\subsection{The MC monad: An example}
\label{MC}

We can turn the proposed |Rand| monad \cite{NewMonads} into a probability
distribution by defining a |Dist| instance for it.  This corresponds to the
sampling monads which have appeared in a number of earlier papers
\cite{ramsey02stochastic, park05probabilistic, erwig06probabilistic}.

> type MC = Rand StdGen
>
> instance Dist MC where
>   weighted wvs = fromList (map flipPair wvs)
>     where flipPair (a,b) = (b,a)

We create a type alias |MC|, which refers to |Rand StdGen|.  The parameter
|StdGen| specifies what source of random numbers will be used, and the
|fromList| function makes a weighted selection from a list.  We also define
a |sample| function:

> sample ::  MC a -> Int -> MC [a]
> sample r n = sequence (replicate n r)

The standard |sequence| function converts a value of type |[m a]| into a
value of type |m [a]|.

\subsection{Monad transformers}

In denotational semantics, a \term{monad morphism} is a mapping from one
monad to another \cite{moggi89abstract}.  Monad morphisms take an
underlying monad, and add features such as state, an environment, or
continuations.  Less formally, monad morphisms may be thought of as
type-level functions from one monad to another.

In Haskell, we can achieve the same result using a monad transformer
\cite{liang95monad, jones95functional, shan07monad}.  To define a monad
transformer |ExampleT|, we must implement both |return| and |>>=| in terms
of |m|, the monad to be transformed:

\begin{spec}
instance (Monad m) => Monad (ExampleT m) where
  return    = ...
  ma >>= f  = ...
\end{spec}

\noindent We must also provide an implementation of |lift|, which maps
values from a monad |m| into its transformed counterpart |t m|.

\begin{spec}
class MonadTrans t where
  lift :: Monad m => m a -> t m a
\end{spec}

\noindent For examples of monad transformers, see Sections
\ref{MVT} and \ref{BayesAndMaybeT}.

\section{Monoids and discrete distributions}
\label{MonoidsAndDiscrete}

In this section, we introduce monoids, $M$-sets, and the |MVT| monad
transformer.  Using these pieces, we build the |DDist| distribution
described in Section \ref{ToolkitPerhapsT}.  Special thanks go to Dan
Piponi, for inspiring the treatment of $M$-sets in Section \ref{Msets}, and
to Cale Gibbard, for first noticing the decomposition of |DDist| into
|WriterT Prob []|.

\subsection{Monoids}

A \term{monoid} is a triple $(M,e,\otimes)$, where $M$ is a set, $\otimes$
is an associative binary operation, and $e$ is an identity element such
that $e \otimes x = x \otimes e = x$.  In Haskell, we can represent a
monoid using the built-in |Monoid| type class \cite{jones95functional}:

\begin{spec}
class Monoid a where
  mempty   ::  a
  mappend  ::  a -> a -> a
\end{spec}

\noindent Here, |mempty| is $e$, and |mappend| is $\otimes$.

Probabilities form a monoid $(P,1,\times)$, where $P$ is the set of all
real numbers between 0 and 1 inclusive, and $\times$ is ordinary
multiplication.  In Haskell, we write this as:

> instance Monoid Prob where
>   mempty           =  1
>   p1 `mappend` p2  =  p1 * p2

\subsection{M-sets and the MV monad}
\label{Msets}

An \term{$M$-set} is a set $X$ with a monoid action $(\cdot)$, such that
\begin{align}
e \cdot x           &= x \\
m_1 \cdot (m_2 \cdot x) &= (m_1 \otimes m_2) \cdot x
\end{align}
where $x \in X$, and $m,n \in M$.  A \term{free $M$-set} is any $M$-set
where
\begin{align}
m \cdot x &= x \qquad \text{only if $m = e$.}
\end{align}
For an equivalent definition of $M$-sets, see Kilp and colleagues
\cite[pp. 43--44,68]{klip00monads}.

Given a set $S$, define $P \times S$ to be the set of pairs $(p,s)$ such
that $p \in P$ and $s \in S$.  In other words, the set $P \times S$
corresponds to elements of $S$ annotated with probabilities.  Now define
our monoid action to be $p_1 \cdot (p_2, s) = (p_1p_2,s)$.  We can easily
see that
\begin{align}
1 \cdot (p,s) &= (p,s) \\
p_1 \cdot (p_2 \cdot (p_3,s)) &= (p_1p_2p_3,s) \notag\\
                              &= (p_1p_2) \cdot (p_3,s) \\
p_1 \cdot (p_2,s) &= (p_2,s) \qquad \text{only if $p_1=1$.}
\end{align}
From this, we conclude that $P \times S$ is a free $M$-set.

In Haskell, we can represent an element of an $M$-set using the type |MV|,
which contains a monoid and a value.

> data (Monoid w) => MV w a =
>   MV { mvMonoid :: w, mvValue :: a }

The type |MV Prob a| corresponds to the set $P \times S$ above.
Interestingly, we can make |MV| a monad.

> mapMV f (MV w v) = MV w (f v)
> joinMV (MV w1 (MV w2 v)) = 
>   MV (w1 `mappend` w2) v
>
> instance (Monoid w) => Monad (MV w) where
>   return v = MV mempty v
>   mv >>= f = joinMV (mapMV f mv)

The |return| function corresponds to a map $x \mapsto (e,x)$, and |mempty|
to the monoid identity $e$.  The |joinMV| function corresponds to our
monoid action $(\cdot)$.

Haskell programmers will recognize the |MV| monad as a stripped-down
version of the standard |Writer| monad \cite{jones95functional}.  We omit
the |listen| and |pass| functions, which are useless in this context.  We
also omit the |tell| function, which we replace with |applyProb|:

> class (Monad m) => MonadCondProb m where
>   applyProb :: Prob -> m ()
>
> instance MonadCondProb (MV Prob) where
>   applyProb p = MV p ()

\noindent We can use |applyProb| and |MV| to multiply a set of independent
probabilities together.  For example, if we have an influenza test with a
30\percent{} false negative rate, and we know that 10\percent{} of the
population has influenza, we can calculate the actual percentage people
receiving a false positive:

> flu             =   applyProb 0.1
> falseNegative   =   applyProb 0.3
>
> missedFluCases  ::  MV Prob ()
> missedFluCases  =   do
>   flu
>   falseNegative

The |MV| monad multiplies 0.1 and 0.3 behind the scenes, and returns |MV
0.03 ()|.  This tells us that 3\percent{} of the population will have
influenza \emph{and} incorrectly receive a negative test result.
Conceptually, |missedFluCases| corresponds to a single ``path'' through the
example in Section \ref{ToolkitPerhapsT}, picking out only the case
|(Flu,Neg)|.  Note that |MV Prob| is not a probability distribution monad!
At best, it represents a single element of a probability distribution.

\subsection{The MVT monad transformer}
\label{MVT}

The |MV| monad is only marginally useful by itself.  We need some way to
combine the features of |MV| with other monads.  We do this by defining
|MVT|, a stripped-down version of the |WriterT| monad transformer
\cite{jones95functional}.

> newtype (Monoid w, Monad m) => MVT w m a =
>   MVT { runMVT :: m (MV w a) }
>
> instance  (Monoid w) =>
>           MonadTrans (MVT w) where
>   lift mv = MVT (do  v <- mv
>                      return (MV mempty v)) 

The trick is to take our underlying monad |m a|, and replace every
occurrence of |a| with |MV w a|.  The |runMVT| function simply extracts the
value packed inside our |MVT| wrapper.  The |lift| function will take an
existing value of type |m a|, and promotes it to a value of type |MVT w a|.

The |Monad| instance for |MVT| is closely modelled on that of |MV|.

> instance  (Monoid w, Monad m) =>
>           Monad (MVT w m) where
>   return    =  lift . return
>   ma >>= f  =  MVT bound
>     where  bound = do
>              (MV w1 v1)  <-  runMVT ma
>              (MV w2 v2)  <-  runMVT (f v1)
>              return (MV (w1 `mappend` w2) v2)

We define the |return| function for |MVT| in terms of |lift| and the
underlying monad's |return|.  The bind function is a bit trickier.  We need
|runMVT ma| to get a value of type |m (MV w a)|, and use |<-| to strip off
|m|.  The final two lines perform the same work as |mapMV| and |bindMV|,
respectively, but do so in the context of our underlying monad.

\subsection{The DDist monad}
\label{DDist}

Now we're finally ready to build our discrete distribution monad.  We
define |PerhapsT| to be an alias for |MVT Prob|, and apply it to the
standard list monad.

> type PerhapsT = MVT Prob
>
> type DDist = PerhapsT []

The list monad, as we saw in Section \ref{ToolkitList}, follows every
possible ``branch'' of a computation.  When we apply |PerhapsT| to the list
monad, we associate a probability with each branch.  To split a branch into
sub-branches, we use |weighted|.

> instance Dist DDist where
>   weighted wvs = MVT (map toMV wvs)
>     where toMV (w,v) = MV (Prob (w / total)) v
>           total = sum (map fst wvs)

Note that we use |total| to normalize the weights, forcing them to add up
to 1.  This not only ensures that |weighted| returns a valid probability
distribution, it also guarantees that the weights calculated by |>>=| for
our sub-branches will add up to the weight of our original branch.

The |DDist| monad is a stripped-down version of Erwig and Kollmansberger's
probability monad \cite{erwig06probabilistic}.  The factoring of |DDist|
into |WriterT Prob []| has been independently discovered by several people,
including Cale Gibbard.

% Cale Gibbard on using WriterT blah [] to build a probability monad:
% http://syntaxfree.wordpress.com/2007/02/11/making-a-monad-martin-erwigs-dist/#comment-379

\section{Bayes' theorem and MaybeT}
\label{BayesAndMaybeT}

In Sections \ref{BackgroundBayesExample} and \ref{ToolkitMaybeT}, we used
Bayes' theorem \cite{bayes1763chances} to interpret the result of an
influenza test.  In this section, we show how to implement Bayes' theorem
using the |MaybeT| monad transformer.

\subsection{Bayes' theorem}
\label{BayesTheoremProof}

We adopt the convention that $\Pb(A)$ specifies a vector of probabilities,
one for each possible value of $A$.
\begin{equation}
\Pb(A) = \langle P(A=a_1), \ldots, P(A=a_n) \rangle
\end{equation}
Two such vectors may be multiplied pointwise.
\begin{align}
\Pb(B=b||A)\Pb(A) = \langle &P(B=b||A=a_1)P(A=a_1), \ldots, \notag\\
                            &P(B=b||A=a_n)P(A=a_n) \rangle
\end{align}
Using this notation, we can state Bayes' theorem \cite[p. 479]{russel03ai}
as
\begin{align}
\Pb(A||B=b) &= \frac{\Pb(B=b||A)\Pb(A)}{P(B=b)} \label{bayesB} \\
            &= \frac{\Pb(B=b||A)\Pb(A)}{\sum_i P(B=b||A=a_i)P(A=a_i)}
               \label{bayesSum}
\end{align}

\noindent Now recall the program from Section \ref{ToolkitMaybeT}.

\begin{spec}
filteredWeightedOutcomes :: BDDist (Status,Test)
filteredWeightedOutcomes = do
  status  <-  weighted [(10,Flu), (90,Healthy)]
  test    <-  ...
  guard (test == Pos)
  return (status,test)
\end{spec}

\noindent This returns a list of possible results, and their corresponding
probabilities:

\begin{quote}
\begin{tabular}{r@@{\ }l@@{\qquad}r@@{\ }l}
7\percent & |Just (Flu,Pos)|     &   3\percent & |Nothing| \\
9\percent & |Just (Healthy,Pos)| &  81\percent & |Nothing| \\
\end{tabular}
\end{quote}

\noindent The |Nothing| values represent those branches of our computation
on which |guard (test == Pos)| failed.

We can apply Bayes' theorem to this example as follows.  Let
\begin{equation}
\Pb(A)=\langle P(A=a_1),\ldots,P(A=a_n)\rangle
\end{equation}
where $a_1,\ldots,a_n$ are the results of each branch of our computation.
Let $G=g_1 \wedge \cdots \wedge g_m$, where $g_i$ is true if and only if
our $i$-th guard clause succeeds.  This gives us
\begin{equation}\label{bayesGuard}
\Pb(A||G) = \frac{\Pb(G||A)\Pb(A)}{\sum_i P(G||A=a_i)P(A=a_i)}
\end{equation}
But for each branch $a_i$ of our computation, $G$ is true if and only if
$a_i \neq |Nothing|$.  This gives us
\begin{equation}
P(G||A=a_i) =
  \left\{\begin{array}{l}
           1 \quad \text{if $a_i \neq |Nothing|$} \\
           0 \quad \text{if $a_i = |Nothing|$}
         \end{array}\right.
\end{equation}
allowing us to simplify \eqref{bayesGuard} to
\begin{equation}
\Pb(A||G) = \frac{\Pb(G||A)\Pb(A)}{\sum_{a_i \neq |Nothing|} P(A=a_i)}
\end{equation}
But this is equivalent to discarding all the |Nothing| terms, and
normalizing the remaining probabilities.  Compare this result to the
diagram in Section \ref{BackgroundBayesExample}.

% TODO: Float Bayes diagram so we can refer to it from here?

\subsection{The MaybeT monad transformer}

To build our Bayesian monad, we need an implementation of |MaybeT|
\cite{NewMonads}.  The |MaybeT| monad transformer lifts a computation of
type |m a| to a computation of type |m (Maybe a)|.

\begin{spec}
newtype MaybeT m a =
  MaybeT { runMaybeT :: m (Maybe a) }

instance MonadTrans MaybeT where
  lift x = MaybeT (liftM Just x)
\end{spec}

\noindent Here, |liftM Just| has the type |m a -> m (Maybe a)|, where |m|
is the monad being lifted.

We also need to define new versions of |return| and |>>=| using the
versions defined by |m|.

\begin{spec}
instance (Monad m) => Monad (MaybeT m) where
  return x  =  lift (return x)
  ma >>= f  =  MaybeT (runMaybeT ma >>= f')
    where  f' Nothing   =  return Nothing
           f' (Just x)  =  runMaybeT (f x)
\end{spec}

\noindent There are two key concepts here:

\begin{enumerate}
\item Neither |return| or |>>=| create new |Nothing| values.  This must be
  done using |mzero|, defined in Section \ref{BDDist}.
\item Once a |Nothing| value is inserted into the monad, it will be passed
  along unchanged by |>>=|, suppressing all further calls to any function
  |f|.
\end{enumerate}

\noindent Note that we provide no |MonadPlus| instance for |MaybeT|,
because it has ambiguous semantics.  It could refer to either

\begin{spec}
instance (MonadPlus m) => MonadPlus (MaybeT m)
\end{spec}

\noindent which lifts the semantics of an underlying |MonadPlus| instance,
or

\begin{spec}
instance (Monad m) => MonadPlus (MaybeT m)
\end{spec}

\noindent which provides semantics similar to |MonadPlus Maybe|.
Therefore, we leave the choice up to the user of |MaybeT|.

\subsection{The BDDist monad}
\label{BDDist}

Now we are ready to define |BDDist|, a discrete distribution monad with
support for Bayes' theorem.  We apply |MaybeT| to |DDist|, and lift the
underlying |weighted| function into the new monad.

> type BDDist = MaybeT DDist
>
> instance (Dist d) => Dist (MaybeT d) where
>   weighted wvs = lift (weighted wvs)

The |guard| function\footnote{Earlier versions of this paper used a
  |condition| function.  Thank you to David House for noticing that
  |condition| was identical to |guard|.} is actually supplied by the
standard Haskell library \cite{haskell98}:

\begin{spec}
guard ::  MonadPlus m => Bool -> m ()
guard cond  =  if cond then return () else mzero
\end{spec}

\noindent If |cond| is true, then |guard| returns |()|.  This continues the
current branch of the computation unchanged.  But if |cond| is false,
|guard| returns |mzero|.  This has the effect of injecting |Nothing| into
the computation, killing off the current branch.

To take advantage of |guard|, we need to make |BDDist| an instance of
|MonadPlus|.  Haskell also forces us to supply |mplus|, which we don't need
in this paper.

\begin{hidden}

> instance MonadPlus BDDist where
>   mzero = MaybeT (return Nothing)
>   d1 `mplus` d2
>      | isNothing (bayes d1)  = d2
>      | otherwise             = d1

\end{hidden}

\begin{spec}
instance MonadPlus BDDist where
  mzero = MaybeT (return Nothing)
  d1 `mplus` d2 = ...
\end{spec}

We now have a monad which supports |guard|, and returns values of type
|Maybe a|.  We want to turn that |Maybe a| back into |a|, keeping the
|Just| values and discarding the |Nothing| values.  We can do this using
|catMaybes'|.
 
> catMaybes'  ::  (Monoid w) =>
>                [MV w (Maybe a)] -> [MV w a]
> catMaybes'  =  map (liftM fromJust) .
>                filter (isJust . mvValue)

Now we're ready to implement Bayes' theorem, following the strategy
outlined in Section \ref{BayesTheoremProof}.  We use |catMaybes'| to
discard all the |Nothing| values, and sum the remaining probabilities into
|total|.

> bayes :: BDDist a -> Maybe (DDist a)
> bayes bfd
>     | total == 0  =  Nothing
>     | otherwise   =  Just (weighted (map unpack events))
>   where
>     events  =  catMaybes' (runMVT (runMaybeT bfd))
>     total   =  sum (map mvMonoid events)
>     unpack (MV (Prob p) v) = (p,v)

If |total==0|, then our |guard| conditions have failed on every possible
path, and we return |Nothing|.  Otherwise, we construct a discrete
probability distribution, using |weighted| to perform the actual
normalization.

\subsection{The BMC monad}
\label{BMC}

We can use |MaybeT| to define a second Bayesian monad, this one based on
rejection sampling \cite{park05probabilistic}.  Again, we omit the details
of |mplus|.

\begin{hidden}

> type BMC = MaybeT MC
>
> instance MonadPlus BMC where
>   mzero = MaybeT (return Nothing)
>   d1 `mplus` d2 = MaybeT choose
>     where  choose = do
>              x1 <- runMaybeT d1
>              case x1 of
>                Nothing  ->  runMaybeT d2
>                Just _   ->  return x1

\end{hidden}

\begin{spec}
type BMC = MaybeT MC

instance MonadPlus BMC where
  mzero = MaybeT (return Nothing)
  d1 `mplus` d2 = ...
\end{spec}

\noindent The |BMC| monad returns samples of type |Maybe a|.  Once again,
we want to keep the |Just| values and discard the |Nothing| values.  We can
do that using the |sampleWithRejections| function, which takes $n$ samples
from a distribution, rejects those samples equal to |Nothing|, and returns
the rest.  These remaining samples are the ones that made it by all of our
|guard| conditions.

> sampleWithRejections  ::  BMC a -> Int -> MC [a]
> sampleWithRejections d n =
>   (liftM catMaybes) (sample (runMaybeT d) n)

Note that this function may return far fewer than $n$ samples.  This is
because the distribution |d|, in a worst-case scenario, may never produce
any samples at all.  This will occur with the distribution |guard False|,
and other distributions with impossible-to-satisfy guard conditions.
Enhanced versions of |sampleWithRejections| must take care not to hang in
these circumstances.

\section{Sequential Monte Carlo sampling}
\label{SMC}

Sequential Monte Carlo sampling is used in robot localization, computer
vision, signal processing and econometrics \cite{fox05bayesian,
  doucet00sequential}.  It differs from regular Monte Carlo sampling in
that it represents samples as sets of ``particles,'' the values of which
are updated over time.  In a typical application, each particle might
represent one possible location of a robot.  Initially, the particles are
positioned at random.  As the robot moves, the particles are moved along
with it (perhaps with some random variation, if the speed of the robot is
uncertain).  If a particle winds up in an impossible location, such as
inside a wall, that particle can be discarded and a new particle allocated
elsewhere.  The cloud of particles will eventually converge on one or more
probable locations for the robot.

The major advantage of sequential Monte Carlo sampling over ordinary
sampling is the ability to reallocate particles dynamically.  This allows
the algorithm to focus resources on the most interesting hypotheses,
and therefore reduces the total number of samples required.

\subsection{The SMC monad}

\begin{figure}
%\includegraphics[scale=0.4]{figs.1}

\[
\xygraph{
!{0;<30mm,0mm>:<0mm,10mm>::}
% Column labels.
[]{|SMC a|} & {|SMC (SMC b)|} & {|SMC b|} \\
% Draw "a" and "b" values, with extra padding.
[]*=<10mm,10mm>{a}="a1" & *=<10mm,10mm>{b}="nb1" & *=<10mm,10mm>{b}="b1" \\
[]*=<10mm,10mm>{a}="a2" & *=<10mm,10mm>{b}="nb2" & *=<10mm,10mm>{b}="b2" \\
[]*=<10mm,10mm>{a}="a3" & *=<10mm,10mm>{b}="nb3" & *=<10mm,10mm>{b}="b3" \\
% Labelled arrows.
"a2" :^{|mapSMC f|}_{|f :: a -> SMC b|} "nb2" :^{|joinSMC|} "b2"
% Put big boxes around each column.  The "**" means "box the two previous
% points, and "!{..}" allows us to write raw xypic code.
[] !{"a1";"a3"**\frm{-}}
   !{"nb1";"nb3"**\frm{-}}
   !{"b1";"b3"**\frm{-}}
% Put small boxes around each item in nested "b" column.
[] !{"nb1"*=<5mm,5mm>{}*\frm{-}}
   !{"nb2"*=<5mm,5mm>{}*\frm{-}}
   !{"nb3"*=<5mm,5mm>{}*\frm{-}}
}\]

\label{DiagramMapAndJoinSMC}
\caption{|mapSMC| and |joinSMC|.}
\end{figure}

We define our |SMC| monad in terms of |MC|.  Internally, we represent |SMC|
as a function mapping the desired number of samples to a list of actual
samples.  The list itself must be in the |MC| monad because it can only be
generated using random numbers.

> newtype SMC a =
>   SMC { runSMC :: Int -> MC [a] }
> 
> liftMC :: MC a -> SMC a
> liftMC r = SMC (sample r)

We can lift a distribution from |MC| to |SMC| simply by taking the
requested number of samples.  The |mapSMC| function is a bit more complex,
however.  We define a new sampling function, |mapped|, which first turns a
distribution |d| into actual samples, and then applies |f| to each sample.
We package |mapped| inside our |SMC| monad, where it can be run when
needed.

> mapSMC :: (a -> b) -> SMC a -> SMC b
> mapSMC f d = SMC mapped
>   where mapped n = liftM (map f) (runSMC d n)

The |joinSMC| function is the heart of the |SMC| monad.  We begin with a
cloud of particles, where each particle is itself an entire cloud (Figure
\ref{DiagramMapAndJoinSMC}, middle column).  We want to take a single
particle from each nested cloud, and put those particles into a new,
flattened cloud.

To do this, we define a function |joined|, which takes |n| samples from the
outermost cloud, and stores them in the list |ds|, which has type |[SMC
a]|.  It then takes one sample from each of these clouds using |sample1|,
and stores them in |xss|, which has type |[[a]]|.  All that remains is to
flatten this list using |concat|.

> joinSMC :: SMC (SMC a) -> SMC a
> joinSMC dd = SMC joined
>   where joined n = do
>           ds <- (runSMC dd n)
>           xss <- sequence (map sample1 ds)
>           return (concat xss)
>         sample1 d = runSMC d 1

By taking only one sample from each of our nested clouds of particles, we
effectively propagate each particle into one of its own possible futures.

The remaining code is trivial.

> instance Monad SMC where
>   return = liftMC . return
>   ps >>= f = joinSMC (mapSMC f ps)
> 
> instance Dist SMC where
>   weighted = liftMC . weighted

\subsection{The WSMC monad and weighted particle filtering}
\label{WSMC}

Sequential Monte Carlo sampling is commonly performed using weighted
particles.  This is a form of \term{importance sampling}, where each sample
has an associated weight \cite{doucet00sequential}.  A high-importance
sample is interpreted as more ``likely'' than a low importance sample.
This allows us to get more data out of a given number samples.  Instead of
discarding samples, as we did using |MaybeT| in Section
\ref{BayesAndMaybeT}, we can simply mark those samples as unlikely.

The definition of the |WSMC| monad should be unsurprising at this point.
It relies on the same techniques seen in Section \ref{BayesAndMaybeT}, with
only cosmetic differences.

> type WSMC = PerhapsT SMC
> 
> instance Dist WSMC where
>   weighted wvs = lift (weighted wvs)
>
> runWSMC :: WSMC a -> Int -> MC [MV Prob a]
> runWSMC wps n = runSMC (runMVT wps) n

We do, however, provide two new features.  The |applyProb| function is a
weighted version of |guard|.  It multiplies the current particle's weight
by |p|.  Recall that |p| here is a probability, so we know that $0 \le p
\le 1$.  The actual multiplication is handled for us by |PerhapsT|.

> instance MonadCondProb WSMC where
>   applyProb p = MVT (return (MV p ()))

In practice, |applyProb| is used to apply a new piece of evidence to our
particles.  For example, imagine that we have a robot with a door sensor.
Let $x$ be the current reading of the door sensor, and let $P(x||pos)$ be
the probability of observing $x$ at $pos$.  We want to call |applyProb|
with an argument of $P(x||pos)$, which will apply an appropriate weight to
each of our particles.  For an excellent illustration of this process, see
Fox and colleagues \cite{fox05bayesian}.

As time goes on, however, repeated calls to |applyProb| will leave many of
our particles with weights near zero.  Periodically, we want to replace
these particles with higher-probability ones.  We can do this using the
|resample| function.

% TODO - Figure out why we actually need 'unpack' in this design.  It
% appears in two different places.

> resample :: WSMC a -> WSMC a
> resample d = lift (SMC resampled)
>   where  resampled n = do
>            xs <- runSMC (runMVT d) n
>            sample (weighted (map unpack xs)) n
>          unpack (MV (Prob p) x) = (p,x)

The |resample| function treats our weighted particles as a probability
distribution, and takes |n| new samples.  This process will favor particles
with high weights over particles with low weights, concentrating most of
our particles in places where they'll do the most good.  This particular
implementation of resampling, however, is extremely naive and has
statistical problems.  See Doucet and colleagues for a survey of more
sophisticated approaches \cite{doucet00sequential}.

% \section{Haskell wish list}
% \label{WishList}

% A Set monad
% A MaybeT monad transformer
% A Rand monad
% A MonadZero type class
% A Writer monad without the gunk

% \section{Examples}
% \label{Examples}
% Provisional section; this material may get worked in elsewhere.

% \subsection{Coin Tosses}

% > data Coin = Heads | Tails
% >   deriving (Show, Eq, Ord)
% >
% > toss :: (Dist d) => d Coin
% > toss = uniform [Heads,Tails]
% >
% > tossTwice :: (Dist d) => d [Coin]
% > tossTwice = do
% >   t1 <- toss
% >   t2 <- toss
% >   return (sort [t1, t2])

% \subsection{Monty Hall}

% > montyHall switch = do
% >   let doors = [1,2,3]
% >   doorWithCar <- uniform doors
% >   playerDoor  <- uniform doors
% >   hostDoor    <- uniform (doors \\ [doorWithCar,playerDoor])
% >   let unopenedDoor = head (doors \\ [playerDoor,hostDoor])
% >       finalDoor = if switch then unopenedDoor else playerDoor
% >   return (finalDoor == doorWithCar)

\section{Related work}
% 1-2 pages.

Probability distribution monads are discussed by Lawvere
\cite{lawvere62category} and Giry \cite{giry81categorical}.  Giry's work
focuses on Markov processes and transition probabilities, and provides a
categorical foundation for probability measures.  Jones and Plotkin lay
further mathematical foundations, introducing a probabilistic power domain
and showing that it forms a monad \cite{jones89probabilistic}.  They also
provide a $\lambda$-calculus model for the probability monad, and propose a
probabilistic choice construct serving the same purpose as |weighted|.

Ramsey and Pfeffer provide Haskell interfaces for several probability
distribution monads, including a sampling monad that corresponds roughly to
|MC| \cite{ramsey02stochastic}.  They show that monads offer efficient
implementations of support and sampling, but may be exponentially slower
than other techniques for calculating expectations.  To solve this problem,
they translate the $\lambda$-calculus into measure terms.

Park and colleagues use a sampling monad in a variety of real-world
robotics applications \cite{park05probabilistic}.  Their
$\lambda_{\bigcirc}$-calculus uses sampling functions to represent
binomial, geometric, and Gaussian distributions.  They provide two
implementations of Bayes' theorem (one using the rejection method, the
other using importance sampling), and a primitive for calculating
expectations.  This is the best-developed version of the |MC| monad in the
literature.

Erwig and Kollmansberger have written an excellent Haskell library
supporting both discrete distributions and random sampling, corresponding
to the |DDist| and |MC| monads in this paper \cite{erwig06probabilistic}.
They provide a much richer set of higher-order functions than we do,
including functions for iterated simulation.

Our work differs from these earlier papers in several ways.  We focus on
the underlying structure of the probability monads, showing how to build
them from layers of monad transformers.  We also show that Bayes' theorem
may be expressed naturally using |MaybeT|, removing the need for special
primitives.  And we introduce a new family of monads for sequential Monte
Carlo sampling.

\subsection{Monad morphisms and monad transformers}

The history of monad morphisms and monad transformers has been described in
a bibliography by Chung-chieh Shan \cite{shan07monad}.  Moggi originally
used monad morphisms to build a modular theory of denotational semantics
\cite{moggi89abstract}.  Moggi's theory was adapted for use in functional
programming languages by King and Wadler \cite{king92combining}.  Later
work focused heavily on modular interpreters \cite{wadler92essence,
  steele94building, espinosa95semantic, liang95monad,
  harrison00compilation}.  A typical modular interpreter is based on an
|eval| function in an unspecified monad.  On top of this foundation,
several layers of monad transformers are used to implement state,
continuations, error reporting, and other language features.

In another field, Chung-chieh Shan applied monad morphisms to natural
language semantics.  He used various monad morphisms to construct a modular
theory of interrogatives, focus, intensionality and quantification
\cite{shan01monads}.

Monad transformers are well-supported by the Haskell programming language
\cite{jones95functional}.  For good tutorials, see Grabmueller
\cite{grabmueller06transformers} and Newburn \cite{all-about-monads}.

\subsection{Other representations of probability distributions}

A variety of techniques have been used to represent probability
distributions.  Fox and colleagues provide an excellent survey of Bayesian
filtering and belief representations \cite{fox05bayesian}.  Park and
colleagues also cover much of this ground in their paper on probability
distribution monads \cite{park05probabilistic}.

In robotics applications, probability distributions are often represented
as Kalman filters \cite{russel03ai, fox05bayesian}.  Kalman filters are
highly efficient, but they can only represent simple Gaussian
distributions.  Under certain assumptions, however, Kalman filters are
theoretically optimal \cite{fox05bayesian}.  Unfortunately, our probability
monad tookit does not support Kalman filters, because Gaussian
distributions cannot be defined over arbitrary Haskell data types.

There is also an extensive literature on sequential Monte Carlo sampling
and particle filters.  For a good overview, see Fox and colleagues
\cite{fox05bayesian}.  For a detailed discussion of current techniques, see
Doucet and colleagues \cite{doucet00sequential}.

\subsection{Probabilistic programming languages}

Probabilistic programming languages are an enormously rich area of
research.  We describe a few here, just to give a sense of the sheer
diversity.

Pfeffer's IBAL programming language supports sophisticated probabilistic
inference \cite{pfeffer05ibal}.  IBAL is a functional language, with
constructs similar to those in this paper.  But where our monads directly
calculate probability distributions, IBAL uses a two-phase inference
algorithm to efficiently solve large problems.  Phase 1 analyzes the
program and produces an optimized computation graph.  Phase 2 solves the
computation graph and returns a probability distribution over the possible
outputs.

Allison describes a Haskell library for machine learning and inductive
inference \cite{allison03types, allison06programming}.  Inductive inference
constructs a Bayesian network from real-world data, filling in the
connections automatically.  The major drawback of this approach is the risk
of ``overfitting,'' which occurs when the inference engine constructs a
excessively-detailed model that describes every quirk of the training data.
To avoid this problem, Allison uses Minimum Message Length (MML) to choose
a model minimizing the combined size of the model and the compressed
training data.

A rich variety of probabilistic logic programming languages have also
appeared in the literature.  Some examples include Ng and Subrahmanian
\cite{ng92probabilistic}, and Muggleton \cite{muggleton95stochastic}.


% There's a Kalman Filtering book by Peter Maybeck which looks interesting.

% Category theory paper on monad transformers and algebras?  I think it was
% written by a math grad student to his old CS research group.

% DSELs?

% There is a rich literature on stochastic processes and monads.  Do we
% know enough to mention anything newer than Giry?

\section{Conclusion}
% 1/2 page.

In this paper, we introduced a toolkit for building probability monads.
Many components of the toolkit already existed in standard Haskell, or in
various proposed libraries.  These included the list monad, the Monte Carlo
sampling monad, and the |MaybeT| monad transformer.  Two of our components,
however, were new: The |SMC| monad (Section \ref{SMC}), which performs
sequential Monte Carlo sampling, and the |PerhapsT| monad transformer
(Section \ref{MonoidsAndDiscrete}), which is a stripped-down version of
|WriterT Prob|.

We also demonstrated some novel applications of this toolkit.  These
included an implementation of Bayes' theorem using |MaybeT| (Section
\ref{BayesAndMaybeT}), and a monad which performs weighted particle
filtering (Section \ref{WSMC}).

The most important benefit of the modular toolkit, however, has been the
ease with which we can construct new monads.  This facilitates tinkering
and experimentation, and frequently offers us new perspectives on
well-known techniques.  Many of the monads in this paper were discovered by
asking, ``Hey, what happens if we combine \emph{this} with one of
\emph{these?}''  We leave a great many such questions unanswered, pending
further exploration.

\section*{Acknowledgments}

Dan Piponi provided me with hours of fascinating reading, and introduced me
to the notion of a free $M$-set monad transformer.  Cale Gibbard was the
first to notice the decomposition of Erwig and Kollmansberger's discrete
distribution monad into |WriterT Prob []|.  Nicholas Sinnott-Armstrong
carefully read the first draft of this paper for clarity, and found many
places that needed further explanation.  Thanks also to Don Stuart, Derek
Elkins, David House and other members of the Haskell community who provided
me with advice and suggested new ideas.  Any remaining mistakes are, of
course, my own.

\bibliography{bibliography}

\begin{hidden}

% Show instances.

> instance Show Prob where
>   show (Prob p) = show (numerator p) ++ "/" ++ show (denominator p)
>
> instance (Monoid w, Show w, Show a) => Show (MV w a) where
>   show (MV w a) = "MV " ++ show w ++ " " ++ show a
>
> instance (Show a, Ord a) => Show (DDist a) where
>   show = show . simplify . runMVT
>
> simplify :: (Monoid w, Num w, Ord a) => [MV w a] -> [MV w a]
> simplify = map (foldr1 merge) . groupEvents . sortEvents
>   where sortEvents = sortBy (liftOp compare)
>         groupEvents = groupBy (liftOp (==))
>         liftOp op  (MV _   v1) (MV _   v2)  = op v1 v2
>         merge      (MV w1  v1) (MV w2  _)   = MV (w1 + w2) v1

% Top-level IO methods

> ddist :: DDist a -> DDist a
> ddist d = d

> sampleIO :: MC a -> Int -> IO [a]
> sampleIO r n = evalRandIO (sample r n)
>
> sampleWithRejectionsIO :: BMC a -> Int -> IO [a]
> sampleWithRejectionsIO r n = evalRandIO (sampleWithRejections r n)
>
> runWSMCIO :: WSMC a -> Int -> IO [MV Prob a]
> runWSMCIO wps n = evalRandIO (runWSMC wps n)

> main = return ()

\end{hidden}

\end{document}
