Introduction When it comes to financial planning, most of us default to presenting a single number...
Demystifying Bayesian Statistics and PYMC: A Practical Guide
If you have ever wondered how to make smarter predictions or handle uncertainty in your data, Bayesian statistics and PYMC might be the tools you're looking for. Bayesian statistics is a method of statistical inference that combines prior knowledge with observed data to make predictions or draw conclusions. At its heart is Bayes' Theorem, a formula for updating probabilities as new evidence becomes available. PYMC is a powerful Python library that simplifies implementing Bayesian models to i.e. analyze trends, test hypotheses, or simulating outcomes.
PYMC is centered around a fundamental idea: estimating prior distributions, defining a likelihood function to link your observed data with the model, and calculating posterior distributions through iterative sampling methods.
In this guide, you'll discover the fundamentals of Bayesian statistics, learn how PYMC works, and explore its use cases. By the end, you'll have a new tool to elevate your data analysis game.
Bayes' Theorem is expressed as:
\[P(A|B) = \frac{P(B|A) * P(A)} {P(B)}\]
Here's what the components mean:
- P(A|B): The posterior probability. This is what you want to calculate - the probability of event A given that event B has occurred.
- P(B|A): The likelihood. This represents the probability of observing event B if A is true.
- P(A): The prior probability. This is your initial belief about the probability of A, before considering the new evidence.
- P(B): The evidence or marginal likelihood. This normalises the result so that the posterior is a valid probability distribution.
Linear Regression Made Simple with PYMC
Bayesian modelling may sound complex, but it's surprisingly accessible with PYMC. Let's explore its mechanics step-by-step through a first example which performs linear regression.
Example Code
import pymc as pm
import numpy as np
if __name__ == '__main__':
# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 1, 100)
true_intercept = 1
true_slope = 2
y = true_intercept + true_slope * X + np.random.normal(scale=0.5, size=len(X))
# Define the model
with pm.Model() as model:
# Priors for unknown model parameters
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=1)
# Likelihood (sampling distribution) of observations
mu = intercept + slope * X
likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y)
# Inference
trace = pm.sample(2000, return_inferencedata=True)
# Summarize the results
print(pm.summary(trace))
The result shows an intercept and slope which is not as close to the true values (y = 2x + 1) as expected (see code snippet above). Reason for the difference is the low number of samples (100 in our case - defined by size of X). When increasing the number of samples by a factor of 10, the values will be much closer to the true values, however this comes at a cost increasing the runtime by a similar factor. Calculating lots of samples with a standard regression takes a few seconds which is the better usage in this case.
Parameter | mean | sd | hdi_3% | hdi_97% |
---|---|---|---|---|
intercept | 0.911 | 0.093 | 0.727 | 1.077 |
sigma | 0.462 | 0.034 | 0.402 | 0.530 |
slope | 2.073 | 0.162 | 1.762 | 2.377 |
Explaining the approach step-by-step
- Following the generation of the synthetic data, you define the prior belief about the slope, intercept of the linear relationship between x and y and the standard deviation of the noise in the data.
In our example we assume the slope and intercept of the priors to be around zero (as we do not know a priori the true parameter values), with a high sigma value of 10, which means they are both weak priors and the posterior will depend more on the data. For the sigma of the noise of the data, the value of 1 indicates a belief that most observed y values will fall within ±1 unit of the model's predictions, but there is still room for larger deviations. - Specify the likelihood function to evaluate how well the model explains the data, given the current parameter values.
The model assumes that the observed y values (vector) are drawn from a normal distribution. The observed parameter of pm.normal fixes the values of the distribution to the observations, effectively turning the distribution into a likelihood function which evaluates how well the model parameters explain the observed data (single probability value for each parameter). The closer the predicted values (mu = slope * x + intercept) are to the real observed values (y), the higher the likelihood. - Use the observed data to calculate posterior distributions.
This step combines prior distributions and likelihood which incorporates the observed data (y) to evaluate how much the prior beliefs should be adjusted. The result is a set of posterior distributions (calculated using Markov Chain Monte Carlo or another sampling algorithm) for each parameter. - Perform inference and visualise results.
In above code we run 2000 iterations to calculate the parameters (their mean, their standard deviation and confidence intervals)
Why Choose Bayesian Statistics and PYMC?
Bayesian statistics cover the main problem types addressed by classical data science (e.g. prediction, classification, clustering, parameter estimation). They allow for flexible incorporation of prior knowledge and handle complex models naturally (e.g., "stacking" different distributions across levels of hierarchical data structures). In addition, Bayesian methods give a full probability distribution for each parameter or prediction, not just point estimates and confidence intervals. This is useful for decision-making under uncertainty.
Classical data science should be used when the primary goal is high predictive accuracy, especially in large datasets or when computational efficiency is more important than quantifying uncertainty. While Bayesian methods are computationally intensive, they are especially advantageous for small datasets, where prior knowledge can help compensate for limited data.
Please find a comparison across the aspects mentioned for the linear regression example:
Aspect | Classical Linear Regression | Bayesian Linear Regression |
---|---|---|
Uncertainty | Provides point estimates and confidence intervals. | Provides full posterior distributions, showing richer uncertainty. |
Prior Knowledge | Cannot incorporate prior knowledge. | Incorporates prior knowledge |
Interpretability | Easy to interpret for non-experts. | Requires understanding of priors and posterior distributions. |
Computational Cost | Fast and efficient, even for large datasets. | Computationally expensive due to sampling or approximation. |
As a next step, let's now explore a real-world example which assesses the effect of a drug on blood pressure using Bayes.
Modeling Drug Effects on Blood Pressure
The following code checks whether a drug has the expected effect on blood pressure for 100 patients. Note that the code is very similar to the linear regression example.
import pymc as pm
import numpy as np
# Simulated data: Assume 100 patients with some effect from the drug
np.random.seed(42)
n_patients = 100
true_beta = 5 # Assume the expected effect is a decrease of 5 units in blood pressure
# Loc is the mean, and scale the standard deviation
y_observed = np.random.normal(loc=true_beta, scale=2, size=n_patients)
# Define the Bayesian model
with pm.Model() as model:
# Prior for the effect of the drug (beta)
beta = pm.Normal("beta", mu=0, sigma=10)
# Prior for the noise (sigma)
sigma = pm.HalfNormal("sigma", sigma=1)
# Likelihood: Observed data
y = pm.Normal("y", mu=beta, sigma=sigma, observed=y_observed)
# Posterior sampling using MCMC
trace = pm.sample(2000, return_inferencedata=True)
# Summarize the posterior distribution
print(pm.summary(trace))
Sampling Process: Each iteration of the MCMC sampling process involves evaluating the model's likelihood and prior at new parameters (based on the current values and the proposal distribution which is fixed for Metropolis-Hastings algorithm - a type of random walk). These evaluations are then compared to previous states to determine the next course of action. The decision to move to a new state or stay at the current one is made based on the Bayesian principle that the posterior is proportional to the likelihood times the prior (Bayes Theorem).
\[\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\]
The outputs from the drug model show that the estimated effect of the drug (mean of beta = 4.792) is slightly less than the expected effect (5 units), but it falls within the credible interval. This suggests that the model, data, and prior have worked reasonably well to capture the expected drug effect, and the results support the efficacy of the drug in reducing blood pressure.
The estimated level of noise (mean of sigma = 1.805) is close to the standard deviation used in the simulation (2 units), indicating that the model has effectively captured the underlying variability in the blood pressure changes. From a qualitative perspective this means that the blood pressure readings are expected to vary by approximately 1.8 units around the mean effect due to random noise.
The HDI shows the range of plausible values for this noise term, which is important for understanding the variability in treatment responses among patients.
The interval of beta represents the region containing 94% of the posterior probability and suggests that we can be 94% confident that the true effect of the drug on reducing blood pressure lies within this range.
Parameter | Mean | SD | HDI 3% | HDI 97% |
---|---|---|---|---|
beta | 4.792 | 0.183 | 4.458 | 5.142 |
sigma | 1.805 | 0.125 | 1.577 | 2.042 |
Credible versus Confidence Intervals
Bayesian models offer more than just point estimates; they provide credible intervals, which enable a deeper understanding of potential outcomes. But what does that mean?
A credible interval represents the range within which a parameter lies with a certain probability, given the data and prior knowledge. For instance, a 95% credible interval indicates that there is a 95% probability that the parameter value falls within this range (between the 2.5th and 97.5th percentiles), based on the posterior distribution.
In contrast, frequentist statistics provide a confidence interval (CI) that represents the range within which the true parameter would lie 95% of the time (for a 95% CI) if we were to repeatedly sample new data. It does not convey the probability of the parameter being within the interval for a specific dataset.
To summarize, Bayesian analysis computes the posterior distribution of the parameters based on actual observed data and prior beliefs. This distribution is updated with evidence from the actual data, rather than relying on hypothetical repeated sampling of new data. This approach allows for a more nuanced understanding of uncertainty and variability in the parameter estimates.
Choosing the Right Priors in Bayesian Modelling
Selecting priors is critical in Bayesian modelling. A prior represents your initial beliefs before observing data. But how do you choose the right ones?
Practical Considerations
- Relevance: Base your priors on domain knowledge.
- Precision: Narrow priors focus the model but risk missing true values; wide priors allow flexibility but may slow convergence.
Below table provides an overview of selected distributions and their area of application:
Distribution | Area of Application | Specific Use Case |
---|---|---|
Normal | Modeling symmetric, continuous variables like heights, test scores | Used in consumer goods for analyzing customer spending patterns |
Exponential | Modeling time until an event occurs, e.g., failure rates | Applied in insurance to model time until claims or failures |
Poisson | Counting events over a fixed interval, like call center arrivals | Used in telecommunications for call arrivals at a call center |
Binomial | Modeling binary outcomes, e.g., success/failure scenarios | Common in clinical trials to model the success rate of a treatment |
Uniform | When all outcomes are equally likely, e.g., random sampling | Used in quality control for simulating equal probability scenarios |
Gamma | Modeling skewed data like rainfall amounts or insurance claims | Employed in environmental science for rainfall modeling |
Beta | Probabilities and proportions, e.g., conversion rates | Used in digital marketing to model click-through rates on ads |
To check whether the priors are a good approximation, you can compare them against the observed data using visualizations like density plots or prior predictive checks. This step ensures that your priors are reasonable and not overly restrictive or uninformative. Priors are typically calculated once based on prior knowledge or assumptions; however, iterative refinement is possible if the initial results suggest misalignment with the observed data.
The choice of priors significantly impacts both the precision of the output and the calculation time. Narrow priors may lead to faster convergence but risk missing true values, while overly wide priors can increase computation time and require more data to refine the posterior distribution effectively. Balancing these trade-offs is essential to achieve accurate and efficient results in Bayesian modeling.
Conclusion
I hope the article could share the advantages of adding Bayesian methods to your data science toolkit. This approach doesn't just crunch numbers; it thoughtfully incorporates your prior knowledge and experience into the analysis, offering a rich, nuanced understanding of complex models and the uncertainties involved. As you have seen, Bayesian methods excel where data is scarce or when detailed probability assessments are crucial, making them a robust choice for informed decision-making in uncertain scenarios.
I encourage you to consider these tools for your next project, especially when traditional methods fall short in depth and flexibility. Remember, while Bayesian analysis might require more computational effort, the insights gained are often worth the investment, particularly in contexts with limited data. Don't hesitate to step beyond classical approaches and explore the full potential of Bayesian statistics to elevate your analytical capabilities and make more confident decisions based on comprehensive data analysis.
Additional Resources
For further learning, consider these resources:
- PYMC Documentation: https://docs.pymc.io
- Think Bayes by Allen Downey
Alberto Desiderio is deeply passionate about data analytics, particularly in the contexts of financial investment, sports, and geospatial data. He thrives on projects that blend these domains, uncovering insights that drive smarter financial decisions, optimise athletic performance, or reveal geographic trends.