A Hidden-Markov Model of Stockout Detection
Bayesian HMM for detecting unobserved inventory outages
Code
# simulates stockouts (we're assuming only full stockouts, no partial ones)
generate_data <- function(n_days, lambda_base, stockouts_idx, seed = 12)
{
# use average lambda; no seasonality for now
lambda_day <- rep(lambda_base, n_days) #+ (1:n_days)
# simulate latent sales, what would be realized if
set.seed(seed)
latent_demand_day <- rpois(n_days, lambda_day)
# create stockout indicator
inventory_sufficient <- rep(T, n_days)
valid_stockouts_idx <- stockouts_idx[stockouts_idx < n_days]
inventory_sufficient[valid_stockouts_idx] <- F
# we'll assume that stockouts don't increase sales afterwards
sales <- latent_demand_day * inventory_sufficient
state <- ifelse(inventory_sufficient, 'regular sales', 'stockout')
state %<>% factor(levels = c('stockout', 'regular sales'))
data.frame( day = 1:n_days, sales = sales, state = state ) # stockout = !inventory_sufficient,
}Summary
Bayesian Hidden Markov Model for detecting stockouts in sales data with no explicit inventory information. A synthetic dataset is generated with steady latent demand and known stockout periods to demonstrate the principle. Realized sales are treated as true demand realizations except during stockouts, which structurally force sales to zero and hide actual demand. The model treats sales as emissions from two hidden states - regular sales and stockout - with state transitions governed by a Markov process. Inference is performed in Stan using MAP estimation and the Viterbi algorithm to recover the most likely stockout periods. The result shows how probabilistic state-switching can uncover unobserved stockouts that distort sales data.
Introduction
In this notebook, we tackle a common data problem in analyzing retail data: Realized sales don’t always reflect the actual demand for a product. This is because when a product goes out of stock, the realized sales are artificially forced to zero, while the latent demand remains what it would have been.
This breaks the direct link between what is observed and what you’d want to use in an analysis or a forecast. If all realized sales are treated as if they came directly from the latent demand distribution, true demand will be underestimated.
This is a problem, because without explicit inventory data, the only clue about stockouts is that observed sales deviate from what the latent demand distribution could plausibly be.
Simulating Sales With Stockouts
We generate synthetic sales data with steady demand and no seasonality, including a single stockout period to demonstrate the issue and its solution.
The function generate_data() below simulates:
- Regular sales: The realized sales depend entirely on the latent demand, and are simulated as draws from a Poisson distribution.
- Stockout: Periods with no inventory, causing realized sales to drop to zero.
This gives us a clean example where we know which zeros are stockouts and which are a part of regular sales.
Code
df <- generate_data( n_days = 90, lambda_base = 3,
stockouts_idx = c(30:36, 71:80), #, 100:120, 150:163, 200:218, 250:260)
)The plot shows flatlined realized during the stockout period, despite a constant latent demand distribution.
Code
df %>% ggplot(aes(day, sales, group=1, color = state)) +
geom_line(color="darkgrey") + geom_point() +
theme_bw() + theme(legend.position = "top") +
guides(color = guide_legend(title = NULL)) +
ylab("sales quantity") +
ggtitle("Sales Over Time") +
theme(plot.title = element_text(hjust = 0.5))Detecting Stockouts
So how do we detect stockouts? In order to do that, we first need to determine (i) the number of stockout days, and (ii) identify stretches of zeros the duration of which total approximately that length.
Determining the number of stockout days is relatively straightforward under the relatively plausible assumption that sales quantities are at least approximately follow the negaive binomial distribution: As is visible in the histogram below, the number of \(0\)s is unexpectedly large given the rest of the distribution. It should be smaller than the number of \(1\)s, but it is larger.
Code
df %>% ggplot(aes(sales)) + #, fill = state
geom_histogram(bins = 30) + theme_bw() + theme(legend.position = "top") +
guides(fill = guide_legend(title = NULL)) +
xlab("sales quantity") +
scale_x_continuous(breaks=0:12) +
ggtitle("Distribution of Sales Volume") +
theme(plot.title = element_text(hjust = 0.5))We can use a zero-inflated negative binomial model to estimate a percentage of excess zeroes: One (simplified) way to think about it is that we fit a negative binomial distribution to all the non-zero quantites, and based on this estimated distribution determine the percentage of expected zeroes. The percentage above that are likely to be excess zeroes due to stockouts.
Code
model <- pscl::zeroinfl(sales ~ 1 | 1, data = df, dist = "negbin")
mu_hat <- exp( coef(model)["count_(Intercept)"] )
theta_hat <- model$theta
zi_coef <- coef(model)[["zero_(Intercept)"]]
excess_zero_prob <- plogis(zi_coef)
df_zinb_mass <- data.frame(
n = c(0, 0:9),
p = c(excess_zero_prob, dnbinom(0:9, size = theta_hat, mu = mu_hat)*(1-excess_zero_prob)),
excess_zeros = ifelse(0:10==0, "excess zeroes", "regular sales")
)
p1 <- df_zinb_mass %>% ggplot(aes(x = n, y = p, fill = excess_zeros)) +
geom_bar(stat = "identity", width = 0.3) + theme_bw() + theme(legend.position = "top") +
guides(fill = guide_legend(title = NULL)) +
xlab("sales quantity") +
scale_x_continuous(breaks=0:12)
p2 <- df %>% group_by(sales, state) %>% count() %>%
ggplot(aes(x = sales, y = n, fill = state)) +
geom_bar(stat = "identity", width = 0.3) + theme_bw() + theme(legend.position = "top") +
guides(fill = guide_legend(title = NULL)) +
xlab("sales quantity") +
scale_x_continuous(breaks=0:12)
ggpubr::ggarrange(p1, p2 )However, while this method helps us understand approximately what proportion of days are affected by stockous, it doesn’t help us determine when exactly the stockouts occur. Some of the zeros are absolutely expected, even without a stockout.
A Probabilistic Model of Stockouts
The key to a good stockout detection algorithm is that we need to posit stockouts where zeroes occur rather unexpectedly. This can happen when sales suddenly drop to zero
- during a period with relatively high sales before and after the zero-sales stretch
- when sales have historically been rather high, or when
- when the number of successive zeroes is too long to be a coincidence during regular sales.
To detect such stockouts, we develop a probabilistic model based on the following principles:
- The latent demand follows a specific distribution (e.g., Negative Binomial with \(\mu\) and \(\theta\)) that governs how many units would be sold if the stock were unlimited.
- The parameter \(\mu\) may depend on seasonal factors and may incorporate a trend.
- The realized sales on any given day is either the latent demand, or \(0\), such as during a stockout.
- On any given day, the system is in a particular state (stockout or regular sales) and transitions between states are each associated with specific probabilities.
Details
We model the phenomenon using a Hidden Markov Model with the structure in the diagram below: Each day is spent to be in one of two hidden states:
Regular sales state (\(R\)): realized sales come from the latent demand distribution.
Stockout state (\(S\)): realized sales must be zero. The latent demand distribution is irrelevant because you can’t fulfill any of it.
Transitions between states are governed by a Markov process. The probability of staying in the regular sales state on day \(t+1\), given one was in that state on day \(t\) is \(\pi_{RR}\).
The forward algorithm computes the likelihood of your entire observed series by marginalizing over all possible state sequences, given the emission distributions:
In the regular state, emissions come from the latent demand distribution, a negative binomial with parameters to be estimated.
In the stockout state, the emission is always zero.
The HMM is implemented as a Bayesian model in Stan, and we use MAP estimation to first obtain parameter estimates for the transition probabilites and the negative binomial parameters, and then use the Viterbi algorithm generating the most likely sequence of states generating the observed sequence of sales.
Code
library(cmdstanr)
# Compile the custom Stan model
model <- cmdstan_model("./stockouts_v1.stan")
data_stan <- list(n_days = nrow(df), sales_qty = df$sales )
opt <- model$optimize(
data = data_stan,
seed = 123,
iter = 5000
)
#opt$summary()
gq <- model$generate_quantities(
fitted_params = opt$draws(),
data = data_stan
)
likely_state <- gq$summary()$mean
df$likely_state <- ifelse(likely_state == 2, "regular sales", "stockout")The plot below shows sales sequence generated above, with the red bands indicating the periods that have been identified as stockouts.
Code
datected_stockouts <- df %>%
mutate(prev_likely_state = lag(likely_state, default = "0"),
change = (likely_state != prev_likely_state),
segment_id = cumsum(change)) %>%
filter( likely_state == "stockout" ) %>%
group_by(segment_id) %>%
summarize( day_start = min(day), day_end = max(day) )
df %>% ggplot(aes(day, sales, group=1, color = state)) +
geom_line(color="darkgrey") + geom_point() +
theme_bw() + theme(legend.position = "top") +
guides(color = guide_legend(title = NULL)) +
ylab("sales quantity") +
ggtitle("Sales Over Time") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_rect(
data = datected_stockouts,
inherit.aes = FALSE,
aes( xmin = day_start, xmax = day_end,
ymin = -Inf, ymax = Inf
),
fill = "red",
alpha = 0.2
)Limitations
- The current model does not implement seasonality, yearly or weekly.
- It does not account for the fact that unusual spikes in demand are more likely to be followed by stockouts than below-average sales.