---
title: "Nowcasting a reporting triangle with csfmt_ensemble_v3"
author: "Richard Aubrey White"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Nowcasting a reporting triangle with csfmt_ensemble_v3}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4)
```

`csalert` provides a small, draw-parallel surveillance engine built around three
S3 formats:

- **`csfmt_reporting_triangle_v3`** — the reference-week × reporting-week input
  (who was reported *when*).
- **`csfmt_ensemble_v3`** — `$data` plus, per measure, a matrix of Monte-Carlo
  `$draws` (rows = weeks, columns = simulations). Every analysis stage adds
  columns to the draws, so uncertainty propagates for free.
- a **quantile collapse** of those draws (optionally *healed* into
  `cstidy::csfmt_rts_data_v3` for the usual plots/tables).

The analysis pipeline is: reporting triangle → nowcast → (rate / short-term trend /
MEM thresholds / signal detection) → collapse. Alongside it sits a replay-based
diagnostics harness — backtest → **coverage** (are the intervals honest?),
**revision** (how much will today's number still move?), and **reporting
completion** (how fast does the data actually arrive?). This vignette walks the
whole path on synthetic data.

```{r}
library(data.table)
library(csalert)
```

## A synthetic reporting triangle

Real surveillance data arrives with a delay: a case with reference week `W` may
only be *reported* in week `W`, `W+1`, `W+2`, … We simulate 70 weeks where most of
a week's cases are in within a week or two and a tail dribbles in over a month.

```{r}
weeks   <- cstime::dates_by_isoyearweek$isoyearweek
i0      <- match("2023-01", weeks)
max_delay  <- 5L
n_weeks    <- 70L
delay_prob <- c(0.45, 0.30, 0.15, 0.07, 0.03)     # P(delay = 0, 1, 2, 3, 4 weeks)

set.seed(1)
rows <- lapply(seq_len(n_weeks), function(w) {
  ref_i <- i0 + w - 1L
  n     <- rpois(1, 60 + 25 * sin(2 * pi * w / 52))   # a seasonal signal
  delay <- sample(0:(max_delay - 1L), n, replace = TRUE, prob = delay_prob)
  data.table(isoyearweek_reference = weeks[ref_i], rep_i = ref_i + delay)
})
ll <- rbindlist(rows)
ll[, isoyearweek_reporting := weeks[rep_i]]

# "today" is the last reference week: drop everything not reported yet, so the
# most recent weeks are still incomplete (right-truncated) -- exactly the problem
# a nowcast solves.
as_of_i <- i0 + n_weeks - 1L
ll <- ll[rep_i <= as_of_i]

triangle_long <- ll[, .(numerator = .N),
                    by = .(isoyearweek_reference, isoyearweek_reporting)]
triangle_long[, `:=`(indicator = "example", location = "nation",
                     age = "total", sex = "total")]
head(triangle_long)
```

Wrap it as a `csfmt_reporting_triangle_v3`. The `as_of` boundary (the latest week
we know about) is inferred from the newest reporting week.

```{r}
tri <- csfmt_reporting_triangle_v3(
  triangle_long,
  id_cols       = c("indicator", "location", "age", "sex"),
  reference_col = "isoyearweek_reference",
  reporting_col = "isoyearweek_reporting",
  value_col     = "numerator"
)
attr(tri, "as_of")
```

## Nowcasting

`nowcast_quasipoisson_v1()` regresses each settled week's final total on the counts
reported so far, then simulates the incomplete recent weeks. It returns a
`csfmt_ensemble_v3`: one row per reference week and an `n_sim`-column matrix of
draws of the (completed) weekly total. Settled weeks are point masses at their
observed total; recent weeks carry the completion uncertainty.

```{r}
ens <- nowcast_quasipoisson_v1(tri, max_delay = max_delay, n_sim = 500)
ens
```

Collapse the draws to a quantile summary.

```{r}
q <- ens_collapse(ens, probs = c(0.05, 0.5, 0.95))
tail(q[, .(isoyearweek,
           lo  = numerator_nowcasted_q05x0,
           med = numerator_nowcasted_q50x0,
           hi  = numerator_nowcasted_q95x0)], 8)
```

The recent weeks have a widening interval (fewer reports in), the older weeks are
pinned to their settled total:

```{r}
show <- tail(q, 30)
xs   <- seq_len(nrow(show))
plot(xs, show$numerator_nowcasted_q50x0, type = "n",
     ylim = range(show$numerator_nowcasted_q05x0, show$numerator_nowcasted_q95x0),
     xlab = "week (last 30)", ylab = "weekly total (nowcast)")
polygon(c(xs, rev(xs)),
        c(show$numerator_nowcasted_q05x0, rev(show$numerator_nowcasted_q95x0)),
        col = grDevices::adjustcolor("steelblue", 0.25), border = NA)
lines(xs, show$numerator_nowcasted_q50x0, lwd = 2, col = "steelblue4")
```

## Is the nowcast any good? A replay backtest

The reporting triangle records *when* every count arrived, so we can reconstruct
exactly what was known at any past week and replay the method against it. A
"method" is any function `f(triangle) -> ensemble` with its parameters baked in;
`nowcast_evaluate_v1()` replays it and scores the result into two scale-free
diagnostics per horizon (weeks-back from the as-of week; 0 = the current,
least-observed week):

- **coverage** — does the 90% / 50% interval actually contain the settled truth
  that often? (interval honesty; target 0.90 / 0.50)
- **revision** — how far the published median sits from where it eventually
  settles, as a fraction of the truth: signed (bias), absolute (typical move), a
  5–95% band, and the tail probabilities.

```{r}
method <- function(x) nowcast_quasipoisson_v1(x, max_delay = max_delay, n_sim = 500)
as_of_weeks <- utils::tail(weeks[i0:as_of_i], 30)     # replay the last 30 weeks
nowcast_evaluate_v1(tri, method, max_delay = max_delay,
                    as_of_weeks = as_of_weeks, horizons = 0:3, seed = 1)
```

Pass a *named list* of methods instead of one, and they are replayed on the same
paired draws (shared `seed`) and stacked with a `method` column, so you can race
several engines head-to-head.

## How fast does the data arrive? Reporting completion

`reporting_completion()` reads the delay ECDF off the settled weeks: `pct_wN` is
the pooled % of a reference week's cases reported once it has been observed for `N`
weeks. `period = "month"` / `"year"` slices it in time to expose drift.

```{r}
reporting_completion(tri, max_delay = max_delay)
```

## The naming grammar

Every measure column is a self-describing name built and parsed by the same
grammar, so downstream code never hard-codes strings.

```{r}
csfmt_var("numerator", role = "nowcasted", q = 0.5)   # -> the collapsed median column
csfmt_parse("numerator_nowcasted_q50x0")              # -> back to its components
```

## Where next

- Add a rate with `ens_add_rate()` (numerator vs a nowcasted denominator),
  a short-term trend with `short_term_trend()`, MEM intensity thresholds with
  `mem_thresholds_v1()`, or exceedance detection with `signal_detection_hlm()` —
  each just adds draw columns before the collapse.
- `ens_collapse(heal = TRUE)` returns a `cstidy::csfmt_rts_data_v3` ready for the
  standard plotting/table helpers.
- Race candidate engines head-to-head by passing several methods to
  `nowcast_evaluate_v1()` and comparing their coverage / revision.