Statistics

Statistical modes

Statistics can be computed in numerous ways, which often leads to confusion when attempting to compare statistics when calculated by different sources.

To aid with this, Providentia allows for flexibility in the calculation of statistics, with three different statistical modes:

Temporal|Spatial
Spatial|Temporal
Flattened

The name of each statistical mode relates to how the dimensions of the data are reduced to calculate the statistical metrics, e.g. mean, median etc, going from 2D to 0D.

For the Temporal|Spatial case, aggregation is first performed across time (e.g. taking median across time per station), going to 1D, before calculating the desired statistic across the stations (e.g. Min), going to 0D.

For the Spatial|Temporal case, aggregation is first performed across stations (e.g. taking median across stations per timestep), going to 1D, before calculating the desired statistic across the timesteps (e.g. Max), going to 0D.

For the Flattened case, rather than performing a two-step calculation, the dimensions are flattened to 1D first, by appending the data in one long array, then the desired statistic is directly calculated (e.g. StdDev), going to 0D.

The default statistical mode is Temporal|Spatial, and the default aggregation statistic for the Temporal|Spatial and Spatial|Temporal modes is the Median. For the Flattened mode, there is no aggregation statistic.

The following graphic visually illustrates the dimensional reduction per mode:

statistical-modes-1

Dimensional reduction by plot type

For some plot types the full dimensional reduction is not possible, e.g map. Therefore, they are locked in certain reduction configurations. This is illustrated for some example plot types:

statistical-modes-2

Setting statistical mode / aggregation

The statistical modes and aggregation statistics can be set in the configuration file like so:

statistic_mode = Temporal|Spatial
statistic_aggregation = Median

On the dashboard, in the statistics tab at the top, the mode and aggregation can be selected via the dropdown menus.

NOTE: For the Flattened mode, there is no statistic_aggregation, so setting it will have no effect when using that mode.

Periodic statistics

The periodic plot gives statistical information for grouped data in individual periodic timesteps, e.g. per hour of the day. Thus it can be seen how each visually how statistics change over period cycles e.g. diurnal.

It is often desired however for a way to quantify how the statistics change over said periodic cycle. This can be done in Providentia through periodic statistics, and can be calculated for all available periodic cycles, i.e. diurnal, weekly, monthly.

There are two modes for calculating these periodic statistics:

Independent
Cycle

Independent works by calculating the desired statistic (e.g. r) per periodic timestep (i.e. as seen on the periodic plot), before aggregating across the timesteps (e.g. median).

Cycle works by aggregating the grouped data per periodic timestep (e.g. median), before then calculating the desired statistic across the timesteps (e.g. MB).

The default periodic statistic mode is Independent, and the default periodic statistic aggregation is Median.

As these resultant statistics are all 0D, the plots that these statistics can be used in are all 0D plot types: statsummary, heatmap, table.

Again the periodic statistics mode and aggregation can be set in the configuration file as follows:

periodic_statistic_mode = Independent
periodic_statistic_aggregation = Median

On the dashboard, in the plot options of the statsummary plot, the periodic statistic mode and aggregation can be selected via the dropdown menus. Additionally, periodic statistics can be added to the statsummary plot also via the dropdown menus.

Available statistical metrics

Providentia statistical metrics come in two categories: basic and model bias.

Basic statistics are calculated independently for observational and model datasets. The available ones are:

Statistic	Meaning
Mean	Mean
StdDev	Standard deviation
Var	Variance
Min	Minimum
Max	Maximum
Data%	Data availability
Exceedances	Number of exceedances over the values defined in settings/exceedances.yaml
p1, p5, p10, p25, p50, p75, p90, p95, p99	Percentiles
NStations	Number of stations
NUnique Stations	Number of unique stations
MDA8	Daily maximum 8 hour average

When not using the dashbaord, to get the bias between the calculated observational statistic and model statistics, the plot option bias can be added to the statistic name, e.g. Mean_bias. On the dashboard, the bias is set by using the bias plot option on the relevant plots.

Model bias statistics are calculated between paired observational and model data, and are thus only available when temporal_colocation is active. The available ones are:

Statistic	Meaning
MB	Mean bias
NMB	Normalised mean bias
ME	Mean error
NME	Normalised mean error
MNB	Mean normalised bias
MNE	Mean normalised error
MFB	Mean fractional bias
MFE	Mean fractional error
RMSE	Root mean square error
NRMSE	Normalised root mean square error
COE	Coefficient of efficiency
FAC2	Fraction of experiment values within a factor of two of observed values
IOA	Index of agreement
R	Pearson correlation coefficient
R2	Coefficient of determination
UPA	Unpaired peak accuracy

All available basic and model bias statistics in Providentia are described here:

Basic statistics

Mean

Computes the arithmetic mean of the data.

\[ \mu = \frac{1}{n} \sum_{i=1}^{n} x_i , \]

where:

\[\begin{split} \begin{aligned} \mu &= \text{arithmetic mean} \\ x &= \text{data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

Median

Computes the median (middle value) of the data.

\[\begin{split} \tilde{x} = \begin{cases} x_{\frac{n+1}{2}}, & \text{if } n \text{ is odd} \\ \frac{1}{2}\left(x_{\frac{n}{2}} + x_{\frac{n}{2}+1}\right), & \text{if } n \text{ is even} \end{cases} , \end{split}\]

where:

\[\begin{split} \begin{aligned} \tilde{x} &= \text{median of the data} \\ x &= \text{data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

Min

Computes the minimum value of the data.

\[ x_{\min} = \min(x) , \]

where:

\[\begin{split} \begin{aligned} x_{\min} &= \text{minimum value of the data} \\ x &= \text{data} \end{aligned} \end{split}\]

Max

Computes the maximum value of the data.

\[ x_{\max} = \max(x) , \]

where:

\[\begin{split} \begin{aligned} x_{\max} &= \text{maximum value of the data} \\ x &= \text{data} \end{aligned} \end{split}\]

p1, p5, p10, p25, p75, p90, p95, p99

Computes the value corresponding to a given percentile of the data.

\[ f_p = \frac{p}{100} \cdot (n - 1) + 1 , \]

\[ k = \operatorname*{round}(f_p) , \]

\[ x_p = x_{k} , \]

where:

\[\begin{split} \begin{aligned} x_p &= \text{percentile value at percentile } p \\ k &= \text{nearest integer index in the data set} \\ f_p &= \text{fractional index corresponding to percentile } p \\ x &= \text{data} \\ p &= \text{percentile (0--100)} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

StdDev (Standard Deviation)

Computes the standard deviation of the data.

\[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2} , \]

where:

\[\begin{split} \begin{aligned} \sigma &= \text{standard deviation of the data} \\ x &= \text{data} \\ \mu &= \text{mean of the data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

Var (Variance)

Computes the variance of the data.

\[ \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 , \]

where:

\[\begin{split} \begin{aligned} \sigma^2 &= \text{variance of the data} \\ x &= \text{data} \\ \mu &= \text{mean of the data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

NData

Computes the total number of non-missing values in the data.

\[ n_{\text{valid}} = \sum_{i=1}^{n} \mathbf{1}(x_i \neq \mathrm{NaN}) , \]

where:

\[\begin{split} \begin{aligned} n_{\text{valid}} &= \text{number of non-missing values} \\ x &= \text{data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

Data%

Computes the fraction of non-missing values in the data, expressed as a percentage.

\[ n_{\text{valid}} = \sum_{i=1}^{n} \mathbf{1}(x_i \neq \mathrm{NaN}) , \]

\[ n_{\text{valid\%}} = \frac{n_{\text{valid}}}{n} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} n_{\text{valid\%}} &= \text{data availability fraction (percent)} \\ n_{\text{valid}} &= \text{number of non-missing values} \\ x &= \text{data} \\ n &= \text{number of values in the data set} \end{aligned} \end{split}\]

NStations (Number of Stations)

Computes the number of stations with valid (non-missing) data values.

\[ N_{\text{stations}} = \sum_{s=1}^{S} \mathbb{I}\!\left( \sum x_s > 0 \right) , \]

where:

\[\begin{split} \begin{aligned} N_{\text{stations}} &= \text{number of stations with valid data} \\ x &= \text{data} \\ S &= \text{total number of stations} \\ \mathbb{I}(\cdot) &= \text{indicator function (1 if condition is true, 0 otherwise)} \end{aligned} \end{split}\]

NUniqueStations (Number of Unique Stations)

Computes the number of unique stations with at least one valid (non-missing) data value across the time domain.

\[ N_{\text{unique stations}} = \sum_{s=1}^{S} \mathbb{I}\!\left( \sum_{u \in \mathcal{D}} \mathbb{I}\!\left( x_{s,u}\ \text{is valid} \right) > 0 \right) , \]

where:

\[\begin{split} \begin{aligned} N_{\text{unique stations}} &= \text{number of stations with at least one valid data value} \\ x &= \text{data} \\ S &= \text{total number of stations} \\ \mathcal{D} &= \text{full analysis domain (all times and groups, if applicable)} \\ \mathbb{I}(\cdot) &= \text{indicator function (1 if condition is true, 0 otherwise)} \end{aligned} \end{split}\]

Exceedances

Computes the number of data points exceeding a specified threshold.

\[ N_{\text{exceed}} = \sum_{i=1}^{n} \mathbf{1}_{\{x_i > \text{threshold}\}} , \]

where:

\[\begin{split} \begin{aligned} N_{\text{exceed}} &= \text{number of exceedances above the threshold} \\ x &= \text{data} \\ n &= \text{number of values in the data set} \\ \text{threshold} &= \text{value used to define an exceedance} \\ \mathbf{1}_{\{x_i > \text{threshold}\}} &= \begin{cases} 1, & \text{if } x_i > \text{threshold} \\ 0, & \text{otherwise} \end{cases} \end{aligned} \end{split}\]

MDA8 (Daily Maximum 8 Hour Average)

Computes the maximum 8-hour rolling average of the data for each day.

\[ \text{MDA8} = \max \Biggl( \frac{1}{8} \sum_{i=t}^{t+7} x_i \Biggr), \quad t = 1, 2, \dots, 17 , \]

where:

\[\begin{split} \begin{aligned} \text{MDA8} &= \text{daily maximum 8-hour average} \\ x &= \text{data (hourly values)} \\ t &= \text{starting hour index of the 8-hour window (1 to 17)} \\ \sum_{i=t}^{t+7} x_i / 8 &= \text{mean over each 8-hour window} \end{aligned} \end{split}\]

Model bias statistics

MB (Mean Bias), NMB (Normalised Mean Bias)

Computes the average difference between model and observations, with a normalised option expressed as a percentage.

\[ \text{MB} = \frac{1}{n} \sum_{i=1}^{n} (m_i - o_i) , \]

\[ \text{NMB} = \frac{\frac{1}{n} \sum_{i=1}^{n} (m_i - o_i)}{\mu_o} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{MB} &= \text{mean bias} \\ \text{NMB} &= \text{normalised mean bias (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

MNB (Mean Normalised Bias)

Computes the mean bias between model and observations, normalised by the observed value, expressed as a percentage.

\[ \text{MNB} = \frac{1}{n} \sum_{i=1}^{n} \frac{m_i - o_i}{o_i} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{MNB} &= \text{mean normalised bias (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

MFB (Mean Fractional Bias)

Computes the average fractional difference between model and observations, expressed as a percentage. Equally weights positive and negative biases and does not treat observations as the absolute “true” value.

Otherwise known as fractional bias (FB) or modified normalised mean bias (MNMB).

\[ \text{MFB} = \frac{1}{n} \sum_{i=1}^{n} \frac{m_i - o_i}{\frac{m_i + o_i}{2}} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{MFB} &= \text{mean fractional bias (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

ME (Mean Error), NME (Normalised Mean Error)

Computes the average absolute difference between model and observations. Always positive, highlighting biases that may cancel out in MB. Has a normalised option expressed as a percentage (NME).

ME is otherwise known as mean gross error (MGE), mean absolute error (MAE) or mean absolute gross error (MAGE). NME is otherwise known as normalised mean gross error (NMGE) or normalised mean absolute error (NMAE).

\[ \text{ME} = \frac{1}{n} \sum_{i=1}^{n} |m_i - o_i| , \]

\[ \text{NME} = \frac{\frac{1}{n} \sum_{i=1}^{n} |m_i - o_i|}{\mu_o} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{ME} &= \text{mean error} \\ \text{NME} &= \text{normalised mean error (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

MNE (Mean Normalised Error)

Computes the mean absolute difference between model and observations, normalised by the observed value, expressed as a percentage. Always positive.

Otherwise known as mean normalised absolute error (MNAE).

\[ \text{MNE} = \frac{1}{n} \sum_{i=1}^{n} \frac{|m_i - o_i|}{o_i} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{MNE} &= \text{mean normalised error (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

MFE (Mean Fractional Error)

Computes the mean absolute fractional difference between model and observations, expressed as a percentage. Always positive.

Otherwise known as fractional error (FE), fractional gross error (FGE) or mean absolute fractional bias (MAFB).

\[ \text{MFE} = \frac{1}{n} \sum_{i=1}^{n} \frac{|m_i - o_i|}{\frac{m_i + o_i}{2}} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{MFE} &= \text{mean fractional error (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

RMSE (Root Mean Squared Error), NRMSE (Normalised Root Mean Squared Error)

Computes the square root of the mean squared difference between model and observations, with a normalised option expressed as a percentage.

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (m_i - o_i)^2} , \]

\[ \text{NRMSE} = \frac{\sqrt{\frac{1}{n} \sum_{i=1}^{n} (m_i - o_i)^2}}{\mu_o} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{RMSE} &= \text{root mean squared error} \\ \text{NRMSE} &= \text{normalised root mean squared error (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

r (Pearson Correlation Coefficient), r2 (Coefficient of Determination)

Computes the linear relationship between observations and model predictions, and the proportion of variance explained.

For r, ranges between -1 and +1, with 0 implying no correlation, and -1 or +1 implying an exact linear relationship, either positive (observations and model increase together) or negative (as observations increase, model decreases).

For r2, ranges between 0 and +1, with 0 implying no correlation, and +1 implying an exact linear relationship.

\[ r = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{o_i - \mu_o}{\sigma_o} \right) \left( \frac{m_i - \mu_m}{\sigma_m} \right) , \]

\[ R^2 = r^2 , \]

where:

\[\begin{split} \begin{aligned} r &= \text{Pearson correlation coefficient} \\ R^2 &= \text{coefficient of determination} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \\ \mu_m &= \text{mean of the model data} \\ \sigma_o &= \text{standard deviation of the observational data} \\ \sigma_m &= \text{standard deviation of the model data} \\ n &= \text{number of paired values} \end{aligned} \end{split}\]

COE (Coefficient of Efficiency)

Measures the model’s predictive skill relative to the mean of the observations. Ranges from -∞ to 1, where 1 indicates a perfect match, 0 indicates no improvement over the mean, and negative values indicate worse performance than the mean.

\[ \text{COE} = 1 - \frac{\sum |m_i - o_i|}{\sum |o_i - \mu_o|} , \]

where:

\[\begin{split} \begin{aligned} \text{COE} &= \text{coefficient of efficiency} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \end{aligned} \end{split}\]

FAC2

Computes the percentage of model values that lie within a factor of two of the corresponding observations.

\[ \mathrm{FAC2} = \frac{100}{n} \sum_{i=1}^{n} \mathbf{1} \left( 0.5 \le \frac{m_i}{o_i} \le 2.0 \right) , \]

where:

\[\begin{split} \begin{aligned} \mathrm{FAC2} &= \text{fraction of values within a factor of two (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \\ n &= \text{number of paired values} \\ \mathbf{1}(\cdot) &= \text{indicator function (1 if condition is true, 0 otherwise)} \end{aligned} \end{split}\]

IOA (Index of Agreement)

Quantifies the degree to which model predictions approach observed values, ranging from -1 to +1. Values near +1 indicate better model performance.

\[\begin{split} \text{IOA} = \begin{cases} 1 - \dfrac{\sum |m_i - o_i|}{2 \sum |o_i - \mu_o|}, & \text{if } \sum |m_i - o_i| \le 2 \sum |o_i - \mu_o| \\[2mm] \dfrac{2 \sum |o_i - \mu_o|}{\sum |m_i - o_i|} - 1, & \text{otherwise} \end{cases} , \end{split}\]

where:

\[\begin{split} \begin{aligned} \text{IOA} &= \text{Index of Agreement} \\ o &= \text{observational data} \\ m &= \text{model data} \\ \mu_o &= \text{mean of the observational data} \end{aligned} \end{split}\]

UPA (Unpaired Peak Accuracy)

Computes the relative difference between the maximum model value and maximum observed value, expressed as a percentage.

\[ \text{UPA} = \frac{\max(m) - \max(o)}{\max(o)} \cdot 100 , \]

where:

\[\begin{split} \begin{aligned} \text{UPA} &= \text{unpaired peak accuracy (percent)} \\ o &= \text{observational data} \\ m &= \text{model data} \end{aligned} \end{split}\]