Teaching the NHL Standard

Standardized Penalty Calling for NHL Officials

A Multi-Year Case Study

 August 5, 2024

Minimizing Variability of NHL Officiating

Evaluating Decision Training Impact on League Play

In a previous study, we looked at our decision training for the referees and linesmen of the National Hockey League (NHL). The aim of this training is to accelerate teaching the NHL Standard in penalty-calling. In this study, we examine 6 years of performance data of the NHL officials and assess uCALL’s integration during that time. We measured standardization of penalty-calling by examining the variability of game-dependent measurements used to evaluate the NHL Officials (see raw stats for most recent year here).

The challenge to this analysis is boiled down to a simple question: How do you evaluate whether an official is good or bad? How does the league evaluate how quickly it is teaching the NHL Standard?

Teaching the NHL Standard: On-Ice Performance vs. uCALL Usage

To answer that question, we looked at raw performance data available on a leading website for evaluating referees and linesmen (Scouting the Refs). This website tracks season-long metrics, such as Goals per Game and Power Plays per Game when a certain ref or linesmen officiates. There are numerous other metrics that relate to what impact an official has on a game. But it is hard to identify any one metric that uniquely describes an official as ‘good’ or ‘bad’. So we take all of them together in this case study as a way of characterizing each official, and groups of officials. From these characterizations of the officials, we measure the variability of that characterization between the veteran and the junior groups of officials.

Simultaneously, we evaluate how many junior NHL Officials had trained with uCALL for Officials, our product specially made for training NHL refs and linesmen. Looking at on-ice performance against uCALL usage, we find that the variability of the referees and linesmen has gone down gradually over 5 seasons of uCALL usage. In that time, the NHL Officials have reduced variability in penalty calls by 47%. We also find a strong relationship between the reduction of variability (i.e., standardization) and the increasing number of uCALL alumni promoted to the NHL.

With decreased variability, we are ensuring that the NHL Standard of penalty-calling is becoming more uniform, regardless of the experience of a referee or linesmen. We get into the details of this analysis below.

Teaching the NHL Standard, Action Shot 1

On-Ice Officiating: “Scouting the Refs” Data

The “Scouting the Refs” website has yearly data on all NHL Officials’ performances. Each official is evaluated in terms of his/her impact on the game. There is not one metric for ‘good’ or ‘bad’ officiating. As a result, we will use all of these metrics as common ground to evaluate uCALL’s utility in teaching the NHL Standard. The list of metrics used in this evaluation are:

  • Gms – Games worked this season
  • Goals/Gm – Combined goals scored per game
  • PP/Gm – Combined power plays per game
  • Minors/Gm – Minor penalty calls per game
  • Penl/Gm – Penalty calls per game
  • PIM/Gm – Penalty minutes per game
  • % Penl Home – Percent of penalties called on the home team
  • Home Win % – Percentage of home team wins
  • % Gms OT/SO – Percentage of games worked that go to OT/SO

In 2022-23 data, additional metrics were added too (see here). Every one of these metrics, except “Gms,” gives a rate in terms of an official’s impact on the game worked. As a result, we have a way to evaluate each official’s impact on a given game.

An example of this data can be seen from the screenshot below for 2021-22:

Making Each Metric Equal To All Others

One thing you notice from the data screenshot above is that some values range from 0 to 1 (0-100%), like “Home Win %”, or Percentage of home team wins. Other metrics range from 5 to 8 (like “Goals per gm”, or Combined goals scored per game), or 12 to 22 (like “PlM per gm”, or Penalty minutes per game). Even though one metric may be more important than another, we don’t know a priori by how much for teaching the NHL Standard. For instance, is Combined goals scored per game 5 to 8 times more important than Percentage of home team wins? Or is Penalty minutes per game 12 to 22 times more important? We don’t know a priori the answers to these questions. So as a first cut, we normalize everything to the same scale, [0,1].

In this analysis, we furthermore do a naïve analysis. In other words, we don’t presume that Minor penalty calls per game is more or less important than Penalty minutes per game. Or, we don’t think Percentage of home team wins is more or less important than Percentage of games worked that go to OT/SO. Everything is on the same level of importance and we just consider the range of the metrics with respect to their maximum. For instance, in the screenshot above, G/gm goes from a range of [5.4, 10.5] to [0.51,1.00] and Minor/gm goes from a range of [2.3, 4.0] to [0.58, 1.00].

We apply the same procedure to all other rows of all other years of officiating data. Once organized this way, we can proceed to how we characterized variability across the Officials when teaching the NHL Standard.

Teaching the NHL Standard, Action Shot 1

Characterizing Standardization of Officiating

Previous studies showed how junior officials’ penalty-calling improved over season-wide use in uCALL. In this part, we describe how we characterized whether those skills transferred to the ice.

Recall from the last section, we have each ref or linesmen’s impact on a game summarized by values ranging from [0,1]. These values were mapped on to this range so one metric does not dominate the analysis. We assume that all metrics are equally important, except one: Games worked this season.

Games Worked (Gms) Divides Veteran from Junior Officials

The NHL has a large percentage of officials that are Veterans. Consequently, these officials work most games of the season. Conversely, the Junior Officials can work anywhere from 1 game up to a couple dozen games in a season. A histogram of Games worked shows the breakdown for 2021-22:

As the graph shows, most of the officials are in the top percentile of Games worked. Each season, a large percentage of officials were in this group. The uppermost percentile of Games worked became the way we defined Veteran Officials. Specifically, we defined Veterans as those who worked more than two standard errors above the mean in games. Anyone below that threshold was considered a Junior Official. For instance, in 2021-22, the mean games worked was 58 and the standard error was 3.5. So anyone who worked more than 58 + 2*3.5 = 65 games was defined as a Veteran. Anyone below was a Junior Official.

Defining Variability Between Veteran and Junior Officials

For each season, we could now group officials by whether they were Juniors or Veterans. In this way, we could define an “average” Veteran or Junior Official by grouping stats above according to this definition. More important for this analysis, this grouping could also help define the variability of this average Veteran/Junior official. We did this by using the standard error about this average (or mean) Veteran or Junior official.

For instance, in 2021-22, the average Veteran Official (more than 65 games worked) had the following metrics, where the values shown in parentheses are the ranges when normalized to a [0,1] scale:

  • Goals/Gm – 5.3±0.4 (0.50±0.04)
  • PP/Gm – 2.5±0.2 (0.71±0.06)
  • Minors/Gm – 2.8±0.2 (0.69±0.05)
  • Penl/Gm –3.2±0.2 (0.63±0.05)
  • PIM/Gm –7.6±0.6 (0.49±0.04)
  • % Penl Home – 25±2%
  • Home Win % – 32±3%
  • % Gms OT/SO – 9±1%

And the average Junior Official (65 games worked or fewer) had the following metrics, with [0,1] normalized ranges again in parentheses:

  • Goals/Gm – 6.5±0.4 (0.62±0.03)
  • PP/Gm – 3.0±0.1 (0.85±0.02)
  • Minors/Gm – 3.4±0.1 (0.84±0.03)
  • Penl/Gm – 3.8±0.1 (0.76±0.03)
  • PIM/Gm – 8.9±0.6 (0.57±0.04)
  • % Penl Home – 50±1%
  • Home Win % – 55±2%
  • % Gms OT/SO – 23±4%

With our focus on the variability (via the standard error, after the “±”), we wanted to see how much difference there was between the standard errors of these numbers for the two groups.

Defining Distance Between Variability of Veteran / Junior Officials

With a column of metrics for Veterans, and another column for Junior Officials, we could now calculate the distance (see more here) between these variabilities. This distance measurement is a proxy for telling us how far the variability for the Veterans is from that of the Juniors. For the 2021-22 season, the distance values for each metric are as follows ([0,1] normalized values shown only):

  • Goals/Gm – 0.116±0.004
  • PP/Gm – 0.148±0.031
  • Minors/Gm – 0.156±0.027
  • Penl/Gm – 0.124±0.021
  • PIM/Gm – 0.082±0.002
  • % Penl Home – 9±2%
  • Home Win % – 10±2%
  • % Gms OT/SO – 4±3%

Finally, we focus here on the standard error values (i.e., after the “±” sign) to quantify the distance of the variability between the Juniors and Veterans. We average each of these values so that each metric’s distance in variability is weighed equally. For the 2021-22 season, we get a variability difference between the Veterans and the Juniors of 0.019.

Teaching the NHL Standard, Action Shot 1

Measuring Standardization Over Seasons

We applied this same method for calculating distance of variability between Junior and Veteran Officials to all seasons uCALL has been used (2018-2023). For baseline purposes, we also went back one season to before uCALL had even begun development (2017-18).

How Standardization Changed By Season (Measured by Variability)

We applied this formula to the six seasons from 2017-18 to 2022-23 for which data on Scouting The Refs was available. We got the following values for differences (distances) in variability (standard errors) between Veteran and Junior Officials:

  • 2017-18: 0.040
  • 2018-19: 0.042
  • 2019-20: 0.025
  • 2020-21: 0.015
  • 2021-22: 0.019
  • 2022-23: 0.022

This trend is more clear from a plot of these values against season years, as seen below:

Teaching the NHL Standard and Reducing Variability of Penalty Calls

As the graph shows, there was a peak distance in variability in the year before uCALL began development (2017-18) and in the first year of its gradual introduction. Over the following seasons, we see the variability decrease (2018-21) and then plateau around a value of 0.020 (2020-23). Over the course of these seasons, the difference in variability between Veterans and Juniors changed by 47% (from 0.040 to 0.022). A decrease in variability distance like this is equivalent to greater degree of standardization between Veteran and Junior Officials. We conclude that after 6 seasons of uCALL integration, the NHL Officials have thus become 47% more standardized in their penalty-calling. This change happened while teaching the NHL Standard to Junior Officials.

How Fast Standardization Happened in NHL Officiating

We can go one step further to evaluate uCALL’s impact in teaching the NHL Standard to Junior Officials. Recall from other case studies, we have seen an increased number of uCALL alumni become NHL refs and linesmen. As a result, the Junior Officials work an increasing number of games each season.

We can filter the Scouting The Refs data for uCALL alumni and apply that filter to Games worked (Gms) in that data set. When we do so, we find that the number of games worked by uCALL alumni each season are as follows:

  • 2017-18: 0
  • 2018-19: 3
  • 2019-20: 49
  • 2020-21: 83
  • 2021-22: 306
  • 2022-23: 374

This increasing trend can be more easily seen from the chart below:

Teaching the NHL Standard and Integrating Junior Officials with Promotions

The graph visually shows the rate at which uCALL alumni are officiating NHL games. Each year the number of Games worked (Gms) increases from the last. In 2018-19 (the year of uCALL’s initial development), Junior Officials using uCALL only worked 3 Gms. But by the 2022-23 season, Junior Officials trained with uCALL have worked 374 Gms.

Evaluating uCALL Alumni Impact on NHL Standardization

When we compare the graphs of “uCALL Alumni (Junior Officials) Games Worked” and “Variability Between Veteran and Junior Officials,” we see nearly an inverse relationship. There is a strong anti-correlation between these values (correlation coefficient, -0.62).

We can therefore conclude that uCALL’s introduction to the training of Junior Officials has brought their variability in penalty-calling closer to those of the Veterans over time. A reduction in variability is by definition a greater degree of standardization. In this case, it means that uCALL’s use by Junior Officials and gradual introduction via their promotions to the NHL have together increased standardization of penalty-calling.

Back in 2018, Stephen Walkom, Director of Officiating for the NHL, commented on this new direction in training officials. He said that making a standard of penalty-calling was his goal, and that “[deCervo] likes to call it brain training.” Here, we see the impact of such training on making a more consistent standard of officiating at the NHL. Teaching the NHL Standard is a complex task, involving organization and neuroscience. But here we show its feasibility in a profession centered on arbitrating complex and fast interactions.