## 1. Introduction

One challenge in integrating weather-dependent renewable energy onto the electric grid is the temporal variability of the wind or solar resource. For wind, this variability is amplified by a wind turbine’s nonlinear power curve that translates wind speed into power, creating large variations of wind energy production over short periods of time. Figure 1 (top panel) displays the time series of wind speed measured on a tall tower at 80 m above ground level (AGL), and the resulting wind power (bottom panel) produced by a turbine using a standard International Electrotechnical Commission (IEC) class 2 (International Electrotechnical Commission 2007) turbine power curve (center panel). These data were collected in South Dakota, a region featuring a large amount of wind energy production, where the class 2 turbine is the most common type of wind turbine deployed.

The wind power production in Fig. 1 has extended periods of time with either zero power production (for speeds below the turbine’s cut-in speed, 3 m s^{−1}) or near 100% of its capacity for high speeds (between 13 and 25 m s^{−1}), with frequent jumps between small and large power values. These jumps, or ramp events, can be large because of the wind power increasing approximately as the cube of the wind speed in the middle portion of the turbine’s power curve. Ramp events are important in real-time grid operations. If large changes in wind power production occur, grid operators must keep the grid in balance by making equally large and abrupt changes in conventional energy generation. Large and sudden changes in conventional generation can be costly and problematic (Bradford et al. 2010; Francis 2008), especially if the changes are not forecast accurately, both in terms of their amplitude and timing.

Standard metrics (e.g., mean absolute error and root-mean-square error) may not be well suited for wind energy forecast evaluation because wind power production can be near constant for considerable periods of time, especially near 100% or 0%; it is the periods of rapid transition between those two states that are most challenging for grid balancing. Therefore, a wind ramp metric can provide a useful statistical measure of the accuracy of the model at forecasting ramp events by weighting the model agreement for these events more than during periods of near-constant power. A ramp metric can be used to compare the skill of two models at forecasting ramp events, or for documenting progress in improving a given model. It could also be used to analyze the climatology of wind ramp events (i.e., seasonal means or interannual variations) from observations or models.

The identification and forecast evaluation of ramp events has similarities and differences with other meteorological phenomena, such as precipitation [rate of change; Hamill (2014)], aviation forecasting [timing; Isaac et al. (2014); Jacobs and Maat (2005); Wong et al. (2013)], air quality indices and severe convection (amplitude), and floods or droughts (direction of change). In addition, the response of a grid operator to a down ramp may be different than for an up ramp as curtailing output for up-ramp events may be easier than quickly bringing on additional generation for down ramps. The combination of rate of change, timing, and the directionality of ramp events makes for a unique forecasting problem. For an agency such as NOAA, the Ramp Tool and Metric (hereafter referred to as RT&M) can be useful for determining how potential changes to research or operational weather prediction models will impact users in the wind energy industry.

Despite the importance of ramp events for renewable energy, there is no commonly accepted definition of a ramp event, nor is a single strict threshold possible, because the threshold at which a ramp becomes important will vary from user to user, and from situation to situation (e.g., Zack et al. 2011). In this study we aim to develop a ramp tool that has the flexibility to be helpful for a variety of users and that can easily be modified or tuned to be valuable in a variety of situations. For models, the RT&M is most applicable as a diagnostic tool, for evaluating the skill of a long time series of forecasts, and was not designed or intended to be used for making real-time decisions.

The RT&M described here has three components. The first is the identification of ramp events in a time series of power data, for which several different methodologies are employed and compared. The second component of the RT&M matches observed ramp events with those predicted by a forecast model. If ramp events are defined such that they are rare events, matching is relatively simple. However, when the definition is relaxed so that ramp events become more frequent, matching events can be more difficult (Wandishin et al. 2014). The final component of the RT&M is a methodology for scoring the ability of a model to forecast ramp events. We develop a scoring metric that accounts for phase, duration, and amplitude errors in the forecast, and differentiates between the impacts of up- and down-ramp events. The particular scoring rules that we use are intended to reflect the perspective of a grid operator; however, the metric itself is flexible so that it could be easily modified to reflect the needs of other users. To test the RT&M, we used the 13-km horizontal resolution National Oceanic and Atmospheric Administration/Earth System Research Laboratory (NOAA/ESRL) Rapid Refresh (RAP) model and tall-tower anemometer observations that were collected during the Wind Forecast Improvement Project (WFIP), which took place in the U.S. Great Plains during 2011–12 (Wilczak et al. 2014, 2015). The RT&M results are applied to data collected over 9 days (7–15 January 2012) from a set of four 80-m-tall towers spanning an ~100 km × 100 km area, and to forecasts from the RAP model at the same locations. The RT&M is available online (http://www.esrl.noaa.gov/psd/products/ramp_tool/).^{1}

This paper is organized as follows. Section 2 presents three different methods for defining ramp events and compares their results. Section 3 discusses the issues related to ramp matching. Forecast scoring and model evaluation procedures are presented for one ramp definition in section 4. The forecast skill score methodology is then extended for a range of ramp definitions in section 5. Section 6 provides a summary and discussion. In the appendix we discuss rules for applying a bonus to the model skill when excess energy production can be reduced through wind plant curtailment.

## 2. Ramp identification

As mentioned, there is no absolute definition of a ramp as the definition will depend on the particular application, and the application will change from user to user (Freedman et al. 2008). Also, users may require multiple definitions of ramps to be operative simultaneously. For example, if a ramp event has a 60% capacity change in generation over a 4-h period, it could inform a utility that a certain type of unit needs to be brought online. However, if within that 4-h period there is an embedded ramp with a 30% of capacity change in only 15 min, then a different type of unit may need to be brought online for that 15-min period within the 4-h duration ramp. A ramp metric must address a matrix of time and amplitude scales simultaneously to give a robust measure of model skill.

The type of power time series used in this tool may differ depending on the user (i.e., for a wind plant operator interested in forecast skill for a single plant or turbine, the time series considered would be the power production from that plant or turbine; for a grid operator concerned with the aggregate power generated by wind and solar over their entire balancing area, the time series to be considered would be the aggregate power). We assume that the time series in either case is the basis upon which operational decisions are made, so there is no reason to filter or modify the time series for the analysis of ramp events. Data compression routines such as the swinging door algorithm (Bristol 1990; Florita et al. 2013; Zhang et al. 2014) reduce a time series to a shorter series of linear segments, and smaller changes in power are ignored as noise with the effect that the filtered time series will not contain the full range of power variations originally present. Also, although we do not apply any bias correction to the model output, it is possible for a user to apply postprocessing techniques before inputting their data into the RT&M.

Three ramp identification methods are presented in this section. Each is tested on 9 days of observations from four South Dakota State University (SDSU) 80-m-tall towers [Faith (FAH): latitude 45.0539°, longitude −102.2630°, altitude 797 m; Long Valley (LVL): latitude 43.4331°, longitude −101.5544°, altitude 944 m; Lowry (LWY): latitude 45.2772°, longitude −99.9861°, altitude 663 m; and Reliance (REL): latitude 43.9681°, longitude 99.5944°, altitude 628 m] and on forecasts for the same location.

Forecasts were generated by the hourly updated RAP model (http://rapidrefresh.noaa.gov). For comparison purposes the model forecast values at the tower location were determined through a horizontal parabolic interpolation of the 16 model grid points surrounding the tower location, followed by linear vertical interpolation of the model wind profiles to the tower instrument height. The RAP model provided output at 15-min intervals, while the SDSU tower data were available as 10-min averages. Time interpolation is discussed below.

The first step of the ramp identification process is to generate equal-length time series of model forecast and observational wind speed data. For this we take two different approaches that we will refer to as the stitching method and the independent forecast run method.

For the stitching method we create a time series of model forecasts for a particular forecast horizon. First, consider the simple case where the dataset consists of hourly model output, hourly observations, and model forecasts that are initialized on an hourly basis. A time series of forecasts for a fixed forecast horizon (say forecast hour 3) is created by concatenating all 3-h forecasts over a length of time *t*, where *t* is greater than the maximum length forecast (15 h for the RAP), and equal in length to the considered observed time series (here, *t* is 9 days, or 216 h).

The WFIP observed wind speeds have 10-min resolution while the model output has 15-min granularity, and the model initialization cycle is hourly. To make use of the high temporal resolution data for detecting ramps, the process described above is modified by extracting the four sequential 15-min forecasts that begin at a given forecast horizon hour, and then concatenating these groups of four forecasts. This time series of 15-min forecasts of similar forecast horizon values is then linearly interpolated to the 10-min intervals of the observations, and both time series are converted into power using the IEC2 turbine power curve.

For the independent forecast runs method we proceed using sets of individual forecast runs, not concatenated, in our case each with a length of 15 h, and to compare each individual forecast run against the corresponding observational time series. The advantage of this approach is that it does not have the potential to create artificial ramps through the stitching process, while it has the disadvantage of increasing the relative number of occurrences of ramps that are terminated at the start and end of each forecast run, as well as other disadvantages that will be discussed later.

We include both of these approaches in the RT&M, allowing the user [through the graphical user interface (GUI)] to choose the one that best addresses their analysis needs.

For both the stitching method and the independent forecast runs method, ramps are then identified within the two corresponding model and observational time series using one of the ramp detection methods described below.

### a. Fixed-time interval method

Various methods for defining ramp events have been proposed (Cutler et al. 2007; Greaves et al. 2009; Kamath 2010; Zack et al. 2010; Bossavy et al. 2010; Ferreira et al. 2011), and the methods that we employ include aspects of these earlier studies. Ramps of different sign are recorded separately, so at the beginning two identical time series of logical “no ramp” values are initialized. The first of these time series will record only “up” events and the second only “down” events. The first ramp identification method, referred to as the fixed-time interval method, uses a sliding time window of length WL, over which we measure the change in power. This method tests if the difference in power

Each concatenated ramp event is defined by its center time

Although appealing because of its simplicity, this method has the possible drawbacks that 1) the selected ramp events may not intuitively look like ramps, since larger values of

An example of ramp identification using the fixed-time interval method on observed and stitched modeled data (for the model initialization time, hour 0) is displayed in Fig. 3 for the 9 days of aggregate power data from the four SDSU towers.^{2}

Using a ramp definition of a power change greater than 40% over a nominal 2-h period, seven up ramps (shown in red) and two down ramps (in green) are found through the time series of the observations. In contrast, due to the smoothness of the forecasted time series of power, the model finds only four up and two down ramps. Since Fig. 3 shows the time series of the aggregate power, the changes in power are smoother that those we would see if we considered the time series of the power for one tower only. For this reason, in this example there are no overlapping points of opposite-signed ramps.^{3}

### b. Minimum–maximum method

The next approach, referred to as the minimum–maximum (min–max) method, avoids the two problems previously noted for the fixed-time interval method. This technique finds the maximum amplitude change in power

The initial ramp duration is determined by the times

In these cases the ramps can have a duration

Figure 4 shows the same schematic power time series as in Fig. 2, but ramp events as detected by the min–max method are displayed. Now the identified ramp events look more intuitive, and more ramp events can be found in general as the selection of minimum and maximum power within WL shortens the duration of the event.

The min–max method on the same aggregate of observed and modeled data as in Fig. 3 (not shown) has generally similar results, finding seven up ramps and three down ramps in the time series of the observations and four up and two down ramps in the model simulation.

### c. Explicit derivative method

The third method that we consider is referred to as the explicit derivative method. In this context, “explicit” means all points within the window WL are used to define a derivative and therefore the ramp. First, a smoothed time derivative of the power

Employing the explicit derivative method on the same aggregate of observed and modeled data in Fig. 3 (not shown) yields five up ramps and three down ramps in the time series of the observations and seven up ramps and two down ramps in the model simulation.

## 3. Matching of forecast and observed ramps

The next step is to develop a methodology for matching the observed and modeled events. The general philosophy we use is to match events that are closest in time; if multiple events have the same time separation, those with the closest ramp rate are matched. The inputs to the matching algorithm are the sequential list of ramp events over a time period *t* from the observations and forecasts, each defined by their duration of event (

The matrix of differences in center times is then searched for the minimum value(s), corresponding to the model ramps that are closest in time to the observed. If this minimum value is larger than WL, then the event is unmatched. Frequently, multiple events with the same minimum timing error will be found, because of a limited number of time shifts possible for the discrete 10-min sample dataset. If more than one minimum exists, all of the minima are evaluated for instances when one model ramp is paired with two equally spaced (preceding and following) observed ramps. If the model value is paired with only one observation, these two events are matched, and then eliminated from any further searching. If the model ramp is paired with two observed events at the current time-shift minima, the choice of which one is matched with the model event is made based on the

## 4. Forecast skill scoring methodology for a single ramp definition

The ramp identification and ramp matching procedures result in time series of matched pairs of forecast and observed ramps. Using this time series of events, a forecast score is determined by comparing the forecast and observed characteristics of each event. The ramp skill score accounts for forecast ramps matched to observed ramps, forecast ramps not matched with observed ramps, and observed ramps not matched with forecast ramps. The skill score incorporates phase, amplitude, and duration errors, and recognizes that up- and down-ramp events can have different impacts on grid operations. Also, the skill score is designed so that a set of random forecasts will have near-zero skill, as we verified by testing the RT&M on a randomly produced forecast. A negative score indicates the model is worse than random, and a positive score indicates the model has skill.

The first step is to classify the different types of ramp scenarios possible (Table 1). This is similar to a 3 × 3 contingency table (Wilks 2006) consisting of up, down, and null events, except that the null/null case is not considered and does not affect the skill score.

Scenario definitions for matched and unmatched ramp events.

_{#}and MinDistScore

_{#}, respectively, represent limits to the score for perfect or missed forecasts, nominally 1, 0, or −1.

We develop values for _{#} and MinDistScore_{#} for two separate cases. The first is a simplified case (presented in this section) in which up/up ramps and down/down ramps are treated equally, and all unmatched events are given zero skill. For this case the value of MinDistScore_{#} will always be zero. The second case (presented in the appendix) is more complex and contains asymmetries, where up/up ramps and down/down ramps are not treated equally and unmatched events can have nonzero skill. For this case the value of MinDistScore_{#} can be different from zero. This case will be used when the user wants to take into account the possibility of curtailing wind production, which can result in forecast errors of different signs having different financial or grid reliability consequences.

The simplified scoring strategy uses the symmetric range of scores in Table 2, with up/up and down/down events having a score between +1 and 0, tending to 0 as the timing error approaches WL; down/up and up/down events having score ranges between 0 and −1; and all missed forecasts having zero skill.

Range of scores possible for all eight event scenarios for the simplified, symmetric case.

_{#}in (1) are the scores with the maximum distance from zero listed in Table 2. Thus, for an up/up event (scenario 1) and a down/down event (scenario 8), MaxDistScore

_{1,8}is +1, while for an up/down event (scenario 3) and a down/up event (scenario 6), MaxDistScore

_{3,6}is −1. The forecast timing and amplitude skill parameters

For all scenarios, the timing skill [(2)] falls in the range

For scenarios 1 and 8 the best skill is obtained when the forecast and observed ramps have identical power amplitudes (_{1,8} = 1. The values of _{1,8} = 0.

For scenarios 3 and 6 the worst case occurs when the forecast and observed ramps have identical (and opposite) maximum power amplitudes (_{3,6} = −1. As the amplitude of the ramps in these scenarios become smaller, _{3,6} (in this case equal to 0) since the utility operator has more time to adjust for the wrong forecast.

## 5. Forecast skill scoring: Matrix of skill values

The ramp metric developed above applies to ramps defined by a single power amplitude threshold and window length. Ideally, one would like to know which model is best for a range of power thresholds and window lengths and then to average the model’s skill over this range of values. For this reason, we consider a matrix of ramp skills schematically illustrated in Fig. 6.

For this study four different time windows (30, 60, 120, and 180 min) and five different power thresholds (30%, 40%, 50%, 60%, and 70%) have been chosen.^{4} The ramp matrix has been designed so that each matrix element answers the question, does a ramp event of a particular duration exceed a

Skill scores as defined in section 4 for each value of power threshold and window length are calculated and placed into each matrix element. Skill scores using extreme ramp definitions (largest power thresholds and shortest window lengths) are placed in the top-left corner of the matrix. Skill scores for more frequently occurring and weaker ramp events (lower power thresholds and longer window lengths) will be placed in the bottom-right corner of the matrix.

### a. Results from 9 days of observation from four SDSU tall towers

To illustrate how the RT&M can be used to measure the skill of a model forecasting ramp events, we applied it to 9 days’ worth of observations from the four SDSU tall towers introduced in section 2 and corresponding forecasts from the RAP model at the same locations. Recognizing that a single 9-day period is insufficient to definitively assess the skill of this model, the intent of this study is limited to describing the RT&M and how it can be used to measure model skill at forecasting ramp events on a real observational dataset. A future study is under way testing the RT&M on a larger observational dataset, comparing different models, and comparing similar models run at different spatial resolutions and with different model output frequencies.

For this exercise we examined the average of the statistical results for each tower site, instead of the aggregate of four towers, although as stated earlier the RT&M provides the option of aggregating the results according to the user needs. We also set the matrix of scores to be symmetric (as in Table 2). First, we present results obtained using the RT&M stitching method to compare the RAP model to observations.

Figure 7 displays the number of occurrences of ramp events found using the min–max identification method, as observed from the tall towers (left panel), and for the RAP model (forecast hour 0, initialization time, center panel; forecast hour 6, right panel), for the previously specified range of ramp power thresholds and window lengths. Relatively few extreme ramp events are found (top-left corner of each panel) while many small-amplitude and long-duration ramps are found (bottom-right corner of each panel). A similar number of events is found at forecast hour 6 compared to the initialization time. In contrast, the number of occurrences in the observations is larger. This is because at 13-km resolution the model has considerably smoother fields than the observed point location power time series.

Skill score matrices for the RAP model simulations using the fixed-time interval ramp definition method are shown in Fig. 8 for the initialization time and forecast hours 3, 6, and 14. The skill is larger for longer window lengths compared to the shorter windows. The skill is greatest at the initialization time, and slowly decays with forecast length. Skill scores for the other methods look qualitatively similar (not shown).

### b. Weighting matrix

Using the methodology presented above, a perfect forecast of a 30% power capacity ramp over 2 h and one of 70% over 30 min may both have forecast skills of 1.0, yet forecasting the larger ramp will be more important and should have more value than the smaller ramp.

In place of averaging the ramp skill scores in all of the score matrix elements equally, a weighting function can be applied before averaging the score matrix that accounts for the fact that the skill scores for the more extreme events will likely have a greater impact on grid operations than the weaker ramps. The weighting matrix that we have used starts with a weight of 1.0 in the top-left corner of the matrix in Fig. 6 (most extreme ramps) and decreases the weight by 10% for each 10% change in the ramp power threshold and each increment in window length.^{5}

The average score across the entire matrix is shown in Fig. 9 for the three ramp definition methods, using both an equal weighting of all the matrix elements (top panel), and when using the weighting matrix (bottom panel). The unweighted averaged skill score is greater than the weighted skill score, because the model has less skill at forecasting the most extreme ramps. Although the 9-day period used in this study is not long enough to claim definitive results, we notice that for this exercise the skill score of the RAP model is positive for all forecast hours and methods, decreasing with forecast length. We also note that model forecast skill tends to be greatest when using the explicit derivative method, which may be due to the fact that this method applies more temporal smoothing to the data then does, say, the min–max method. The choice of method will clearly depend on the user’s specific forecast needs.

A breakdown of the ramp events into the eight different scenarios using the three ramp identification methods (not shown) shows the largest number of events by far occurs for the two model null event scenarios (4 and 5). This is most likely due to the fact that observational data have much more temporal variability compared to the model time series. For this reason, as noticed in section 2, fewer ramp events are found in the model time series and many observed ramps have no match in the model time series. The next most common events are scenario 1 (up/up), scenario 2 (up/null), scenario 8 (down/down), and scenario 7 (down/null), but with much lower frequencies than scenarios 4 and 5. The least common are scenarios 3 (up/down) and 6 (down/up), the worst scenarios possible.

The ramp skill scores can also be broken down into each scenario category (not shown). The positive contribution to skill score comes from scenarios 1 and 8, when the forecast accurately predicts up ramps and down ramps when they are observed. Scenarios 3 and 6 (ramp forecasts with the sign opposite than that observed) have a negative contribution smaller than the positive contribution of scenarios 1 and 8. Scenarios 2, 7, 4, and 5 (null events) have no net effect as the score for these scenarios is set equal to zero (see Table 2) for this exercise.

Also, the skill of a model at forecasting observed up-ramp events versus down-ramp events can be tested separately with this same RT&M, simply rerunning it first only using nonzero values for scenarios 1 and 6 in Table 2 (observed up ramps) and the second time only using nonzero values for scenarios 3 and 8 in Table 2 (observed down ramps). For this dataset this comparison is presented in Fig. 10, where the dashed lines are used for up events and the solid lines represent down ramps. The model has greater skill at forecasting observed up-ramp events compared to down-ramp events. Again the skill of the model at forecasting both up- and down-ramp events decreases with forecast length, but stays positive for all forecast hours.

Finally, in Fig. 11 we present the same results presented in Fig. 9 (bottom) and Fig. 10, but obtained by the RT&M independent forecast runs method to compare the model to the observations. In this case, the observed and model time series are shorter and equal to the length of the model forecast (equal to 15 h in the case of the RAP model used in this study). During a period of 9 days in January 2012, there are 216 (24 times 9) such time series that allow comparison of the observed and model data for ramp identification. Ramps are determined and matched between each pair of time series in the same way they were determined and matched using the “stitching method” presented before, but when we measure the skill of the model for each particular ramp, we add that skill to the matrix of skills for the same forecast hour during which the central time of the model ramp occurs (in the appropriate matrix element, relative to that particular ramp definition). In this way we can preserve the information on how the statistics vary as a function of the forecast horizon. A disadvantage of this method is that the beginning and ending forecast hours will suffer from truncated ramps that potentially begin before the start of the forecast cycle or end after forecast hour 15. For instance, there will be no ramps with central time at forecast hour 00:00, so forecast hour 0 skill will only have skill coming from ramps centered at 00:15, 00:30, and 00:45. Moreover, ramps centered at 00:15 will be limited in definition (no ramps with

When this second approach is run on the 9-day dataset, we found that the statistics for forecast hours 2–13 are very similar to those found with the stitching method (cf. Fig. 11 to Fig. 9, bottom panel, and to Fig. 10), proving that the RT&M is very robust. As expected from the considerations above, the skill of the RAP in Fig. 11 (both panels) is less that what we had found before in Fig. 9 (bottom) and Fig. 10 at forecast hours 0, 1, and 14. For this reason, Fig. 11 shows gray areas for the forecast hours that suffer from the deficiencies described above. Nevertheless, both of these approaches are available in the RT&M, and the user can choose the one that best addresses their analysis needs.

## 6. Summary and conclusions

A Ramp Tool and Metric was developed to test the ability of a model to forecast ramp events for wind energy. Power forecasts were evaluated by converting tall-tower observations and model forecast wind speeds to normalized capacity power using a standard IEC2 power curve. Two options are provided to the user to decide how to compare the model to observations.

The RT&M has three components: it identifies wind ramp events, matches forecast and observed ramps, and calculates a skill score for the forecasts.

The skill score incorporates phase, duration, and amplitude errors. Since no single pair of changes in power and time thresholds may be representative, and some users may use different ramp definitions for different situations, the RT&M provides the option to integrate the skill over a range of changes in power and time windows. Although specific RT&M parameter values were used in this study, the tool is flexible and can be modified by users for their purposes. Also, a greater emphasis can be given to the more extreme events using a weighting matrix.

We tested the RT&M on 9 days’ worth of observations from a set of four SDSU tall towers located in South Dakota, and used these data to illustrate how the RT&M can be employed to evaluate the skill of a model (RAP in this example) at forecasting ramp events. This hourly updated model runs operationally over North America at 13-km resolution, and was used during the 2011–12 WFIP campaign. For the RT&M analysis, RAP output at 15-min resolution was used, allowing us to consider short-duration ramps. The RAP model is found to have positive skill decreasing with forecast length, and greater skill at forecasting up-ramp events compared to down-ramp events. We developed and described three different methods for identifying ramp events but at this stage of the analysis it is not yet clear which one is best.

Since this RT&M is used to evaluate the skill of a model when forecasting ramp events in a time series of power data, in principle it could also be used on a time series of power data generated by solar plants. In this case the values of the time windows and power thresholds should be changed according to the expected behavior of the power data produced by the solar plants. This work is in our future plans. The RT&M is publically available online (http://www.esrl.noaa.gov/psd/products/ramp_tool/).

## Acknowledgments

This research has been funded by the U.S. Department of Energy under the Wind Forecast Improvement Project (WFIP), Award DE-EE0003080, and by NOAA/Earth System Research Laboratory. The authors wish to acknowledge Joseph Olson from the NOAA/ESRL/GSD group for providing the RAP model outputs, Barb DeLuisi from the NOAA/ESRL/PSD group for maintaining the RT&M website, and three anonymous reviewers for the helpful comments.

## APPENDIX

### Curtailment Bonus

The simplified scoring strategy presented in the main body of the manuscript uses the symmetric range of scores shown in Table 2. However, symmetric scoring does not account for the fact that forecast errors of different signs (underprediction versus overprediction of power) may have different financial consequences. In particular, if the observed wind power

Implicit in our curtailment analysis is the assumption that the grid is always balanced at the start of a ramp event, even if the model forecast is already in error at that point. Therefore, for example, a forecast down ramp that has been matched with an observed up ramp will always provide the opportunity for curtailment, even if the actual time series of forecast power remains greater than the observed power during the ramp event. That is, the grid operator will have accounted for the initial error in the forecast, so that the expected amount of power to be produced will be offset from the raw model power forecast. Since the tool only analyzes ramp events, a second assumption that we make to simplify the analysis is that a curtailment bonus will only be applied for those time periods when either an observed or forecast ramp is present.

A feature of the curtailment bonus option is that the user can choose the weight to give to this bonus through a variable called bonusweight that ranges between 0 and 1 (with steps of 0.1), with 0 being equivalent to no bonus and 1 being the maximum. When a user runs the RT&M using the downloadable GUI, he or she can select what value to assign to the bonusweight in the appropriate box.

For the null (unmatched) events, curtailment can occur for scenarios 7 and 4, when the total power observed is greater than the forecast. However, even though for these two null case scenarios there may be value in curtailment, because these are missed events we still wish for the model to have a skill score close to zero. Therefore, for scenarios 7 and 4 we will multiply the value of bonusweight by a small value (0.1), so that the maximum score for these scenarios will never be greater than this small value. The scoring strategy for the null scenarios will then become as shown in Table A1.

Scores for the four possible null scenarios when a bonus is applied to the situations that can be solved by curtailing wind energy generation.

The scoring strategy for the eight scenarios will then become the one presented in Table A2. When the bonus is applied, not only is the range of scores changed, but the formulations are changed as explained below.

Range of scores possible for all eight event scenarios for the case with bonuses.

For the nonnull scenarios, bonuses will be applied in three different situations:

- The first situation when the bonus will be applied to the model is in scenario 1, but only when the following conditions are met:
- the central time of the forecast is later compared to the central time of the observations (
), and - the power change of the forecast is less than the power change of the observations (
), and - the starting time of the forecast event happens before the end time of the observed event [
], so that we are sure there is a time during the forecast event when the actual wind power available is larger than the forecast, and, consequently, curtailment is possible. We note that it is possible for these criteria to not be met even though two ramps are matched.This situation is schematically presented in Fig. A1. In this case the gray area is the time during which the nonperfect forecast results in the actual wind power supply available (blue curve) being greater than the forecast power (red curve), and hence the amount required to balance demand, which can be alleviated by curtailing wind energy generation. From this figure we can see that if the forecast starting time is later compared to the final time of the observations, there will not be a time during which the observations are larger than the forecast and therefore the bonus cannot be applied.

- the central time of the forecast is later compared to the central time of the observations (
- The second situation when the bonus will be applied to the model is in scenario 8, but only when the following conditions are met:
- the central time of the forecast is earlier compared to the central time of the observations (
), and - the power change of the forecast is less (more negative) than the power change of the observations (
), and - the final time of the forecast event happens after the initial time of the observed event [
], so that we are sure there is a time during the forecast event when the observations are larger than the forecast, and, consequently, curtailment is possible.This situation is schematically presented in Fig. A2. In this case the gray area is again the time during which the nonperfect forecast can be alleviated by curtailing wind energy generation. Again, from this figure we can see that if the forecast final time is instead earlier compared to the initial time of the observations, there will not be a time during which the observations are larger than the forecast and the bonus cannot be applied.

- the central time of the forecast is earlier compared to the central time of the observations (
- The third and last situation when the bonus will be applied is in scenario 6, but only when the following conditions are met:
- the central time of the forecast is earlier compared to the central time of the observations (
), and - the final time of the forecast event happens after the initial time of the observed event [
], or - the central time of the forecast is late compared to the central time of the observations (
), and - the initial time of the forecast event happens before the end time of the observed event [
], so that we are sure there is a time during the forecast event when the observations are larger than the forecast, and, consequently, curtailment is possible.

- the central time of the forecast is earlier compared to the central time of the observations (

This situation is schematically presented in Fig. A3. In this case the gray area is again the time during which the nonperfect forecast can be alleviated by curtailing wind energy generation. Again, from this figure we can see that if the above conditions are not met, there will not be a time during which we can be sure the observed ramp event power is larger than the forecast and therefore the bonus cannot be applied.

To assign the bonuses to the model in the nonnull scenarios and in the instances where curtailment is possible, we need to modify the formulations for

with

The

with

The

with

The

Using a value of the bonusweight = 1, we repeated the same analysis presented in the body of the manuscript, when the stitching method is used to compare the model to the observations. The average score across the entire matrix is shown in the two panels of Fig. A7 for the three ramp definition methods, using an equal weighting of all the matrix elements and a value of bonusweight = 0 in the top panel (equivalent to the top panel in Fig. 9), and bonusweight = 1 in the bottom panel. We notice that because of the use of the bonus in circumstances when the nonperfect forecast results in wind curtailment, the score of the model is larger in the bottom panel of Fig. A7 compared to the top panel. The skill score of the RAP model remains positive for all forecast hours and for all three methods, and does decreases with forecast length.

## REFERENCES

Bossavy, A., , Girard R. , , and Kariniotakis G. , 2010: Forecasting uncertainty related to ramps of wind power production.

*Proc. European Wind Energy Conf. and Exhibition*, Warsaw, Poland, European Wind Energy Association. [Available online at https://hal-mines-paristech.archives-ouvertes.fr/hal-00765885/document.]Bradford, K. T., , Carpenter R. L. , , and Shaw B. L. , 2010: Forecasting southern plains wind ramp events using the WRF model at 3-km. Preprints,

*Ninth Annual Student Conf.*, Atlanta, GA, Amer. Meteor. Soc., S30. [Available online at https://ams.confex.com/ams/pdfpapers/166661.pdf.]Bristol, E. H., 1990: Swinging door trending: Adaptive trend recording.

*Proc. Int. Studies Association National Conf.*, New Orleans, LA, ISA, 749–753. [Available online at http://ebristoliclrga.com/PDF/SwDr.pdf.]Cutler, N., , Kay M. , , Jacka K. , , and Nielsen T. S. , 2007: Detecting, categorizing and forecasting large ramps in wind farm power output using meteorological observations and WPPT.

,*Wind Energy***10**, 453–470, doi:10.1002/we.235.Ferreira, C., , Gamma J. , , Matias L. , , Botteud A. , , and Wang J. , 2011: A survey on wind power ramp forecasting. Argonne National Laboratory Rep. ANL/DIS-10-13, 28 pp. [Available online at http://ceeesa.es.anl.gov/pubs/69166.pdf.]

Florita, A., , Hodge B.-M. , , and Orwig K. , 2013: Identifying wind and solar ramping events.

*IEEE Green Technologies Conf.*, Denver, Colorado, IEEE, NREL/CP-5500-57447. [Available online at http://www.nrel.gov/docs/fy13osti/57447.pdf.]Francis, N., 2008: Predicting sudden changes in wind power generation.

*North American WindPower*, October, North American WindPower, Oxford, CT, 58 pp. [Available online at https://www.awstruepower.com/assets/Predicting-Sudden-Changes-in-Wind-Power-Generation_NAWP_Oct20082.pdf.)Freedman, J., , Markus M. , , and Penc R. , 2008: Analysis of west Texas wind plant ramp-up and ramp-down events. AWS Truewind Tech. Rep., Albany, NY, 26 pp. [Available online at http://interchange.puc.state.tx.us/WebApp/Interchange/Documents/33672_1014_580034.PDF.]

Greaves, B., , Collins J. , , Parkes J. , , and Tindal A. , 2009: Temporal forecast uncertainty for ramp events.

,*Wind Eng.***33**, 309–319, doi:10.1260/030952409789685681.Hamill, T. M., 2014: Performance of operational model precipitation forecast guidance during the 2013 Colorado Front Range floods.

,*Mon. Wea. Rev.***142**, 2609–2618, doi:10.1175/MWR-D-14-00007.1.International Electrotechnical Commission, 2007: Wind turbines - Part 12-1: Power performance measurements of electricity producing wind turbines. IEC 61400-12-1, 90 pp.

Isaac, G. A., and et al. , 2014: The Canadian Airport Nowcasting System (CAN-Now).

,*Meteor. Appl.***21**, 30–49, doi:10.1002/met.1342.Jacobs, A. J. M., , and Maat N. , 2005: Numerical guidance methods for decision support in aviation meteorological forecasting.

,*Wea. Forecasting***20**, 82–100, doi:10.1175/WAF-827.1.Kamath, C., 2010: Understanding wind ramp events through analysis of historical data.

*Proc. Transmission and Distribution Conf. and Exposition*, New Orleans, LA, IEEE Power and Energy Society, doi:10.1109/TDC.2010.5484508.Wandishin, M. S., , Layne G. J. , , Etherton B. J. , , and Petty M. A. , 2014: Challenges of incorporating the event-based perspective into verification techniques.

*Proc. 22nd Conf. on Probability and Statistics in the Atmospheric Sciences*, Atlanta, GA, Amer. Meteor. Soc., 4.4. [Available online at https://ams.confex.com/ams/94Annual/webprogram/Paper240097.html.]Wilczak, J. M., , Bianco L. , , Olson J. , , Djalalova I. , , Carley J. , , Benjamin S. , , and Marquis M. , 2014: The Wind Forecast Improvement Project (WFIP): A public/private partnership for improving short term wind energy forecasts and quantifying the benefits of utility operations. NOAA Final Tech. Rep. to DOE, Award DE-EE0003080, 159 pp. [Available online at http://energy.gov/sites/prod/files/2014/05/f15/wfipandnoaafinalreport.pdf.]

Wilczak, J. M., and et al. , 2015: The Wind Forecast Improvement Project (WFIP): A public–private partnership addressing wind energy forecast needs.

,*Bull. Amer. Meteor. Soc.***96**, 1699–1718, doi:10.1175/BAMS-D-14-00107.1.Wilks, D. S., 2006:

*Statistical Methods in the Atmospheric Sciences: An Introduction*. Academic Press, 627 pp.Wong, W.-K., , Lau C.-S. , , and Chan P.-W. , 2013: Aviation Model: A fine-scale numerical weather prediction system for aviation applications at the Hong Kong International Airport.

,*Adv. Meteor.***2013**, 532475, doi:10.1155/2013/532475.Zack, J. W., , Young S. , , Nocera J. , , Aymami J. , , and Vidal J. , 2010: Development and testing of an innovative short-term large wind ramp forecasting system.

*Proc. European Wind Energy Conf. and Exhibition*, Warsaw, Poland, European Wind Energy Association.Zack, J. W., , Young S. , , and Natenberg E. J. , 2011: Evaluation of wind ramp forecasts from an initial version of a rapid update dynamical-statistical ramp prediction system.

*Proc. Second Conf. on Weather, Climate, and the New Energy Economy*, Seattle, WA, Amer. Meteor. Soc., 781. [Available online at https://ams.confex.com/ams/91Annual/webprogram/Paper186686.html.]Zhang, J., , Florita A. , , Hodge B.-M. , , and Freedman J. , 2014: Ramp forecasting performance from improved short-term wind power forecasting.

*Int. Design Engineering Technical Conf./Computers and Information in Engineering Conf.*, Buffalo, NY, American Society of Mechanical Engineers, NREL/CP-5D00-61730. [Available online at http://www.nrel.gov/docs/fy14osti/61730.pdf.]

^{1}

The RT&M is coded in Matlab, and users can download the main code, functions and instructions, and the data used for this study to test the RT&M, before running it on their own datasets. When the RT&M is run, a GUI opens and the user can choose several options, some of which will be introduced in the subsequent narrative of the paper, and the others are explained in a readme file, downloadable with the RT&M. We also created an executable version of the code for users that do not have Matlab. In this case they will not be able to modify the Matlab code, but can still use the GUI. When a user runs the RT&M, he or she can select whether the input data will be wind speed or power, according to what type of data are available.

^{2}

When a user runs the RT&M using the downloadable GUI, he or she can choose whether to run the RT&M over individual sites and then average the statistics, or to run the RT&M on the aggregated sites.

^{3}

When a user runs the RT&M using the downloadable GUI, overlapping points will be shown with blue squares.

^{4}

When a user runs the RT&M using the downloadable GUI, he or she could also choose different time windows and power thresholds by modifying the Matlab code, but the minimum possible time window that can be chosen is equal to 2 times the resolution of the data, in our case the minimum time window would be equal to 2 × 10 min = 20 min.

^{5}

When a user runs the RT&M using the downloadable GUI, he or she can choose to run the RT&M averaging the ramp skill scores in all of the score matrix elements equally, or apply the weighting function introduced here to weight the extreme events (or create their own weighting matrix, according to his or her needs, by modifying the Matlab code itself).