1.Department of Technical and Engineering, Science and Research Branch, Islamic Azad University, Tehran 1477893855, Iran 2.Department of Water Resources Engineering, Tarbiat Modares University, Tehran 14115-336, Iran Manuscript received: 2017-04-14 Manuscript revised: 2017-07-13 Manuscript accepted: 2017-08-07 Abstract:The application of numerical weather prediction (NWP) products is increasing dramatically. Existing reports indicate that ensemble predictions have better skill than deterministic forecasts. In this study, numerical ensemble precipitation forecasts in the TIGGE database were evaluated using deterministic, dichotomous (yes/no), and probabilistic techniques over Iran for the period 2008-16. Thirteen rain gauges spread over eight homogeneous precipitation regimes were selected for evaluation. The Inverse Distance Weighting and Kriging methods were adopted for interpolation of the prediction values, downscaled to the stations at lead times of one to three days. To enhance the forecast quality, NWP values were post-processed via Bayesian Model Averaging. The results showed that ECMWF had better scores than other products. However, products of all centers underestimated precipitation in high precipitation regions while overestimating precipitation in other regions. This points to a systematic bias in forecasts and demands application of bias correction techniques. Based on dichotomous evaluation, NCEP did better at most stations, although all centers overpredicted the number of precipitation events. Compared to those of ECMWF and NCEP, UKMO yielded higher scores in mountainous regions, but performed poorly at other selected stations. Furthermore, the evaluations showed that all centers had better skill in wet than in dry seasons. The quality of post-processed predictions was better than those of the raw predictions. In conclusion, the accuracy of the NWP predictions made by the selected centers could be classified as medium over Iran, while post-processing of predictions is recommended to improve the quality. Keywords: ensemble forecast, NWP, TIGGE, evaluation, post-processing 摘要:目前数值天气预报产品的应用正与日俱增. 已有研究表明: 集合预报的技巧比单一确定性预报的更高. 本文利用确定性评估, 二分型评估, 概率评估等判别指标评估了TIGGE资料集合预报2008年至2016年间伊朗降水的预报质量, 评估选取了分布于伊朗8个典型降水区域的13个雨量计观测资料为参考. 首先将提前1-3天的数值预报结果通过距离反比加权, Kriging插值等方法插值到选取的站点, 并利用贝叶斯模式平均方法对数值预报产品预处理以进一步提高预报质量. 结果表明: 欧洲中期天气预报中心(ECMWF)的预报比其他产品得分更高. 所有预报中心的产品都低估了降水较多地区的降水, 同时高估了其他地区的降水. 这表明TIGGE集合预报中存在着系统性偏差, 需进一步应用偏差校订技术. 基于二分型评估, 美国环境预报中心(NCEP)的预测在大部分站点中表现较好, 尽管所有的中心均高估了降水事件的发生频次. 相较于ECMWF和NCEP的预报, 英国气象局(UKMO)的预报在山区表现更好, 但在其他地区表现较差. 此外, 所有中心的预报在湿季都比在干季表现更好. 经过预处理的预报比未经过处理的预报质量更好. 总的来说, 本研究选择的各中心数值预报对伊朗降水预报的精度可以评为中等, 同时对预报产品进行预处理可提高预报质量. 关键词:集合预报, 数值天气预报, TIGGE, 评估, 预处理
HTML
--> --> --> -->
2.1. Data
The 2008-16 50-km prediction products were extracted from the TIGGE database at the ECMWF with lead times of one, two and three days over Iran. Among the centers in the TIGGE database, three (ECMWF, NCEP and UKMO) were selected. The characteristics of the aforementioned centers are provided in Table 1. Observed data were extracted for 13 synoptic stations in Iran, spread over eight different regions as classified by (Modarres, 2006). Table 2 presents the characteristics of the stations. (Modarres, 2006) classified eight homogenous precipitation regimes over Iran, based on the application of Ward's technique to the annual and monthly precipitation of the selected rainfall stations. These eight regimes/clusters cover 90% of the precipitation variance within Iran. The first cluster (G1) is the largest and includes stations in arid and semi-arid regions in central Iran. The second cluster (G2) involves highland margins of G1, while G3 represents the northwestern cold region. The fourth cluster (G4) includes areas along the Persian Gulf coast south of Iran, while the sixth and the eighth clusters (G6, G8) involve areas located along the coast of the Caspian Sea. The major difference between the G6 and G8 regions in the north is the amount of precipitation decreasing from west to east. The fifth and seventh clusters (G5, G7) encompass regions in the Zagros Mountains, where precipitation in G5 is higher than in G7. The geographic distribution of the cluster regions is shown in Fig. 1. To make a direct comparison with precipitation spatial variation, Fig. 2 displays the average annual precipitation from 1984 to 2014. Interpolation of NWP predicted values at stations was implemented using the Inverse Distance Weighting (IDW) and Kriging methods. Figure2. Mean annual precipitation of Iran.
2 2.2. Evaluation techniques -->
2.2. Evaluation techniques
The evaluations were performed using deterministic, dichotomous (yes/no), and probabilistic approaches. For deterministic evaluation, four common criteria were adopted, including the Pearson correlation coefficient (Pearson's r), root-mean-square error (RMSE), mean absolute error (MAE), and the relative root-mean-square error (RRMSE). Furthermore, yes/no binary assessment criteria, including the probability of detection (POD), false alarm rate (FAR), bias score (BIAS), and equitable threat score (ETS), were used for the dichotomous evaluations. Finally, the Brier score (BS), Brier skill score (BSS), continuous ranked probability score (CRPS), and the area under the relative operating characteristic (ROC) (ROC.Area) were adopted for the probabilistic evaluation. All criteria formulations are given in Table 3 and contingency table are shown in Table 4.
2 2.3. BMA -->
2.3. BMA
BMA combines predictions from several statistical models with variable weighting coefficients. This method was used for ensemble predictions by (Raftery et al., 2005) to predict air temperature, surface and sea level pressure (Liu and Fan, 2014). The probability distribution function (PDF) of BMA is as follows (Raftery et al., 2005): \begin{equation} P(y|f_1,\cdots,f_K)=\sum_{k=1}^Kw_kg_k(y|f_k) , \nonumber \end{equation} where y is the prediction coefficient; gk(y|fk) is the conditional PDF of y based on fk, which is the best member of the ensemble prediction; wk is the posterior probability of forecast k which is non-negative with a summation equal to one and K is the number of models being combined. Since there were large numbers of zero precipitation events, the computational PDF in this paper was set to a Gamma distribution function, which was selected due to its high skewness. Detailed information regarding the calculation of gk(y|fk) and wk may be found in the literature (e.g. Raftery et al., 2005; Liu and Xie, 2014). This study took advantage of the ensemble BMA package in the R software.
-->
3.1. Total annual QPF evaluation
According to (Modarres, 2006), the G1 region is the dominant precipitation regime in Iran and has a high coefficient of variation with low precipitation in a predominantly arid and semi-arid climate condition. Due to the extent of this region, three stations (Esfahan, Semnan, and Zahedan) were selected. Figure 3 presents the total annual precipitation associated with this region. In most years, all centers overestimated the annual precipitation, while ECMWF offered better precipitation predictions at Semnan and Esfahan compared to that at Zahedan. On the contrary, NCEP performed better in predicting the annual precipitation at Zahedan but comparatively poorly at Esfahan and Semnan. Figure3. Total observed and predicted annual precipitation (mm yr-1) of three centers at rain gauge stations selected in precipitation regions: (a) Esfahan; (b) Mashhad; (c) Semnan; (d) Shahrekord; (e) Zahedan; (f) Tehran; (g) Ahvaz; (h) Bandarabbas; (i) Tabriz; (j) Sanandaj.
In the G2 region, which essentially constitutes mountainous areas upstream of the G1 region, three stations were selected: Mashhad, Shahrekord, and Tehran. Similar to G1, all centers overestimated the annual precipitation for most years at Mashhad and Tehran. At Shahrekord, which receives higher precipitation than the other two stations, UKMO underestimated, whereas the other two generally overestimated, the precipitation. Some centers showed different performance in predicting precipitation in the wet seasons compared with those of the whole year. For example, UKMO, which performed better than the other two models at Shahrekord, was the weakest for the wet season. The total NCEP predicted precipitation over the study period was significantly different from the total observed precipitation at Tehran. In the G3 region, which encompasses cold regions in northwestern Iran, the station at Tabriz was studied. According to Fig. 3, NCEP predictions were the poorest in all years, except in 2010 and 2011, compared to those of the other centers, while better predictions were achieved by ECMWF compared to those of UKMO and NCEP. In the G4 region, the stations at Ahvaz and Bandar Abbas were selected. Based on Fig. 3, although all three centers overestimated the annual precipitation, UKMO did quite poorly. For Sanandaj station in the G5 region, similar to other regions, all centers overestimated the annual precipitation. Predictions made by UKMO were better compared to those of ECMWF. Moreover, poorer predictions were made by ECMWF in 2008 and 2009. In the rainy climate of the G6 region, the station at Babolsar was selected. Based on Fig. 4, the centers overestimated and underestimated precipitation in different years. At Ilam in G7, which generally receives more precipitation than G5, NCEP was the poorest of all the centers, whereas ECMWF's predictions were better than those of UKMO in most years. As shown in Fig. 4, in the G8 region, receiving higher precipitation than the G6 region, ECMWF offered better predictions compared to those of the other centers, while NCEP's was the poorest, underestimating the precipitation in all years. Overall, the products of all the centers underestimated the precipitation in the relatively wetter climate regions but overestimated the precipitation in dryer climate areas. This implies a systematic bias in forecasts and demands application of bias correction techniques, such as quantile mapping. Figure4. Total observed and predicted annual precipitation ( mm yr-1) of three centers at three rain gauge stations selected in precipitation regions: (a) Babolsar in the G6 region; (b) Ilam in the G7 region; (c) Rasht in the G8 region.
2 3.2. QPF deterministic evaluation -->
3.2. QPF deterministic evaluation
For the deterministic evaluation, this study adopted four criteria: the correlation coefficient (r), MAE, RMSE, and RRMSE, whose formulations are presented in Table 3. The results are shown in Fig. 5. Due to limitations in displaying all examined cases, the average performance of the stations in each cluster is presented. Moreover, the results of each station are presented in Table 5. At Esfahan and Semnan in the G1 region, ECMWF and NCEP yielded the best and poorest scores, respectively. In contrast, at Zahedan, ECMWF and NCEP were the poorest and the best predicting centers, respectively. All in all, in this region, ECMWF was the best and NCEP was the poorest. In the G2 region, and based on the correlation coefficient, ECMWF at all three selected stations produced the best scores, while NCEP was the poorest. At Shahrekord, UKMO performed well, but was poorest at Mashhad. In the cold climate of the G3 region, based on all three indicators, ECMWF was the best and NCEP was the poorest of all. In the hot and dry G4 region, NCEP yielded smaller prediction errors compared to those of the other centers, while UKMO performed comparatively poorly in terms of the deterministic evaluation scores. In the G5 region, of all three centers, UKMO resulted in smaller prediction error, whereas NCEP performed the poorest. In the G6 rainy region, ECMWF and NCEP had the best and poorest scores, respectively. However, in this region, due to higher precipitation relative to other areas in Iran, large prediction errors were produced by all three centers. At Ilam in the G7 region, ECMWF's predictions were slightly better than those of UKMO; NCEP was the poorest of all. In G8, based on the correlation coefficient and RMSE, ECMWF was the best and UKMO was the poorest. Figure5. Results of the deterministic evaluation of three centers for eight precipitation regions in Iran between observations and forecasts: (a) correlation coefficient; (b) mean absolute error (mm d-1); (c) root-mean-square error (mm d-1); (d) relative root-mean-square error.
In general, based on deterministic evaluation, ECMWF in most regions of Iran, UKMO in mountainous regions, and NCEP in southern Iran, provided better results compared to other centers. In addition, TIGGE numerical precipitation predictions at Ilam within the G7 region performed best among all examined stations in terms of annual precipitation.
2 3.3. QPF dichotomous (yes/no) evaluation -->
3.3. QPF dichotomous (yes/no) evaluation
This study used four indicators (POD, FAR, ETS and BIAS) for dichotomous evaluation. The evaluation results are shown in Fig. 5. According to the BIAS criteria, which is the ratio of the number of predicted precipitation events to observed precipitation events, NCEP and UKMO respectively offered the best and poorest predictions of the number of precipitation days. ECMWF showed smaller BIAS in the G3 region compared to that of NCEP. All centers overestimated the number of precipitation days. Based on the ETS score, which measures the fraction of forecast events that were correctly predicted, NCEP achieved comparatively better scores at all stations, except in the G3 region. In addition, the prediction quality of UKMO was poor. However, the very low scores of ETS at most stations represents an inappropriate prediction accuracy of the number of precipitation events. According to Fig. 5d, POD values are high, which is due to a high BIAS score at most stations. Of all centers, UKMO, due to the higher values of BIAS compared to those of other centers, yielded better POD, while NCEP had the lowest scores. Based on FAR, which represents the number of false alarms in precipitation events, UKMO was the poorest and NCEP, in most regions, was better than other centers. The number of false identifications was quite high in the G1 and G4 regions, most likely due to the rarity of precipitation events in these regions. In conclusion, the number of precipitation events predicted by all three centers was higher than observed, while NCEP had better scores in most regions. Figure6. Dichotomous (yes/no) evaluation of three centers for eight precipitation regions in Iran between observations and forecasts: (a) bias score (frequency bias); (b) equitable threat score (Gilbert skill score); (c) false alarm ratio; (d) probability of detection (hit rate).
Figure7. Results of the probabilistic evaluation of three centers for eight precipitation regions in Iran between observation and forecasts: (a) Brier score; (b) Brier skill score; (c) continuous ranked probability score; (d) area under the relative operating characteristic (ROC) curve.
2 3.4. QPF probabilistic evaluation -->
3.4. QPF probabilistic evaluation
In this section, the gamma PDF was used to represent the QPF distribution. Four common methods (ROC.Area, CRPS, BS and BSS) were used for the probabilistic evaluation and the results are presented in Fig. 6. BS, which is a function of resolution, uncertainty and reliability, measures the mean squared probability error. BSS, which expresses the BS skill score relative to the reference BS, is usually determined by climatology predictions. CRPS evaluates the accuracy of the probabilistic forecast distribution. The ROC curve is a measure of the prediction's isolation skill in occurrence/non-occurrence of precipitation. The area under the curve is also an evaluation criterion. The values closer to 1.0 represent higher confidence in predictions. Figure 7 shows the average probabilistic evaluations over the eight study years. Based on BS, precipitation at stations in the G4 region was better predicted than that at other selected stations. However, based on BSS, predictions were poor due to, as previously mentioned, the rarity of precipitation events. In all regions, based on BSS, NCEP showed better prediction capability compared to ECMWF, except in G1 and G3, whereas UKMO was the poorest based on both BS and BSS. Moreover, based on CRPS, UKMO and ECMWF had higher scores in some regions while NCEP did poorly compared to other models. Based on ROC.Area, ECMWF and NCEP yielded the highest and lowest scores, respectively. As a whole, according to the probabilistic evaluations in Table 5, precipitation at Semnan and Zahedan in the G1 region, as well as Bandar Abbas in G4, were poorly predicted. Mashhad, Zahedan, Ilam had better scores than those of other stations. ECMWF and NCEP performed almost the same, while UKMO performed poorer in the probability of precipitation occurrence/non-occurrence criteria. Summary results are presented in Table 5, showing ECMWF performed better in all regions. UKMO had slightly better performance compared to NCEP in precipitation prediction. However, according to the dichotomous evaluation, NCEP performed better in almost all regions and could predict precipitation occurrence/non-occurrence better than other centers. Figure 8 presents the evaluation results for lead times of between one and three days. The results clearly illustrate that the precipitation prediction skill decreases with an increase in lead time. This reduction is quite obvious based on CRPS. According to Fig. 8, region G7 had the best scores, while the poorest performance in precipitation prediction was achieved in G1 and G4. Also, Fig. 9 compares the performance of the models in the dry and wet seasons. Only the results of the rainy regions of G6 and G8 are presented because other regions receive very little precipitation in the dry season. Based on Fig. 9, all models performed better in the wet than in the dry season, whereas UKMO failed in the G8 region for the dry season. Overall, the results indicate that better numerical prediction performance is expected in regions with high precipitation. Figure8. Results of the three prediction centers' assessments for eight precipitation regions with different lead times between observation and forecasts: (a) correlation coefficient; (b) bias score; (c) Brier score; (d) continuous ranked probability score.