HTML
--> --> --> -->2.1. Survey aims, design and scope
We conducted a large international survey to document and understand the expert judgments of the climate modeling community on the relative importance of different model variables in the evaluation of simulation fidelity.To keep the scope of this study focused, we only considered the evaluation of the annual mean climatology of an atmosphere-only model simulation, with prescribed SST. In addition, participants were asked to assume that their evaluation would be carried out only on the basis of scalar metrics (e.g., RMSE, correlation) characterizing the agreement of the respective model field with observations.
Transient features of climate were intentionally excluded from this study, but are of critical importance in model evaluation, and should be explored in future work. Similarly, coupled climate models have more complex tuning criteria that are not considered here.
We chose to limit the number of variables and criteria under consideration in order to encourage broader participation, and in anticipation of a planned follow-up study (described in more detail in section 4). Briefly, the follow-up study will invite experts to compare and evaluate climate model outputs, and will aim to infer the importance that experts implicitly assign to different aspects of model fidelity in conducting this assessment. To the best of our knowledge, this would be the first attempt to experimentally characterize expert evaluations of climate model fidelity, and so we aim to initially test the approach using a small number of key variables, which will allow for a more controlled study. The relative importance ratings and other input from experts reported in this study will both inform the design of the follow-up study and provide a priori values for Bayesian inference of the weights wi.
The importance of a particular variable in model evaluation will depend on the purpose for which the model will be used. To better constrain the responses, as well as to explore how expert rankings of different model variables might change depending on the scientific objectives, we asked participants to rate the importance of different variables with respect to several different "Science Drivers". A list of the six Science Drivers used in this survey is shown in Table 1. For each Science Driver, participants were presented with a pre-selected list of variables thought to be relevant to that topic, and asked to rate the importance of each variable on a seven-point Likert scale from "Not at all Important" to "Extremely Important". Participants were also invited to provide written feedback identifying any "very important" or "extremely important" variables that they felt had been overlooked; many took the opportunity to provide these comments, summarized in Tables S1-S3 (see Electronic Supplementary Material). This feedback will be used to improve the survey design in the follow-up study.
2
2.2. Survey recruitment, participation, and data screening
The survey was distributed via several professional mailing lists targeting communities of climate scientists, especially model developers and users, and by directly soliciting input from colleagues through the professional networks of the authors of this paper. Due to privacy restrictions, we are unable to report the identities or geographic locations of survey respondents, but we are confident that they are representative of the climate modeling community. The survey was open from 18 January 2017 to 25 April 2017. Participants who had not completed at least all items on the first Science Driver (N=12), and participants who rated themselves as "not at all experienced" with evaluating model fidelity (N=7) were excluded from analysis. Of the remaining 96 participants, 81 had completed all six Science Drivers.Our survey respondents were a highly experienced group, with the vast majority of participants rating themselves as either "very familiar" (40.6%) or "extremely familiar" (40.6%) with climate modeling. In addition, a large fraction of our participants had worked in climate modeling for many years, with the majority of participants (62) reporting at least 10 years' experience, and a substantial number of participants (31) reporting at least 20 years' experience with climate modeling. When asked to rate their experience in "evaluating the fidelity of the atmospheric component of global climate model simulations," 37.5% rated themselves as "very experienced," and 20.8% as "moderately experienced" in "tuning/calibrating the atmospheric component of global climate model simulations". An overview of the characteristics of the survey participants is shown in Fig. 1.
Figure1. Characteristics of survey participants.
2
2.3. Formal consensus measure: Coefficient of Agreement (A)
To quantify the degree of consensus among our participants, we employ a formal measure of consensus called the coefficient of agreement A (Riffenburgh and Johnstone, 2009), which varies from values near 0 (no agreement; random responses) to a maximum possible value of 1 (complete consensus). Calculated values of A for the two experience groups, and their probability p of being significantly different from each other, are tabulated for all Science Drivers and variables in the Supplementary Tables S4-S6.The coefficient of agreement is calculated from the observed disagreement d obs and the expected disagreement under the null hypothesis of random responses d exp. Let r max denote the number of possible options (7 in the Likert scale used here); let r=1… rmax denote the possible responses (r=7 is "Extremely important", r=6 is "Very important", and so on); let nr denote the number of respondents choosing the rth option, and let r med denote the median value of r from all respondents. The observed disagreement is then calculated as \begin{equation} d_{\rm obs}=\sum_{r=1}^{r_{\max}}n_{r}|r_{\rm med}-r| ,\ \ (2) \end{equation} where |r med-r| is the weight for the rth choice. The expected disagreement is calculated as \begin{equation} d_{\rm exp}=\dfrac{n}{k}\sum_{r=1}^{r_{\max}}\left|\dfrac{k+1}{2}-r\right| . \ \ (3)\end{equation} The coefficient of agreement A is then calculated as the complement of the ratio of observed to expected disagreement: $$ A=1-\dfrac{d_{\rm obs}}{d_{\rm exp}} . $$ For randomly distributed responses, d obs would be close to d exp, and A would be close to zero; while for perfect agreement, d obs=0 and A=1.
Because the value of A is sensitive to the total number of respondents N, the value of A is not comparable for subgroups of participants with different sizes. We performed additional significance testing to determine whether the degree of consensus was the same, or different, between our "high experience" and "low experience" groups, and/or between two survey drivers.
We test for statistically significant differences between two values of the coefficient of agreement for two groups of responses, A1 and A2, by performing a randomization test with the null hypothesis H0: A1=A2. To perform this test, we take l=1:100 random draws, without replacement, from the two groups of survey responses. For each lth draw, we calculate the difference in the coefficient of agreement for the two groups, dl=|A1l-A2l|. We then calculate the p-value for rejection of the null hypothesis, i.e., the probability that a difference in agreement larger than the observed mean could occur by chance: \begin{equation} p=\dfrac{1}{100}\sum_{l=1}^{100}\left\{ \begin{array}{l@{\quad}l} 1, & d_l>d_{l,{\rm mean}}\\ 0, & d_l\leq d_{l,{\rm mean}} \end{array} \right.,\ \ (4) \end{equation} where dl,mean is the mean of all dl.
-->
3.1. Importance of different variables to climate model fidelity assessments across six Science Drivers
In this section, we discuss expert ratings of variable importance for the six science drivers. In order to understand whether participants' responses differed depending on their degree of expertise, we first divided the participants into two experience groups: those who rated themselves as "very experienced" in evaluating model fidelity were placed into the "high experience" group (N=36); all other participants were placed into the "low experience" group (N=60).Figure2. Science Driver 1: distributions of importance ratings, ranked by consensus, as quantified by the coefficient of agreement A, for variables with high expert consensus about their importance.
Figure3. As in Fig. 2 but for variables with low expert consensus about their importance.
Figure4. Science Drivers 1-3: mean responses, high and low experience groups, ranked by overall mean response from all participants; color of dots indicates standard deviation of responses.
We emphasize that our "low experience" group consists largely of working climate scientists over the age of 30 (95%), with a median of 10 years of experience in climate modeling. In other words, our "low experience" group mostly consists not of laypersons, students or trainees, but of early-to-mid-career climate scientists with moderate levels of experience in evaluating and tuning climate models. Our "high experience" group consists largely of mid-to-late career scientists: the majority are over the age of 50 (53%), with a median of 20.5 years of experience in climate modeling. Researchers on the development of expertise have argued that roughly 10 years of experience are needed for the development and maturation of expertise (Ericsson, 1996); 86% of our "high experience" group members have 10 years or more of climate modeling experience.
3.1.1. Science Driver 1: How well does the model reproduce the overall features of the Earth's climate?
Our first Science Driver asked respondents to assess the importance of different variables to "the overall features of Earth's climate". We believe that this statement summarizes the primary aim of most experts when calibrating a climate model. However, experts' typical practices are likely to be influenced by factors such as the tools and practices used by their mentors and immediate colleagues, their disciplinary background, and their research interests. Such factors could contribute to differences in judgments of what constitutes a "good" model simulation. The aim of this Science Driver is to understand what experts prioritize when the goal is relatively imprecisely defined as optimizing the "overall features" of climate; these responses can then be contrasted with the more specific questions in the following five Science Drivers.
Figures 2 and 3 show the distribution of responses for each variable in Science Driver 1 for the high and low experience groups. Figure 4 (top) summarizes the mean and standard deviation of importance ratings for all variables in Science Driver 1. Overall, the variables most likely to be identified as "extremely important" were (in ranked order): rain flux (N=31), 2-m air temperature (N=28), longwave cloud forcing (N=22), shortwave cloud forcing (N=21), and sea level pressure (N=20). The complete distributions of responses for all science drivers by experience group, together with statistical summary variables and significance tests, are shown in Tables S1-13.
The distribution and degree of consensus is similar between the two groups, with no statistically significant differences for any variable (see Supplementary Tables S4-S6). This suggests that once an initial level of experience is acquired, additional experience may not lead to significant differences in judgments about model fidelity.
It is instructive to examine which variables are the exceptions to this general rule; these exceptions hint at insights into where and how greater experience matters most in informing the judgments experts make about model fidelity. The distribution of responses of the high experience and low experience group differed for only one item in Science Driver 1——the oceanic surface wind stress (p<0.01); for this variable, the median response of the high and low experience groups was "very important" and "moderately important," respectively. We speculate that the high-experience group may be more sensitive to this variable due to (1) its critical importance to ocean-atmosphere coupling, and (2) awareness of the relatively high-quality observational constraints available from wind scatterometer data.
We also investigated the degree of consensus on the importance of different variables. We observe a clearly higher degree of consensus for some variables, compared to others. Across all participants (high and low experience groups together), there is a comparatively high degree of consensus on the importance of shortwave cloud forcing (A=0.67), longwave cloud forcing (A=0.62), and rain flux (A=0.62). In particular, there is comparatively little agreement on the importance of oceanic surface wind stress (A=0.39), due to the discrepancy between experience groups on this item, and on the aerosol optical depth (AOD; A=0.42). The data we collected do not allow us to be certain of the reasoning behind importance ratings, but the lack of consensus on AOD importance is perhaps unsurprising in light of the high uncertainty associated with the magnitude of aerosol impacts on climate (Stocker et al., 2013), and recent controversies among climate modelers on the importance of aerosols to climate, or lack thereof (Booth et al., 2012; Stevens, 2013; Seinfeld et al., 2016).
3.1.2. Science Driver 2: How well does the model reproduce features of the global water cycle?
Our second Science Driver included a comparatively limited number of variables related to the global water cycle (Fig. 4: middle). These should be considered in combination with Science Driver 6, which addresses the assessment of simulated clouds using a satellite simulator (Fig. 5).
Figure5. Science Drivers 4-5: mean responses, high and low experience groups, ranked by overall mean response from all participants; color of dots indicates standard deviation of responses.
While the differences did not pass our criteria for statistical significance, we note a slight tendency for the high experience group to assign higher mean importance ratings to net TOA radiative fluxes and precipitable water amount. We speculate that this might be due to a slightly greater awareness of, and sensitivity to, observational uncertainties among the high experience group, expressed as a higher importance rating for variables with stronger observational constraints from satellite measurements. This interpretation is supported by the comment of one study participant (with 20 years' experience in climate modeling), who observed that "surface LH [latent heating] and SH [sensible heating] are not well constrained from obs[ervations]. While important, that means they aren't much use for tuning."
3.1.3. Science Driver 3: How well does the model simulate Southern Ocean climate?
For Southern Ocean climate, surface interactions that affect ocean-atmosphere coupling, including wind stress, latent heat flux (evaporation) and rain flux, together with shortwave cloud forcing, were identified as among the most important variables by our participants (Fig. 4: bottom).
The high experience group rated rain fluxes as more important (median: "very" important) compared to the low experience group (median: "moderately" important; probability of difference: p=0.02).
It is interesting to compare the responses with Science Driver 1, which included many of the same variables. For instance, for AOD, the low experience group assigned a lower mean importance for overall climate (mean: 4.32; σ: 1.41) than for Southern Ocean climate (mean: 4.04; σ: 1.49); the high experience group assigned a higher mean importance for overall climate (mean: 4.64; σ: 1.16) than for Southern Ocean climate (mean: 4.34; σ: 1.13).
The reasons for this discrepancy are unclear. One possibility is that the high experience group may be more aware that over the Southern Ocean, AOD provides a poor constraint on cloud condensation nuclei (Stier, 2016), and is affected by substantial observational uncertainties, with estimates varying widely between different satellite products.
Figure6. Science Driver 6: mean responses, high and low experience groups, ranked by overall mean response from all participants; color of dots indicates standard deviation of responses.
3.1.4. Science Driver 4: How well does the model simulate important features of the water cycle in the Amazon watershed?
On Science Driver 4, which addresses the water cycle in the Amazon watershed (Fig. 5: top), participants identified surface sensible and latent heat flux, specific humidity, and rain flux as the most important variables for evaluation. It is possible that the more experienced group is more sensitive to the critical role of land-atmosphere coupling in the Amazonian water cycle. This interpretation would be consistent with the additional variables suggested by our survey participants for this science driver, which also focused on variables critical to land-atmosphere coupling, e.g. "soil moisture", "water recycling ratio", and "plant transpiration" (Supplementary Table S2). While the variables selected for the survey focused largely on mean thermodynamic variables, commenters also mentioned critical features of local dynamics in the Amazon region, such as surface topography and "wind flow over the Andes", "convection", and vertical velocity at 850 hPa.
3.1.5. Science Driver 5: How well does the model simulate important features of the water cycle in the Asian watershed?
For Science Driver 5, focused on the Asian watershed, participants rated rain flux, surface latent heat flux, and net shortwave radiative flux at the surface as the most important variables (Fig. 5: bottom). For variables included in both Science Drivers, the order of variable importance was the same as in the Amazon watershed, but different than in the Southern Ocean; some of these differences will be discussed in section 3.3. Written responses again mentioned soil moisture (3×) and moisture advection (2×) as important variables missing from the list.
3.1.6. Science Driver 6: How well does the model simulate the climate impact of clouds globally?
The final Science Driver addressed the evaluation of cloud properties in the model (Fig. 6) using a satellite simulator, which produces simulated satellite observations and retrievals based on radiative transfer calculations in the model. "Very important" (6) was the most common response for all variables in Science Driver 6 (Supplementary Table S15).
While differences in responses between the two experience groups did not pass our bar for statistical significance, the high experience group selected "extremely important" more frequently than the low experience group for the "high level cloud cover" and "low cloud cover" items, which also had the highest mean importance ratings in this Science Driver.
Five participants indicated that longwave cloud forcing and shortwave cloud forcing should have been included, and one respondent noted "A complete vertical distribution of cloud properties would be even more interesting than "low", "medium" and "high" cloud cover. Cloud particle size and number would also be interesting." Another responded that "cloud fraction is a model convenience but is quite arbitrary."
2
3.2. Impact of experience on judgments of variable importance
We hypothesized that: (H1) respondents with less experience in climate modeling would differ from more experienced respondents in their judgments of relative variable importance; and (H2) Respondents with greater experience in climate modeling would exhibit greater consensus in their judgments of the importance of different variables.(H1): Using a Chi-squared significance test (details in the Supplementary Material), we find support for differences in assessment of variable importance by high and low experience groups, but only for certain selected variables. Compared to the low experience group, the high experience group rated ocean surface wind stress as more important to evaluation of global climate (Science Driver 1) and rain flux as more important to evaluation of Southern Ocean climate (Science Driver 3).
Some other differences are observable between the two groups (see Supplementary Tables S10-S15), but did not meet our criteria for significance; it is possible that additional differences would emerge if a larger survey population could be attained.
(H2): We find no statistically significant differences in degree of consensus between the high and low experience groups.
Figure7. Perceived barriers to systematic quantification of model fidelity. Answers were selected from a predetermined list in response to the prompt: "Which one among the following, do you feel, is the biggest barrier towards systematic quantification of model fidelity?"
The lack of large differences in responses between the high and low experience groups suggests that variations in importance ratings are mainly driven by factors that are unrelated to the amount of experience the scientists have. Examples could include the specific subdiscipline of the individual expert, or the practices and research foci that are common in their particular research community or geographic area. This result also suggests that expertise in climate model evaluation may reach a plateau after a certain level of proficiency is attained, with additional experience leading to only incremental changes in expert evaluations and judgments. One possible reason for this is that the process of model evaluation is constantly evolving as updated model versions incorporate additional processes and improvements, new observational datasets become available, and new tools are developed to support the evaluation process. As a result, climate scientists continually need to update their understanding about climate models and their evaluation to reflect the current state-of-the-art. Another possible explanation is that the culture of the climate modeling community may promote an efficient transfer of knowledge, as more experienced scientists offer training and advice to less experienced colleagues and to other research groups, shortening the learning curve of new scientists entering the field.
2
3.3. Impact of Science Drivers on judgments of variable importance
We expected that survey participants would rate the importance of the same model variables differently depending on the science goals, and indeed this is what we found. In this section, we focus on the ratings from the high experience group, but results from the low experience group are similar.For instance, rain flux was rated as less important to evaluation of the Southern Ocean (mean: 6.00; σ: 1.12) than to global climate (mean: 6.14; σ: 0.92) or the Asian watershed (mean: 6.32; σ: 1.00), while shortwave and longwave cloud forcing were rated as less important to the Asian watershed (shortwave: mean: 5.48; σ: 0.84; longwave: mean: 5.23; σ: 1.01) than to global climate (shortwave: mean: 5.89; σ: 1.02; longwave: mean: 5.78; σ: 1.02) or Southern Ocean climate (shortwave: mean: 5.63; σ: 0.86; longwave: mean: 5.56; σ: 0.90). Surface wind stress was rated more important in the Southern Ocean (mean: 5.84; σ: 1.30), and less important in the Asian watershed (mean: 5.10; σ: 1.33), compared to its importance to global climate evaluation (mean: 5.81; σ: 1.02). While total cloud liquid water path was rated as equally important in the Southern Ocean (mean: 5.09; σ: 1.10), Amazon watershed (mean: 5.06; σ: 1.29), and Asian watershed (mean: 5.13; σ: 1.13), total cloud ice water path was rated as less important to the evaluation of the model in the Amazon watershed (mean: 4.45; σ: 1.52) and Asian watershed (mean: 4.74; σ: 1.22), compared to the Southern Ocean (mean: 5.03; σ: 1.13).
These differences indicate that experts adjust the importance assigned to different metrics depending on the science question or region they are focusing on. As a result, we recommend that future work focused on understanding or quantifying expert judgments of model fidelity should always be explicit about the scientific goals for which the model under assessment will be evaluated.
2
3.4. Perceived barriers to systematic quantification of model fidelity
We also explored the community's perceptions about the current obstacles to systematic quantification of model fidelity (Fig. 7). Survey participants identified the lack of robust statistical metrics (28%) and lack of analysis tools (10%) as major barriers, with 17% selecting "all of the above".Many participants selected the option "Other" and contributed written comments. We grouped these into qualitative categories of responses. The most commonly identified issues related to:
? Lacking or inadequate observational constraints and error estimates for observations (8×);
? Laboriousness of the tuning process (7×); and
? Challenges associated with identifying an appropriate single metric of model fidelity (7×).
On the final point, many of the comments focused on the risk of oversimplifying the analysis and evaluation of models: "Focusing on single metrics over simplifies the analysis too much to be useful. It is often hard to identify good vs. bad because one aspect works while others don't, and different models have different trade offs." "No one metric tells the whole story; this may lead to false confidence in model fidelity." Another commenter noted that "it's very hard to create a single metric that accurately encapsulates subjective judgments of many scientists." Finally, several respondents noted other barriers, including a perceived lack of sufficient expertise in the community, a perception that some widespread practices are inadequate or inappropriate for model evaluation, and a lack of sufficient attention to model sensitivities, as opposed to calibration with respect to present-day mean climate.
Figure8. Illustration of the concept of overall model fidelity rankings and their sensitivity to expert weights. Consider the pair of models uq1 and uq2, where the overall fidelity of the model is evaluated as a weighted mean of several component scores. If uq1 performs better than uq2 on some component scores, but worse on others, the ranking of these models according to their overall mean fidelity metric will be sensitive to how strongly each component metric is weighted. In this example, the rankings of several models using "naive weights" (unweighted average) are compared to rankings that use importance weights derived from the responses of two different experts in our survey.