Choice experiments in non-market value analysis: some methodological issues

Purpose –This paper reviews the current literature on theoretical andmethodological issues in discrete choice experiments, which have been widely used in non-market value analysis, such as elicitation of residents’ attitudes toward recreation or biodiversity conservation of forests. Design/methodology/approach – We review the literature, and attribute the possible biases in choice experiments to theoretical and empirical aspects. Particularly, we introduce regret minimization as an alternative to random utility theory and sheds light on incentive compatibility, status quo, attributes nonattendance, cognitive load, experimental design, survey methods, estimation strategies and other issues. Findings – The practitioners should pay attention to many issues when carrying out choice experiments in order to avoid possible biases. Many alternatives in theoretical foundations, experimental designs, estimation strategies and even explanations should be taken into account in practice in order to obtain robust results. Originality/value – The paper summarizes the recent developments in methodological and empirical issues of choice experiments and points out the pitfalls and future directions both theoretically and empirically.


Introduction
Choice experiments have been used for a long time to estimate consumer preferences and predict consumer behavior in market (Gao and Schroeder, 2009;Louviere and Hensher, 1982;Lusk and Schroeder, 2004) and non-market valuation studies Boxall et al., 1996;Morey et al., 2002). Forests have ecological multifunction and the non-market values cannot be easily elicited. Ample applications of choice experiments have been carried out for studying residents' attitudes toward recreation (Saelen and Ericson, 2013;Juutinenab et al., 2014), carbon sequestration, biodiversity conservation and ecological services (Baranzini et al., 2012) and other conservation values of forests (Cerda, 2006;Cerda et al., 2014).
A choice experiment is a survey approach designed to elicit consumer preferences based on hypothetical markets. Respondents are required to choose between multiple public or private goods. This choice is expected by the researcher to occur by trading of the individual attributes of the different goods available, and choosing the good (or alternative) that provides the most utility. This approach to consumer behavior was first developed by Lancaster (1966), who states that the utility from a good is not derived from the good itself, but from its individual attributes. From a series of observed choices, a researcher then tries to infer the latent utility function. Traditionally, McFadden's (1974) random utility approach is used to describe the utility gained from a certain alternative on the basis of the attributes, utility weights for each attribute and a random error term to make the estimation of the utility weights feasible. Finally, the estimated model can be used for welfare estimation or market share predictions.
However, researchers are faced with a number of choices when designing a choice experiment. The initial step in designing a choice experiment is the development of an attribute list. The number and type of attributes (either quantitative or qualitative) critically depend on the decision-making context, and attributes need to be thoroughly tested. For an economic valuation study, it is essential that one attribute capturing the cost of the alternative is included. Next, levels must be assigned to each attribute, where great care must be given to realism and local nuances. Depending on the size of each alternative (in terms of number of attributes), the researcher must decide on how many alternatives to include in a single choice set, and whether an "opt-out"-option should be included. In market good valuation studies, these alternatives usually describe the "would not buy any" option in a choice set. When choosing among non-market goods such as environmental amenities, this option is often considered as a "status quo" option, which simply describes the state that the respondent is currently in.
After the researcher decided on the number of, attributes, levels and alternatives, the next step becomes developing an experimental design. Full factorial or orthogonal fractional factorials, D-optimal designs or Bayesian designs have been proposed in the literature. When the design has been created and prepared such that the cognitive burden to the respondent is low enough to create reliable responses, supporting questions of the questionnaire can be developed. These include socio-demographic questions, but can also aim at attitudes, behavioral aspects or attribute attendance. Debriefing questions may help investigate possible reliability issues of the model estimates in later stages of the analysis.
To elicit preferences, one or several survey modes must be chosen. Examples include mail or online surveys, or mixed-mode approaches. Respondents can be contacted via professional survey companies, or researchers may prefer to draw random samples themselves. Hidden populations might even be contacted via Internet forums or via snowball sampling.
After the sample has been collected, model estimation is the next step. Here, the researcher has to decide which type of model should be estimated. The standard model is the multinomial logit (MNL) model. However, in recent years, a number of other models have been developed which avoid some of the restrictive assumptions like the independence of irrelevant alternatives (IIA) assumption or the preference homogeneity assumption of MNL model. The random parameters logit (RPL) model assumes some form of distribution of the parameters and therefore allows for preference heterogeneity across the sample. Latent class models allow to model choices of discretely distributed, "latent" respondent types and computationally separate respondents into different classes. While the traditional underlying model in the analysis of choices has been the random utility model, recent developments have incorporated regret theory into choice models (Chorus et al., 2008). Both models can be estimated using the methods described above, but differences in model interpretation and the underlying decision framework make a closer look interesting and necessary.
Finally, results can be used to estimate the benefits of policies across the target population, or the willingness to pay for new products. Further, preference weights can be used to predict consumer behavior in specific scenarios (see Hensher et al. (2005); Louviere et al. (2000)).
While choice experiments have already been used for decades (see e.g. Hanley et al. (1998); Hanley et al. (2001) and Hoyos (2010) for comprehensive reviews of the application to environmental policy choices), the method has rapidly developed in theoretical and methodological issues, attempting to make the method better fit the framework of economic theory and human decision processes. Therefore, there is a call to summarize these advances. In particular, we focus on, but do not restrict ourselves to studies regarding the valuation of non-market goods. We compare the findings in the different steps of conducting a choice experiment and conclude with some general recommendations. An overview over the publications discussed in this paper and their main innovations is given in Table 1.

Method
In this paper, we systematically review the literature on important issues in choice experiments. We used the scientific search engines Google Scholar, ScienceDirect and PubMed. Our primary search terms included choice experiment, choice issues, attribute processing, regret theory, random regret model, status quo option and incentive compatibility. Secondary search terms included order effects, experimental design, efficient design, pivot design, endogeneity, welfare effects, willingness to pay, qualitative methods and attribute design.

Theoretical foundations
In this section, we present some theoretical issues regarding the design and analysis of choice experiments. First, we introduce alternative choice rules by discussing the random regret model developed by Chorus et al. (2008). Then, we move on to the issue of incentive compatibility.

Departures from utility theory
The standard model to analyze discrete choice experiments has been McFadden's (1974) random utility model (RUM). In a choice setting, a respondent i is expected to maximize his utility u i , which is composed of a deterministic, observable part v i and a stochastic unobservable part ε i . Assumptions about the distribution of this error term allow estimating the deterministic utility function with some form of binary econometric framework (see Train (2009) for details). However, in recent years, alternative decision models have been suggested to analyze choices, in particular regret theory (Bell, 1982;Fishburn, 1982;Loomes and Sugden, 1982). According to Zeelenberg (1999, p. 326), regret is "the negative, cognitively based emotion that we experience when realizing or imagining that our present situation would have been better had we acted differently". Applications of regret theory in choice modeling include for example Chorus et al. (2008); Boeri et al. (2014); Hess and Stathopoulos (2013) or Thiene et al. (2012). A complete overview of the regret model and its econometric application as Random Regret Model (RRM) is provided by Chorus (2012). In short, instead of maximizing (expected) utility, respondents are expected to minimize their (anticipated) regret from the non-chosen alternatives. As Chorus et al. (2008, p. 15) point out, the two decision paradigms lead to very different outcomes; while utility maximizers prefer alternatives that perform well on most attributes, regret minimizers choose alternatives which perform reasonably well on all attributes. An intuitively appealing form of the regret function (Chorus, 2012, p. 8 where (x jm Àx im ) describes the difference in levels of attribute m between alternatives i and j, and β m is interpreted as attribute m's potential contribution to the regret function. Attribute processing and ANA Hensher (2007) Chapter in Kanninen (2007) Theoretical exposition on different attribute processing strategies, influence of complexity on attribute processing Hensher (2010b) Chapter in Proceedings of the International Choice Modeling Conference 2010 Dempster-Shafer belief functions to assess processing strategy, attribute non-attendance, attribute aggregation Mariel et al. (2012) Conference

Choice experiments in
non-market value analysis from this definition, the value of the regret function cannot be negative, meaning that if the attribute of the chosen alternative is already better than the non-chosen alternative, regret from this attribute is zero. However, this particular functional form has a discontinuity at 0, which makes it difficult to estimate. Chorus (2012) therefore proposes to approximate the regret function in the following way and to add an IID random error term ε to form the random regret model: One advantage of regret minimization is the fact that compared to the linear specification of the random utility model, the attributes of the regret model are only semi-compensatory, i.e. do not serve as perfect substitutes. In addition, the model has been fully generalized to estimation of choices under uncertainty (Chorus et al., 2008), however, difficulties arise in the estimation of welfare effects. While the random utility model is deeply rooted in microeconomic welfare theory, welfare measures based on regret theory are just currently being developed (Boeri et al., 2014). In addition, Boeri et al. (2014) and Hess and Stathopoulos (2013) show approaches to estimate the proportion of respondents which are utility maximizers and those who minimize regret. Monte Carlo simulations by Boeri et al. (2013) indicated that the wrong decision model can lead to significant bias in the estimated parameters. Chorus et al. (2014) review the literature and compare RRM and RUM in 21 studies with regard to (1) model fit, (2) predictive performance and (3) managerial implications. By applying the Ben-Akiva and Swait (1986) test for non-nested models, they look for statistically significant differences in model fit and find that contextual differences matter with regard to which model fits the data better. In general, they find that for important or difficult decisions, such as which car to buy or which policy to choose, the RRM framework fits best, while decisions in leisure activities or travel choice were best modeled by the RUM. With regard to predictive performance and external validity, the RRM was found to perform significantly better than the RUM, however, differences were generally small. Finally, different models were also found to influence managerial implications, for example differences in elasticities and predicted market shares. Chorus et al. (2014) conclude that the choice between RUM and RRM should be made on the basis of where each model performs better in terms of model fit and predictive power. Alternatively, researchers may opt for a hybrid model, combining utility maximization and random regret minimization either arbitrarily or within a latent class framework (Boeri et al., 2014;Hess and Stathopoulos, 2013

Journal of Marketing Research
Control function approach to revealed preference data Guevara and Ben-Akiva (2010) Chapter in proceedings of the International Choice Modeling Conference 2010 Use control function method and show link between control functions and latent variables Guevara and Polanco (2013) Paper presented at the International Choice Modeling Conference 2013 Use of a multiple indicator solution to correct for endogeneity Guevara and Ben-Akiva (2012) Transportation Science Scale factor correction for models estimated by the control function method Table 1.
However, no matter which decision rule is applied, valid results critically depend on whether respondents reveal their true preference in a questionnaire, and whether the responses are influenced by the structure and mode of the questions being asked. Recent findings on incentive compatibility and order effects are therefore described in the next sections.
3.2 Incentive compatibility "An allocation mechanism or institution is said to be incentive compatible when its rules provide individuals with incentives to truthfully and fully reveal their preferences" (Harrison, 2007, p. 67).
The difference between hypothetical and actual WTP, known as hypothetical bias, has been the subject of several studies (e.g. Hensher, 2010;Murphy et al., 2005;Yu et al., 2016). While there have been some attempts to explain causes for hypothetical bias, it still lacks a general theory (Murphy et al., 2005). In addition, hypothetical bias can go both ways, depending on the context. For example, Brownstone and Small (2005) and Hensher (2010) found that in transportation research, hypothetical WTP is often lower than actual WTP. On the other hand, when valuing different private or public goods, WTP of the hypothetical scenario has been found to exceed the real WTP when respondents were forced to pay the stated amount for a project (Krawczyk, 2012;Murphy et al., 2005).
In their rigorous theoretical discussion of the incentive compatibility of different choice formats, Carson and Groves (2007) compare single binary choice questions with series of binary and multinomial choice questions. The authors conclude, that in order to be incentive compatible, close attention has to be paid to the good being valued, the choice context and the payment vehicle. For example, valuing a private good without coercive payment might induce a respondent to overstate his WTP in the hypothetical question, if that respondent has at least some probability of gaining positive utility from this good. Overstating own willingness to pay in a non-consequential setting might, in the mind of the respondent, therefore increase the probability of the good being developed. In a non-market good setting, a voluntary payment mechanism might yield similar results. However, if the agency providing the public good can collect the payment coercively, the respondent's incentive to overstate his WTP may be reduced. The statement of true WTP further critically depends on if the respondents perceives the proposed scenario as plausible (meaning the public good could technically be provided at the proposed cost), and how the agency will decide on which good in the choice set will be provided (either by majority rule or some other mechanism). Vossler et al. (2012) conducted an experiment for the valuation of planted trees along roads and rivers. They used four treatments, in which the first three required a real payment, while the fourth treatment leaves the consequentiality of the treatment open. Further, they examined how different policy implementation methods influence choice behavior. Vossler et al. (2012) conclude that the notion of consequentiality is far more important then the "real vs. hypothetical" discussion in stated preference applications. Further, their results suggest a 30% increase in WTP for the treatment where no actual payment is defined.

WTP vs WTA
Practitioners have two choices to elicit non-market values: willingness to pay (WTP) and willing to accept (WTA) (Freeman, 2003). Both theories and empirical evidences show a gap between WTP and WTP (Cerda, 2006;Cerda et al., 2014). The gap can be explained by many factors, such as design methods, respondents' inner attitudes, endowment effects and even legal difference (Freeman, 2003). However, the basic assumptions for WTA and WTP are different and have different legal context. Freemann (2003) points out that the implicit assumption of WTP is that respondents have to accept all policy changes and have to pay for Choice experiments in non-market value analysis maintaining the current situation; while WTA assume that respondents have the rights to maintain the current status and are compensated by the policy changes. This assumption has profound implications for experimental design, welfare change and theoretical explanation to the results.

Decision process and choice
4.1 Status quo option and "do not know" responses Most choice experiments include some type of opt-out or status quo option. In the market good context, this could include a "do not buy" option, while for non-market goods, a "I prefer the current situation" is applicable in many cases. While this option adds realism of choice situation, different surveys report "status quo bias" (Samuelson and Zeckhauser, 1988;Zhou et al., 2017) as a possible problem for welfare measurement. This may have various reasons. In the experimental economics literature, status quo bias has been attributed to the endowment effect, preferences for a legitimate alternative, preferences for inaction or to avoid the complexity of a choice task (Boxall et al., 2009). Boxall et al. (2009) found evidence that, as the number of choice tasks increases, respondents are more likely to opt out. In addition, their findings indicate that older respondents choose a status quo option consistently more often than younger respondents. Including variables associated with status quo bias significantly changed the levels and the variance of welfare measures associated with some environmental change.
A way to measure a preference for the status quo is to include an alternative specific constant for the status quo. Meyerhoff and Liebe (2009) argue that a significantly positive constant for the status quo could be interpreted either as the average effect of all attributes that were not included or as the utility associated with the status quo option, as suggested by Adamowicz et al. (1998). As Meyerhoff and Liebe (2009) demonstrate, the status quo constant can further be interacted with socio-demographic and behavioral characteristics of the respondent. In their choice experiment on sustainable forest management in Lower Saxony, Germany, they found that older, better educated frequent forest users were less, while protest respondents (identified by a number of attitudinal debriefing questions) were more likely to choose status quo. Also, they found some evidence that respondents who perceive the choice task as too complex are more likely to choose status quo. A similar strategy was applied by Lanz and Provins (2012), who also find significant influences of socio-demographic characteristics on status quo choice in the context of water provision in Switzerland. Both studies incorporate attitudinal questions to separate protest responses based on the credibility of the scenario, aversion toward the payment vehicle and the feeling of being provided with insufficient information. Overall, Lanz and Provins (2012) found that all three indicators of protest behavior were significantly increasing the probability of opting out, while variables indicating the perception of the survey (interesting, complicated, educational) were not found to be significant. Interestingly, a more in-depth description of the status quo alternative lead to a significant reduction in status quo responses, all other things equal.
However, in many choice experiments, researchers have to deal with the issue of serial nonparticipation (Von Haefen et al., 2005). In their words, "one form of serial participation is when a respondent always chooses the status quo option" (Von Haefen et al., 2005, p. 1,061). One may argue that the behavioral process guiding serial nonparticipation is different from utility maximization based on attributes, and therefore remove all the respondents engaged in serial nonparticipation from the sample. Von Haefen et al. (2005) propose a hurdle model similar to dealing with excess zeros in contingent valuation studies. Lanz and Provins (2012) studied serial nonparticipation based on socio-demographics and protest-attitudes. However, they found no statistically significant evidence that protest attitudes influenced serial nonparticipation, while evidence that satisfaction with the status quo would lead to a higher probability of serial nonparticipation was statistically significant. The feeling of having been provided with insufficient information however led to a significant higher probability of serial nonparticipation.
While not as frequently used as status quo options, do not know responses can also be introduced into choice sets. Balcombe and Fraser (2011) develop a general framework for the treatment of "do not know" (DK) responses consistent with the nested logit model. Basically, they add the probability of someone giving a DK response, given that actually some other alternative is preferred, to the likelihood function. The likelihood function then becomes where θ •jj describes the probability of reporting DK given a preference for alternative j, p ij is the standard logit or probit probability and ε i 5 1 if a preference was reported and zero otherwise. The expression in the first product describes the marginal probability of choosing DK. Balcombe and Fraser (2011) further generalize the model by introducing measures for the similarity between alternatives. They provide three model specifications, one allowing for a constant probability of choosing DK, one where it depends on the similarities between all the alternatives and one where the probability of DK depends on the similarity between the two alternatives that provide the highest utility. Each of these models can be estimated using a specific likelihood function.

Attribute processing
While the standard assumption in choice experiments is that respondents attend to all attributes equally, evidence has shown that often respondents use simplifying strategies when making their choices. Hensher (2007) cites Payne et al. (1992) when summarizing the most important strategies into two broad categories: Attribute based strategies include elimination by aspects, lexicographic choice and majority of confirming dimensions. Alternative-based approaches include weighted additive, satisficing and equal-weight strategies. These strategies differ in the total amount of processing required, and the degree to which processing is consistent or selective across alternatives or attributes (Payne et al., 1992, p. 115). Hensher (2007) conceptualizes the response to a discrete choice experiment as a two-stage process: first the choice of the attribute processing strategy and second the choice among the offered alternatives conditional on the chosen processing strategy. In an empirical application, he finds that the number of attributes attended to increases as the number of attribute levels declines and the range of an attribute increases. Further, increasing the number of alternatives also increased the number of attributes attended to significantly. Clear evidence was found for choice set simplification through addition of attributes (e.g. different components of travel time). Hensher (2010) further investigates different approaches to attribute processing and incorporates three heuristics analysis: Common-metric attribute aggregation, common-metric parameter transfer and attribute non-attendance. By estimating a number of mixed logit and latent class models, he shows that various heuristics in attribute processing influence WTP estimates substantially. While it is relatively easy to estimate the impact of a given choice heuristic, it is more difficult to investigate which strategy was actually chosen by the respondent. Using supporting questions such as "Which attributes did you not attend to?" are convenient, however, Hensher (2010b) proposes delve deeper into the respondent's psyche and apply a Dempster-Shafer belief function to investigate the role of attribute processing in choice experiments.

Choice experiments in non-market value analysis
A growing body of literature has started to tackle the issue of attribute non-attendance (ANA). In principle, "stated" and "inferred" methods of detecting ANA can be distinguished (Kravchenko, 2014). A series of studies have used stated methods to investigate the effects of attribute non-attendance, either on a choice sequence level (Alemu et al., 2013) or at the level of individual choice tasks (Colombo and Glenk, 2013;Quan et al., 2018). Mariel et al. (2012) compare stated and inferred ANA methods in the context of wind farms in Germany and conclude that stated ANA is not always consistent with inferred ANA. They estimate inferred ANA by the method developed by Hess and Hensher (2010). Therefore, Alemu et al. (2013, p. 341) ask for reasons why a certain attribute was ignored, including (1) the attribute is not important to me, (2) ignoring the attribute made it easier to choose between the alternatives, (3) attribute levels were unrealistically high/low, (4) I do not think the attribute should be weighed against the others and (5) do not know. Alemu et al. (2013) argue that reason one exhibits genuine preferences, while reasons three and four exhibit protest behavior. Ignoring the attribute to make the choice easier was specifically often chosen for attributes with non-market good characteristics. Colombo and Glenk (2013) distinguish between attribute non-attendance and alternative non-attendance in the context of agricultural subsidies of the European Common Agricultural Policy. Based on stated attribute non-attendance after each choice set, they estimate a series of models in which they consider the non-attendance of individual attributes, and find that the benefits of asking debriefing questions after each individual choice would outweigh the additional effort for the respondent. Also, they consider the possibility that an alternative would not be considered at all due to an unacceptable attribute level. They conclude that attribute non-attendance is very common and that the inclusion of the additional information leads to better statistical performance in the estimated models. Scarpa et al. (2009) present an empirical framework to estimate the effect of attribute nonattendance based on a latent class approach and a Bayesian approach. In the latent class approach, they divide their sample into several classes having total attendance to all attributes, total non-attendance (by restricting all parameters to zero) and partial nonattendance (by restricting the parameters of an individual attribute to zero). Further, they investigate non-attendance to combinations of attributes, in particular the interactions of cost and non-monetary attributes. Not accounting for ANA is found to overestimated WTP measures, compared to models where non-attendance, in particular to the combinations with the cost attribute is taken into account. In the Bayesian approach, they account for taste heterogeneity among non-zero taste entities. Findings with regard to WTP were similar to the latent class approach; however, the variance of the estimated parameters was comparably higher. Scarpa et al. (2009) conclude that severe attribute non-attendance could be identified from their dataset, and recommend future research to focus on the implications of attribute non-attendance for welfare estimates. In addition, they propose a further investigation into the appropriate supporting questions to better identify attribute non-attending choice strategies.
Hensher et al. (2012) provide a probabilistic model to incorporate attribute non-attendance based on a latent class approach. Their basic idea assumes that each respondent is part of one of 2 K classes (with an associated probability) each of which ignores a certain combination of K attributes. This probability is then multiplied with the conditional choice probability of choosing the selected alternative. Similar to Scarpa et al. (2009) Hensher et al.'s approach has the advantage that it does not rely on stated information about which attributes were not attended to. Puckett and Hensher (2008) pick up DeShazo and Fermo's (2004) notion of rationally adaptive behavior, "which assumes that decision makers acknowledge that information processing is costly, and hence full attention to the information in a choice task may not be optimal" (Puckett and Hensher, 2008, p. 380). They used two follow-up questions after each choice set to account for possible adaptively rational behavior, including ignored attributes (either all, or for a specific alternative) and the possibility of adding up attributes along a common dimension (e.g. all cost attributes). Puckett and Hensher (2008) also confirm that incorporating attribute-aggregation and ignored attributes into model estimation leads to very different WTP estimates compared to the standard bounded rational model.

Order effects
The desire to gain additional information from each respondent, and thereby bring down costs of survey implementation, has led to the inclusion of multiple independent choice sets into a single questionnaire. Standard microeconomic theory conceptualizes these choices to be driven by (1) All respondents truthfully answer the questions being asked and, (2) True preferences are stable over the course of a sequence of questions (McNair et al., 2011, p. 556).
With regard to the choice set order, this means that respondents will not be influenced by the order in which choice sets are presented. A number of studies have contested this idea and investigated so-called order effects (Bateman et al., 2008;Carson and Groves, 2007;Day et al., 2012;Day and Pinto Prades, 2010;McNair et al., 2011;Scheufele and Bennett, 2012;Vossler et al., 2012). Day et al. (2012) divide order effects into position-dependent and precedentdependent order effects. Position-dependent order effects influence the respondent's choice because of their position within a series of choice sets. This includes, for example institutional learning (Scheufele and Bennett, 2012): a respondent might have had undeveloped preferences for the good in question, which he learns of during the task of going through the choice sets. One way of tackling this would be the inclusion of "warm-up" choice sets preceding the series of choice sets (Carson et al., 1994). However, Meyerhoff and Glenk (2013) point out that this strategy might actually induce starting point bias. Using a split sample approach, they compared samples with and without warm-up choice sets. They found significant differences in WTP between the samples and conclude that including warm-up choice sets "might do more harm than good" (Meyerhoff and Glenk, 2013, p. 25). Becoming more familiar with the choice context might also induce strategic learning, where a respondent alters his choice behavior for to make a specific strategic goal more likely, without a change in their preferences. Offering a second choice set might also alert the respondent toward the possibility of the price being uncertain or open to negotiation (Carson and Groves, 2007). Being associated with the uncertain variance in future income, WTP might decline for subsequent choice sets. Finally, respondents might become fatigued by the number of choice sets and fail to carefully consider all attributes toward the end of the experiment. They might devolve into satisficing strategies to reduce their cognitive load or show bias for status quo (Day et al., 2012;Giampietri et al., 2016).
In precedent-dependent strategies, the attributes of the previous choice sets affect current choices. For example, to achieve low-cost provision of a public good, a respondent might systematically reject all alternatives being offered at a higher than the lowest price observed thus far. McNair et al. (2011, p. 556) name a similar concept, cost-driven value learning: an alternative is more (less) likely to be chosen if its cost level is low (high) compared to the levels in the previous choice task(s). Undefined preferences might also induce starting point effects, where respondents compare the attributes of the current choice set to the first-choice set (Day et al., 2012;Ladenburg and Olsen, 2008).
Only few studies have tried to isolate various types of order effects in DCEs empirically. Research on order effects was pioneered by studies examining the double-bounded contingent valuation format (e.g. Hanemann et al. (1991) and Cameron and Quiggin (1994)), Choice experiments in non-market value analysis where the evidence suggested that WTP estimates from the first-and second-valuation question were not drawn from the same distribution. These findings critically influence the credibility of WTP estimates of DCEs, where a series of seemingly independent choice questions are asked. Empirical evidence, however, suggests that order effects in DCEs exist. For example, Scheufele and Bennet (2012) found that responses significantly depend on previous levels of the cost attribute. In particular, if the level is the highest in the series observed so far, respondents are less likely to choose this alternative in a binary choice setting with status quo and one alternative. Having the minimum cost in the series so far, however, did not show a significant improvement in choice probability. Similar results were found by Day and Pinto Prades (2010) and Zhou et al. (2017). Further, Scheufele and Bennett (2012)  Day et al. (2012) test a whole series of order effects in their study of preferences for tap water quality improvements. First, they find that the probability of choosing the status quo, regardless of the alternative's attribute levels, is influenced by whether choice sets are revealed sequentially (STP) or in advanced disclosure (ADV) formats. With regard to the presentation mode, they find strong evidence for position dependence in the STP, but not in the ADV mode. As position dependence of STP converges to the level of ADV toward the end, Day et al. argue that institutional learning (in contrast to preference learning) is likely to occur in STP mode. Toward the end of the experiment, the status quo option was significantly more often chosen in STP than in the ADV mode. This can be explained by the theory of loss of credibility as more different combinations at different costs are observed. However, fatigue and the aforementioned income uncertainty hypothesis might also be a reasonable explanation. Day et al. also find evidence of precedent-dependence; however, this is observed in both treatment groups. They use the water quality improvement per cost unit as preference-weighted "deal" measure, and calculate a vector of deal-measures including first task, directly previous choice task, best task and worst task thus far in the series of tasks. While the first deal and the best deal so far significantly shaped preferences of the current choice in both ADV and STP treatment, the worst choice was only significant in the STP treatment, and the directly previous choice was not significant at all. Two explanations for this asymmetry are offered, including either a more cautious perspective of respondents in the STP treatment or strategic misrepresentation of their preferences in the ADV treatment.

Choice set design and attribute selection
The selection of attributes through qualitative processes is one of the key issues when designing a choice experiment. In the words of Louviere et al. (2000), "We cannot overemphasize how important it is to conduct this kind of qualitative, exploratory work to guide subsequent phases of the SP study". Despite this recommendation, the documentation of the process of attribute selection in the literature has been sparse (Elgart et al., 2012). However, recently a number of studies in the medical field have been published describing in detail their method of selecting the attributes and levels in a choice experiment (Abiiro et al., 2014;Coast et al., 2012;Kløjgaard et al., 2012;Michaels-Igbokwe et al., 2014). Coast et al. (2012, p. 731) criticize that with regard to attribute selection, studies hardly report information on sampling, recording, transcription or analytical methods in empirical qualitative studies, or information on search terms, inclusion and exclusion criteria and data extraction methods in literature reviews. Important characteristics of attributes include (1) importance to the respondent for the decision, (2) sufficient distance to the latent construct investigated in the choice experiment (e. g. overall utility should not be an attribute if the researcher is trying to estimate a utility function), (3) single attributes should not have such a large impact that a large number of respondents make no errors in decision making and (4) attributes should not be intrinsic to a person's personality (Coast et al., 2012, p. 734). They review eight papers in the health economics context and make a strong argument for qualitative research approaches in attribute development. They argue that qualitative methods enable the researcher to develop richer and more nuanced attributes than when just taking attributes from the literature or tailored toward a particular policy question. Also, qualitative research could help in refining the language in questionnaires so respondents understand the meaning desired by the researchers. However, they also report some challenges in applying qualitative methods, such as the opportunity costs in generating qualitative research skills, and the reluctance of experienced qualitative researchers to boil down complex relationships into simple and easily comprehensible attributes. Finally, Coast et al. (2012) provide a guideline to how attribute development should be reported, including a rationale for the method to develop the attributes, type of sampling and information on how interviews were conducted, details of the analysis, a description of the results and which attributes were problematic and how they were changed or removed from the experiment. Abiiro et al. (2014) pick up Coast et al.'s (2012) suggestions and describe in detail their approach to attribute design in a choice experiment in the context of micro-finance health insurance in Malawi. As most studies, they start with a literature review and extract the most important attributes. These attribute lists are used to develop a semi-structured guide for a qualitative study including members of the target population. Abiiro et al. (2014) stress that a literature review alone may not capture important attributes specific to the local population. Therefore, community members were led through focus group discussions using open-ended questions, and interviews were conducted with key informants from the health industry. The attributes and their levels were then directly extracted from transcripts using qualitative data analysis, and further narrowed down through additional expert interviews. Criteria for dropping attributes were overlapping with other attributes, attributes where a clear preference was already visible from the focus group discussions to avoid dominance and attributes that had been identified of being less import. All the dropped attributes were fixed to a standard level described in the introduction of the choice experiment. Finally, Abiiro et al. (2014) conclude that their qualitative framework could be complemented by basic quantitative methods such as best-worst scaling and nominal group ranking techniques.
Similarly, Michaels-Igbokwe et al. (2014) use a mixture of focus group discussions and key informant interviews to select attributes in their choice experiment on health services for young people in Malawi. However, they first engage in a decision mapping process, where they first explore the possible motivations for young people to require access to health services (e.g. those who had used the service before and those who had notthen delve deeper into the reasons why some young adults had not used them). This allowed them to structure their experiment accordingly. Kløjgaard et al. (2012) report that their experience using qualitative processes in designing a choice experiment in the context of degenerative disc diseases of the spine. A literature review to better understand the decision-making context revealed the most relevant patient groups, and also surfaced some instances where patients would regret their decisions to have surgery ex-post. In addition, Kløjgaard et al. (2012) conducted three days of observational Choice experiments in non-market value analysis field work in a spine surgical treatment ward, observing the patients' questions, thoughts, and motivations and conducting interviews with doctors. After these phases, a first list of attributes and levels was proposed and a preliminary questionnaire was developed. In-depth interviews were conducted with two doctors and three patients, discussing the chosen attributes, which led to the revision of some attributes and levels. In particular, patients were asked whether the attributes should be included, whether some of them were connected, whether the formulation was understandable and if they felt any dominance among the attributes. Next, level formulation and range were discussed, as well as whether a labeled or unlabeled design should be used. In particular, it was found that a labeled (here: surgical vs non-surgical treatment) design might bias a respondent with pre-formed preferences toward the preferred label, without considering the rest of the attributes. Finally, the framing and the total design and layout were discussed with regard to their comprehensibility, length and complexity. Based on these interviews, the attribute list was revised again before conducting a quantitative pilot test.
All of the studies discussed above conclude that qualitative processes are important in attribute design, in particular when shaping the experiment to a certain local context. However, they also point out some difficulties in conducting qualitative research, including the effort and required research skills and difficulty to reduce the wealth of information obtained into a few simple attributes. It would make sense to extend the experience from health economics studies to other fields, such as environmental valuation studies in different cultural contexts.

Experimental design
The experimental design is at the core of the choice experiment. It assures the unbiased/ uncorrelated distribution of attributes and levels among choice sets, and therefore significantly impacts the consistency and efficiency of estimated parameters. The researcher has to choose which design from the many available (e.g. randomly drawn designs, orthogonal main effects designs, various efficient designs or full factorial designs) for the specific research task. This choice critically depends on the performance of these designs with respect to estimating utility functions. While the classic indicators of design quality were orthogonality (i.e. no attributes correlated with each other) and attribute balance (each attribute occurs exactly the same number of times throughout the design), recent developments have relaxed the orthogonality condition in favor of an efficiency measure. Orthogonal designs for the specific task can usually be found in catalogs, and choice sets can be constructed by randomly pairing alternatives with each other. Other alternatives include cycling through the attributes with each alternative or by the mix-and-match method described by Johnson et al. (2007).
Efficient designs minimize some form of error term, in most cases the D-error (Huber and Zwerina, 1996). The D-error is calculated through the asymptotic covariance matrix of the parameter estimates (Street and Burgess, 2007). In linear models, the asymptotic covariance matrix is approximated by the inverse of the second derivatives of the estimated function. In nonlinear models, such as the multinomial logit model and its generalizations, this covariance matrix is calculated through the matrix of second derivatives of the log-likelihood function. In particular, the value being minimized is the determinant of the inverse of the negative expected value of the matrix of second derivatives of the log likelihood function: and This expression is fully general and can be applied to a wide range of models, e.g. the multinomial logit, nested logit, random parameters logit, etc. While most designs used in environmental valuation studies have assumed a linear utility function, recent research has shown an increase in the use of non-linear functions to account for the non-linearity of the multinomial logit model or other models used in the analysis. S andor and Wedel (2001) demonstrate their use of a Bayesian design procedure that assumes a prior distribution of likely parameters. In particular, they use marketing manager's prior beliefs about the market share of some product to construct the prior parameter distributions, and then construct the experimental design by minimizing the Bayesian D B -error (i.e., the expectation of the former mentioned D-error over the prior distribution of the parameter values). They use a multinomial logit model to investigate and minimize the D B -error. Bliemer et al. (2009) show how to generate efficient designs for analysis with nested logit models and find a significant increase in efficiency compared to standard orthogonal designs. However, this requires a correct specification of the priors. By using Bayesian priors, the design can be made less sensitive to incorrect prior specifications. Also, the efficiency of this particular design is sensitive to misspecifications of the nesting structure. Bliemer and Rose (2010) develop a design strategy for experiments analyzed with the random parameters logit model that allows for correlations across observations. Similar to the nested logit design, the random parameters logit designs are also sensitive toward misspecification of prior distributions of the parameters. Several researchers (Bliemer and Rose, 2010;Ferrini and Scarpa, 2007) suggest to review the literature, or to conduct a pilot study using an orthogonal design, in order to find fitting the prior parameters or parameter distributions.
A number of studies have performed Monte Carlo analysis with regard to this issue (Carlsson and Martinsson, 2003;Ferrini and Scarpa, 2007;Gao et al., 2010;Lusk and Norwood, 2005). In particular, Ferrini and Scarpa (2007) test whether designs using a priori information (including Bayesian designs with weakly informative and informative prior) perform better (in terms of deviations from true parameters) than standard design approaches (a shifted fractional factorial orthogonal design). They found that the Bayesian design, under "good" prior information, is robust to model miss-specification if the sample size is large enough, even more than the designs without prior information. Also, in general, their shifted orthogonal design was superior to the D-optimal design with prior information. Gao et al. (2010) extended the study by incorporating the attribute information load into their Monte Carlo experiment. Using one continuous and one non-continuous utility function, they estimated the parameters obtained from different design strategies (randomly drawn, orthogonal main effects, minimal D-optimal and random pairings from a full factorial) to find out which design produces the most efficient WTP measures.
Overall, these studies have one common finding: Increasing the sample size significantly improves the WTP measures toward their true values (Carlsson and Martinsson, 2003;Ferrini and Scarpa, 2007;Gao et al., 2010;Lusk and Norwood, 2005). Also, while Lusk and Norwood (2005) find no statistically significant differences between any of the experimental estimates and the true WTP values, they find that designs that include interactions lead to more precise estimates for WTP. Also, according to their findings, larger designs do not necessarily perform better; a main-effects plus two-way interactions orthogonal design containing 243 choice sets did not provide more efficient estimates than a D-efficient twoway interaction design containing 31 choice sets. This has important implications for questionnaire design, as it allows for less complex questionnaires which could lead to more Choice experiments in non-market value analysis accurate information. Gao et al. (2010) found that WTP measures have a quadratic relationship with the number of attributes in the choice experiment, and recommend a maximum number of three attributes for discrete utility functions. However, they recognize that many aspects in a choice experiment (e.g. statistical efficiency, cognitive burden, budget constraint. . .) are subject to trading off.
Apart from purely technical considerations in experimental design creation, some contextual implications should be considered as well. For example, the researcher has to decide whether to use a labeled or a generic design Johnson et al., 2007). As the name suggests, the labeled design contains information in its label, and therefore requires a different estimation strategy than a purely generic design where all the information is captured by the attributes. For example, a choice experiment using wine might use different labels to describe production methods (e.g. organic, conventional) in each choice set. The other attributes (taste, price, etc.) might then be analyzed with respect to each production method separately.
Also, different approaches toward the choice of attribute levels have been proposed, specifically in transportation research. In particular, Rose et al. (2008) suggest the use of pivot designs instead of absolute attribute levels. The basic idea of a pivot design is to let the respondent enter his status quo alternative (e.g. the attributes of his regular transportation to work) and then have the alternatives pivot around this base alternative. While this approach allows for more flexibility, Hess and Rose (2009) list a number of cautions when using data from this type of design. By analyzing data from a transport choice experiment, they found correlations in the error terms across the replications of the reference trips, as well as differences in the variance between the reference trip and the hypothetical trips. Also, their findings suggest asymmetric preference formation around the reference attribute levels.
An important consideration is the cognitive burden that the respondent has to go through when working through the choice tasks. In most cases, the orthogonal or efficient design will still be too large for one respondent to handle. Therefore, designs can be split into multiple blocks, by using blocking algorithms that try to keep the within-block orthogonality maximal (Wheeler, 2011). Kessels et al. (2011) introduces a different approach to reduce the cognitive burden by use of partial profiles. They describe a two-stage procedure that generates Bayesian D-optimal designs. In essence, these designs keep some attributes constant over all alternatives, and therefore reduce the cognitive burden on the respondent. They also provide instructions for creating utility-neutral designs, i.e. designs which make the choice probabilities of all alternatives all equal. Kessels et al. (2011) conclude that their designs are about 10-20% less efficient as full profile designs, however this drawback might be compensated by the chance of respondents making non-compensatory decisions in designs that are more difficult to evaluate (i.e. not attending to all attributes).

Survey mode and sampling
The mode of surveying has been found to influence results of stated choice questionnaires. Commonly available modes include face-to-face (f2f) interviews, mail surveys, telephone interviews and online questionnaires (Champ and Welsh, 2007). As Champ and Welsh (2007) and Dillman and Christian (2005) provide an excellent overview of different survey strategies and their pros and cons, we will only briefly discuss the most important pitfalls that can occur when deciding on a survey mode, and discuss recent findings on differences in responses between different survey modes. Research on differences between survey modes has been conducted on contingent valuations (Macmillan et al., 2002;Maguire, 2009;Marta-Pedroso et al., 2007), but less in choice experiments. The exception is Olsen (2009), who conducted a choice experiment on protecting different types of landscape encroachment in Denmark. They compared response characteristics between an online panel and a random mail survey. While they found only a small difference in response rates between the two survey modes, they report a significantly larger number of protest bids in the mail sample. Further, while WTP estimates do not differ significantly, they found more homogeneous preferences in the mail vs the online sample. Regarding socio-demographics, the two samples did not differ significantly.
A key issue in the choice of survey mode is the sampling frame, and whether it can be reached better or worse by a specific mode. For example, a population of elderly might be less likely to have Internet connection at home, and therefore might be excluded from online surveys targeted at the general public. Also, there might be systematic demographic or attitudinal differences between people who use online surveys and people who respond to mail questionnaires, depending on the context of the survey (Marta-Pedroso et al., 2007). While researchers from different fields recommend the f2f method (Arrow et al., 1993), issues have been raised with regard to increased social desirability bias and interviewer effects (Ethier et al., 2000;Leggett et al., 2003;Maguire, 2009), resulting in a higher reported WTP than in self-administered modes.

Estimation strategy 8.1 Frequentist inference
In principle, one can analyze a choice experiment according to different decision rules. The classic random utility maximization rule (McFadden, 1974), which is described in Section 3, or according to random regret theory (Chorus et al., 2008). In the context of random utility maximization, the multinomial logit and probit model are the most basic models. The probability of choice in the multinomial logit model takes the form which can be straightforwardly estimated by maximum likelihood. Some researchers also include individual specific constants into this functional form and therefore estimate a conditional logit model (McFadden, 1974). However, these models come with the restrictive assumptions of independence of irrelevant alternatives (IIA). To allow for correlation between individual alternatives, the nested logit model can be used to group alternatives according to some criteria. This imposes the assumption on the preference structure that the choice of the individual is first taken between the groups, and then within the groups. For example, a choice set in transportation research might consist of a private car, bus or train. The respondent might first choose among public vs private transport, and then (after choosing public transport) choose between bus and train. This can be modeled by imposing a nested structure on the decision rule.
Several models have been proposed to account for preference heterogeneity among individuals in particular, latent class models (Greene and Hensher, 2003), error components logit  and random parameters logit (McFadden and Train, 2000). While multinomial and nested logit models only allow for individual preferences (i.e. marginal utilities) to be fixed and equal across the population (except for interactions with individualspecific characteristics), the random parameters logit model allows the researcher to impose some distribution on the parameters. With this specification, model parameters can be positive, as well as negative or (near) zero for some parts of the population, which adds more realism to the model. A fourth class of models is the latent class model. Similar to the random parameters model, parameters are assumed to vary within the population, however, parameters are discretely distributed. This usually allows to estimate entire parameter sets for a specified number of "latent classes". All of the mentioned estimation methods have been well established in the literature and can be estimated using standard free and open source Choice experiments in non-market value analysis software like R (R Core Team, 2014) or Biogeme (Bierlaire, 2003) or commercial (STATA (© StataCorp), Nlogit (© Econometric Software, Inc.), statistical packages (see, e.g. (Train, 2009) for further details on estimation).
If the researcher chooses to use random regret theory as a decision rule, analysis can be conducted similar to the above. Chorus (2012) show that a simple random regret model can be estimated by where p(i) is the probability of choosing alternative i, and R i is the regret function (see Section 3). While this method is relatively new, packages already exist to estimate regret functions (e.g. Biogeme, Nlogit).
While the random regret model is not a superior alternative to the random utility model, it allows the dissection of a sample into different decision strategies, and therefore the analysis of the drivers of these strategies (Hess and Stathopoulos, 2013). In the future, it will be interesting to analyze which individuals maximize utility and which minimize their regret. Applications could include comparisons between individuals of different age, gender, sociocultural background or ethnicity.

Bayesian inference
Bayesian estimation adds another way to obtain parameter estimates. As Train (2009) points out, Bayesian statistics provide an alternative view on the nature of parameters. In general, parameters are not seen as fixed, but as following a certain distribution. The researcher starts with some initial guess of this parameter distribution k(θ) and updates his subjective belief of the distribution as more information is obtained. While the asymptotic properties of Bayesian vs maximum likelihood estimation are identical, estimates can differ due to sampling effects. One advantage of Bayesian estimation is that it does not rely on asymptotic assumptions when calculating the variance-covariance matrix of the estimated parameters. However, this comes at the cost of computational intensity, since closed-form solutions of the required distributions are usually not available.

Endogeneity
Endogeneity in econometrics refers to a correlation of an explanatory variable with the unobserved error term. Louviere et al. (2005) report several sources of endogeneity in stated preference method, including attribute non-attendance mentioned above, social interactions between individuals or strategic behavior. Omitted variable bias can occur if respondents infer values of an omitted attribute from included attributes. Train (2009) describes three methods to deal with endogeneity, the BLP approach, control functions and a maximum likelihood approach. The BLP approach developed by Berry, Levinsohn and Pakes separates the estimated utility function into two parts: a constant across all markets and products, which represents the average utility gained from each product within each market, but is constant across all individuals; this part captures the endogenous error term that is correlated with another explanatory variable (e. g. price). The second part of the utility function captures the preferences of individuals and may include socio-demographic characteristics, as well as an I.I.D error term. The constant term can be estimated in a choice model including a fixed effect for products and markets, and then further be disaggregated using a linear instrumental variable regression, as the error term is still assumed to be correlated with one of the explanatory variables. An application of the BLP method using social influence variables to correct for endogeneity in transport mode choice is for example Walker et al. (2011).
However, in several cases the endogeneity cannot be absorbed by the product-market constant, in particular when the endogeneity occurs at the individual level. An alternative to the BLP method is the control function method (Hausman, 1978;Heckman, 1978;Petrin and Train, 2010), which is set up similarly to simultaneous equation models. The idea in the control function method is to recover the part of the error term that is correlated with an explanatory variable in an instrumental variable regression, and then use the residuals of this equation in the estimation process of the choice model. The control function can be any type of function that describes the conditional mean of the error term in the endogenous choice model. A simple example by Train (2009, p. 335) is set up as follows: where the utility function U nj depends on the endogenous variable(s) y nj , exogenous variables x nj , and the estimated marginal utilities β n . The endogenous variable y nj is explained by instruments z nj , parameters γ, and the error term μ nj . The system is estimated in two steps: First, the instrumental variables regression explaining y nj is estimated using for example OLS, and the residuals are recovered. In the second step, the residuals are used to construct the conditional expectation of the error term ε nj in the utility function. If the conditional mean of ε nj is a simple linear function of μ, a parameter for μ can be estimated by simply adding μ to the utility function. Depending on the assumptions on the distribution of μ and ε, e.g. whether the error terms are correlated across alternatives or not, different types of models can be estimated using probit, logit or mixed logit. To incorporate preference heterogeneity, the control function can also be interacted with socio-demographic characteristics (Petrin and Train, 2003).
A similar method to the control function approach is the maximum likelihood approach (Train, 2009), sometimes called Full Information Maximum Likelihood (FIML). Rather than estimating the choice equation in two steps, both equations are estimated in one step. This requires the researcher to specify the joint distribution of μ and ε.
Several studies have applied these methods for dealing with endogeneity, either empirical or by using Monte Carlo simulations. One of the classical examples of endogeneity is missing variable bias. Petrin and Train (2010) apply both the BLP and the control function approach to data from household television services (antenna, cable, cable with added premium, satellite dish). By correcting for endogeneity using the control function approach, they report several parameters switching to the correct signs compared to a model without control function. Petrin and Train point out that the BLP approach is more difficult to implement than the control function approach, as BLP requires a contraction procedure. Also, because it tries to match predicted market shares to true market shares, it is not consistent in the case of sampling error. Overall, their application finds similar estimated parameters and elasticities for both approaches.
Guevara and Ben-Akiva (2010) use both, the two-step control function approach and the FIML approach to investigate the properties of choice models with endogenous variables. Specifically, they show the link between control functions to correct for endogeneity and the use of latent variables (Walker and Ben-Akiva, 2002). Endogeneity often accrues because some qualitative attribute (e.g. comfort) is difficult to measure and therefore not correctly specified in the model. In this case, the latent variable can be explained by an additional equation in a structural equation setting. After estimating an IV regression on the endogenous variable, the residuals are recovered and used as explanatory variable in the latent variable equation, which is integrated out in the final estimation. Alternatively, Choice experiments in non-market value analysis the instrumental price equation can be integrated into the latent variable equation directly, and the likelihood function estimated in a single step. Using a series of Monte Carlo experiments, Guevara and Ben-Akiva (2010) found that including a latent variable in their estimation of a control function choice model outperformed both the two-stage control function only and the simultaneous equation control function only models. Guevara and Polanco (2013) adapt the Multiple Indicator Solution (MIS) method (Wooldridge, 2002) to the use in choice models. Arguing that valid instruments are often difficult to find, Guevara and Polanco (2013) use a system of equations where two indicators are explained by the omitted variable q where q 1 and q 2 are indicators of the omitted variable q, and e q1 and e q2 are error terms. Under the assumptions that they show that first running a regression of recovering the residuals b ε and adding them the utility function can correct for endogeneity. Using Monte Carlo simulations, they find that both, controlfunctions and MIS, perform similarly well with regard to parameter efficacy and efficiency. However, they find that the MIS method is more robust toward mild violations of the underlying assumptions. While the control function method leads to consistent ratios between the estimated parameters, the actual estimators are inconsistent (Guevara and Ben-Akiva, 2012). In standard logit choice models, the scale parameter of the extreme value distribution is not identifiable and therefore usually normalized to one. Assuming that the error term of an endogenous choice model ε 5 v þ e, where v is the part correlated with an unobserved variable and e is I.I.D extreme value, Guevara and Ben-Akiva (2012) approximate the joint distribution with an extreme value distribution. They propose a correction for the parameters of the control function model of the size where μ is the scale factor of the extreme value distribution. If the assumption holds that σ 2 e ¼ π 2 =3, leads to a scale factor of Since the ratio between the coefficients is still estimated consistently, Guevara and Ben-Akiva (2012) investigate the effect of omitting a correction of the scale factor on model elasticities and forecasting by Monte Carlo simulation. Their results show that omitting an orthogonal variable affects the scale of the parameters in the logit model, but not the ratios between parameters on a significant level. However, omitting a variable that is correlated with some other explanatory variable affects the scale, as well as the ratio. Finally, using the two-stage control-function method re-established a consistent estimate of the ratio, but the parameter scale was also affected. With regard to forecasting properties, similar results were found, with the forecast choice probabilities not being affected by the scale issues. Including the residuals from the first stage of the two-stage control function approach in the utility function significantly improved the forecast, while only adjusting for scale performed poorly. In addition, Guevara and Ben-Akiva (2012) apply their ideas to real housing market data. Similar to their simulations, they find that the effect of price on choice is underestimated in a model where quality attributes are not accounted for. Also, they find that other effects (which are correlated with price and quality) are underestimated without the correction for endogeneity. They stress the important policy implications of these findings.

Demographic variables
Several methods exist to account for preference heterogeneity into discrete choice studies. One way of doing so is to include socio-demographic variables into the choice sets. However, it has to be kept in mind that only the difference between two alternatives counts in the utility function. If the study uses a labeled design, individual-specific intercepts for the alternatives can be estimated. On the other hand, if a generic design is used, socio-demographics have to be interacted with some attribute in order to generate meaningful results . This leads to the convenient interpretation of how a certain consumer group likes or dislikes a certain attribute, and adds flexibility to modeling and market share forecasts.

Welfare measures
The standard welfare measure when using choice experiments to predict market shares and welfare changes is the compensating variation (CV). In short, the CV is defined as the amount of income that has to be detracted from an individual in order to make him or her as well of as before a price or policy change (Just et al., 2004). For a policy change, the CV can be calculated as where the V's are the respective levels of utility, p's are vectors of market goods, q's are vectors of non-market goods and y is income. This can be easily calculated numerically using the goal-seek function of a spreadsheet application. For a marginal change in one attribute, the marginal CV is simply calculated by dividing the parameters -β attr /β price , where β attr represents the marginal utility obtained from the attribute, and β price represents the marginal utility of money. The standard errors of the (marginal) CV can be obtained via bootstrapping (Krinsky and Robb, 1986). In random parameters models, the calculation of the distribution of the CV is more difficult and can be obtained via Cholesky factorization of the variancecovariance matrix (see  for a short description of the procedure). A different approach to estimating the standard error of welfare measures is to estimate the model in willingness-to-pay space, instead of preference space (Train and Weeks, 2006). Here, the estimated parameters can be interpreted as marginal willingness-to-pay for the attribute in question right away.

Conclusions and further research
Choice experiments have come a long way since their introduction in transportation research. The ecological multifunction of forests has been widely recognized in age of aggravating environmental pollution and biodiversity loss. Elicitation of the forestry non-market values with choice experiments has been intensively studied, such as recreation (Saelen and Ericson, 2013;Juutinenab et al., 2014), carbon sequestration, biodiversity conservation and ecological services (Baranzini, 2012;Cerda, 2006, Cerda et al., 2014.

Choice experiments in non-market value analysis
While the method has improved in nearly all its aspects, a large number of issues still remains. In this review, we have focused on the theoretical and methodological issues that occur in choice experiments and that are continuously identified and discussed in the literature. From the theory point of view, the departure from classical utility theory might be the most important innovation in recent years. It will be interesting to see how this methodological framework is developed further and in what context it can and should be applied in the future.
However, issues that are endogenous to the choice process still remain, no matter which decision rule is used. First of all, the incentive compatibility of hypothetical choice experiments has been challenged by a number of studies, and it will require more research to find if valid and reliable welfare measures can be extracted from such studies. The notion of consequentiality (Vossler et al., 2012) provides an interesting starting point for the development of new elicitation techniques and their application. Further, it puts a stronger focus on the policy implementation, which might often be neglected in environmental valuation studies. On the other hand, correction mechanisms for hypothetical responses in case of overstated willingness-to-pay may be developed.
The body of work related to order effects suggests that severe order effects often exist in choice studies, which should not be neglected. While a number of causes for order effects have been discussed above, the key issue on how to systematically deal with these effects is still largely undetermined. For example, a lower WTP toward the end of a series of choice sets might be caused by institutional learning, by fatigue or by preference learning. This means that either the WTP stated in the first choice sets might be biased upward or the WTP from the last choice sets might be biased downward. Further research is required to identify the "true WTP" from possibly biased responses introduced by the design of the questionnaire.
The order effects issue is also tightly connected to attribute processing. While attribute processing in individual choice sets has been well researched, the question arises if attribute processing strategies change over a series of choice sets. Further, while conceptual models have been developed to deal with attribute non-attendance and other issues, a major issue still is the identification of different processing strategies from the questionnaire. In particular, explicit and implicit methods can be distinguished. In explicit methods, respondents are asked directly which attributes were not attended to. Insights from psychology might help to identify non-attended attributes, and Hensher's (2007) approach to using Dempster-Shafer belief functions could be further developed. Inferred methods, such as Hensher et al.'s (2012) latent class approach, seek to probabilistically separate respondents into different classes who do not attend to one or several attributes, and therefore the contribution of this attribute to utility should be zero. Over all, we see that while there is a large number of conceptual approaches to the modeling of attribute processing strategies, there is still a lack of knowledge of incorporating these approaches into empirical work and deriving measurable conclusions for environmental valuation studies.
On the experimental design frontier, new designs have been developed that incorporate the non-linear nature of choice models into their efficiency measures, in particular the series of designs by Bliemer and Rose (2010); Bliemer et al. (2009) or Street and Burgess (2007). Pivot designs have further increased the flexibility for tailoring experimental designs to the specific (expected) situation. However, this also comes at a cost of more information requirements for picking the right design, and possibly negative consequences if a more sophisticated design is chosen which does not reflect the actual choice situation well. An interesting field of future research might be the optimal design of experiments that incorporate order effects into their optimality measures. This could also be combined with Kessels et al. (2011) approach to partial profiles, to generate designs that reduce the cognitive burden while at the same time reducing the probability of order effects.
Estimation methods are already very advanced, and new theoretical extensions (such as the random regret model) are quickly adopted to complex estimation procedures originally developed for random utility maximization. More and more studies, particularly in the revealed preference field, now consider endogeneity issues in their estimation. Sources of endogeneity in stated preference contexts have been identified, but the literature on estimation in this field is still very sparse, however, problems such as order effects or attribute non-attendance are also related to endogeneity when it comes to estimating utility function. Therefore, more theoretical and empirical work is required on how endogeneity can play a role in stated preference surveys, and what the consequences of endogeneity are when estimating welfare effects.
Overall, even though the method has improved over the recent decades, many issues still remain in practical applications, as well as on a theoretical level. Further research into the handling of these issues is therefore required.