Search results
1 – 10 of over 40000
This purpose of this paper is to provide an overview of the theoretical background and applications of inverse reinforcement learning (IRL).
Abstract
Purpose
This purpose of this paper is to provide an overview of the theoretical background and applications of inverse reinforcement learning (IRL).
Design/methodology/approach
Reinforcement learning (RL) techniques provide a powerful solution for sequential decision making problems under uncertainty. RL uses an agent equipped with a reward function to find a policy through interactions with a dynamic environment. However, one major assumption of existing RL algorithms is that reward function, the most succinct representation of the designer's intention, needs to be provided beforehand. In practice, the reward function can be very hard to specify and exhaustive to tune for large and complex problems, and this inspires the development of IRL, an extension of RL, which directly tackles this problem by learning the reward function through expert demonstrations. In this paper, the original IRL algorithms and its close variants, as well as their recent advances are reviewed and compared.
Findings
This paper can serve as an introduction guide of fundamental theory and developments, as well as the applications of IRL.
Originality/value
This paper surveys the theories and applications of IRL, which is the latest development of RL and has not been done so far.
Details
Keywords
How can managers optimally distribute rewards among individuals in a job group? While the management literature on compensation has established the need for equitable…
Abstract
Purpose
How can managers optimally distribute rewards among individuals in a job group? While the management literature on compensation has established the need for equitable reimbursements for individuals holding similar positions in a function or group, an objective grounding of rewards allocation has certainly escaped scrutiny. This paper aims to address this issue.
Design/methodology/approach
Using an optimization model based on a financial rubric, the portfolio approach allows organizations to envision human capital assets as a set (i.e. a team, group, function), rather than independent contractors. The portfolio can be organized and managed for meeting various organizational objectives (e.g. optimizing returns and instrumental benefits, assessing resource allocations).
Findings
This research introduces an innovative portfolio management scheme for employee rewards distribution. Akin to investing in capital assets, organizations invest considerable resources in their human capital. In doing so, organizations, over time, create a portfolio of human capital assets. The findings reduce large variances in rewards distribution yet serving employee and management considerations.
Practical implications
The research has tremendous implications for managers who can mitigate serious equitable rewards distribution issues by creating a process that exemplifies rewards distribution using four different rewards allocation scenarios based on varying managerial prerogatives.
Originality/value
This research is a unique model that addresses a pressing human resource issue by solution based on a usable and feasible optimization mechanism from financial portfolio theory.
Details
Keywords
Kesha K. Coker, Deepa Pillai and Siva K. Balasubramanian
Rewards from sales promotions may be either immediate (e.g. instant savings, coupons, instant rebates) or delayed (e.g. rebates, refunds). The latter type is of interest…
Abstract
Purpose
Rewards from sales promotions may be either immediate (e.g. instant savings, coupons, instant rebates) or delayed (e.g. rebates, refunds). The latter type is of interest in this study. The purpose of this paper is to present the hyperbolic discounting framework as an explanation for how consumers delay‐discount rewards, and test whether this holds for both high‐price and low‐price product categories.
Design/methodology/approach
Data were collected by administering two online surveys to respondents. One survey presented choice scenarios between sales promotion formats for a high‐priced product (a laptop, n=154) and the other for a low‐priced product (a cell phone, n=98). Hyperbolic and exponential functions were then fitted to the data.
Findings
The hyperbolic function had a better fit than the exponential function for the low‐priced product. However, this effect was not evident in the case of the high‐priced product; no significant difference was found between the functions. The rate of discounting was greater for the high‐priced product than for the low‐priced product. Thus, for low‐priced products, rather than discount a reward rationally, consumers tend to discount the value of the reward at a decreasing rate.
Originality/value
This study addresses delay discounting in the context of a typical consumer buying situation. It also addresses the possibility of consumers applying different forms of discounting to products at different price levels and tests for the same. The results are of considerable significance for marketers wishing to offer price discounts to consumers. For low‐priced products, marketers seem to have more flexibility in delaying the reward, since the rate of discounting decreases for longer delay periods. At the same time, the discount rate for high‐priced products is higher than that for low‐priced products, hence delay periods may have a more critical role as discounted values fall steeply with an increase in delay to reward.
Details
Keywords
An information‐like formulation of the human reward function is shown to be in qualitative agreement with some prominent features of human behavior. Individual events are…
Abstract
An information‐like formulation of the human reward function is shown to be in qualitative agreement with some prominent features of human behavior. Individual events are regarded as “symbols” in a communication theory sense, and their reward for a person depends on their frequency of occurrence in his environment.
The two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with…
Abstract
Purpose
The two‐armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.
Design/methodology/approach
Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards/penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a binomial parameter is the Beta distribution.
Findings
BLA is to be proven instantaneously self‐correcting, and it converges to only pulling the optimal arm with probability as close to unity as desired. Extensive experiments demonstrate that the BLA does not rely on external learning speed/accuracy control. It also outperforms established non‐Bayesian top performers for the TABB problem. Finally, the BLA provides superior performance in a distributed application, namely, the Goore game (GG).
Originality/value
The value of this paper is threefold. First of all, the reported BLA takes advantage of the Bayesian perspective for tackling TABBs, yet avoids the computational complexity inherent in Bayesian approaches. Second, the improved performance offered by the BLA opens up for increased accuracy in a number of TABB‐related applications, such as the GG. Third, the reported results form the basis for a new avenue of research – even for cases when the reward/penalty distribution is not Bernoulli distributed. Indeed, the paper advocates the use of a Bayesian methodology, used in conjunction with the corresponding appropriate conjugate prior.
Details
Keywords
Jacqueline Gottlieb, Manuel Lopes and Pierre-Yves Oudeyer
Based on a synthesis of findings from psychology, neuroscience, and machine learning, we propose a unified theory of curiosity as a form of motivated cognition. Curiosity…
Abstract
Based on a synthesis of findings from psychology, neuroscience, and machine learning, we propose a unified theory of curiosity as a form of motivated cognition. Curiosity, we propose, is comprised of a family of mechanisms that range in complexity from simple heuristics based on novelty, salience, or surprise, to drives based on reward and uncertainty reduction and finally, to self-directed metacognitive processes. These mechanisms, we propose, have evolved to allow agents to discover useful regularities in the world – steering them toward niches of maximal learning progress and away from both random and highly familiar tasks. We emphasize that curiosity arises organically in conjunction with cognition and motivation, being generated by cognitive processes and in turn, motivating them. We hope that this view will spur the systematic study of curiosity as an integral aspect of cognition and decision making during development and adulthood.
Details
Keywords
English original movies played an important role in English learning and communication. In order to find the required movies for us from a large number of English original…
Abstract
Purpose
English original movies played an important role in English learning and communication. In order to find the required movies for us from a large number of English original movies and reviews, this paper proposed an improved deep reinforcement learning algorithm for the recommendation of movies. In fact, although the conventional movies recommendation algorithms have solved the problem of information overload, they still have their limitations in the case of cold start-up and sparse data.
Design/methodology/approach
To solve the aforementioned problems of conventional movies recommendation algorithms, this paper proposed a recommendation algorithm based on the theory of deep reinforcement learning, which uses the deep deterministic policy gradient (DDPG) algorithm to solve the cold starting and sparse data problems and uses Item2vec to transform discrete action space into a continuous one. Meanwhile, a reward function combining with cosine distance and Euclidean distance is proposed to ensure that the neural network does not converge to local optimum prematurely.
Findings
In order to verify the feasibility and validity of the proposed algorithm, the state of the art and the proposed algorithm are compared in indexes of RMSE, recall rate and accuracy based on the MovieLens English original movie data set for the experiments. Experimental results have shown that the proposed algorithm is superior to the conventional algorithm in various indicators.
Originality/value
Applying the proposed algorithm to recommend English original movies, DDPG policy produces better recommendation results and alleviates the impact of cold start and sparse data.
Details
Keywords
We reflect upon the histories of the behavioral science and the neuroscience of motivation, taking note of how these increasingly consilient disciplines inform each other…
Abstract
We reflect upon the histories of the behavioral science and the neuroscience of motivation, taking note of how these increasingly consilient disciplines inform each other. This volume’s chapters illustrate how the field has moved beyond the study of immediate external rewards to the examination of neural mechanisms underlying varied motivational and appetitive states. Exemplifying this trend, we focus on emerging knowledge about intrinsic motivation, linking it with research on both the play and exploratory behaviors of nonhuman animals. We also speculate about large-scale brain networks related to salience processing as a possibly unique component of human intrinsic motivation. We further review emerging studies on neural correlates of basic psychological needs during decision making that are beginning to shine light on the integrative processes that support autonomous functioning. As with the contributions in this volume, such research reflects the increasing iteration between mechanistic studies and contemporary psychological models of human motivation.
Details
Keywords
Jonathan Chapman and Clare Kelliher
Reward research has focussed on level (what individuals are paid) and structure (relationship between different levels of reward). Less emphasis has been given to reward…
Abstract
Purpose
Reward research has focussed on level (what individuals are paid) and structure (relationship between different levels of reward). Less emphasis has been given to reward mix decisions, i.e. the relative proportions of each element making up overall reward. This paper seeks to examine the determinants of reward mix.
Design/methodology/approach
Interview based research with reward consultants as key organisational observers and participants in reward mix decision making.
Findings
Benchmarking has led to the development of reward mix norms. Organisations are under pressure to conform to these norms, moderated by leadership beliefs, the occurrence of events and the extent to which organisations' change capability can overcome strong institutional forces.
Research limitations/implications
The results question agency theory based explanations of reward mix determination and point towards resource dependence and institutional theory perspectives being more suitable theoretical frameworks.
Practical implications
The model developed allows reward managers to consider how the moderating variables, to the dominant mimetic pressure faced, could be manipulated for their firm to allow greater differentiation of the reward mix.
Originality/value
Academically the work contributes to a programme of research into reward determination from a constructionist perspective and aims to provide greater theoretical robustness to the subject. Practically, the findings may prompt practitioners to think more consciously about the drivers of their firm's reward mix. Policy makers may use the stronger theoretical base for understanding the determinants of reward mix choices and the extent to which organisational free choice and institutionally determined choice influence final choices in reward policy decision making.
Details
Keywords
Alagar Rangan, Dimple Thyagarajan and Y Sarada
The purpose of this paper is to generalize Yeh and Zhang's 2004 random threshold failure model for deteriorating systems.
Abstract
Purpose
The purpose of this paper is to generalize Yeh and Zhang's 2004 random threshold failure model for deteriorating systems.
Design/methodology/approach
An N‐policy was adopted by which the system was replaced after the Nth failure.
Findings
The model was found to have practical applications in warranty cost analysis.
Originality/value
By identifying the instance of a shock as the failure of the system and the threshold times as the warranty period offered and changing the definition of lethal shock (system failure in this case) as the occurrence of a shock within a threshold period in our generalized model, one can study the renewing warranty cost analysis.
Details