Start Date: 10/20/2019
Course Type: Common Course |
Course Link: https://www.coursera.org/learn/complete-reinforcement-learning-system
Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.Milestone 1: Formalize Word Problem as MDP
Milestone 2: Choosing The Right Algorithm
Milestone 3: Identify Key Performance Parameters
Milestone 4: Implement Your Agent
Article | Example |
---|---|
Reinforcement learning | Successes of reinforcement learning are listed here. |
Reinforcement learning | The basic reinforcement learning model consists of: |
Reinforcement learning | Two components make reinforcement learning powerful: |
Reinforcement learning | There is also a growing interest in real life applications of reinforcement learning. |
Reinforcement learning | A reinforcement learning agent interacts with its environment in discrete time steps. |
Reinforcement learning | Multiagent or Distributed Reinforcement Learning is also a topic of interest in current research. |
Reinforcement learning | In machine learning, the environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible. |
Reinforcement learning | There are multiple applications of reinforcement learning to generate models and train them to play video games, such as Atari games. In these models, reinforcement learning finds the actions with the best reward at each play. This method is a widely used method in combination with deep neural networks to teach computers to play Atari video games. |
Reinforcement learning | Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs. |
Reinforcement learning | Reinforcement learning algorithms such as TD learning are also being investigated as a model for Dopamine-based learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia function as the prediction error. Reinforcement learning has also been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995-1996, and there have been many follow-up studies). |
Reinforcement learning | Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations: |
Learning classifier system | Up until the 2000's nearly all learning classifier system methods were developed with reinforcement learning problems in mind. As a result, the term ‘learning classifier system’ was commonly defined as the combination of ‘trial-and-error’ reinforcement learning with the global search of a genetic algorithm. Interest in supervised learning applications, and even unsupervised learning have since broadened the use and definition of this term. |
Reinforcement learning | The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems. |
Reinforcement learning | The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection. |
Reinforcement learning | Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take "actions" in an "environment" so as to maximize some notion of cumulative "reward". The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called "approximate dynamic programming". The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality. |
Reinforcement learning | So far, the discussion was restricted to how policy iteration can be used as a basis of the designing reinforcement learning algorithms. Equally importantly, value iteration can also be used as a starting point, giving rise to the Q-Learning algorithm and its many variants. |
Reinforcement learning | Thus, reinforcement learning is particularly well-suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and go (AlphaGo). |
Reinforcement learning | In reinforcement learning methods the expectations are approximated by averaging over samples and one uses function approximation techniques to cope with the need to represent value functions over large state-action spaces. |
Reinforcement | A great many researchers subsequently expanded our understanding of reinforcement and challenged some of Skinner's conclusions. For example, Azrin and Holz defined punishment as a “consequence of behavior that reduces the future probability of that behavior,” and some studies have shown that positive reinforcement and punishment are equally effective in modifying behavior. Research on the effects of positive reinforcement, negative reinforcement and punishment continue today as those concepts are fundamental to learning theory and apply to many practical applications of that theory. |
Learning classifier system | In 1995, Wilson published his landmark paper, "Classifier fitness based on accuracy" in which he introduced the classifier system XCS. XCS took the simplified architecture of ZCS and added an accuracy-based fitness, a niche GA (acting in the action set [A]), an explicit generalization mechanism called "subsumption", and an adaptation of the Q-Learning credit assignment. XCS was popularized by its ability to reach optimal performance while evolving accurate and maximally general classifiers as well as its impressive problem flexibility (able to perform both reinforcement learning and supervised learning) . XCS later became the best known and most studied LCS algorithm and defined a new family of "accuracy-based LCS". ZCS alternatively became synonymous with "strength-based LCS". XCS is also important, because it successfully bridged the gap between LCS and the field of reinforcement learning. Following the success of XCS, LCS were later described as reinforcement learning systems endowed with a generalization capability. Reinforcement learning typically seeks to learn a value function that maps out a complete representation of the state/action space. Similarly, the design of XCS drives it to form an all-inclusive and accurate representation of the problem space (i.e. a "complete map") rather than focusing on high payoff niches in the environment (as was the case with strength-based LCS). Conceptually, complete maps don't only capture what you should do, or what is correct, but also what you shouldn't do, or what's incorrect. Differently, most strength-based LCSs, or exclusively supervised learning LCSs seek a rule set of efficient generalizations in the form of a "best action map" (or a "partial map"). Comparisons between strength vs. accuracy-based fitness and complete vs. best action maps have since been examined in greater detail. |