articles+ search results
418 articles+ results
1  20
Next

Janz, David, Litvak, Alexander E., and Szepesvári, Csaba
 Subjects

Statistics  Machine Learning and Computer Science  Machine Learning
 Abstract

We provide the first useful, rigorous analysis of ensemble sampling for the stochastic linear bandit setting. In particular, we show that, under standard assumptions, for a $d$dimensional stochastic linear bandit with an interaction horizon $T$, ensemble sampling with an ensemble of size $m$ on the order of $d \log T$ incurs regret bounded by order $(d \log T)^{5/2} \sqrt{T}$. Ours is the first result in any structured setting not to require the size of the ensemble to scale linearly with $T$  which defeats the purpose of ensemble sampling  while obtaining near $\sqrt{T}$ order regret. Ours is also the first result that allows infinite action sets.
 Full text View this record from Arxiv

Janz, David, Liu, Shuai, Ayoub, Alex, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative loglikelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. With the datadependent perturbations we propose, not present in previous PHEtype methods, EVILL is shown to match the performance of Thompsonsamplingstyle parameterperturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.
 Full text View this record from Arxiv

Lin, Jihao Andreas, Padhy, Shreyas, Antorán, Javier, Tripp, Austin, Terenin, Alexander, Szepesvári, Csaba, HernándezLobato, José Miguel, and Janz, David
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

We study the optimisation problem associated with Gaussian process regression using squared loss. The most common approach to this problem is to apply an exact solver, such as conjugate gradient descent, either directly, or to a reducedorder version of the problem. Recently, driven by successes in deep learning, stochastic gradient descent has gained traction as an alternative. In this paper, we show that when done right$\unicode{x2014}$by which we mean using specific insights from the optimisation and kernel communities$\unicode{x2014}$this approach is highly effective. We thus introduce a particular stochastic dual gradient descent algorithm, that may be implemented with a few lines of code using any deep learning framework. We explain our design decisions by illustrating their advantage against alternatives with ablation studies and show that the new method is highly competitive. Our evaluations on standard regression benchmarks and a Bayesian optimisation task set our approach apart from preconditioned conjugate gradients, variational Gaussian process approximations, and a previous version of stochastic gradient descent for Gaussian processes. On a molecular binding affinity prediction task, our method places Gaussian process regression on par in terms of performance with stateoftheart graph neural networks.
 Full text View this record from Arxiv

Weisz, Gellért, György, András, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$realizability assumption, where it is assumed that the actionvalues of all policies can be expressed as linear functions of stateaction features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomialsamplecomplexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.
 Full text View this record from Arxiv

Amortila, Philip, Jiang, Nan, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Statistics  Machine Learning
 Abstract

Theoretical guarantees in reinforcement learning (RL) are known to suffer multiplicative blowup factors with respect to the misspecification error of function approximation. Yet, the nature of such \emph{approximation factors}  especially their optimal form in a given learning problem  is poorly understood. In this paper we study this question in linear offpolicy value function estimation, where many open questions remain. We study the approximation factor in a broad spectrum of settings, such as with the weighted $L_2$norm (where the weighting is the offline state distribution), the $L_\infty$ norm, the presence vs. absence of state aliasing, and full vs. partial coverage of the state space. We establish the optimal asymptotic approximation factors (up to constants) for all of these settings. In particular, our bounds identify two instancedependent factors for the $L_2(\mu)$ norm and only one for the $L_\infty$ norm, which are shown to dictate the hardness of offpolicy evaluation under misspecification.
Comment: Accepted to ICML 2023. The arXiv version contains improved results
 Full text View this record from Arxiv
6. Contextlumpable stochastic bandits [2023]

Lee, ChungWei, Liu, Qinghua, AbbasiYadkori, Yasin, Jin, Chi, Lattimore, Tor, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning
 Abstract

We consider a contextual bandit problem with $S $ contexts and $A $ actions. In each round $t=1,2,\dots$ the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min\{S ,A \}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$optimal policy after using at most $\widetilde O(r (S +A )/\epsilon^2)$ samples with high probability and provide a matching $\widetilde\Omega(r (S +A )/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r^3(S +A )T})$. To the best of our knowledge, we are the first to show the nearoptimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general lowrank bandits and get improved regret bounds in some scenarios.
 Full text View this record from Arxiv

Kitamura, Toshinori, Kozuno, Tadashi, Tang, Yunhao, Vieillard, Nino, Valko, Michal, Yang, Wenhao, Mei, Jincheng, Ménard, Pierre, Azar, Mohammad Gheshlaghi, Munos, Rémi, Pietquin, Olivier, Geist, Matthieu, Szepesvári, Csaba, Kumagai, Wataru, and Matsuo, Yutaka
 Subjects

Computer Science  Machine Learning
 Abstract

Mirror descent value iteration (MDVI), an abstraction of KullbackLeibler (KL) and entropyregularized reinforcement learning (RL), has served as the basis for recent highperforming practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear function approximation through its sample complexity required to identify an $\varepsilon$optimal policy with probability $1\delta$ under the settings of an infinitehorizon linear MDP, generative model, and Goptimal design. We demonstrate that leastsquares regression weighted by the variance of an estimated optimal value function of the next state is crucial to achieving minimax optimality. Based on this observation, we present VarianceWeighted LeastSquares MDVI (VWLSMDVI), the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinitehorizon linear MDPs. Furthermore, we propose a practical VWLS algorithm for valuebased deep RL, Deep Variance Weighting (DVW). Our experiments demonstrate that DVW improves the performance of popular valuebased deep RL algorithms on a set of MinAtar benchmarks.
Comment: ICML 2023 accepted
 Full text View this record from Arxiv
8. Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL [2023]

Liu, Qinghua, Weisz, Gellért, György, András, Jin, Chi, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited  they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework  Optimistic NPG for online RL. Optimistic NPG can be viewed as simply combining of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$optimal policy within $\tilde{O}(d^2/\varepsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d^2)$. It also improves over stateoftheart results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. For general function approximation that subsumes linear MDPs, Optimistic NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning nearoptimal policies.
 Full text View this record from Arxiv

Kane, Daniel, Liu, Sihan, Lovett, Shachar, Mahajan, Gaurav, Szepesvári, Csaba, and Weisz, Gellért
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Computer Science  Computational Complexity
 Abstract

A fundamental question in reinforcement learning theory is: suppose the optimal value functions are linear in given features, can we learn them efficiently? This problem's counterpart in supervised learning, linear regression, can be solved both statistically and computationally efficiently. Therefore, it was quite surprising when a recent work \cite{kane2022computational} showed a computationalstatistical gap for linear reinforcement learning: even though there are polynomial samplecomplexity algorithms, unless NP = RP, there are no polynomial time algorithms for this setting. In this work, we build on their result to show a computational lower bound, which is exponential in feature dimension and horizon, for linear reinforcement learning under the Randomized Exponential Time Hypothesis. To prove this we build a roundbased game where in each round the learner is searching for an unknown vector in a unit hypercube. The rewards in this game are chosen such that if the learner achieves large reward, then the learner's actions can be used to simulate solving a variant of 3SAT, where (a) each variable shows up in a bounded number of clauses (b) if an instance has no solutions then it also has no solutions that satisfy more than (1$\epsilon$)fraction of clauses. We use standard reductions to show this 3SAT variant is approximately as hard as 3SAT. Finally, we also show a lower bound optimized for horizon dependence that almost matches the best known upper bound of $\exp(\sqrt{H})$.
 Full text View this record from Arxiv

Tkachuk, Volodymyr, Bakhtiari, Seyed Alireza, Kirschner, Johannes, Jusup, Matej, Bogunovic, Ilija, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Statistics  Machine Learning
 Abstract

A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multiagent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blowup in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Qfunction in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
 Full text View this record from Arxiv

Yin, Dong, Thiagarajan, Sridhar, Lazic, Nevena, Rajaraman, Nived, Hao, Botao, and Szepesvari, Csaba
 Subjects

Computer Science  Machine Learning and Computer Science  Artificial Intelligence
 Abstract

The focus of this work is sampleefficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertaintyfirst local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our metaalgorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initialstate distribution. The agentenvironment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve superhuman performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
Comment: 25 pages, 11 figures
 Full text View this record from Arxiv

Mei, Jincheng, Chung, Wesley, Thomas, Valentin, Dai, Bo, Szepesvari, Csaba, and Schuurmans, Dale
 Subjects

Computer Science  Machine Learning and Computer Science  Artificial Intelligence
 Abstract

We study the effect of baselines in onpolicy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the \emph{state value} baseline allows onpolicy stochastic \emph{natural} policy gradient (NPG) to converge to a globally optimal policy at an $O(1/t)$ rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the nonuniform \L{}ojasiewicz (N\L{}) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction \emph{cannot} explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance. That is, we demonstrate that a finite variance is \emph{not necessary} for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.
Comment: 55 pages; published at NeurIPS 2022
 Full text View this record from Arxiv
13. Learning Lipschitz Functions by GDtrained Shallow Overparameterized ReLU Neural Networks [2022]

Kuzborskij, Ilja and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

We explore the ability of overparameterized shallow ReLU neural networks to learn Lipschitz, nondifferentiable, bounded functions with additive noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noise, neural networks trained to nearly zero training error are inconsistent in this class, we focus on the earlystopped GD which allows us to show consistency and optimal rates. In particular, we explore this problem from the viewpoint of the Neural Tangent Kernel (NTK) approximation of a GDtrained finitewidth neural network. We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate for learning on the class of considered Lipschitz functions by neural networks. We discuss several datafree and datadependent practically appealing stopping rules that yield optimal rates.
 Full text View this record from Arxiv

Zhao, Yao, Stephens, Connor James, Szepesvári, Csaba, and Jun, KwangSung
 Subjects

Computer Science  Machine Learning
 Abstract

Simple regret is a natural and parameterfree performance criterion for pure exploration in multiarmed bandits yet is less popular than the probability of missing the best arm or an $\epsilon$good arm, perhaps due to lack of easy ways to characterize it. In this paper, we make significant progress on minimizing simple regret in both datarich ($T\ge n$) and datapoor regime ($T \le n$) where $n$ is the number of arms, and $T$ is the number of samples. At its heart is our improved instancedependent analysis of the wellknown Sequential Halving (SH) algorithm, where we bound the probability of returning an arm whose mean reward is not within $\epsilon$ from the best (i.e., not $\epsilon$good) for \textit{any} choice of $\epsilon>0$, although $\epsilon$ is not an input to SH. Our bound not only leads to an optimal worstcase simple regret bound of $\sqrt{n/T}$ up to logarithmic factors but also essentially matches the instancedependent lower bound for returning an $\epsilon$good arm reported by KatzSamuels and Jamieson (2020). For the more challenging datapoor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on realworld tasks.
 Full text View this record from Arxiv
15. Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$realizable MDPs [2022]

Weisz, Gellért, György, András, Kozuno, Tadashi, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Statistics  Machine Learning
 Abstract

We consider approximate dynamic programming in $\gamma$discounted Markov decision processes and apply it to approximate planning with linear valuefunction approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worstcase approximation error $\epsilon$ of the actionvalue functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a nonstationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}H\epsilon)$optimal policy after issuing $\tilde O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.
 Full text View this record from Arxiv

Liu, Qinghua, Netrapalli, Praneeth, Szepesvári, Csaba, and Jin, Chi
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Statistics  Machine Learning
 Abstract

This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the nearoptimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable modelbased Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weaklyrevealing/observable POMDPs and multistep decodable POMDPs), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) wellconditioned lowrank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of modelbased RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a rewardfree variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of nearoptimal policies for all reward functions simultaneously.
 Full text View this record from Arxiv

Vaswani, Sharan, Yang, Lin F., and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

In contrast to the advances in characterizing the sample complexity for solving Markov decision processes (MDPs), the optimal statistical complexity for solving constrained MDPs (CMDPs) remains unknown. We resolve this question by providing minimax upper and lower bounds on the sample complexity for learning nearoptimal policies in a discounted CMDP with access to a generative model (simulator). In particular, we design a modelbased algorithm that addresses two settings: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to satisfy the constraint. For (i), we prove that our algorithm returns an $\epsilon$optimal policy with probability $1  \delta$, by making $\tilde{O}\left(\frac{S A \log(1/\delta)}{(1  \gamma)^3 \epsilon^2}\right)$ queries to the generative model, thus matching the samplecomplexity for unconstrained MDPs. For (ii), we show that the algorithm's sample complexity is upperbounded by $\tilde{O} \left(\frac{S A \, \log(1/\delta)}{(1  \gamma)^5 \, \epsilon^2 \zeta^2} \right)$ where $\zeta$ is the problemdependent Slater constant that characterizes the size of the feasible region. Finally, we prove a matching lowerbound for the strict feasibility setting, thus obtaining the first near minimax optimal bounds for discounted CMDPs. Our results show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
Comment: NeurIPS'22
 Full text View this record from Arxiv

Yuan, Hui, Ni, Chengzhuo, Wang, Huazheng, Zhang, Xuezhou, Cong, Le, Szepesvári, Csaba, and Wang, Mengdi
 Subjects

Computer Science  Machine Learning and Statistics  Machine Learning
 Abstract

Directed Evolution (DE), a landmark wetlab method originated in 1960s, enables discovery of novel protein designs via evolving a population of candidate sequences. Recent advances in biotechnology has made it possible to collect highthroughput data, allowing the use of machine learning to map out a protein's sequencetofunction relation. There is a growing interest in machine learningassisted DE for accelerating protein optimization. Yet the theoretical understanding of DE, as well as the use of machine learning in DE, remains limited. In this paper, we connect DE with the bandit learning theory and make a first attempt to study regret minimization in DE. We propose a Thompson Samplingguided Directed Evolution (TSDE) framework for sequence optimization, where the sequencetofunction mapping is unknown and querying a single value is subject to costly and noisy measurements. TSDE updates a posterior of the function based on collected measurements. It uses a posteriorsampled function estimate to guide the crossover recombination and mutation steps in DE. In the case of a linear model, we show that TSDE enjoys a Bayesian regret of order $\tilde O(d^{2}\sqrt{MT})$, where $d$ is feature dimension, $M$ is population size and $T$ is number of rounds. This regret bound is nearly optimal, confirming that bandit learning can provably accelerate DE. It may have implications for more general sequence optimization and evolutionary algorithms.
 Full text View this record from Arxiv

Liu, Qinghua, Szepesvári, Csaba, and Jin, Chi
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, Computer Science  Computer Science and Game Theory, and Statistics  Machine Learning
 Abstract

This paper considers the challenging tasks of MultiAgent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer generalsum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information ExtensiveForm Games (IIEFGs). We identify a rich subclass of POMGs  weakly revealing POMGs  in which sampleefficient learning is tractable. In the selfplay setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sampleefficient results for learning POMGs.
 Full text View this record from Arxiv

Kozuno, Tadashi, Yang, Wenhao, Vieillard, Nino, Kitamura, Toshinori, Tang, Yunhao, Mei, Jincheng, Ménard, Pierre, Azar, Mohammad Gheshlaghi, Valko, Michal, Munos, Rémi, Pietquin, Olivier, Geist, Matthieu, and Szepesvári, Csaba
 Subjects

Computer Science  Machine Learning, Computer Science  Artificial Intelligence, and Statistics  Machine Learning
 Abstract

In this work, we consider and analyze the sample complexity of modelfree reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the KullbackLeibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimaxoptimal for finding an $\varepsilon$optimal policy when $\varepsilon$ is sufficiently small. This is the first theoretical result that demonstrates that a simple modelfree algorithm without variancereduction can be nearly minimaxoptimal under the considered setting.
Comment: 29 pages, 6 figures
 Full text View this record from Arxiv
Catalog
Books, media, physical & digital resources
Guides
Course and topicbased guides to collections, tools, and services.
1  20
Next