Découvrez Les IA Experts
Nando de Freitas | Researcher at Deepind | |
Nige Willson | Speaker | |
Ria Pratyusha Kalluri | Researcher, MIT | |
Ifeoma Ozoma | Director, Earthseed | |
Will Knight | Journalist, Wired |
Nando de Freitas | Researcher at Deepind | |
Nige Willson | Speaker | |
Ria Pratyusha Kalluri | Researcher, MIT | |
Ifeoma Ozoma | Director, Earthseed | |
Will Knight | Journalist, Wired |
Profil AI Expert
Not Available
Les derniers messages de l'Expert:
2024-12-19 05:05:53 Self-supervised reinforcement learning provides a great way to get agents to learn about the world on their own, but this has been notoriously hard to instantiate in the real world. In our new work by @YifeiZhou02 &
2024-12-18 01:24:34 RT @zhiyuan_zhou_: Turns out that the (most) simple and intuitive RL fine-tuning recipe works great: pre-train on the offline dataset with…
2024-12-17 18:00:32 RT @zhiyuan_zhou_: Can we finetune with RL *without retaining offline data*? Typically, algorithms co-train on offline data during finetun…
2024-12-15 20:24:41 I'll be presenting methods that can make our robots think harder at the Multimodal Algorithmic Reasoning workshop at 1:30 pm PT at #NeurIPS2024 (West Hall A) (in one hour) -- come find out about embodied chain of thought and other ways to make our robots think harder!
2024-12-12 16:35:37 RT @vivek_myers: When is interpolation in a learned representation space meaningful? Come to our poster at 4:30 to see how time-contrastive…
2024-12-11 19:42:59 We evaluate this on connecting connectors (including new connectors), moving objects, etc. For more, see the website: https://t.co/wL3IuZG3BB w/ @CharlesXu0124 @qiyang_li @jianlanluo https://t.co/wnUfm9ubqF
2024-12-11 19:42:58 How do we train vision-language-action (VLA) models with RL data? Distilling specialized RL policies into a generalist VLA (e.g., OpenVLA) works wonders for training VLAs to be fast &
2024-12-11 19:36:01 RT @simon_zhai: I will be standing up, talk, and show this work at @NeurIPS tomorrow afternoon (Wed Dec. 11 4:30 - 7:30 p.m. at East Exhibi…
2024-12-11 04:13:43 This turns out to be much better than prior offline ->
2024-12-11 04:13:42 Can we finetune policies from offline RL *without retaining the offline data*? We typically keep the offline data around when finetuning online. Turns out we can avoid retaining and get a much better offline to online algorithm, as discussed in @zhiyuan_zhou_'s new paper: https://t.co/zjYHo80ITD
2024-11-27 05:32:08 RT @sea_snell: Can we predict emergent capabilities in GPT-N+1 using only GPT-N model checkpoints, which have random performance on the ta…
2024-11-25 02:32:50 We can theoretically prove that this leads to a bound on Q-values. We can then apply this method to train Transformer Q-functions for language modeling and dialogue, robotic control, and a variety of LLM and VLM tasks. For more, check out the paper here: https://t.co/mC2aj8bXla https://t.co/APaifKIVBK
2024-11-25 02:32:49 The idea to directly treat token probabilities themselves as Q-values: i.e., the probability of a token is proportional to the approximate value that will be obtained from producing that token, then acting (near) optimally. Of course this has a problem: probabilities sum to 1,… https://t.co/36NvAjm1Ne https://t.co/jqw4ghh31g
2024-11-25 02:32:48 New paper by Joey Hong shows how we can train LLMs with value-based RL for multi-turn tasks *just by turning probabilities into Q-values*! This provides an algorithm that can be used for LLMs, VLMs, robotics tasks, etc. with one simple loss function. Thread https://t.co/2Oen2sIkTm
2024-11-20 06:14:17 An intriguing new result from @katie_kang_: after training long enough, LLMs will reproduce training examples exactly (not surprising). But how they get there matters: if they first get the right answer (e.g., to a math problem) and only then memorize, they've "learned" the… https://t.co/HR7YP2XMLV https://t.co/J53BmgI31y
2024-11-15 22:33:41 RT @KuanFang: How can we fine-tune VLMs for robotic control without robot data? In KALIE, we leverage diffusion models to synthesize data f…
2024-11-15 17:56:01 RT @Astribot_Inc: Astribot Physical Intelligence @physical_int Together, we’re working toward next-gen AI robot assistants that can lear…
2024-11-15 17:00:56 @JieWang_ZJUI Espresso machine is the final boss
2024-11-15 16:58:15 RT @hausman_k: https://t.co/6E5hVGzbwh @willknight it's not a soft German accent, it's Polish
2024-11-15 15:47:04 This is fun! Really impressed that this task worked so quickly, looking forward to lots more collaborative coffee in the future. And perhaps a few other things
2024-11-10 12:25:37 Lots of memorable quotes from @JitendraMalikCV at CoRL, the most significant one of course is: “I believe that Physical Intelligence is essential to AI” :) I did warn you Jitendra that out of context quotes are fair game. Some liberties taken wrt capitalization.
2024-11-09 13:52:40 RT @vivek_myers: Current robot learning relies heavily on imitating expert demonstrations. This limits new compositional and long horizon b…
2024-11-08 16:27:39 Check out the industry panel at CoRL tomorrow -- @SurajNair_1 will represent PI. https://t.co/L2JOjzpwj8
2024-11-08 14:16:50 RT @NoriakiHirose: Today we have oral presen in oral session 5 and poster in poster sesssion 4 in #CoRL2024 for SELFI, self-improvement ap…
2024-11-08 14:16:23 This should be fun! Post-training is now a thing (not just a thing, a necessity) in large-scale robotic learning. https://t.co/ag34BIv6Ep
2024-11-08 14:15:23 Come check out Kevin's talk tomorrow as well if you want to get the inside scoop about how pi-zero was trained! https://t.co/XSr0ciETCn
2024-11-08 11:57:07 Tmrw(Sat) I'll talk about pi-zero &
2024-11-02 04:37:36 RT @kvablack: It's been 6 months since I slammed the brakes on several PhD research projects to go work at π... super excited to finally…
2024-11-02 03:20:24 @iandanforth Now I have to fold laundry for the robot
2024-11-02 01:52:37 TGIF https://t.co/eQK0KE7j2I
2024-11-02 01:21:58 @Shixo Nope not secret, it’s just pretty standard. We’ll aim to share more over the next few months.
2024-11-01 05:23:29 @iandanforth The struggle is real
2024-11-01 05:23:07 @iamai_eth The first video is completely autonomous. The only command is Michael pushing a button to start it.
2024-10-31 22:50:18 RT @michael_equi: Excited to share what we've been up in the past 8 months @physical_int! We trained a 3B vision-action-language flow match…
2024-10-31 22:46:56 @sreak1089 Not any deep reason, mostly trying to keep things simple, inexpensive, and easy to work with. Not ideological — definitely getting these models to run on humanoids would be awesome!
2024-10-31 22:45:40 @venuv62 That would require superhuman skills
2024-10-30 21:56:40 This was fun https://t.co/4zRvoghlO0
2024-10-30 19:58:43 My talk from Actuate summit from a couple months ago is now up! https://t.co/fbVGoLdiKv
2024-10-30 04:03:21 A great collaboration w/ @seohong_park @kvfrans @ben_eysenbach Website: https://t.co/Xwq6dek8iX Repository: https://t.co/8Mlyg5bl5F Paper: https://t.co/aGN83x9xi8
2024-10-30 03:50:43 OGBench provides an innovative set of tasks for goal-conditioned offline RL. One of my favorites is the puzzle task (right) in the manipulation suite: even from data of random moves, offline RL can learn to solve the puzzle *entirely from data without any online exploration* https://t.co/mk5r3ee2yY
2024-10-30 03:50:42 Offline goal-conditioned RL could provide a fully unsupervised and fully data-driven way to pretrain broadly capable policies from previously collected data, but progress has been bottlenecked by a lack of complex and compelling tasks. OGBench addresses this. https://t.co/j7wdrBvIDE
2024-10-28 16:15:25 All training/evaluation videos are available at the website: https://t.co/2ENKU7te4C Code: https://t.co/RrV0V1Zahd Paper: https://t.co/87diEdCbFo
2024-10-28 16:15:24 HIL-SERL combines initialization from demos, human interventions, and sample-efficient RL. This allows us to learn some tasks that are very dynamic. https://t.co/LBSsignmbD
2024-10-28 16:15:23 RL in the real world presents some big challenges, but also some really big opportunities. In our new work, HIL-SERL, @CharlesXu0124, Jeffrey Wu, @jianlanluo show that real-world RL can learn a huge range of precise and robust tasks, and perform them much faster than imitation. https://t.co/uGtc5wrZlS
2024-10-24 14:53:26 Pretraining can transform RL, but it might need rethinking how to pretrain with RL on unlabeled data to bootstrap downstream exploration. In our new work, we show how to accomplish this with unsupervised skills and exploration. https://t.co/fmO84uJzev
2024-10-24 03:30:56 My colleague @JacobSteinhardt and his co-founders have a cool new effort looking forward to see how this evolves! https://t.co/q5qRGtykCH
2024-10-20 03:07:43 A fun collaboration on optimizing sequence models to maximize downstream reward functions for computational design synthetic biology. https://t.co/1s8AFqxP5A
2024-10-18 15:28:16 RT @BiologyAIDaily: Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design • This study…
2024-10-18 15:11:03 A way to get 128x faster inference with diffusion models by training for shortcuts. Who has time to wait for 128 diffusion steps… https://t.co/s2MIv4syfs
2024-10-18 04:18:11 Combining robotic foundation models (Octo, OpenVLA, etc.) with offline RL trained value functions makes them better! A great thing about value functions is that we can plug them into any policy as a filter on samples, providing a lightweight and general improvement mechanism. https://t.co/ljw0jEPlen
2024-10-16 22:05:38 RT @SudeepDasari: Excited to share my final PhD project We show how simple, yet elegant changes enable diffusion transformers to learn SO…
2024-10-08 20:06:27 @JonathanScholz2 FWIW I do think that some of your arguments worked insofar as your team is one of the… erm… realest of the London groups.
2024-10-07 06:46:56 Our new paper on using YouTube videos to learn language conditioned navigation is out! By leveraging pretrained models and video data mined from the web, we can get robots to better understand language instructions. https://t.co/viydd3lOg5
2024-10-06 02:34:05 I wrote a little blog post about robotic foundation models (generalist robotic policies): https://t.co/jXPRdT6Rst
2024-09-28 16:58:38 We have some fun new RL results coming soon! https://t.co/NZKtjeKgMu
2024-09-24 03:12:53 RT @jianlanluo: FMB appeared at IJRR today, open access at: https://t.co/1qlbZMjjPK During the review period, we also greatly improved the…
2024-09-19 06:12:16 RT @IEEESpectrum: Robots rely on specialized control policies to dictate their movements and skills. But now there's CrossFormer, a single…
2024-09-14 04:07:00 For more, check out: arXiv: https://t.co/gn5GgX8wBp website: https://t.co/0UpqxMrxQm code: https://t.co/29CvISzdvq
2024-09-14 04:06:59 The idea in PALO is to take a few demonstrations for a new task, and instead of running imitation learning, infer the semantic commands (e.g., "grasp drawer", "pull open", etc) that each stage of the demo corresponds to. Then, the robot "imitates" the demo by executing that… https://t.co/Fa7u2gCTGr https://t.co/ra8vDM8qCo
2024-08-29 05:51:39 RT @seohong_park: Is "offline RL" in offline-to-online RL really necessary? Surprisingly, we find that replacing offline RL with *unsuperv…
2024-08-23 23:53:47 Can one transformer-based policy fly drones, manipulate objects, walk, and drive? @riadoshi21, @HomerWalke, @oier_mees, @SudeepDasari show in their new papers that CrossFormer can do all of these things, with flexible per-embodiment action heads: https://t.co/hVvBMnsE3T https://t.co/GFu8GgUkKR
2024-08-23 04:57:55 RT @HomerWalke: Big shout out to @riadoshi21 who co-led this project with me! Check out her thread here. Also thanks to my other fantastic…
2024-08-23 04:57:50 RT @riadoshi21: Can we train one policy to control a wide range of robots, from drones to quadrupeds, navigators to bimanual manipulators…
2024-08-23 04:57:48 RT @oier_mees: Huge shoutout to the amazing @riadoshi21 and @HomerWalke for leading this project and to collaborators @SudeepDasari @svlevi…
2024-08-06 05:58:23 @nick_rhinehart Congratulations Nick!!!
2024-08-06 05:58:00 RT @nick_rhinehart: I’m happy to share that I’ve started a new position as Assistant Professor at The University of Toronto! I'll lead the…
2024-08-03 02:09:32 RT @shahdhruv_: I “defended” my thesis earlier today — super grateful to @svlevine and everyone at @berkeley_ai for their support through t…
2024-07-31 03:29:15 To learn more, check out: website: https://t.co/lDWpzFAWD9 paper: https://t.co/8lXLBdiNbc code: https://t.co/CjdleKgeEx work led by @pranav_atreya &
2024-07-31 03:29:14 This works very well because the goal-conditioned policy can self-improve without any human supervision, while the VLM and diffusion model leverages Internet-scale pretraining. So every component either improves through self-supervision or benefits from pretraining (or both). https://t.co/IDiue5sD6r
2024-07-21 03:30:22 RT @foxglove: The one and only @svlevine from @physical_int will be talking about Robotic Foundation Models at #Actuate2024 on September 18…
2024-07-21 03:16:16 RL training of diffusion models requires maximizing a reward function, staying within the manifold of valid samples (e.g., valid images), and efficient exploration, all together. Balancing these factors while leveraging pretrained models requires new RL methods. In our new paper… https://t.co/1wkzdDROVF https://t.co/S5Zqmljagx
2024-07-21 03:12:37 Training diffusion models with RL allows optimizing for aesthetically appealing images, prompt alignment, drug function, etc. In a new survey by Masatoshi Uehara w/ @zhaoyl18thu &
2024-07-19 03:52:43 RT @oier_mees: Rockstars @HomerWalke and @its_dibya presenting Octo on behalf of the Octo team! #RSS2024 https://t.co/kENdFMX1Nc
2024-07-17 15:46:54 If you want to learn about open source robotic foundation models check out Octo at RSS 2024!! https://t.co/RqzIYLxOgD
2024-07-17 15:46:28 RT @its_dibya: Giving a talk about Octo at RSS w/ @HomerWalke bright and early tomorrow @ 8:30 AM! Join us &
2024-07-17 05:35:01 RT @fangchenliu_: I cannot make it for #RSS2024 but MOKA will be presented tmrw by my friends @LinfengZhaoZLF and @ChuanWen15 at session 9.…
2024-07-12 04:49:36 This was a really fun collaboration with @MiZawalski, @verityw_, @KarlPertsch, @oier_mees, @chelseabfinn Website: https://t.co/keE4R6dAaa Paper: https://t.co/4VVml43wFm
2024-07-12 04:49:34 While our main experiments use the Bridge v2 setup: https://t.co/g4hcGEglep We also tested on a variety of other embodiments from OXE: https://t.co/4llTvH4Xsy https://t.co/zvsOPnBhp5
2024-07-12 04:49:33 The resulting VLA can even interpret human corrections and interventions, incorporating them as corrections into the embodied chain of thought process! https://t.co/o5pycPXc3M
2024-07-12 04:49:31 The resulting model can solve complex tasks that require multi stage inferences. It can generalize more effectively to novel objects, perform longer tasks, and understand sophisticated instructions. https://t.co/YUE2JSit9C
2024-07-11 04:18:31 Paper: https://t.co/a3Nprj3WlZ Code: https://t.co/4RlrJDDt5A…
2024-07-11 04:17:56 The cool thing is that this satisfies quasimetric properties, obeying triangle inequality, and allowing for "chaining" in offline RL for free by exploiting metric structure, as shown in these didactic examples. https://t.co/4vHm4iwCtX
2024-07-11 04:17:55 This has a geometric interpretation: the teal are represents the cumulative probability over time of reaching g from s, the orange of reaching g from g. The difference between these is the new notion of "successor distance" https://t.co/kmZLXJyPQZ
2024-06-21 16:04:28 RT @aviral_kumar2: New paper: we trained a SOTA (>
2024-06-16 02:35:20 RT @KarlPertsch: Very excited to release OpenVLA today, a 7B parameter open-source vision-language-action model (VLA). SoTA generalist p…
2024-06-14 23:44:30 RT @seohong_park: Most works in offline RL focus on learning better value functions. So value learning is the main bottleneck in offline RL…
2024-06-14 20:24:23 RT @aviral_kumar2: This was an awesome collaboration led by @seohong_park, w/ @kvfrans and @svlevine. @seohong_park also wrote a terrific…
2024-06-14 16:00:14 What can we do with OpenVLA? We can finetune it to downstream tasks, use it for zero-shot control, and extend it for accessible and practical VLA research. VLAs will be an integral part of robotic learning in the future, and accessible tools for research on VLAs are essential
2024-03-01 00:00:00 CAFIAC FIX
2024-03-11 00:00:00 CAFIAC FIX
2023-05-22 20:53:47 @natolambert Absolutely. Providing feedback to our models is far too important a job to be left to humans, we should get feedback from something smarter
2023-05-22 20:52:28 @generatorman_ai I think that could make sense. Would need to pick some form (eg MSE) of recon loss, but it does make sense as an objective, essentially rewarding for distance to gt image on each prompt. Not certain what exactly that would do… maybe @michaeljanner has some thoughts on that
2023-05-22 17:59:59 Of course this is not without limitations. We asked the model to optimize for rewards that correctly indicate the *number* of animals in the scene, but instead it just learned to write the number on the image :( clever thing... https://t.co/xxjiq34npT
2023-05-22 17:59:58 We optimized for "animals doing activities" and it does some cool stuff. Interestingly, a lot of the pictures start looking like childrens' book illustrations -- we think this is because "bear washing dishes" is less likely to be a photograph, more likely in a childrens' book https://t.co/SjTxCZSQLm
2023-05-22 17:59:57 Quantitatively, this leads to very effective optimization of a wide variety of reward functions, both hand-designed rewards and rewards derived automatically from vision-language models (VLMs). https://t.co/TENhaeESQn
2023-05-22 17:59:56 We figured out how to train diffusion models with RL to generate images aligned with user goals! Our RL method gets ants to play chess and dolphins to ride bikes. Reward from powerful vision-language models (i.e., RL from AI feedback): https://t.co/5Mui7Wb8pB A https://t.co/j24K2IQRhh
2023-05-19 19:00:00 CAFIAC FIX
2023-05-21 19:00:00 CAFIAC FIX
2023-05-10 01:26:41 RT @mitsuhiko_nm: We've released our code for Cal-QL, along with public @weights_biases logs to make the replication easier! Please try it…
2023-05-04 18:26:32 RT @mitsuhiko_nm: I will be presenting our work Cal-QL, as a Spotlight at Reincarnating RL Workshop at #ICLR2023 in a few hours! (10:50 am…
2023-05-04 02:36:03 Cal-QL is a simple modification to CQL that "calibrates" the value function during offline training making it finetune efficiently online. You can check out the paper here: https://t.co/Mz8fmhqgQm You can also watch a previous talk on YouTube: https://t.co/VluE64Rk64
2023-05-04 02:36:02 If you want to learn about how offline pretraining with CQL can be combined with efficient online finetuning (Cal-QL), check out @mitsuhiko_nm's talk tomorrow at #ICLR2023 (workshops), 1:50 am PT = 10:50 am in Kigali: https://t.co/WRMny8o3Kb Also at RRL workshop. More info https://t.co/9A8fpu4o9J
2023-05-03 17:04:47 I'll be speaking at the Reincarnating RL WS tmrw (4:30 pm CAT = 6:30 am PDT) along w @JosephLim_AI @furongh @annadgoldie @marcfbellemare @DrJimFan @jeffclune Follow along in person &
2023-05-03 03:46:17 ALM (Aligned Latent Models) is an MBRL alg that optimizes a latent space, model, and policy w.r.t. the same variational objective! By @GhugareRaj, @mangahomanga, @ben_eysenbach, @svlevine, @rsalakhu 16:30/7:30 am MH1-2-3-4 #124 https://t.co/k9h1REVlea https://t.co/O5bxe0CCTg https://t.co/Iols6UcmU0
2023-05-03 03:46:16 Neural constraint satisfaction is a method that uses object-based models and enables planning multi-step behaviors By @mmmbchang, @a_li_d, @_kainoa_, @cocosci_lab, @svlevine, @yayitsamyzhang 16:30 local/7:30am PDT at MH1-2-3-4 #78 https://t.co/qZtQREJA4A https://t.co/UwLK35Bmcm https://t.co/wevEIJ1XQv
2023-05-03 03:46:15 This will be presented virtually at 16:30 local time (7:30 am PDT): https://t.co/s2dgxSPLZQ @setlur_amrith, @DonDennis00, @ben_eysenbach, @AdtRaghunathan, @chelseabfinn, @gingsmith, @svlevine paper: https://t.co/CIDxrji6mC
2023-05-03 03:46:14 Next, bitrate DRO trains classifiers robust to distr. shift w/ a simple idea: real-world shifts affect underlying factors of variation (lighting, background, etc), and leveraging this structure by constraining the shifts we are robust to leads to very effective methods. https://t.co/aUyVx7rEXv
2023-05-03 03:46:13 Value-based offline RL has the potential to drastically increase the capabilities of LLMs by allowing them to reason over outcomes in multi-turn dialogues, and I think ILQL is an exciting step in this direction. More commentary in my blog post here: https://t.co/WLn8ugzwgr
2023-05-03 03:46:12 All are in 16:30 session (7:30 am PDT). ILQL is an offline RL method to train LLMs for multi-turn tasks. RL lets us train LMs to do things like ask questions in dialogue: https://t.co/gsgRwEF7oO https://t.co/xrPkESFPYj @sea_snell, @ikostrikov, @mengjiao_yang, @YiSu37328759 https://t.co/oPwlBqhgV2
2023-05-03 03:46:11 Tmrw at #ICLR2023 in Kigali, we'll be presenting: ILQL: a method for training LLMs w/ offline RL! Bitrate-constrained DRO: an information-theoretic DRO method Neural constraint satisfaction for planning with objects Aligned latent models for MBRL w/ latent states Thread https://t.co/IkV4LTej2G
2023-05-01 17:18:18 The idea is to adapt conservative Q-learning (CQL) to learn a Q-function for all confidence levels, representing a bound: Q(s, a, delta) means that the Q-value is bounded by the predicted value with probability 1 - delta. https://t.co/CHD8qrlCpR
2023-05-01 17:18:17 Pessimism is a powerful tool in offline RL: if you don't know what an action does, assume it's bad. But how pessimistic should an agent be? We'll present "Confidence-Conditioned Value Functions" at #ICLR2023 tmrw, arguing that we can learn all pessimism levels at once https://t.co/Tm5ne7SIaz
2023-05-01 17:14:56 Interested in how we can train fast with RL, taking as many gradient steps as possible per sim step? This work will be presented at #ICLR2023 in Kigali, Tue May 2, 11:30 am local time (2:30 am PDT), MH1-2-3-4 #96 https://t.co/yChkpwjjHf
2023-05-01 06:39:45 RT @agarwl_: I'll be presenting our work today at @iclr_conf on scaling offline Q-learning to train a generalist agent for Atari games. Sto…
2023-05-01 00:41:03 Talk: 10 am local/1 am PDT Mon May 1 Poster: 11:30 am local/2:30 am PDT at MH1-2-3-4 #105 For more, you can find the paper and more details on the project website: https://t.co/f0UPDa635I
2023-05-01 00:41:02 We'll be presenting our work on large-scale offline RL that pretrains on 40 Atari games at #ICLR2023. Come learn about how offline RL can pretrain general RL models! (well, general-Atari RL...) w/ @aviral_kumar2, @agrwl, @georgejtucker Long talk @ 10:00 am Mon (1 am PDT) info https://t.co/FXIPeFtDJN
2023-04-30 21:02:30 @haarnoja Congratulations Tuomas (&
2023-04-30 18:35:02 It's often a mystery why we can't just take way more gradient steps in off-policy model-free RL and get it to learn faster. There are a variety of (very reasonable) explanations, but it turns out that overfitting is a pretty good explanation, and suggests some simple fixes. https://t.co/yChkpwjjHf
2023-04-24 17:45:18 IDQL is also quite fast computationally. While it's not as blazing-fast as IQL, the decoupling does mean that it is much faster than other diffusion-based RL methods, since the diffusion model is trained completely separately from the critic with supervised learning.
2023-04-24 17:45:17 This is cool, but it also reveals a problem: the implicit policy is really complicated! Yet regular IQL uses a standard unimodal Gaussian policy to approximate it, which is going to be really bad. So we propose to use a much more powerful policy that can accurately capture pi_imp
2023-04-24 17:45:16 First: what is IQL? IQL <
2023-04-24 17:45:15 We released a new version of implicit Q-learning (IQL) with diffusion-based policies to get even better results with less hyperparameter tuning. For paper&
2023-04-23 23:14:31 "Sim" to real transfer from artificial pickles to real pickles. Awesome to see that Bridge Data helps with generalization to real foodstuffs :) https://t.co/gIHGKxxuye
2023-04-23 00:48:25 A talk that I prepared recently that describes a few recent works on RL finetuning with RL pretraining: https://t.co/WKUcORzhfp Offline RL can pretrain models that work great for online RL. We can develop algorithms for this (e.g., Cal-QL) and models (e.g., ICVF).
2023-04-21 21:29:30 Robotics research https://t.co/UEN4W7YUWw
2023-04-21 21:28:48 RT @kevin_zakka: I had the same reaction seeing this the first time!
2023-04-21 21:27:51 RT @smithlaura1028: I'm very excited to share our super simple system for teaching robots complex tasks using existing controllers ̈ Build…
2023-04-21 16:30:02 For full results, see the video here: https://t.co/29uhFS9YTl Website: https://t.co/9K7JWae5xF Arxiv: https://t.co/rmNiowu12S Led by @smithlaura1028, Chase Kew, w/ Tianyu Li, Linda Luu, @xbpeng4, @sehoonha, Jie Tan
2023-04-21 16:30:01 This turns out to work really well as a way to do curriculum learning. E.g., we first train basic jumping with motion imitation, then finetune to maximize jumping performance with buffer initialization from imitation. Similar idea for hind legs walking https://t.co/Ph1KXvKMLI
2023-04-21 16:30:00 The idea in TWiRL is to leverage RLPD (https://t.co/ZdRDLKNhZy) as a transfer learning algorithm, initializing the replay buffer from another task, environment, etc., and then using the super-fast RLPD method to quickly learn a new task leveraging the prior environment/task data. https://t.co/LL7PqY5emi
2023-04-21 16:29:59 We've taught our robot dog new tricks! Our new transfer learning method, TWiRL, makes it possible to train highly agile tasks like jumping and walking on hind legs, and facilitates transfer across tasks and environments: https://t.co/9K7JWae5xF https://t.co/yZYb6H4ovc
2023-04-21 03:55:03 @jmac_ai We used RLPD for the online phase (which is similar to SAC) and IQL for pretraining. Tradeoffs b/w IQL and CQL are complex. IQL is a great very simple way to learn values/distances, but tricky with policy extraction, CQL is more SAC-like. No clear winner, depends on how it's used
2023-04-21 00:00:01 CAFIAC FIX
2023-04-20 16:43:01 ...and then finetune on top of that general-purpose initialization with very efficient online RL that can learn at real-time speeds by leveraging an effective initialization. This is likely to make RL way more practical, so that robots can learn in 10-20 minutes
2023-04-20 16:43:00 This was a really fun collaboration with @KyleStachowicz, Arjun Bhorkar, @shahdhruv_ , @ikostrikov The methods this is based on are: IQL: https://t.co/pksjjvwaaf RLPD: https://t.co/ZdRDLKNhZy
2023-04-20 16:37:51 We're also releasing code along with a high quality simulation with nice terrain in MuJoCo, so that you can play with FastRLAP yourself! https://t.co/Z07kcJ96Wl arxiv: https://t.co/jmbQ0LUZX9 video: https://t.co/IK8GN5eG0b web: https://t.co/DXwMZ5oB0m https://t.co/L9nKz3ZGdG
2023-04-20 16:37:50 Here is an example lap time progression for an outdoor course. In 10-20 min it matches the demo, by 35 it's approaching expert (human) level. This is one of the harder tracks we tried, others are also easier (lots more examples on the website above) https://t.co/TzQCnP4bpo
2023-04-20 16:37:49 We then use this backbone to kick off online RL training with RLPD, initialized with one or a few slow demos in the target race course. The offline RL trained encoder is frozen, and RLPD then learns to drive fast in just 10-20 minutes. https://t.co/Mvxg9dmyMC
2023-04-20 16:37:48 FastRLAP is based on offline RL pretraining followed by super fast online RL. We first use an IQL-based offline RL method to pretrain a general-purpose navigation backbone using a large dataset from *other* robots driving around. This gives a general "navigational common sense" https://t.co/9lqL1kuzaJ
2023-04-20 16:37:47 Can we use end-to-end RL to learn to race from images in just 10-20 min? FastRLAP builds on RLPD and offline RL pretraining to learn to race both indoors and outdoors in under an hour, matching a human FPV driver (i.e., the first author...): https://t.co/DXwMZ5oB0m Thread: https://t.co/zUf8Moyvlq
2023-04-18 18:39:43 Paper: https://t.co/3RlP2YFD0M Site &
2023-04-18 18:38:31 ICVFs can learn across morphologies (train on one morphology, finetune to another), and can even pretrain for fast learning of Atari games using YouTube videos of Atari playing (in some cases, video appears to be someone pointing a phone at their screen...) https://t.co/NLkk2JVLlV
2023-04-18 18:37:21 ICVFs (intention-conditioned value functions) learn how effectively a particular intention enables reaching a particular outcome. Roughly this can be thought of as a generalization of goal-conditioned RL, learning not just goals but all tasks in the learned feature space.
2023-04-18 18:35:30 Code, paper, website here: https://t.co/Gax55dfRSE The idea in ICVFs is to do self-supervised RL where we learn a multilinear representation, predicting which state representation will be reached when we attempt a task, for every possible task and every possible outcome. https://t.co/McsUPzgP4H
2023-04-18 18:35:29 If we want to pretrain RL from videos, we might use representation learning methods like VAEs, MAEs, CPC, etc. But can RL *itself* be an effective rep. learning method? Self-supervised RL turns out to be great at this. That's the idea behind ICVFs, by @its_dibya @ChetBhateja https://t.co/C88rANkCp8
2023-04-18 14:49:45 RT @mitsuhiko_nm: Previously, PTR has demonstrated that combining limited target demos with a large diverse dataset can achieve effective o…
2023-04-18 03:36:55 @engradil123 Yes, we are considering international applications.
2023-04-17 20:53:39 We previously were fortunate to have Kristian Hartikainen <
2023-04-17 20:53:38 Our lab at Berkeley (RAIL) is hiring a research engineer! If you have a BS or MS and are interested, please see the official UC job posting here: https://t.co/UVovWzvk86 If you want to know more, the job posting has the official details, a few remarks:
2023-04-17 17:19:11 Paper website with videos, code, and arxiv link are here: https://t.co/vUFWTc42vH PTR pretrains on the Bridge Dataset: https://t.co/g4hcGEglep Using a variant of CQL, it can acquire representations that enable learning downstream tasks with just 10-20 trials...
2023-04-17 17:19:10 PTR (Pretraining for Robots) now supports online RL finetuning as well as offline RL finetuning! The concept behind PTR is to pretrain with offline RL on a wide range of tasks (from the Bridge Dataset), and then finetune. New results (led by @mitsuhiko_nm) below! Links &
2023-04-17 06:54:29 RT @_akhaliq: Reinforcement Learning from Passive Data via Latent Intentions abs: https://t.co/6yVdYQeyq2 project page: https://t.co/XCwd…
2023-04-13 21:19:43 RT @GoogleAI: Today we discuss a large-scale experiment where we deployed a fleet of #ReinforcementLearning-enabled robots in office buildi…
2023-04-13 20:58:27 RT @hausman_k: And here is a blog post talking about RLS sorting trash in the real world by @svlevine and @AlexHerzog001: https://t.co/wgq…
2023-04-13 01:19:08 RT @_akhaliq: This is wild Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators a system for deep rein…
2023-04-13 00:30:21 RT @xiao_ted: Reinforcement learning (!) at scale in the real world (!!) for useful robotics tasks (!!!) in multiple "in-the-wild" offices…
2023-04-12 17:58:25 @mattbeane @JoeRobotJones When the policies generalize effectively across different hardware platforms and tasks.
2023-04-12 17:35:15 RT @julianibarz: This was a 3+ year effort, RL in the real world is hard for sure. Any research that helps stabilize training, make hyper…
2023-04-12 16:57:20 This would not have been possible without an awesome team, and it has been a long journey. But the journey is not over. With our recent efforts integrating language into robotic policies (RT-1, SayCan, etc.), there are big things ahead as we get robots out in to our offices. https://t.co/28SxL5dM4Z
2023-04-12 16:57:19 There is of course a lot more to it, so you can check out the paper and more naimations and info on the project site: https://t.co/qZkeyLg8xO
2023-04-12 16:57:18 But of course real-world practice is the "real deal" where robots will drive up to waste stations with real trash deposited by real people, try to sort it, and collect more experience to improve on the job https://t.co/R40cfWRtn2
2023-04-12 16:57:17 We settled on sorting recyclables because it builds on things we already knew how to approach, but extends it with new open-world challenges: unexpected objects, novel situations, lots of chances to improve through trial and error (like these hard scenes below!) https://t.co/YqQtDOrycN
2023-04-12 16:57:16 When we developed QT-Opt, we could get even better performance at grasping tasks, and later for multi-task robotic learning, with end-to-end deep RL. https://t.co/sfeFyUQkdi https://t.co/rlLTIOyjYn https://t.co/nYLhIVWwpB
2023-04-12 16:57:15 Our research on large-scale robotic learning goes back to around 2015, when we started with the first "arm farm" system to study collective robot learning from experience, initially for robotic grasping. https://t.co/tQ58x1McRy https://t.co/IplSGfg6Cr
2023-04-12 16:57:14 Can deep RL allow robots to improve continually in the real world, while doing an actual job? Today we're releasing our paper on our large-scale deployment of QT-Opt for sorting recyclables: https://t.co/qZkeyLg8xO This has been a long journey, but here is a short summary https://t.co/glhB5OmnMK
2023-04-11 18:54:02 We have big plans to further expand Bridge Data in the future. @HomerWalke and many collaborators and data collectors have been doing a lot to expand this dataset, and we'll be adding language, more autonomous data, and many more tasks in the future.
2023-04-11 18:54:01 We've updated Bridge Data with 33k robot demos, 8.8k autonomous rollouts, 21 environments, and a huge number of tasks! Check out the new Bridge Data website: https://t.co/g4hcGEglep The largest and most diverse public dataset of robot demos is getting bigger and bigger! https://t.co/gkI7mPngQK
2023-04-11 17:34:59 RT @berkeley_ai: Tomorrow at 12 PM PT, the Berkeley AI lecture series continues. @svlevine will present "Reinforcement Learning with Large…
2023-04-10 18:04:21 @jasondeanlee It's just taking it to the next level https://t.co/UVqKcQYl8c
2023-04-04 16:56:59 The key idea behind Koala is to scrape high-quality data from other LLMs (yeah...). As we discuss in the post, this has some interesting implications for how powerful LLMs can be trained on a budget (in terms of weights and compute).
2023-04-04 16:56:58 We've released the Koala into the wild The Koala is a chatbot finetuned from LLaMA that is specifically optimized for high-quality chat capabilities, using some tricky data sourcing. Our blog post: https://t.co/7gWNiPOQ0T Web demo: https://t.co/TMWpixTZCK
2023-04-03 17:59:07 The reviewer's dilemma: it's like prisoner's dilemma, where if you accept the invitation to review for @NeurIPSConf, everyone will have fewer papers to review, but if you defect, then everyone's papers get reviewed by Reviewer 2. (so you should all accept your reviewer invites!)
2023-04-02 06:04:02 @archit_sharma97 It's really eye-popping to me that an accomplished professor would engage in such short-sighted behavior -- while the research vision is quite clear-eyed, the optics of this kind of thing are murky at best. Btw, where can I get a toad like that?
2023-03-29 22:27:55 RT @ancadianadragan: Offline RL figures out to block you from reaching the tomatoes so you change to onions if that's better, or put a plat…
2023-03-29 17:32:55 Also, to acknowledge the most relevant work that inspired this: Awesome work from Annie Xie, @chelseabfinn &
2023-03-29 17:28:50 Though at the same time, influence itself is not necessarily bad: e.g., an educational agent might influence students to pay more attention or better retain the material, etc. So we should be thoughtful in approaching these problems!
2023-03-29 17:28:49 ...so we should also be thinking carefully about how to *detect* that someone is interacting with an RL agent that is trying to influence or trick them, and also develop rewards and objectives that prevent deceptive and dishonest behavior.
2023-03-29 17:28:48 Large models (like LLMs) could pick up on subtle patterns in human behavior, and offline RL might enable using these patterns for subtle influence. I also discuss this more here: https://t.co/WLn8ugzwgr Of course, this is ripe for abuse...
2023-03-29 17:28:47 This was a fun collaboration with Joey Hong &
2023-03-29 17:28:46 And here is the RL agent. Notice how it puts the plate on the counter and then refuses to help until the human picks up the plate. Once the human is "forced" in this way, they do the right strategy, and the two players can play together more optimally! https://t.co/ZbHuYuDmw3
2023-03-29 17:28:45 Here is a more subtle example. Here, the optimal strategy is for RL agent (green) to pass plate to the human (blue) so the human can plate the soup and deliver it. Naive BC, shown below, doesn't get this and executes a very suboptimal strategy. https://t.co/9N7wqXEvbC
2023-03-29 17:28:44 And here is the offline RL agent. The agent is the green hat, the blue hat is a human (this is a real user study). Notice how green hat blocks blue hat from picking up on the onions -- after blocking them a few times, the human "gets the idea" and makes tomato soup only. https://t.co/j0cQ72kKWs
2023-03-29 17:28:43 Aside from this, it's basically offline RL (with CQL): analyze the data, and figure out how to do better by influencing humans. Here is one example: we change reward to favor tomato soup (humans don't know this), and the agent influences the human to avoid onions! BC baseline: https://t.co/Tm7PH7at3z
2023-03-29 17:28:42 The algorithm has one change from offline RL: add a "state estimator" to infer the "state" of the human's mind, by predicting their future actions and using the latent state as additional state variables. This allows the agent to reason about how it changed a person's mind. https://t.co/F81sfJEk4P
2023-03-29 17:28:41 The idea: get data of humans playing a game (an "overcooked" clone made by @ancadianadragan's group), plug this into offline RL with a few changes (below). Humans might not play well, but if they influence each other *accidentally* RL can figure out how to do it *intentionally*. https://t.co/spMuZvHmBU
2023-03-29 17:28:40 Offline RL can analyze data of human interaction &
2023-03-29 16:33:08 @RCalandra @tudresden_de @TactileInternet Congratulations Roberto!! Good luck with your new lab, looking forward to seeing what kinds of cool new research your group produces!
2023-03-27 21:25:32 RT @tonyzzhao: Introducing ALOHA : ow-cost pen-source rdware System for Bimanual Teleoperation After 8 months iterating @stanford a…
2023-03-27 21:02:21 To read about ALOHA (@tonyzzhao's system), see: Website &
2023-03-27 21:01:02 This is also why we opted to use low-cost widowx250 robots for much of our recent work on robotic learning with large datasets (e.g., Bridge Data https://t.co/fGG35iD0IA, PTR https://t.co/vUFWTc42vH, ARIEL https://t.co/Za9qhEwnIw)
2023-03-27 20:59:44 ...much of the innovation in the setup is to simplify aggressively, using cheap hardware, no IK (direct joint control), etc. With the right learning method, the hardware really can be simpler, with more focus on cost and robustness vs. extreme precision.
2023-03-27 20:58:54 It's also worth pointing out that, separately from the awesome robot results, the action chunking scheme in @tonyzzhao's paper is actually quite clever, and seems to work very well. But perhaps most interesting is that the robots don't have to be very fancy to make this work...
2023-03-27 20:52:44 Fine-grained bimanual manipulation with low-cost robots &
2023-03-27 00:22:09 @MStebelev This paper studies somewhat related questions https://t.co/Og9UpOpleA That said in general offline RL does suffer from the same covariant shift problems in the worst case, so the best we can do is characterize non worst case settings.
2023-03-26 00:44:44 RT @ben_eysenbach: I really enjoyed this conversation about some of the work we've done on RL algorithms, as well as the many open problems…
2023-03-25 23:25:11 I recently gave a talk on RL from real-world data at GTC, here is a recording: https://t.co/Ox7u4XxEL6 Covers our recent work on offline RL, pre-training large scalable models for robotic RL, offline RL for goal-directed large language models, and RL-based human influence.
2023-03-23 03:45:55 Throwing out unnecessary bits of an image to improve generalization, robustness, and representation learning in RL. When you focus on something, you get "tunnel vision" and the irrelevant surroundings seem to fade from view. Maybe RL agents should do the same? https://t.co/I4Kr5JRwL8
2023-03-19 04:20:33 @kchonyc I extrapolated that philosophy to all future "service" tasks (running a big grant, running a conference, etc.), and it seems like a really reliable rule of thumb. Sort of the academic version of "let him who is without sin..."
2023-03-19 04:18:49 @kchonyc Reminds me of wise words one of my colleagues said when I joined UCB: "if the teaching coordinator asks you to teach a specific class, you can say no
2023-03-11 07:52:23 RT @mitsuhiko_nm: Offline pre-training &
2023-03-10 22:45:42 RT @aviral_kumar2: Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-train…
2023-03-10 15:51:03 @hdonancio We just use Monte Carlo estimates for this (sum up rewards in the observed training trajectories). This is always possible and always unbiased, though it has non zero variance.
2023-03-10 04:45:21 Cal-QL was a great collaboration, led by @mitsuhiko_nm, @simon_zhai, @aviral_kumar2, w/ Anikait Singh, Max Sobol Mark, @YiMaTweets, @chelseabfinn Website: https://t.co/iZ1TKkaAqi Arxiv: https://t.co/Mz8fmhqgQm
2023-03-10 04:45:20 I particularly like this result because it shows (1) the high UTD training ideas in RLPD also transfer to other methods
2023-03-10 04:45:19 As a sidenote, the recently proposed RLPD method (w/ @philipjohnball, @ikostrikov, @smithlaura1028) <
2023-03-10 04:45:18 This effectively fixes the problem! Now the Q-function is on the right scale, and online finetuning makes it directly improve, instead of experiencing the "dip." https://t.co/dL2q6vY6yx
2023-03-10 04:45:17 This is very bad both for safety (really bad performance for a while) and learning speed (lots of time wasted for recovery). Fortunately, we can fix this with a very simple 1-line change to CQL!
2023-03-10 04:45:16 This is not an accident: what is happening is that CQL is underestimating during the offline phase, so once it starts getting online data, it rapidly "recalibrates" to the true Q-value magnitudes, and that "traumatic" recalibration temporarily trashes the nice initialization.
2023-03-10 04:45:15 The concept: it's very appealing to pretrain RL with offline data, and then finetune online. But if we do this with regular conservative Q-learning (CQL), we get a "dip" as soon as we start online finetuning, where performance rapidly drops before getting better. https://t.co/6DFTjlwSEr
2023-03-10 04:45:14 Can conservative Q-learning be used to pretrain followed by online finetuning? Turns out that naive offline RL pretraining leads to a "dip" when finetuning online, but we can fix this with a 1-line change! That's the idea in Cal-QL: https://t.co/iZ1TKkaAqi A thread https://t.co/pXt8jgg43N
2023-03-07 04:52:05 It's interesting to compare PaLM-E to PaLM-SayCan: https://t.co/zssoUzy9ep SayCan "stapled" an LLM to a robot, like reading the manual and then trying to drive the robot. PaLM-E is more like learning to drive the robot from illustrations and other media -- obviously way better.
2023-03-07 04:52:04 This allows PaLM-E to do the usual LLM business, answer questions about images, and actually make plans for robots to carry out tasks in response to complex language commands, commanding low-level skills that can do various kitchen tasks. https://t.co/eIRu8bjRyw
2023-03-07 04:52:03 PaLM-E ("embodied" PaLM) is trained on "multimodal sentences" that consist of images and text, in addition to normal LLM language training data. These multimodal sentences can capture visual QA, robotic planning, and a wide range of other visual and embodied tasks. https://t.co/wHozVM1Pwk
2023-03-07 04:52:02 What if we train a language model on images &
2023-03-05 10:00:00 CAFIAC FIX
2023-03-02 22:00:00 CAFIAC FIX
2023-02-27 01:00:00 CAFIAC FIX
2023-02-23 21:34:57 RT @GoogleAI: Presenting Scaled Q-Learning, a pre-training method for scaled offline #ReinforcementLearning that builds on the conservative…
2023-02-23 21:32:58 A really fun collaboration with @aviral_kumar2, @agarwl_, @younggeng, @georgejtucker Arxiv paper here: https://t.co/SmBOijHdSj This will be presented as a long oral presentation at ICLR 2023.
2023-02-23 21:32:57 But of course the real test is finetuning performance. That works from both offline data on new games and online interaction from new game variants! https://t.co/vVWfw2DNDg
2023-02-23 21:32:56 The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data. https://t.co/j2IwWUIYud
2023-02-23 21:32:55 Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain &
2023-02-23 18:41:31 Cassie learns to jump with deep RL! Really fun collaboration with @ZhongyuLi4, @xbpeng4, @pabbeel, @GlenBerseth, Koushil Sreenath. https://t.co/g5KmuANxzq
2023-02-19 20:38:22 @natolambert Uh oh I should watch what I say when I'm being recorded
2023-02-16 23:13:55 RT @philipjohnball: @svlevine Thanks for the comments everyone! We’ve now updated the manuscript to include the great work by @yus167, and…
2023-02-16 16:57:13 If you want to try super-fast RL from prior data, @philipjohnball, @ikostrikov, @smithlaura1028 have now released the RLPD repo: https://t.co/o8369eQPuZ We've been using this in multiple robotics projects lately (stay tuned!), and RLPD works pretty great across the board. https://t.co/Ari63kw9Pu https://t.co/Su7RMClpWe
2023-02-16 16:37:14 @ThomasW423 This is not the most mainstream opinion, but I think MB and MF (value-based) methods are not that diff. Both are about prediction, and for real-world use, they are slowly converging (eg multi-task value functions =>
2023-02-16 03:39:08 Apparently there is a recording of the talk I gave on offline RL for robots, goal-directed dialogue, and human influence https://t.co/fSTjdXRWsc
2023-02-15 17:59:11 In practice, this works *especially* well with label noise, which really hurts methods that don't get known groups, since they often end up up-weighting mislabeled points rather than coherent difficult groups. https://t.co/Gw4MumnLuL
2023-02-15 17:59:10 In theory, this really does work: the "inductive bias" from having a simple function class determine the groups makes it possible to improve robustness of classifiers under group shifts w/ unknown groups w/o knowing the groups in advance. https://t.co/xxGv0dckv6
2023-02-15 17:59:09 We usually don't know groups, we just have a bunch of images/text &
2023-02-15 17:59:08 How can we learn robust classifiers w/o known groups? In Bitrate-Constrained DRO by @setlur_amrith et al., we propose an adversary should use *simple* functions to discriminate groups. This provides theoretically &
2023-02-15 17:44:48 @archit_sharma97 On a more serious note, I think the issue with "ImageNet moment" is that it's a bit of a category error. "Solving robotics" is like finding a cure for all viruses -- it's just too big. Robotics is by its nature integrative, it typically lacks clean problems like ImageNet or Go.
2023-02-15 17:41:33 @archit_sharma97 At least as a roboticist you can sleep soundly knowing that when the AI apocalypse comes the robots will be pretty clumsy. I still remember my favorite headline about the Google arm farm work: "when the robots take over, they'll be able to grab you successfully 84% of the time."
2023-02-15 07:57:35 @QuantumRamses It was a department colloquium talk he gave at Berkeley around 2017. I’m sure he gave it in a few other places too. About how we should stop calling RL RL, among other things.
2023-02-15 03:33:39 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari I would be quite OK with any of those. We could also just call it cybernetics
2023-02-14 17:53:51 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari We could even go with "learning-based optimization" (LBO) to capture the black-box things that are not really control, like chip design and neural architecture search. Perhaps these things deserve to be put under the same umbrella as there are fascinating technical commonalities.
2023-02-14 17:50:41 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari I would be perfectly happy if we can all agree on "learning control" (LC) as a general term and reserve the "reinforcement" bit for some more narrow special case, but everyone got confused when I tried to make that distinction, so I gave up and just call everything RL now.
2023-02-14 17:48:55 @ylecun @IanOsband @_aidan_clark_ @CsabaSzepesvari For better or worse, in modern ML, "RL" is basically a synonym for "learning-based control." I would actually much prefer the latter term (Vladlen Koltun has a great talk about this too!), but language evolves, and people use "RL" to mean a lot more than it originally meant.
2023-02-14 03:10:13 A fun collaboration studying how we can pretrain a variety of features that can then be adapted to distributional shifts. The key is to learn decision boundaries that are all predictive but orthogonal to each other, then use them as “features.” Works in theory and in practice. https://t.co/WwzpGYsmSR
2023-02-13 22:57:41 @hausman_k All my arguments are RL arguments in (thin) disguise
2023-02-13 19:48:44 @hausman_k That would be cool, though lately we're trying hard to make AI as anthropocentric as possible. Clearly we need less imitation learning and more autonomous learning our future robot overlords will wonder why we tried so hard to get them to imitate such imperfect creatures
2023-02-13 17:50:13 @hausman_k Whether we are "generalists" or not, we are clearly way better at some things than others, which determines how we see the world. Missteps in AI research make this starkly apparent, but how deeply do such biases permeate physics, biology, etc.? Maybe just as much...
2023-02-13 17:48:46 @hausman_k One of the most interesting takeaways from the history of AI is just how deeply "anthropocentric" biases permeate our thinking, from Moravec's paradox to the Bitter Lesson -- so much of what we think we know about the world is colored by the nature of our own "intelligence"...
2023-02-13 05:44:18 @danfei_xu I think so. That's part of why I wanted to talk about RL-based training of LMs in there. It may be that with the right training procedure, LMs would be a lot better at rational decision making than they are now.
2023-02-12 22:52:53 I was invited to give a talk about how AI can lead to emergent solutions. Not something I talk about much, so it was a challenge: a reflection on "The Bitter Lesson", ML with Internet-scale data, and RL. I figured I would record and share w/ everyone: https://t.co/sxpfdcwvgS
2023-02-11 17:40:52 @neu_rips @JesseFarebro Interestingly, what the ablation study in our paper shows is that the most crucial choice is *not* the symmetric sampling, it's actually layer norm. The symmetric sampling helps, but not enormously in our experiments in Sec 5.1, we say that it's not enough in Sec 4.1. https://t.co/P2yAnvaeHu
2023-02-11 17:21:11 @neu_rips @JesseFarebro Of course the particular execution, details, results, etc. in each work are novel, and that's kind of the point -- I think our paper is pretty up front about the fact that the interesting new thing is in the details of how parts are combined to get results, not the basic concept.
2023-02-11 17:16:02 @neu_rips @JesseFarebro These papers all do this: https://t.co/lcJW0d8xqN https://t.co/oncdFIcWyC https://t.co/2eT8wx5EaC This paper has the same buffer balancing trick (see App F3): https://t.co/6fDaq3NALN Probably others do too, Wen told me he first saw it in this paper: https://t.co/C80Lfvu5AW
2023-02-11 17:14:23 @neu_rips @JesseFarebro I agree it would be good to expand our related work to add discussion of this paper, as well as several others! Getting feedback like that is one of the reasons to post pre-prints on arxiv. That said, we're pretty clear that this is not even remotely a new idea ->
2023-02-10 21:14:53 @JesseFarebro Ah, good point! You're right, this is indeed jointly training with offline data, just like DDPGfD, DQfD, etc.
2023-02-10 03:15:15 by @seohong_park Web: https://t.co/QsccXWoeYn arxiv: https://t.co/PaROenizIm https://t.co/ihyX6yj4K0
2023-02-10 03:15:14 This works really well for downstream planning: here we show MPC-based control with the PMA model vs models from prior methods and baselines. The PMA model leads to much more effective plans because it makes much more accurate predictions. https://t.co/GtdmYp3NtF
2023-02-10 03:15:13 This has an elegant probabilistic interpretation: the original MDP is turned into an abstracted predictable MDP (hence the name), and a "low level" policy acts as a decoder that decodes predictable MDP actions into grounded actions. https://t.co/9PDWfMHoBa
2023-02-10 03:15:12 The key idea is to learn an *abstraction* of the original MDP that only permits actions whose outcomes are easy to predict. If the agent is only allowed to select among those actions, then learning a model becomes much easier. https://t.co/eDX5wHjt23
2023-02-10 03:15:11 PMA is an unsupervised pretraining method: we first have an unsupervised phase where we interact with the world and learn a predictable abstraction, and then a *zero-shot* model-based RL phase where we are given a reward, and directly use the model from unsupervised interaction. https://t.co/nM9FExxvcX
2023-02-10 03:15:10 Model-based RL is hard, b/c complex physics are hard to model. What if we restrict agent to only do things that are easy to predict? This makes model-based RL much easier. In PMA, @seohong_park shows how to learn such "predictable abstractions" https://t.co/QsccXWoeYn Thread https://t.co/xmXfrOecsT
2023-02-09 16:33:12 @JesseFarebro That paper appears to provide theoretical analysis of offline RL pretraining with online finetuning, something that has been studied in a number of papers (including the IQL algorithm in the comparisons above). That's an old idea, and in my opinion a good one.
2023-02-09 03:22:31 This is work by @philipjohnball, @ikostrikov, @smithlaura1028 You can read the paper here: https://t.co/ZdRDLKNhZy
2023-02-09 03:22:30 These might seem like details, but they lead to a *huge* boost in training, without the more complex approaches in other offline to online work. Here are mujoco Adroit and D4RL results and comparisons https://t.co/ii9bGzM6uw
2023-02-09 03:22:29 Choice 3: use high UTD and a sample-efficient RL method (typically something with stochastic regularization, like ensembles or dropout). This one is known from prior work (DroQ, REDQ, etc.), but it really makes a big difference here in quickly incorporating prior data.
2023-02-09 03:22:28 There are three main design decisions, which turn out to be critical to improve SAC to be a great RL-with-prior data (RLPD) method: Choice 1: each batch is sampled 50/50 from offline and online data This is an obvious one, but it makes sure that prior data makes an impact https://t.co/dTRigTP79T
2023-02-09 03:22:27 RL w/ prior data is great, b/c offline data can speed up online RL and overcome exploration (e.g., solve the big ant maze task below). It turns out that a great method for this is just a really solid SAC implementation, but details really matter https://t.co/ZdRDLKNhZy https://t.co/JrBbtCQsTf
2023-01-30 01:00:00 CAFIAC FIX
2023-01-21 21:16:54 @shahdhruv_ @natolambert @xiao_ted @mhdempsey @andyzengtweets @eerac @jackyliang42 @hausman_k @vijaycivs I didn't even know we had V100s
2023-01-20 17:34:51 3 of the post authors are RAIL alumni @jendk3r was our RE, who in his brief tenure made everything run very smoothly and got a bunch of research done at the same time @ColinearDevin is now at DeepMind @GlenBerseth is prof at U of Montreal It was awesome working with you all!!
2023-01-20 17:34:50 The idea: run RL (after a bit of pretraining), continually picking up toys and dropping them back down to continue practicing. RL takes a while, but once it's autonomous, it can run for days and keep getting better! Web: https://t.co/DNSnEL7qbX Arxiv: https://t.co/RiPiTF3IJ0 https://t.co/WcEZ1kUUFe
2023-01-20 17:34:49 Some perspectives on autonomous lifelong RL for real-world robotics: https://t.co/tcfIHTOz6W This post, by @jendk3r, Charles Sun, @ColinearDevin &
2023-01-17 00:57:21 RL with language -- see my recent article here: https://t.co/WLn8ugzwgr ILQL (the algorithm I mention in the podcast): https://t.co/ssxXviA5wB Meta result on playing Diplomacy: https://t.co/bhxUJCLRCk
2023-01-17 00:57:20 I had a fun discussion with Sam Charrington for TWIML about machine learning in 2022 and things to watch in 2023. You can check it out below, with the YT link here: https://t.co/zGq3BHJjQE If you would like some references to some of the works I discuss, see links below: https://t.co/9fXBUTiIII
2023-01-12 17:23:33 @hausman_k @icmlconf I thought by "opposite" you were going to say that all papers are *required* to be generated by LLMs. The prompt must be submitted at the time of the abstract deadline. Bonus points for papers that use the same LLM that the generated paper is proposing.
2023-01-12 06:12:05 @r_mohan Should be quite possible to run on other robots (the quadcopter is a few hundred bucks...), it's just easiest to provide a clean wrapper for locobot + ROS as an example, but other robots that can read out images and accept inertial frame waypoints should work.
2023-01-11 19:02:05 We're releasing our code for "driving any robot", so you can also try driving your robot using the general navigation model (GNM): https://t.co/ohs8A3qnAw Code goes with the GNM paper: https://t.co/DWtvOWWd0b Should work for locobot, hopefully convenient to hook up to any robot https://t.co/P1kzajZl4u
2022-12-28 19:51:02 RT @hausman_k: SayCan is available in Roblox now! https://t.co/jaE07jB9vl You can play with an interactive agent using supported by GPT-3
2022-12-27 23:01:44 Our new paper analyzes how turning single-task RL into (conditional) multi-task RL can lead to provably efficient learning without explicit exploration bonuses. Nice summary by @simon_zhai (w/ @qiyang_li &
2022-12-21 19:57:06 RT @riadoshi21: Just released our latest work: AVAIL, a method training a robot hand to perform complex, dexterous manipulation on real-wor…
2022-12-21 18:07:52 w/ @imkelvinxu, Zheyuan Hu, Ria Doshi, Aaron Rovinsky, @Vikashplus, @abhishekunique7 https://t.co/NzUtY2vSVx https://t.co/nsxGE6nL3v https://t.co/gSpYuU8cQr
2022-12-21 18:07:51 The experiments cover three tasks. The one shown above is the dish brush task. Another task is to grasp and insert a pipe connector (shown below), and another task involves learning to attach a hook to a fixture. https://t.co/KHVW8T0hXM
2022-12-21 18:07:50 The concept behind our system, AVAIL, is to combine a task graph for autonomous training with VICE for classifier-based rewards and efficient image-based end-to-end RL for control. https://t.co/3iioDf55t1
2022-12-21 18:07:49 We trained a four-finger robot hand to manipulate objects, from images, with learned image-based rewards, entirely in the real world. It can reposition objects, manipulate in-hand, and train autonomously for days. https://t.co/nsxGE6nL3v https://t.co/NzUtY2vSVx A thread: https://t.co/s3jp1JZq0A
2022-12-20 23:05:47 RT @xf1280: If you missed this here is a highlight reel of SayCan presentations and demos at CoRL 2022, enjoy! https://t.co/L8mKVH4Ubv
2022-12-16 23:22:10 The paper is here: https://t.co/KwmujHIZnk Also check out our other recent work on data-driven navigation! GNM: https://t.co/DWtvOWWd0b LM-Nav (also at CoRL!): https://t.co/EVsFOiKGfS
2022-12-16 23:22:09 To learn more, check out the video: https://t.co/DnftBErjd6 @shahdhruv_ will present this work at CoRL, 11 am (NZ time) Sun = 2 pm PST Sat for the oral presentation, 3:45 pm (NZ time) Sun for the poster! w/ @shahdhruv_ @ikostrikov Arjun Bhorkar, Hrish Leen, @nick_rhinehart
2022-12-16 23:22:08 This makes the planner prefer paths that it thinks will satisfy the RL reward function! In practice this method can absorb large amounts of data with post-hoc reward labeling, satisfying user-specified rewards and reaching distant goals, entirely using real data. https://t.co/xT3relwnBR
2022-12-16 23:22:03 Of course, we don't just want to stay on paths/grass/etc., but reach distant goals. It's hard to learn to reach very distant goals end-to-end, so we instead use a topological graph (a "mental map") to plan, with the RL value function as the edge weights. https://t.co/BREA7KgzXP
2022-12-16 23:21:59 The idea: we take a navigation dataset from our prior work, and post-hoc label it with some image classifiers for a few reward functions: staying on paths, staying on grass, and staying in sunlight (in case we need solar power). Then we run IQL offline RL on this. https://t.co/imwvzEwZCS
2022-12-16 23:21:53 Offline RL with large navigation datasets can learn to drive real-world mobile robots while accounting for objectives (staying on grass, on paths, etc.). We'll present ReVIND, our offline RL + graph-based navigational method at CoRL 2022 tomorrow. https://t.co/zA0WVJjHAT Thread: https://t.co/OYEDTXo4wA
2022-12-16 23:12:32 RT @xf1280: Today I learned you can do a live demo during the oral session, and it felt great! Kudos to the entire team!
2022-12-16 23:12:07 RT @shahdhruv_: I'll be presenting LM-Nav at the evening poster session today at @corl_conf: 4pm in the poster lobby outside FPAA. Come fin…
2022-12-16 01:41:24 At CoRL 2022, @xf1280, @hausman_k, and @brian_ichter did a live demo of our SayCan system! Running RT-1 policy with PaLM-based LLM to fetch some chips in the middle of the CoRL oral presentation https://t.co/zssoUzy9ep https://t.co/ipoAEuuJvY https://t.co/0sNPBYVzYI
2022-12-14 20:29:18 RT @shahdhruv_: LangRob workshop happening now at #CoRL2022 in ENG building, room 401! Pheedloop and stream for virtual attendees: https:/…
2022-12-14 16:41:39 @JohnBlackburn75 This is trained entirely in the real world (except one experiment that is not shown here that studies what happens when we include simulation)
2022-12-14 03:04:22 And it's now on arxiv :) https://t.co/GqMEmRsYER
2022-12-13 19:04:06 RT @hausman_k: Introducing RT-1, a robotic model that can execute over 700 instructions in the real world at 97% success rate! Generalize…
2022-12-13 19:03:58 RT @xf1280: We are sharing a new manipulation policy that shows great multitask and generalization performance. When combined with SayCan,…
2022-12-13 19:03:06 Also worth mentioning that (finally) this one has open-source code: https://t.co/2W5dd83zJ8 (The other large-scale projects were very difficult to try to open source because of the complexity of the codebase that goes into these kinds of projects)
2022-12-13 18:59:20 This work is a culmination of a long thread of research on ultra-large-scale robotic learning, going back to 2015: arm farm: https://t.co/tQ58x23fTy QT-Opt: https://t.co/p2Lm3f9tEw MT-Opt: https://t.co/R7hNPwThts BC-Z: https://t.co/RQ5F2Y8HTM SayCan: https://t.co/zssoUzPcgp
2022-12-13 18:59:19 One of the experiments that I think is especially interesting involves incorporating data from our previous experiments (QT-Opt -- https://t.co/CptssIFS9R), which used a different robot, and showed that this can enable the EDR robot to actually generalize better! https://t.co/ZbkxM64FSE
2022-12-13 18:59:18 But perhaps even more important is the design of the overall system, including the dataset, which provides a sufficient diversity and breadth of experience to enable the model to generalize to entirely new skills, operate successfully in new kitchens, and sequence long behaviors. https://t.co/NuUmGZFCJP
2022-12-13 18:59:17 This thread by @hausman_k summarizes the design: https://t.co/XmYehq9fgV The model combines a few careful design decisions to efficiently tokenize short histories and output multi-modal action distributions, allowing it to be both extremely scalable and fast to run. https://t.co/SZcLZowTci
2022-12-13 18:59:16 New large-scale robotic manipulation model from our group at Google can handle hundreds of tasks and generalize to new instructions. Key is the right dataset and a Transformer big enough to absorb diverse data but fast enough for real-time control: https://t.co/iMnmxlnDWm >
2022-12-13 17:53:09 We've developed this line of work in a series of papers, culminating in fully learned systems that can drive for multiple kilometers off road, on roads, etc.: https://t.co/lY3FfMLR35 https://t.co/CljpgnJiOG https://t.co/co8FMAVJsC https://t.co/qqNxL7k3zm
2022-12-13 17:53:08 "Mental maps" can be built using these navigational affordances: take landmarks you've seen in the world, and determine which ones connect to which others using learned affordances. These maps are not geometric, more like "to get to the grocery store, I go past the gas station" https://t.co/L0KRi8gfzk
2022-12-13 17:53:07 Learning can give us both of these things. Navigational affordances should be learned from experience -- try to drive over different surfaces, obstacles, etc., and see what happens, learn how the world works, and continually improve. This is much more powerful. https://t.co/woHXy9U4hX
2022-12-13 17:53:06 The idea: navigation is traditionally approached as a kind of geometry problem -- map the world, then navigate to destinations. But real navigation is much more physical
2022-12-13 17:53:05 First, the entire special issue on 3D vision, our full article, and the author manuscript (if you can't access the paywall) are here (I'll get it on arxiv shortly too!). Full issue: https://t.co/Q06GB7RBuT Article: https://t.co/wRrV2TYJZP Manuscript: https://t.co/cnSnAooGby
2022-12-13 17:53:04 Learning can transform how we approach robotic navigation. In our opinion piece/review in Philosophical Transactions, @shahdhruv_ and I discuss some recent work and present an argument that experiential learning is the way to go for navigation: https://t.co/wRrV2TYJZP Short : https://t.co/pNNTcBn6Iz
2022-12-13 07:33:00 @gklambauer Count (# of times (s, a) has been visited, often approximated with density estimators in practice). C.f., count-based exploration. (but we should probably clarify this in the paper, good catch)
2022-12-13 05:58:55 This method does well offline, and the ability to adaptively adjust delta allows for improved online finetuning! w/ Joey Hong &
2022-12-13 05:58:54 There is a relatively elegant way to derive updates for training such a value function using variable levels of pessimism. We can also flip the sign and train *both* upper and lower bounds for all confidence levels. The CQL-algorithm that results can be summarized as: https://t.co/kAATV4iFB0
2022-12-13 05:58:53 The idea is to train a value function that is conditioned on "delta", a confidence level such that the value function is above the predicted value with that level of confidence. We can then train for *all* values of delta (sampled randomly during training). https://t.co/VrwanDFdez
2022-12-08 13:00:00 CAFIAC FIX
2022-12-07 08:00:00 CAFIAC FIX
2022-11-15 17:29:55 New talk discussing some perspectives on RL with real-world data: https://t.co/3KP7bquaRvDiscusses how offline RL can be applied to large-scale data-driven learning problems for robots, large language models, and other application domains.
2022-10-31 17:33:16 We evaluate on a wide range of test connectors (using a hold-one-out cross-validation procedure), and find that the method can consistently finetune even to pretty tricky insertion scenarios. https://t.co/F6W7Z8h7zF
2022-10-31 17:33:11 This kind of autonomous online finetuning bootstrapped from offline RL is far more practical than conventional RL, which requires either a huge number of trials or engineering-heavy sim-to-real solutions. I think in the future, all robotic RL systems will use offline pretraining! https://t.co/MIXG7yGFsK
2022-10-31 17:33:08 The finetuning is autonomous, and requires no additional human-provided information: the intuition is that the reward model generalizes better than the policy (it has an easier task to solve), and hence the policy can finetune from offline initialization.
2022-10-31 17:33:07 The method uses the IQL algorithm to train on an offline dataset of 50 connectors, and also learn a reward model and a novel domain-adaptive representation to facilitate generalization. For each new connector, the reward model provides image-based rewards that enable finetuning. https://t.co/X9ocEWkZ4J
2022-10-31 17:33:05 Offline RL initialization on diverse data can make online robotic RL far more practical! In our new paper, we show that this works great for industrial connector insertion, pretraining on dozens of connectors and finetuning autonomously to a new one!A thread: https://t.co/liXSKArTNI
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-10-20 19:16:52 @chrodan Yup, I think in many cases just reward shaping without bonuses is enough, hence bonuses are often not used (to be clear, the "bonuses" in our paper are just reward shaping multiplied by a count-based discount, so that the shaping goes away over time to recover unbiased sol'n)
2022-10-20 02:11:15 RT @abhishekunique7: Excited about our work on understanding the benefits of reward shaping! Reward shaping is critical in a large portion…
2022-10-19 17:24:01 I think this is quite an important and understudied area: so much RL theory is concerned with complexity w/ exploration bonuses, and so much RL practice eschews bonuses in favor of shaped rewards and other "MDP engineering." This paper aims to bridge theory and practice.
2022-10-19 17:24:00 We show formally that reward shaping improves sample efficiency in two ways: (1) with appropriate (freq-based) weighting, reward shaping terms can act as "informed exploration bonuses", biasing toward novel regions that have high shaping rewards while being unbiased in the limit
2022-10-19 17:23:59 In theory RL is intractable w/o exploration bonuses. In practice, we rarely use them. What's up with that? Critical to practical RL is reward shaping, but there is little theory about it. Our new paper analyzes sample complexity w/ shaped rewards: https://t.co/27wWFe9B8mThread: https://t.co/iQ3dA5rXQi
2022-10-19 03:36:44 Single life reinforcement learning -- when you have to complete the task in one (long) episode at all costsFun new paper with @_anniechen_, @archit_sharma97, @chelseabfinn https://t.co/m1p24MM4k0 https://t.co/uFcuYuo6z5
2022-10-18 16:34:57 Large pre-trained models (ViT, BERT, GPT, etc.) are a starting point for many downstream tasks. What would large pre-trained models look like in robotics, and how do we build them?I discuss this in my new article:https://t.co/d8lNPaQoxsShort summary:
2022-10-17 17:50:36 w/ @HiroseNoriaki, @shahdhruv_, Ajay Sridhararxiv: https://t.co/LNJhXR9BjMwebsite: https://t.co/EWdOhUQeu0video: https://t.co/KbO9QFJi0S
2022-10-17 17:50:25 This lets us drive a *wider* range of robots than what was seen in the data, resulting in policies that generalize over camera placement, robot size, and other parameters, while avoiding obstacles. Uses similar multi-robot dataset as our recent GNM paper: https://t.co/DWtvOXdg2b https://t.co/eAYm73HB9k
2022-10-17 17:50:12 Experience augmentation (ExAug) uses 3D transformations to augment data from different robots to imagine what other robots would do in similar situations. This allows training policies that generalize across robot configs (size, camera placement): https://t.co/EWdOhUQeu0Thread: https://t.co/llRj2tY7Ef
2022-10-14 23:58:20 RT @KuanFang: Check out our CoRL 2022 paper on leveraging broad prior data to fine-tune visuomotor policies for multi-stage robotic manipul…
2022-10-14 21:54:20 Besides FLAP, our recent work on large-scale robotic learning includes:PTR (offline RL pretraining to learn from a few demos): https://t.co/vUFWTcl5xHGNM (learning to drive any robot from diverse multi-robot data): https://t.co/Y58VidiSbs
2022-10-14 21:54:12 w/ @KuanFang, Patrick Yin, @ashvinair, @HomerWalke, Gengchen Yanarxiv: https://t.co/6tMygFkm3iwebsite: https://t.co/6tVoynrihWThis caps off our week of releasing this latest round of large-scale robotic learning papers! https://t.co/vS8QUGF1O6
2022-10-14 21:53:52 FLAP can enable the robot to perform complex tasks, like moving the pot to the stove and putting a bunny in the pot (mmm...). The method is pretrained on the bridge dataset, which we collected last year for our large scale robotic learning research: https://t.co/fGG35ikRus https://t.co/8nWnp64YX9
2022-10-14 21:53:29 That's where finetuning comes in: the subgoals give the robot a "trail of breadcrumbs" to follow to practice the task end-to-end with online RL, getting better and better. Thus, if the representation &
2022-10-14 21:53:08 The idea: learn a goal-conditioned policy, which in the process recovers a lossy latent space for planning. A multi-step predictive model in this latent space can plan subgoals in a new scene to do a task. Crucially, the robot might not succeed at following those subgoals! https://t.co/BvvIZszlvs
2022-10-14 21:52:46 Pretraining on diverse robot data with offline RL allows planning multi-stage tasks, and then finetuning those tasks with end-to-end exploration. Our new paper, FLAP, describes a method for doing this with real robots!https://t.co/6tMygFkm3ihttps://t.co/6tVoynrihWThread: https://t.co/ZiKmO95yJL
2022-10-12 20:00:50 Check out the website:Get the code here to try it yourself: https://t.co/DqDKdkdpi7(supports multi-GPU &
2022-10-12 20:00:41 PTR also gets better as we add more parameters, with larger networks reaching significantly better performance. We might expect here that with even more data and even more parameters, we might get even better performance! https://t.co/i9LgLn4aYV
2022-10-12 20:00:29 PTR works for a variety of tasks in similar environments as Bridge Data (opening doors, placing objects into bowls). It makes sense -- if you want to learn skills with RL, you should pretrain with (multi-task) RL, learning representations aware of dynamics and causal structure. https://t.co/eYXECCx564
2022-10-12 20:00:09 Critically, the same exact offline RL method is used for pretraining as for finetuning! In experiments, this leads to better performance, learning tasks with as few as 10 demonstrations (and no other data!). Better pretraining than BC, etc.: https://t.co/tuHa0Lrgvt
2022-10-12 19:59:57 PTR (pre-training for robots) is a multi-task adaptation of CQL for multi-task robotic pretraining. We pretrain a policy conditioned on the task index on a multi-task dataset (we use the Bridge Dataset: https://t.co/fGG35ikRus). Then we finetune on a new task.
2022-10-12 19:59:47 How should we pretrain for robotic RL? Turns out the same offline RL methods that learn the skills serve as excellent pretraining. Our latest experiments show that offline RL learns better representations w/ real robots:https://t.co/vUFWTcl5xHhttps://t.co/9Q9oUdXefTThread>
2022-10-10 18:30:38 To learn more about GNMs, check out our paper here: https://t.co/7bHqgh0yZ7Website: https://t.co/Y58Vid1P9sVideo: https://t.co/HJHXYRKKeuw/ @shahdhruv_, Ajay Sridhar, Arjun Bhorkar, @HiroseNoriaki
2022-10-04 04:41:30 Great opportunity for anyone who wants to tackle deep and fundamental problems in RL https://t.co/GIFGutS7KD
2022-10-04 03:10:47 @DhruvBatraDB Asking "why not also use X" kind of misses the point, doesn't it? And it's a phrase that anyone who worked on deep learning in the early days probably heard numerous times. There is value in fundamental, general, and broadly applicable learning principles.
2022-10-03 22:33:08 RT @sea_snell: Great to see ILQL making it into the TRLX repo! You can now train ILQL on very large scale language models. See
2022-10-03 16:52:34 Or to summarize more briefly:Perhaps the motto should be: “use data as though the robot is a self driving car, and collect data as though the robot is a child”
2022-10-03 16:16:49 Yet people can learn without simulators, and even without the Internet. Even what we call "imitation" in robotics is different from how people imitate. As for data, driving looks different from robotics b/c we don't yet have lots of robots, but we will: https://t.co/VSp5otL6Ml https://t.co/YslnxBwqCS
2022-09-20 16:41:01 A model-based RL method that learns policies under which models are more accurate (in a learned representation space). This provides a single unified objective for training the model, policy, and representation, such that they work together to maximize reward. https://t.co/HMIR9sYGQV
2022-09-19 17:43:20 This blog post summarizes our ICML 2022 paper:arxiv: https://t.co/STGj2CD3nowebsite: https://t.co/tx1RlPjukmmy talk on this: https://t.co/J7qlep6gfmKatie's talk at ICML: https://t.co/2cZzAcPoaT
2022-09-19 17:40:43 Interested in what density models (e.g., EBMs) and Lyapunov functions have in common, and how they can help provide for safe(r) reinforcement learning? @katie_kang_'s new BAIR blog post provides an approachable intro to Lyapunov density models (LDMs): https://t.co/MiHkzw2us0
2022-09-14 01:03:59 RT @ZhongyuLi4: I think the coolest thing of GenLoco is that, if there is a new quadrupedal robot developed, we can just download the pre-t…
2022-09-13 18:09:53 The concept: train on randomized morphologies with a controller conditioned on a temporal window. With enough morphologies, it generalizes to new real-world robots!w/ Gilbert Feng, H. Zhang, @ZhongyuLi4, @xbpeng4, B. Basireddy, L. Yue, Z. Song, L. Yang, Y. Liu, K. Sreenath https://t.co/NO6cS1FdGo
2022-09-13 18:06:12 Can we train a *single* policy that can control many different robots to walk? The idea behind GenLoco is to learn to control many different quadrupeds, including new ones not seen in training.https://t.co/XDYODDXykRCode https://t.co/Q4cEs5OcQ3Video https://t.co/xyHbeD52Ve https://t.co/gYCvVCDsLY
2022-08-31 17:59:34 Article in the New Scientist about @smithlaura1028 &
2022-08-23 16:49:29 But do make sure to keep an eye on the robot and be ready to stop it... we had an attempted robot uprising once that took out a window. I guess now we have the dubious distinction of saying that our building has been damaged by out-of-control autonomous robots.
2022-08-23 16:48:25 And now you can grab the code and train your own A1 to walk! If you have an A1 robot, this should run without any special stuff, except a bit of determination. https://t.co/9Z8nnFkiEH
2022-08-23 14:55:08 @pcastr On the heels of NeurIPS rebuttals, AC’ing, etc. — shows how jaded I’ve become when I thought this was going to be a joke with the punchline “strong reject”
2022-08-22 16:41:48 I wrote an article about how robotics can help us figure out tough problems in machine learning: Self-Improving Robots and the Importance of Datahttps://t.co/VSp5otL6Ml
2022-08-19 02:24:19 @dav_ell @ikostrikov We'll get the code released shortly so you can see for yourself :) but there is a bit of discussion in the arxiv paper
2022-08-18 03:58:36 BTW, much as I want to say we had some brilliant idea that made this possible, truth is that the key is really just good implementation, so the takeaway is "RL done right works pretty well". Though I am *very* impressed how well @ikostrikov &
2022-08-18 01:14:11 By Laura Smith &
2022-08-18 01:13:53 Here are some examples of training (more videos on the website). Note that the policy on each terrain is different, dealing with dense mulch, soft surfaces, etc. With these training speeds, the robot adapts in real time. https://t.co/b99PDlXFpK
2022-08-18 01:13:34 Careful implementation of actor-critic methods can train very fast if we set up the task properly. We trained the robot entirely in the real world in both indoor and outdoor locations, each time learning policies in ~20 min. https://t.co/tC1nR664hf
2022-08-18 01:13:13 RL is supposed to be slow and inefficient, right? Turns out that carefully implemented model-free RL can learn to walk from scratch in the real world in under 20 minutes! We took our robot for a "walk in the park" to test out our implementation.https://t.co/unBt7nouiathread->
2022-08-17 05:25:37 RT @hausman_k: We have some exciting updates to SayCan! Together with the updated paper, we're adding new resources to learn more about thi…
2022-08-17 05:01:10 Fun article on Reuters about SayCan: https://t.co/SpSaEM0Iar(that's @xf1280 in the picture)But it's really not *just* a soda-fetching robotand my other thread about the difficulty of adding skills notwithstanding, it does get new skills added regularly.
2022-08-17 04:33:49 But these are big open problems. What I think is great about SayCan is that it clearly shows that if we can just get enough robot skills, composing them into complex behaviors is something that state-of-the-art LMs can really help us to do. So the work is cut out for us :)
2022-08-17 04:33:39 At Berkeley, we also looked at...learning from automatically proposed goals: https://t.co/Jq03a0M0FPlearning by playing two-player games: https://t.co/EtNpMR2dy8Even learning by translating videos of humans! https://t.co/2nP5zpjQ8O
2022-08-17 04:33:28 There are *a lot* of ways that people have thought about this, including us. For ex, just at Google we studied:learning from human demos: https://t.co/rV1BsZIK0nlearning with all possible goals: https://t.co/daYtlhy6sflearning from reward classifiers: https://t.co/R7hNPwThts
2022-08-17 04:33:17 But for all of this to work, we need the skills on which to build this, and getting these at scale is an open problem. SayCan used imitation learning, but automated skill discovery, real-world RL, and mining skills from humans are all important directions to take this further...
2022-08-17 04:33:06 In a sense, the robot represents the "world" to the LM in terms of the value functions of its skills, following a scheme we developed in the VFS paper: https://t.co/yJqJwwCT6r https://t.co/sNneWnCj9J
2022-08-17 04:32:54 Language models provide us with a powerful tool to access a kind of "smart knowledge base," but the challenge here is to parse this knowledge base correctly. SayCan does this with a joint decoding strategy to find skills that are likely under the LM and feasible for the robot. https://t.co/GWTrjN4fzk
2022-08-17 04:32:39 SayCan was a fun project: turns out connecting natural language to a robot works well once the robot has a broad repertoire of skills. Though we still have a lot to do in terms of making it scalable and automatic to equip robots with even broader skill repertoires. A thread: https://t.co/FkNNCP6tZN
2022-08-10 04:10:20 Successor features can enable a really simple and effective "meta-IRL" style method that can infer reward functions from just a few trials by using prior experience of solving RL tasks! by @marwaabdulhai &
2022-08-09 15:59:41 RT @xbpeng4: We've released the code for ASE, along with pre-trained models and awesome gladiator motion data from @reallusion!https://t.c…
2022-08-03 17:55:04 A fun project on training quadrupedal robots to score a goal: https://t.co/gYcQxpY7zaw/ Yandong Ji, @ZhongyuLi4, Yinan Sun, @xbpeng4, @GlenBerseth, Koushil Sreenathhttps://t.co/HW7ExqTfvd
2022-07-19 03:56:37 Many of us will also be at the workshops! Hope to see you at #ICML2022
2022-07-19 03:56:13 To enable goal-conditioned policies to make analogies, @phanseneecs, @yayitsamyzhang, @ashvinair will present "Bisimulation Makes Analogies":Thu 21 Jul 4:45 p.m. — 4:50 p.m. EDT, Room 309https://t.co/kNftFEfL8Ohttps://t.co/68Kvb7ejGshttps://t.co/FeB56WRzAa https://t.co/q7w5T2miAa
2022-07-19 03:55:49 We'll describe how unlabeled data can be included in offline RL just by setting rewards to zero (!) (@TianheYu, @aviralkumar2):Thu 21 Jul 4:40 pm - 4:45 pm ET Room 309https://t.co/UJAvKEJ9k0https://t.co/23sEday3aehttps://t.co/aMRqaNzq6r
2022-07-19 03:55:28 We'll present a suite of benchmarks for data-driven design (model-based optimization), called Design Bench (@brandontrabucco, @younggeng, @aviral_kumar2):Thu 21 Jul 1:50 p.m. — 1:55 p.m. EThttps://t.co/GBsQUTjMoOhttps://t.co/JQs4K8jn9zhttps://t.co/6rlzrO3kLe https://t.co/CfeswO2x9G
2022-07-19 03:55:13 To enable rapid learning of new tasks, @Vitchyr will present SMAC, an algorithm that meta-trains from offline data and unlabeled online finetuning:Thu 21 Jul 10:50 a.m. — 10:55 a.m. PDThttps://t.co/bwzIQG7Tcyhttps://t.co/XS9Hn6jF5Uhttps://t.co/mtgZTivlVs https://t.co/XZTQaJsxeI
2022-07-19 03:54:57 We'll present Diffuser (@michaeljanner, @du_yilun), which uses diffusion models for long-horizon model-based RL, in a full-length oral presentation:Wed 7/20 1:45-2:05 pm ET, Room 307, RL trackhttps://t.co/HdRTMNmtmehttps://t.co/ViPDTvCWfhhttps://t.co/BKs9EuoP5G https://t.co/ktgIpoBXJh
2022-07-19 03:54:44 Katie Kang (@katie_kang_) will present Lyapunov Density Models, which enable safety by mitigating distributional shift:Tue Jul 19 02:35 PM -- 02:40 PM (EDT) @ Room 309https://t.co/4sWYHloK6Bhttps://t.co/STGj2CU6pohttps://t.co/tx1RlPAxmm https://t.co/PMIDb2syXx
2022-07-19 03:54:30 We'll discuss how offline RL policies must be adaptive to be optimal (@its_dibya, Ajay, @pulkitology), and introduce a Bayes-adaptive offline RL method, APE-V:Tue 19 July 2:15 - 2:35 PM EDT @ Room 309https://t.co/x14skPtkJYhttps://t.co/3B0MvqY0jphttps://t.co/4NfMrxxcf1 https://t.co/y6nRmglpfh
2022-07-19 03:54:02 Students &
2022-07-16 02:07:40 RT @lgraesser3: So excited to share i-S2R in which we train robots to play table tennis cooperatively with humans for up to 340 hit rallie…
2022-07-14 19:35:09 We have several large-scale offline RL robotics papers coming soon, and some already out this year: https://t.co/8Is9AoyTifA number of other groups are also making great progress on this topic, e.g.:https://t.co/mZo782IBp8https://t.co/P9myggpBLzhttps://t.co/UCueqbH4sn
2022-07-14 19:34:50 This work is a step toward general-purpose pre-trained robot models that can finetune to new tasks, just like big vision &
2022-07-14 19:34:39 by @HomerWalke, Jonathan Yang, @albertyu101, @aviral_kumar2, Jedrzej Orbik, @avisingh599 Web: https://t.co/Za9qhENqKwPaper: https://t.co/i2AGx7LlafVideo: https://t.co/J45Dzk3lRQ
2022-07-14 19:34:25 Here is an example of some reset-free training. Of course, the new tasks have to have some structural similarity with the prior tasks (in this case, pick and place tasks, ring on peg, open/close drawer etc.). https://t.co/T7yj9YQT7r
2022-07-14 19:33:53 The idea in ARIEL is to use a large prior dataset with different tasks (in this case with scripted collection) to initialize a forward and backward policy that learns to both perform a task and reset it, as shown below. The task can be learned from scratch, or with a few demos. https://t.co/YLHJFaThMf
2022-07-14 19:33:42 Don't Start From Scratch: good advice for ML with big models! Also good advice for robots with reset-free training: https://t.co/Za9qhENqKwARIEL allows robots to learn a new task with offline RL pretraining + online RL w/ forward and backward policy to automate resets.Thread: https://t.co/3yAkLhfZc4
2022-07-14 01:22:04 See thread by @hausman_k (and website above): https://t.co/hpdHN5Mqrhw/ @wenlong_huang, @xf1280, @xiao_ted, @SirrahChan, @jackyliang42J, @peteflorence, @andyzengtweets, @JonathanTompson, @IMordatch, @YevgenChebotar, @psermanet, Brown, Jackson, Luu, @brian_ichter, @hausman_k
2022-07-13 18:48:02 See thread by @hausman_k (and of course the website above): https://t.co/hpdHN5uP2Hw/ @wenlong_huang, @xf1280, @SirrahChan, @jackyliang42J, @peteflorence, @andyzengtweets, @JonathanTompson, @IMordatch, @YevgenChebotar, @psermanet, Brown, Jackson, Luu, @brian_ichter, @hausman_k
2022-07-13 18:46:22 Robots can figure out complex tasks by talking... to themselves. In "Inner Monologue" we show how LLMs guide decision making and perception through an "monologue" with themselves, perception modules, and even dialogue with humans: https://t.co/snpSj8fPeShttps://t.co/7Q8ztC9jO7 https://t.co/4Sk4G6xN7m
2022-07-13 07:30:08 @SOURADIPCHAKR18 I only found I’m speaking two days ago, a bit short notice :) I’m just the substitute...
2022-07-13 05:03:36 I'll be delivering this talk tomorrow at the "New Models in Online Decision Making" workshop: https://t.co/7QJ3BkcShyIf you're attending the workshop, come watch it live (well, live over Zoom)!11:00 am CT/9:00 am PT
2022-07-13 05:03:29 Happy to share a talk on Lyapunov density models and epistemic POMDPs for offline RL. Describes how we can use offline data to get safety and rapid online learning.Covers LDM: https://t.co/tx1RlPAxmmAnd APE-V: https://t.co/4NfMrxxcf1See the talk here: https://t.co/hf9kZJInxN
2022-07-12 02:32:52 If you want to play with LM-Nav in the browser, check out the colab demo here: https://t.co/GKrOr24hU6Of course, our colab won't make a robot materialize and drive around, but you can play with the graph and the LLM+LVM components
2022-07-12 02:31:43 LM-Nav uses the ViKiNG Vision-Navigation Model (VNM): https://t.co/oR1bplIvMtThis enables image-based navigation from raw images.w/ @shahdhruv_, @blazejosinski, @brian_ichter Paper: https://t.co/ymU9kALQ9GWeb: https://t.co/EVsFOj1JhSVideo: https://t.co/QRhQDAZydM
2022-07-12 02:30:24 The results in fully end-to-end image-based autonomous navigation directly from user language instructions. All components are large pretrained models, without hand-engineered localization or mapping systems. See example paths below. https://t.co/phhVOgvvSP
2022-07-12 02:30:11 It uses a vision-language model (VLM, CLIP in our case) to figure out which landmarks extracted from the directions by the LLM correspond to which images in the graph, and then queries the VNM to determine the robot controls to navigate to these landmarks. https://t.co/3gOOeFUaUW
2022-07-12 02:29:57 LM-Nav first uses a pretrained language model (LLM) to extract navigational landmarks from the directions. It uses a large pretrained navigation model (VNM) to build a graph from previously seen landmarks, describing what can be reached from what. Then... https://t.co/lCnf7Cqru4
2022-07-12 02:29:48 Can we get robots to follow language directions without any data that has both nav trajectories and language? In LM-Nav, we use large pretrained language models, language-vision models, and (non-lang) navigation models to enable this in zero shot!https://t.co/EVsFOj1JhSThread: https://t.co/lL76BlNVst
2022-07-09 21:24:20 A (non-technical) talk about Moravec's paradox, the state of AI, and why robot butlers are hard to build: https://t.co/EBcoEbivMj
2022-07-07 20:24:33 @michal_pandy You can think of this (roughly) as an analysis of Bayesian RL in the offline setting, with a particular algorithmic instantiation based on ensembles. The trick is that the ensemble still needs to be aware of the belief state (which is updated incrementally).
2022-07-07 17:09:51 Paper: https://t.co/3B0MvqY0jpby @its_dibya + A. Ajay, @pulkitology, meThis builds on the ideas we previously introduced in our work on the epistemic POMDP: https://t.co/u5HaeJPQ29
2022-07-07 17:09:39 We can instantiate this by using an ensemble to track uncertainty about Q-functions, and conditioning the policy on the ensemble weights (i.e., the belief state). Adapting these weights via Bayes filtering updates leads to improved performance! https://t.co/kZl1WbFQUs
2022-07-07 17:09:24 Formally, the finite sample size in offline RL makes any MDP actually a POMDP, where partial observability is due to epistemic uncertainty! The state of this POMDP includes the belief over what the MDP really is, which we can update as we run the policy. https://t.co/UxEkCUvy1C
2022-07-07 17:09:10 Another example: we have sparse data on narrow roads, dense data on big roads. We should try the narrow roads to see if they provide a "shortcut," but if they don't, then we update our policy and go with the safer option. Again, adaptive strategies win over any conservative one. https://t.co/vbyLIPrwGr
2022-07-07 17:08:56 The right answer is to adapt the policy after trying the first door, thus updating the agent's posterior about what kind of MDP it is in -- basically, seeing that the door didn't open gives more information beyond what the agent had from the training set.
2022-07-07 17:08:48 This is *not* the same as regular classification, since it gets multiple guesses. The optimal policy (try doors in order) is *not* optimal in the training data, where it's easy to memorize the classes of each image. Optimal strategy is only apparent if accounting for uncertainty.
2022-07-07 17:08:41 Here is an example: there are 4 doors, the agent gets a (CIFAR) image (lower-left), and can only open the door corresponding to the class of the image. If it knows the label, that's easy. But if it's uncertain? It should "try" doors in order of how likely they are to be right. https://t.co/yGqzvfjzDu
2022-07-07 17:08:26 Offline RL methods have to deal with uncertainty due to fixed-sized datasets. Turns out that it is provably more optimal to recover *adaptive* policies rather than a static conservative policy. We study this setting in our new paper: https://t.co/3B0MvqY0jpThread w/ intuition: https://t.co/HHhkmjoJwv
2022-07-07 16:55:12 RT @its_dibya: What does an offline RL policy need to do to be maximally performant? In our new paper, we discovered that being optimal…
2022-07-07 03:29:53 Also very appropriate that @Mvandepanne is receiving the Computer Graphics Achievement Award the same year that his former student gets the Dissertation Award Very well deserved, Michiel's papers were a big inspiration for me in my own PhD as well.
2022-07-07 03:28:06 Congratulations @xbpeng4 on receiving the SIGGRAPH Outstanding Dissertation Award!Awesome recognition for making virtual characters robustly recover from being pelted with big boxes, small boxes, and medium boxesand sometimes making them hit backhttps://t.co/MIIYJ1cpJs https://t.co/NmXlxHkdgO
2022-07-06 16:45:59 If you are applying for PhD and want to create the most awesome animation methods and the most lifelike robots with one of the leading researchers in the field, Jason is starting a new lab at SFU!! https://t.co/IXcQqy39Jr
2022-07-06 03:39:31 A simple fixed point differentiation algorithm that can allow significantly more effective training of models that contain iterative inference procedures (e.g., models for object representations). https://t.co/W8O3qoNLUs
2022-07-01 01:45:17 If you want to check out the talk, you can find it here (along with a recording of the entire workshop!): https://t.co/E7HE1MtgxYCovers our recent work on offline RL + robotics, including a few soon-to-be-released papers! https://t.co/IbdgptCHYv
2022-06-29 18:15:21 If you're at #RSS2022, check out Yanlai Yang's talk on the Bridge Dataset, a dataset of 7k+ demos with 71 tasks in many kitchens + an evaluation of how such a large dataset can boost generalization for new tasks!11:05 am ET/Thu/Poster 12More here: https://t.co/JbIbC9X1I1 https://t.co/VmPT9e27nX
2022-06-28 21:00:01 @pcastr It may be that the similarity is more than just superficial, as similar math can explain both to some degree (though I suppose that's true for any two things if interpreted broadly enough): https://t.co/qx4ArX5ZTu
2022-06-26 21:19:38 @truman8hickok Let me see, hopefully it's recorded by the organizers, but if not, I think I can post my practice talk &
2022-06-26 19:32:29 I'll be speaking at #LDOD at #RSS2022 tmrw (Mon 6/27), 9:00am ET. I'll cover a few brand new large-scale robotic learning works! Here is a little preview:New Behaviors From Old Data: How to Enable Robots to Learn New Skills from Diverse, Suboptimal Datasets https://t.co/59wkTUojOv https://t.co/uKcctIGiEs
2022-06-26 18:01:13 RT @shahdhruv_: On Tuesday, I’m stoked to be presenting ViKiNG — which has been nominated for the Best Systems Paper award — at the Long Ta…
2022-06-24 18:19:11 The arxiv posting is finally out! https://t.co/xrPkESnGKb
2022-06-22 19:00:23 @jekbradbury So they're tackling somewhat orthogonal problems, and conceivably could be combined effectively.
2022-06-22 19:00:05 @jekbradbury RLHF deals more with the question of what kind of human feedback to get (preferences, feedback, etc.). In RL, feedback is somehow turned into a reward (see, e.g., Paul Christiano's preferences paper), which is used with some kind of RL method. ILQL is one choice for RL method.
2022-06-22 17:40:46 By @katie_kang_, Paula Gradu, Jason Choi, @michaeljanner, Claire Tomlin, and myselfweb: https://t.co/tx1RlPAxmmpaper: https://t.co/STGj2CU6poThis will appear at #ICML2022!
2022-06-22 17:40:37 Some results -- here, we use model-based RL (MPC, like in PETS) to control the hopper to hop to different locations, with the LDM providing a safety constraint. As we vary the threshold, the hopper stops falling, and if we tighten the constraint too much it stands in place. https://t.co/xxjeuWqna7
2022-06-22 17:40:20 Intuitively, the LDM learns to represent the worst-case future log density we will see if we take a particular action, and then obey the LDM constraint thereafter. Even with approximation error, we can prove that this keeps the system in-distribution, minimizing errors! https://t.co/JH9XP2PjIs
2022-06-22 17:40:01 LDM can be thought of as a "value function" with a funny backup, w/ (log) density as "reward" at the terminal states and using a "min" backup at other states (see equation below, E = -log P is the energy). In special cases, LDM can be Lyapunov, density model, and value function! https://t.co/ZAro9Jex2q
2022-06-22 17:39:37 We can take actions that are high-density now, but lead to (inevitable) low density later, so just like the Lyapunov function needs to take the future into account, so does the Lyapunov dynamics model, integrating future outcomes via a Bellman equation just like in Q-learning. https://t.co/igWYdk8Zkd
2022-06-22 17:38:15 By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >
2022-06-22 17:38:00 Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this. https://t.co/38AZmVCiMe
2022-06-22 17:36:48 What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: https://t.co/tx1RlPAxmmA thread: https://t.co/yPlPVIRTam
2022-06-21 23:51:10 @KevinKaichuang @KathyYWei1 Nothing to see here, just a super-intelligent AI sending messages to its friends by secretly encoding them into proteins.
2022-06-21 18:54:36 Note: we tried to post our paper on arxiv, but arxiv has put it on hold (going into the third week now). Hopefully arxiv posts it soon, but for now it's just hosted on the website above. I guess arxiv mods are a bit swamped these days...
2022-06-21 18:52:58 by @sea_snell, @ikostrikov, @mengjiao_yang, Yi Supaper: https://t.co/Pt2oGnBnfKwebsite: https://t.co/ssxXvihWitcode: https://t.co/mrbFVX2GarYou can also check out our prior work on offline RL for dialogue systems (NAACL 2022):https://t.co/hZny7dV0dEhttps://t.co/eT08Q95v6U
2022-06-21 18:52:49 One cool thing is that, just by changing the reward, we can drastically alter generated dialogue. For example, in the example on the right, the bot (Questioner) asks questions that minimize the probability of getting yes/no answers. https://t.co/WD1FZIPni3
2022-06-21 18:52:26 The result is that we can train large transformer Q-functions finetuned from GPT for arbitrary user-specified rewards, including goal-directed dialogue (visual dialogue), generating low-toxicity comments, etc. It's a general tool that adds rewards to LLMs. https://t.co/NpCB7Dl0w6
2022-06-21 18:52:11 Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding. https://t.co/h6YjSq2gv8
2022-06-21 18:51:58 We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data! https://t.co/m0EAVwIyar
2022-06-21 18:51:31 NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: https://t.co/ssxXvihWitCode: https://t.co/mrbFVX2GarThread ->
2022-06-19 01:20:59 @b_shrir Thanks for pointing that out! Definitely looks relevant, we'll take a closer look.
2022-06-17 23:42:46 This work led by @setlur_amrith, with @ben_eysenbach &
2022-06-17 23:42:36 Decision boundaries look much better too. Here, vertical axis is spurious, horizontal is relevant. Pink line (RCAD) always gets the right boundary, ERM (black) is often confused by the spurious direction. So RCAD can really "unlearn" bad features. https://t.co/aT4PYjEiSa
2022-06-17 23:42:23 It's also possible to prove that this adversarial entropy maximization "unlearns" spurious features, provably leading to better performance. This unlearning of bad features happens in practice! Here we start w/ ERM (red) and switch to RCAD (green), spurious feature wt goes down! https://t.co/xZdncIj0s9
2022-06-17 23:42:11 This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited. https://t.co/AFziLNhIv7
2022-06-17 23:41:58 The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training. https://t.co/Q4Db3VhY66
2022-06-17 23:41:49 Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: https://t.co/lJf5aVv3JrA thread: https://t.co/pHj76WHnUA
2022-06-17 01:17:51 See also Ben's thread on this work here: https://t.co/x2UzoAq8th
2022-06-17 01:17:35 ...then we don't need to worry about separate representation learning for RL, just use task-agnostic RL objectives (like goal reaching) to learn your representations!w/ @ben_eysenbach, @tianjun_zhang, @rsalakhuwebsite: https://t.co/mdyQrG2FMCpaper: https://t.co/UtLxMjgMGp
2022-06-17 01:17:26 More generally, I think contrastive RL poses a very interesting question: instead of asking how representation learning can help RL, can we instead ask how RL can help representation learning? If representation learning and RL objectives are cast in the same framework...
2022-06-17 01:17:18 This also leads to a very effective offline goal-conditioned RL method (albeit with a couple modifications), and can outperform state-of-the-art methods on the difficult ant maze tasks when provided with the task goal. https://t.co/8t0PQPDQjn
2022-06-17 01:17:03 This is very simple, and works very well as a goal-conditioned RL method, naturally leads to the (obvious) relabeling strategy, and outperforms a wide range of prior methods when doing goal reaching from images. https://t.co/DEZHOEY7h1
2022-06-17 01:16:51 High-level idea is simple: contrastive learning contrasts positives vs negatives (left). We can do RL by contrastive current state-action tuples with future states, sampled from a discounted future distribution (right). Then condition on the future goal, and pick the max action. https://t.co/QsIu8DpSS6
2022-06-17 01:16:38 Turns out we can do RL directly with contrastive learning, and it leas to good goal-conditioned performance on images without any separate explicit representation learning objective: https://t.co/mdyQrG2FMCA short thread: https://t.co/IboAGJaeHu
2022-06-14 01:53:25 @unsorsodicorda @keirp1 @SheilaMcIlraith Yes, there is also this (though focusing specifically on return conditioned): https://t.co/mE9qZDitQnSeems pretty clear that just return conditioned policies don't generally work. In RvS we argued that goal-conditioning should be better (though still requires alg changes).
2022-06-14 01:22:40 @Caleb_Speak Hmm that’s an interesting thought. We focus specifically on RL tasks, but it would be interesting under which conditions this would cause problems for general supervised systems with feedback — perhaps non trivial because it depends a lot on how they are supervised.
2022-06-13 19:20:28 Read more in Ben's tweet here: https://t.co/TB46OUutxV
2022-06-13 19:20:18 This leads to much better results particularly in stochastic environments, and we hope that this "normalized OCBC" can serve as a foundation for a better and more principled class of supervised learning RL methods! https://t.co/Fnz2r7OkIt
2022-06-13 19:20:06 Fortunately, we can fix this problem with two subtle changes. First, we reweight the data to account for the average policy (i.e., prior prob of actions). Second, we can't relabel tasks for which the policy is too different. With these two changes, we can prove convergence. https://t.co/orzea5yOd5
2022-06-13 19:19:53 In practice, this shows up as a kind of "winner take all" problem, where some tasks (e.g., some goals) get better, while others get worse. That makes sense -- if a goal gets worse, we have only bad data to relabel into that goal, and it will not get much better. https://t.co/TcmZfrRtPe
2022-06-13 19:19:41 ...and in general, the maximization step does *not* improve it beyond the "badness" introduced by the averaging step. This can become a big problem especially in stochastic environments. This is a problem, b/c/ this is a popular supervised learning alternative to RL https://t.co/iHQq7htR9w
2022-06-13 19:19:30 In our paper (@ben_eysenbach, @0602soumith, @rsalakhu), we show that in general, naive relabeling + supervised learning *doesn't* work. We can interpret it as two steps (analogously to EM): averaging (pooling data from all trials) and training. Averaging makes the policy worse... https://t.co/HtPH5Uloni
2022-06-13 19:19:20 An appealing alternative to classic RL is goal-conditioned BC (GCSL, etc.), or generally outcome-conditioned BC: condition on a future outcome and run supervised learning. Does this work? Turns out the answer, in general, is no: https://t.co/2zJFTt51CmA thread:
2022-06-13 19:04:07 A talk that I preparing on how reinforcement learning can acquire abstractions for planning. Covers some HRL, trajectory transformer, offline RL: https://t.co/LKl3OgAjstThis was made for the "Bridging the Gap Between AI Planning and Reinforcement Learning" workshop at ICAPS.
2022-06-05 04:39:28 RT @avisingh599: A ~35minute talk I gave at the ICRA 2022 Behavior Priors in Reinforcement Learning for Robotics Workshophttps://t.co/aNB…
2022-05-26 19:55:24 @yukez @snasiriany @huihan_liu @UTCompSci @texas_robotics Awesome, congratulations @snasiriany, @huihan_liu, @yukez!!
2022-05-26 01:43:52 Finally, a method that will let us communicate with aliens if we don't know their language. Unfortunately, we couldn't find aliens for the study, so we decided to let the algorithm figure out how to control a lunar lander from hand gestures. Video here: https://t.co/PluzJe5vPT https://t.co/7s6r0OlaTh
2022-05-25 23:22:46 @mathtick @NandoDF I think that would be quite neat. The current model is still in discrete time though, so this would be a nontrivial extension — the diffusion here roughly corresponds to steps of plan refinement, rather than time steps.
2022-05-25 02:33:48 w/ @KuanFang Patrick Yin @ashvinair website with videos: https://t.co/EK3eQb0we9arxiv: https://t.co/8Is9AoQuGP
2022-05-25 02:32:23 PTP builds on a number of prior workson planning with goal-conditioned policieshttps://t.co/i6wCVW1dK2https://t.co/Qr3qUYzC0Thttps://t.co/DBvTUShGGBhie... vision-based RLhttps://t.co/cHpFdS1fIvhttps://t.co/Rre1xThTdxFor some reason a lot of these are from 2019
2022-05-25 02:28:48 PTP can use planning to significantly improve goal-conditioned policies with online finetuning, and then those goal-conditioned policies can solve multi-stage tasks (here, push an object and then close the drawer) in the real world! https://t.co/swZovHEroP
2022-05-25 02:27:10 The subgoals make it much easier to further finetune the goal-conditioned policy online to get it to really master multi-stage tasks. Intuitively, the affordance model tells the robot what it should do to achieve the goal, and the goal-conditioned policy how to do it. https://t.co/pZB9mEjtBl
2022-05-25 02:25:58 PTP trains a goal-conditioned policy via offline RL, as well as an "affordance model" that predicts possible outcomes the robot can achieve. Then it performs multi-resolution planning, breaking down the path to the goal into finer and finer subgoals. https://t.co/lCApBvnAOS
2022-05-25 02:24:40 Planning to practice (PTP) combines goal-based RL with high-level planning over image subgoals to finetune robot skills. Goal-conditioned RL provides low-level skills, and they can be integrated into high-level skills by planning over "affordances": https://t.co/8Is9AoQuGP->
2022-05-23 19:19:33 w/ @michaeljanner, @du_yilun, Josh TenenbaumPaper: https://t.co/ViPDTvUxDRWebpage: https://t.co/UDwLsWBoqaCode: https://t.co/QSLozrOD2Y
2022-05-23 19:18:19 This scales to more complex tasks, like getting a robotic arm to manipulate blocks, and does quite well on offline RL benchmark tasks! https://t.co/10Lavbx3iT
2022-05-23 19:17:51 Why is this good? By modeling the entire trajectory all at once and iteratively refining the whole thing while guiding toward optimality (vs autoregressive generation), we can handle very long horizons, unlike conventional single-step dynamics models that use short horizons https://t.co/FRRkRNtXKt
2022-05-23 19:17:04 The architecture is quite straightforward, with the only trajectory-specific "inductive bias" being temporally local receptive fields at each step (intuition is that each step looks at its neighbors and tries to "straighten out" the trajectory, making it more physical) https://t.co/MAWjMVnTw0
2022-05-23 19:15:42 The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method. https://t.co/XiAaYAonwk
2022-05-23 19:14:24 Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!https://t.co/UDwLsWBoqa->
2022-05-20 08:11:00 CAFIAC FIX
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-10-24 01:28:40 RT @KuanFang: We will present Planning to Practice (PTP) at IROS 2022 this week. Check out our paper if you're interested in using visual…
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0
2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &
2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih
2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &
2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &
2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &
2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0
2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &
2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih
2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &
2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &
2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &
2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v
2022-11-27 22:34:00 There are additional talks, posters, and workshop talks from RAIL that I'll post about later! Also check out work by Jakub Grudzien (who is rotating with us this semester) + colleagues on discovering policy update rules: https://t.co/dM8KvMLi39 arxiv: https://t.co/jIYv6VcBZ0
2022-11-27 22:33:59 Also at 4:30, Hall J #333, Han Qi, @YiSu37328759 &
2022-11-27 22:33:57 Then at 4:30 pm CST, Hall J #505, @mmmbchang will present his work on implicit differentiation to train object-centric slot attention models. Talk: https://t.co/zG4sFVk8DF Web: https://t.co/XvsCYJYUUE YT video: https://t.co/VeiYJplJgN https://t.co/NjxdilM2Ih
2022-11-27 22:33:56 Also at 11, Hall J #303, @ben_eysenbach &
2022-11-27 22:33:55 Also at 11, Hall J #928, @ben_eysenbach &
2022-11-27 22:33:53 At 11:00 am CST Hall J #610, @ben_eysenbach &
2022-11-27 22:33:52 We'll be presenting a number of our recent papers on new RL algorithms at #NeurIPS2022 on Tuesday 11/29, including contrastive RL, new model-based RL methods, and more. Here is a short summary and some highlights, with talk &
2022-11-23 18:27:42 w/ Han Qi, Yi Su, @aviral_kumar2 This will appear at #NeurIPS2022 https://t.co/X90UWhZFwq If you want to learn more about offline MBO, check out our blog post that covers our previous MBO method: https://t.co/E0d2ZMURCc
2022-11-23 18:27:41 We evaluate the resulting method on various MBO benchmarks: superconductor design, robot morphology optimization, etc. Averaging over the tasks, our method, IOM (invariant objective models) improves over prior methods, and has appealing offline hparam selection rules. https://t.co/QAwBOO8enU
2022-11-23 18:27:40 So we usually somehow limit mu_opt(x) to be close to mu(x) (e.g., KL constraint, pessimism, etc.). But what if we instead train the *representation* inside f(x) (the model/value function) to be *invariant* to differences between mu(x) and mu_opt(x)? https://t.co/JsYU5q7fOK
2022-11-23 18:27:39 In offline [RL/MBO/bandits], we start with a data distribution ("mu(x)"), and find a policy ("mu_opt(x)") that is better (i.e., x ~ mu_opt(x) has higher utility). We estimate f(x) (or V(x) in RL) with a model, but if we change mu(x) too much, the model makes incorrect predictions https://t.co/eM7TkmXjZu
2022-11-23 18:27:38 Data-driven decision making (e.g., offline RL, offline MBO, bandits from logs) has to deal with distribution shift. But perhaps we can approach this as "domain adaptation" between the data and the optimized policy? In our new paper, we explore this unlikely connection. A thread: https://t.co/IfTYeM4AqR
2022-11-23 05:06:29 Paper: https://t.co/muDIHjvpLC Website: https://t.co/62iojrRFwh
2022-11-23 05:05:59 How can vision-language models supervise robots to help them learn a broader range of spatial relationships, tasks, and concepts? DIAL can significantly improve performance of instruction following policies by augmenting human labels with huge numbers of synthetic instructions. https://t.co/ytcd6Mbzj7 https://t.co/EiB2rb65UA
2022-11-22 18:23:38 This ends up working quite well in practice. DASCO gets good empirical results, and the experiments validate that the additional "auxiliary" generator, which is what leads to support constraints, really does lead to significantly improved performance. https://t.co/YJH2OYojoX
2022-11-22 18:23:37 If we use a regular discriminator, that's a distributional constraint. How do we get a support constraint? We discriminate between data and a mix of *two* actors -- a "good" one and a "bad" one! The aux generator captures the bad actions, so the main actor can keep the good ones. https://t.co/MR6tf0ZRig
2022-11-22 18:23:36 In these settings, ReDS significantly outperforms other offline RL methods. We also evaluate it on a fairly complex set of simulated robotic manipulation tasks. https://t.co/B8SDaiWIao
2022-11-22 18:23:35 Turns out that all that is needed is to slightly modify CQL so that the "negative" term (the one that pushes down Q) is a 50/50 mix of the policy and an "anti-advantage" distribution that picks up on *low* advantage actions (rho in the eq above, which optimizes this objective): https://t.co/SfgMB4TcIt
2022-11-22 18:23:34 The first method, ReDS, modifies conservative Q-learning (CQL). CQL pushes down on Q-values not in the data, which implicitly makes the Q-function conform to the shape of the data distribution. ReDS reweights the terms to only push down on out-of-support actions. https://t.co/akLzTFqjka
2022-11-22 18:23:33 Offline RL requires staying close to data. Distribution constraints (e.g., KL) can be too pessimistic, we need "support" constraints that allow for any policy inside the support of the data. We developed 2 methods for this: https://t.co/u1UMAKutNL https://t.co/9c5Dwewjy6 Thread: https://t.co/ohmZhWxR6v