Découvrez Les IA Experts
Nando de Freitas | Researcher at Deepind | |
Nige Willson | Speaker | |
Ria Pratyusha Kalluri | Researcher, MIT | |
Ifeoma Ozoma | Director, Earthseed | |
Will Knight | Journalist, Wired |
Nando de Freitas | Researcher at Deepind | |
Nige Willson | Speaker | |
Ria Pratyusha Kalluri | Researcher, MIT | |
Ifeoma Ozoma | Director, Earthseed | |
Will Knight | Journalist, Wired |
Profil AI Expert
Not Available
Les derniers messages de l'Expert:
2024-07-16 17:13:20 This is an excellent paper, that ties many threads together around scaling models and hyperparameters. https://t.co/pN90kPoxvS
2024-06-07 13:32:51 This was one of the most research-enabling libraries I used at Google. If you want to try out LLM ideas with a simple, clean, JAX codebase, this is for you. https://t.co/HGwAWSKRVJ
2024-04-06 03:27:22 This was a fun project! If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is… https://t.co/uNEwq1imky https://t.co/TOJ5LG1YL5
2024-04-03 00:59:47 RT @trishume: Here's Claude 3 Haiku running at >
2024-03-01 00:00:00 CAFIAC FIX
2024-03-11 00:00:00 CAFIAC FIX
2023-05-19 19:00:00 CAFIAC FIX
2023-05-21 19:00:00 CAFIAC FIX
2023-04-21 00:00:01 CAFIAC FIX
2023-03-24 22:00:44 @DavidDuvenaud This announcement makes me very happy! Thank you for working to make the future better for your children and mine.
2023-03-14 02:20:58 @gwern But look at how unexpectedly clean the plots are! I do think it would be possible to make these definitions more objective -- check bonus section 6 in the blog post for some ideas
2023-03-10 16:03:10 @TechCapo These orderings were subjective judgements of others! I buy this though -- alphago is trained indirectly via value functions, so there's another imperfect link in the chain linking it's output to an objective, compared to eg a classifier.
2023-03-10 15:52:51 @georgebdavis Re (1) -- my own hypothesis is that evolution had to work *very hard* to make animals intelligent in a way that contributed positively to our fitness function. It's not that coherence can't be achieved, rather that we're going to have to work hard for every bit of coherence.… https://t.co/KUVplHxxks
2023-03-10 15:41:31 @Sheikheddy +100
2023-03-10 15:38:23 @DavidSKrueger Those are all possible! Here's a sketch of another possible low level mechanism: Agents interacting with the world are high dimensional dynamical systems world state → model output / action → new world state Smarter agents are: - more complex dynamical systems (shorter… https://t.co/YDlQmVJEG3
2023-03-10 15:28:20 @catherineols Yes! That is a risk scenario that sounds worryingly plausible to me.
2023-03-10 02:17:03 @Cory29565470 I didn't choose the organizations -- I asked a subject, who didn't know what the experiment was about, to choose them, so I wouldn't be able to bias the results by cherry picking.
2023-03-09 17:46:45 (And stochastically tagging a few people who might be interested. @KatjaGrace @DavidSKrueger @DavidDuvenaud @bucketofkets @EthanJPerez )
2023-03-09 17:00:46 Huge thank you to my generous volunteer subjects (tagging the few cases where I know your twitter handle -- sorry if I missed you!): @dmdohan @jesseengel @thisismyhat @DylanPaiton @neurotalker
2023-03-09 16:36:57 @nabla_theta I completely agree. But under that scenario we will need to work really hard for every scrap of coherent behavior. We won't accidentally get to a paperclip maximizer.
2023-03-09 16:15:53 See the post for details -- including discussion of the many ways these results are speculative and could be improved. This is my second blog post ever -- please continue to be harsh but also constructive! https://t.co/OukfipSkIJ
2023-03-09 16:15:52 The hot mess theory of AI misalignment (+ an experiment!) https://t.co/OukfipSkIJ There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent). https://t.co/tdnZP65DTc
2023-03-05 10:00:00 CAFIAC FIX
2023-03-02 22:00:00 CAFIAC FIX
2023-02-27 01:00:00 CAFIAC FIX
2023-01-30 01:00:00 CAFIAC FIX
2022-12-23 20:23:24 Intuitive extensions to standard notation, that make it less ambiguous for common math in machine learning. This should become common practice in ML papers. This could have saved past me cumulative days of confusion (and worse, misinterpretations I probably never discovered). https://t.co/l6wpPT6hTF
2022-12-08 13:00:00 CAFIAC FIX
2022-12-07 08:00:00 CAFIAC FIX
2022-11-09 14:46:56 @ErikSchluntz +1. Generalizing/abstracting your example slightly, you're saying changes which increase efficiency in the *typical* case, may lead to worse performance in the *average* case, because of an increased risk of catastrophic failure? (A key phrase might be black swan event.)
2022-11-09 14:33:02 @athundt @peteflorence @ruha9 of the phenomenon with a moral judgement about the phenomenon, in a way that I think would make technical discussion, including around mitigations, difficult.)
2022-11-09 14:28:43 @athundt @peteflorence @ruha9 Thanks for the connection! I just added these to the list of related concepts. (* While I think these are excellent observations, I wouldn't be comfortable myself using these examples as the primary term for the underlying concept, because they seem to combine a description https://t.co/JQXJo6CFgY
2022-11-08 00:33:22 @RazMarinescu +1 to adapting goals+incentives being key to mitigating this.
2022-11-08 00:29:28 @PaulsonJonathan This is a really good point! If we could somehow observe the world where the listed thing changed, but everything else was held fixed, we might see absolute outcomes get worse. But we don't live in that world, and there are reasons everything changes at once. I will think on this
2022-11-07 14:43:09 @updateless_ This turns out to be really hard to write, because I have so much uncertainty. Predicting the future is hard.
2022-11-07 14:38:25 @sirbayes These are also worried I have!"In a world that will only become more influenced by mathematical intelligence, can we ruin culture through our attempts to perfect it?"
2022-11-07 14:30:57 @DavidSKrueger I hadn't seen that paper. I like that it introduces an ontology -- I think this was missing from how I thought about it. Thank you for the connection.
2022-11-07 04:45:54 RT @boazbaraktcs: 3/7 this should not detract from the general point, that in many cases, as a system, whether algorithmic, individual, or…
2022-11-07 01:29:20 Also @-ing some people I follow (and get a lot of value from) that might find this perspective interesting. @bucketofkets @AmandaAskell @albrgr @DavidSKrueger @KatjaGrace @OwainEvans_UK @sleepinyourhat @jackclarkSF @geoffreyirving @ESYudkowsky
2022-11-07 01:07:15 If there's one thing that AI will bring, it's dramatically greater efficiency across many domains. We should expect that this will cause similarly dramatic harmful unintended consequences, in every domain AI touches, *all at once*. This is going to be a hard period of history.
2022-11-07 01:07:14 The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse. https://t.co/LNLOg5IBmA
2022-11-07 01:07:12 My first blog post ever! Be harsh, but, you know, constructive.Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's lawhttps://t.co/uR7pL7WNST https://t.co/NaibgX1bRb
2022-11-06 02:43:41 I'm on mastodon! @jascha@mathstodon.xyz. I will post new content there, before Twitter.I don't like my social+professional interactions being mediated+manipulated by a corporation with very different incentives than me. I'm hoping mastodon replaces scientific Twitter.
2022-11-02 18:09:39 @ericjang11 @dpkingma I think there is a qualitative difference between the magnitude degree of freedom, and other degrees of freedom. That is, I think getting relative magnitudes of activations correct is somehow easier for neural networks then getting the overall norm correct.
2022-11-01 21:19:13 @ericjang11 (though that observation really just moves the why question one step farther up, rather than answering it)
2022-11-01 21:18:30 @ericjang11 This is for the same reason that neural networks are often poorly calibrated. NNs are good at producing a vector that points in the right direction, but bad at getting the magnitude correct. For classification, you just need to get the vector direction right.
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-09-27 17:14:18 @TacoCohen +1000 to this.
2022-09-23 03:56:14 RT @BorisHanin: PRINCETON ML THEORY POSTDOCI'm looking for a theory postdoc with background in math, physics, stats, CS. Share widely.…
2022-09-23 01:34:30 One of the largest challenges around learned optimizers is making inner and outer training *stable*. James shows how eigenvalue analysis and careful intervention can produce massive improvements. https://t.co/hcKmrytW4n
2022-09-14 15:38:19 RT @ARomanNovak: Quadratic scaling in the number of pixels is a huge bottleneck of the NNGP/NTK. Very excited about _orders-of-magnitude_ s…
2022-09-14 15:24:57 I'm very excited to help out with the AI Grant program! I know I'm going to learn a lot. Hopefully we can learn a lot together. https://t.co/oDzzmuy1Gi https://t.co/4GhCYkcmwM
2022-08-26 22:47:26 RT @ScienceInsider: BREAKING: White House issues new policy that will require, by 2026, all federally-funded research results to be freely…
2022-08-06 20:41:03 This thread is an excellent read. I don't know that I would characterize the observations as spicy, so much as maybe just worrisome. https://t.co/wnBZUlEpl2
2022-08-06 20:13:18 @jackclarkSF At least half the time, this is because the original authors didn't realize an aspect was actually very important, or didn't realize an insight suggested by their experiments.
2022-07-23 18:10:47 @karpathy (Animal Eyes is an amazing book. Every few pages you'll learn something you want to share with everyone near you. Bruno Olshausen uses it for a great course at Berkeley.)
2022-07-23 18:04:45 @FelixHill84 So I guess -- eventually I think the bitter lesson will apply, but we need to figure out a lot before we can blindly scale the number of interacting large models.
2022-07-23 18:03:07 @FelixHill84 Good Q! I suspect for a while we will design multi-agent systems, then once they're stable we will scale them, then when the agent count is large enough, we will wrap another layer of abstraction on top, and start designing ?societies? of many interacting multi-agent systems.
2022-07-23 04:39:02 I think we will increasingly build systems out of many large models interacting with each other. I think the cascades perspective -- write down a probabilistic graphical model, but with every node a language model -- is the right formalism for describing these systems. https://t.co/oVcHgEu7ad
2022-07-22 03:36:52 RT @sschoenholz: Paper is here with details: https://t.co/kgb8Wvkje5If you don't care about details, the finite-width NTK calculations in…
2022-07-22 03:36:31 RT @ARomanNovak: Will be presenting our work on fast finite-width NTK today at #icml2022 - please come to our talk at 10:55 EDT, or the pos…
2022-07-01 01:48:22 RT @ethansdyer: 1/ Super excited to introduce #Minerva (https://t.co/UI7zV0IXlS). Minerva was trained on math and science found on the web…
2022-07-01 01:47:08 RT @alewkowycz: Very excited to present Minerva: a language model capable of solving mathematical questions using step-by-step natural lan…
2022-06-27 21:55:05 RT @EthanJPerez: We’re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task…
2022-06-19 15:36:07 @laurence_ai Noted. We should add a discussion of this to our paper.
2022-06-18 06:33:57 @TheGregYang Good question! You can write the reparameterization in terms of either a feature x feature or data x data kernel, whichever is smaller (see Appendix B). So it's not a problem computationally. Large data/ width ratio will lead to a less smooth reparameterized distribution though.
2022-06-18 01:00:39 RT @hoonkp: Awesome work by @jirimhron and friends at Google: Bayesian parameter posterior of the infinite-width limit! Another concrete e…
2022-06-18 00:06:37 PS -- When I described these results to @TheGregYang a couple months ago, he initially described them as "too good to be true", so you know they have to be good!
2022-06-18 00:06:36 Many, many more details in the paper! My fantasy and hope for this work is that it not only helps us understand neural networks better, but will also help make Bayesian models (without egregious approximations) practical. https://t.co/wmjO5F3ozq
2022-06-18 00:06:35 Even better, because the KL between prior and posterior shrinks with width, MCMC sampling after repriorization grows *more efficient* with width. (almost all current common MCMC samplers instead grow dramatically less efficient with increasing dimensionality) https://t.co/wLhDDPmptO
2022-06-18 00:06:34 MCMC mixes much faster after repriorization (we show >
2022-06-18 00:06:33 We characterize the weight space posterior by defining a data-dependent reparameterization that causes the *posterior* distribution over parameters conditioned on a dataset to converge in KL towards the *prior* distribution over parameters. We call this mapping repriorization. https://t.co/7WSan1HCdJ
2022-06-18 00:06:32 Detour for acknowledgements:@jirimhron deserves the lions share of credit. He is also job hunting!! Jiri is brilliant and extremely patient, and you should hire him. Thank you also to @ARomanNovak and Jeffrey Pennington, who played crucial roles.More about the result:
2022-06-18 00:06:31 For years I've shown this 2x2 grid in talks on infinite width networks, but with just a big in the upper-left.No longer! In https://t.co/NyZaHUsYjC we characterize wide Bayesian neural nets in parameter space. This fills a theory gap, and enables *much* faster MCMC sampling. https://t.co/zTUsGJVIhf
2022-06-17 23:28:44 @TrendingML I asked an internal language model I have access to, and it says it will require 114,720 Tweets. That is my final answer.
2022-06-17 18:04:48 @pde33 @machinaut @realSharonZhou The Brier score submission from the three of you is the cause of an entire section on calibration in the BIG-bench paper. Thank you!
2022-06-15 17:17:04 RT @qlhoest: Thanks @LiamFedus @AJAndreassen @jaschasd @ethansdyer @guygr and team for the incredible work on BigBench !You can find it on…
2022-06-14 03:32:35 RT @james_y_zou: Excited to contribute to bias assessment of large language models in the BIG-bench!
2022-06-13 16:10:33 RT @vedantmisra: BIG Bench is not only a fascinating collection of tasks for LLMs, it's also a shining example of how open and collaborativ…
2022-06-13 05:09:56 RT @geoffreyirving: Whether LLMs are conscious or pass Turing Tests or what precisely a Turing Test means matters much less than whether yo…
2022-06-12 04:22:17 RT @adityagupta2211: Glad to have contributed to such a massive collaborative work! Excited to see DISFL-QA (https://t.co/gwdw5s9ici) and T…
2022-06-12 02:29:09 RT @ivanzhouyq: This is incredible work on LLMs! Reading through this paper, I'm not only amazed by the huge amount of work behind BIG-benc…
2022-06-11 19:55:44 This is a fascinating task, on which the performance of the largest models is still close to chance and not obviously increasing with scale. https://t.co/0l68wt7o8f
2022-06-11 19:34:35 The link to the task is here:https://t.co/kMTrrbkjO3This is a great task, that large models still perform roughly at chance on. https://t.co/r1miq5TlKK
2022-06-11 19:30:55 The implicatures task was one of my favorites!! Silly, but also requires some quite complex skills, possibly including a rich world model and theory of mind. https://t.co/uORNzPxqvZ
2022-06-11 18:38:35 @raphaelmilliere @OwainEvans_UK are to human capabilities for quite a while.
2022-06-11 18:36:21 @raphaelmilliere @OwainEvans_UK Good question! I don't want to hazard a timeline, because that's the sort of thing that gets screenshotted and turned into an embarrassing slide. BIG-bench includes many tasks that language models can't do at all though. I believe it will remain a useful test for how close LMs
2022-06-11 03:27:04 @billmdev We will still have the low and high scores that are part of task metadata for new tasks, which are useful for establishing a reasonable scale. To compare to humans though would be another project, which we don't currently have plans for.
2022-06-11 03:18:14 @tdietterich I think experimental physics has smoothed out all the rough spots for arXiv submissions with long author lists. @ethansdyer pasted all the names into the arXiv form field ... and it just worked.
2022-06-11 03:11:38 @tomgara Great! Now, tell me why them getting worse is expected (or at least funny if you have the right context).
2022-06-11 03:06:24 Owain's task is truthful_qa, which is a great tasks that targets a specific worrying failure of language models (that they will just make up incorrect things when they don't know the answer). Thank you!!https://t.co/wlhkuxqaXa https://t.co/THVa4Vcrcz
2022-06-11 03:03:42 @billmdev So scoring close to 100 corresponds to doing well.
2022-06-11 03:03:25 @billmdev We hired humans to do almost all the tasks in the benchmark, so we can compare LM performance to human performance. Each task also specified as part of its metadata their estimate for what "low" and "high" scores on their task would be. We normalize those to be between 0 and 100.
2022-06-11 01:12:16 RT @dmdohan: Huge props to the organizers for their leadership in pushing this to completion! Exciting model for large-scale collaboratio…
2022-06-10 21:11:59 RT @andrey_kurenkov: Generally really cool, but I also like this bit - "BIG-bench continues to accept tasks and evaluation results on a rol…
2022-06-10 21:11:42 And the corresponding task is here!https://t.co/Un3voQbmCXThank you! https://t.co/tI5SPAZQCu
2022-06-10 19:53:34 Here is the task, which is high quality (and somewhat distressing):https://t.co/TlFIH1cgIl https://t.co/NEAJc4uaUx
2022-06-10 19:50:39 Oops -- I just saw that you gave links to your tasks later in a thread. Comment still applies though -- your tasks were excellent!
2022-06-10 19:49:13 Your contributions were great Marie!! To list them for Twitter:https://t.co/voIHsJ0Iy1https://t.co/0v2HRTfJXChttps://t.co/IbaKBUzSK8(I particularly liked yes_no_black_white) https://t.co/YtkxMEdZX7
2022-06-10 19:42:32 @karpathy Unfortunately, tasks where models show breakthrough performance, and the way in which PaLM performance looks like the start of a sigmoid in terms of log-parameter-count, together mean that I'm still highly uncertain about what the near-future capabilities of large models will be.
2022-06-10 19:40:35 @karpathy My primary (personal) motivation for BIG-bench was that I was drawing straight lines on the plots in the GPT3 paper, and I really wanted to know what the *actual* capabilities of larger models would be.
2022-06-10 19:35:10 RT @karpathy: imo a major AI safety contribution, both in short-term (applications) and long-term (AGI) scope
2022-06-10 19:33:34 @kchonyc @thisismyhat You definitely have to work hard for it not to apply. Self-cite, but even a high dimensional random walk is concentrated in a low dimensional subspace, with energy in different eigenvalues of the iterate covariance falling off like a power low: https://t.co/04I1D8vtJl
2022-06-10 19:22:05 RT @karpathy: Incredible effort!!
2022-06-10 19:20:32 This was a great task! https://t.co/q8MHePbJUv
2022-06-10 19:15:22 @Suhail We do not, though a blog post is something we should really do. The paper and repository READMEs are hopefully pretty clearly written.
2022-06-10 17:55:14 RT @barret_zoph: It was a pleasure to be part of this effort! Very bullish on the impact this will have for the future of LLMs.Also very…
2022-06-10 17:21:38 @its_ericchu @snehapriscilla This tasks seems to require both a simple geometric world model, and also to internally perform multiple sequential reasoning steps -- it's great for probing weaknesses of current model architectures!
2022-06-10 16:22:03 @dk_gup @ethansdyer is the answer
2022-06-10 16:20:37 RT @BuzanDilyar: It was an amazing experience collaborating with amazing people @UvA_Amsterdam and contributing to the BIG-bench benchmark.…
2022-06-10 16:18:31 RT @douglas_eck: It indeed takes an army. Lots of interesting new research directions have been uncovered by the BigBench effort!
2022-06-10 15:17:50 @webis_de This is a cool task! Thank you!
2022-06-10 15:16:36 @rodrigfnogueira I think we should be comparing against the top rather than bottom baseline line on that plot. It's true that the trend looks worrying for humans though! (also, that plot is a subset of json tasks, which are generally easier than the programmatic tasks)
2022-06-10 15:12:45 @peppeatta This exists!! Start at one of the links below, and navigate to individual tasks. Performance vs. baseline is at the bottom of every task's readme.https://t.co/4YSK6aLvt4https://t.co/MkuXP5rVqB
2022-06-09 01:14:13 @stanfordnlp I didn't know anyone was saying otherwise! I think it's a mark of pride to manage a large collaboration (or even a small one). Projects in ML are also just going to keep on getting bigger, and so are author lists.
2022-05-25 05:06:10 RT @GoogleAI: Introducing Imagen, a new text-to-image synthesis model that can generate high-fidelity, photorealistic images from a deep le…
2022-05-24 19:10:38 RT @Chitwan_Saharia: We are thrilled to announce Imagen, a text-to-image model with unprecedented photorealism and deep language understand…
2022-05-20 08:11:00 CAFIAC FIX
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-10-23 19:12:19 I just read this, and got a lot out of it. https://t.co/yUn5EJMz1x
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <
2022-11-25 01:59:21 @geoffreyirving @sbeckerkahn This just exceeded my mathematical depth. I don't disbelieve you though!
2022-11-24 21:09:10 @geoffreyirving @sbeckerkahn OTOH, rationals are dense in the 2d plane, and Brownian motion has a fractal dimension of 2, so probably a Brownian SDE would hit rational points?
2022-11-21 18:38:37 RT @wtgowers: Note that if X is a finite set and we take all its subsets, then every element of X belongs to exactly half the subsets. Yes…
2022-11-19 19:40:49 @GMartius The optimizer wall time overhead is about 10x the overhead of Adam. For most problems though this is still small compared to the time to compute the gradients. See the left pane in this plot from the appendix: https://t.co/QSZ1qR7NsQ
2022-11-19 19:34:13 @w_t_payne Yes! Or -- we don't address it in this work, but that is another clear target for meta-learning.
2022-11-19 19:32:45 @mauricetpunkt We have, but it doesn't seem to be necessary. We've also tried initializing the learned optimizer by *distilling* another optimizer, like Adam. This works OK, but has never really been pushed. (@Luke_Metz of course may have more to say)
2022-11-18 22:00:11 @yablak Many ML are going to https://t.co/1AiLl2tfTk.
2022-11-18 21:47:49 RT @ada_rob: Here is a real-world example (not in the paper) for T5 Small (~60M params). The VeLO-trained model reaches the same loss as…
2022-11-18 19:37:42 @deliprao @Luke_Metz @jmes_harrison @bucketofkets Nope, JAX only for the moment -- unless you want to make the PyTorch port?
2022-11-18 19:36:41 @AIjedi @Luke_Metz Have we mentioned how great JAX is yet? JAX is pretty great. If some bold stranger wanted to port VeLO to PyTorch though, then that would be amazing.
2022-11-18 19:34:12 @fedetask @jmes_harrison We tested this in the paper for Ant ... it's not great. RL problems are very different than the meta-training distribution. Future work to finetune VeLO to do well on RL tasks, or to include RL problems in meta-training distribution for VeLO 2.
2022-11-18 19:32:21 @DrJimFan @jmes_harrison @Luke_Metz @bucketofkets @poolio @ada_rob @IMordatch @amilmerchant @jekbradbury @naman33k The overhead due to the optimizer is ~10x the overhead of Adam, which is usually small compared to the compute cost of computing the gradients for the model you are applying VeLO to train. See leftmost pane in this plot: https://t.co/359UyP9INv
2022-11-18 19:28:20 @short_spy Nope, not currently. Have I mentioned how great JAX is? Because JAX is really great.
2022-11-18 19:27:45 @TexasBigData https://t.co/C2sKUiOYgX
2022-11-18 14:29:34 @albertzeyer @bucketofkets @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob The model params are published. See https://t.co/G7bLCn2vkZ
2022-11-18 14:01:30 RT @giffmana: My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning,…
2022-11-18 04:59:22 RT @bucketofkets: This was a really fantastic project—the culmination of literal years of work led by .@Luke_Metz. We’re proud to release…
2022-11-18 04:57:18 RT @poolio: Learned optimizers finally work! Swap out Adam for VeLO: a learned optimizer that outperforms human-designed optimizers withn…
2022-11-18 04:57:13 RT @ada_rob: Very excited to have been a small part of this amazing project. More work to be done to make this optimizer the go-to for gi…
2022-11-18 04:50:10 And huge thank yous to collaborators @bucketofkets Amil Merchant @giffmana @jekbradbury @naman33k @poolio @IMordatch @ada_rob as well!!! https://t.co/rKHQ2dInXt
2022-11-18 04:50:09 And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
2022-11-18 04:50:08 Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2). https://t.co/GwDaEZLWgS
2022-11-18 04:50:07 If you are training models with <