Sketchy | Proceedings of the 37th International Conference on Neural Information Processing Systems (2024)

Advanced Search
Browse
About
- Sign in
- Register

Advanced Search

nips

research-article

Free Access

Authors:
Vladimir Feinberg Google DeepMind

Google DeepMind
Search about this author

,
Xinyi Chen Princeton University, Google DeepMind

Princeton University, Google DeepMind
Search about this author

,
Y. Jennifer Sun Princeton University, Google DeepMind

Princeton University, Google DeepMind
Search about this author

,
Rohan Anil Google DeepMind

Google DeepMind
Search about this author

,
Elad Hazan Princeton University, Google DeepMind

Princeton University, Google DeepMind
Search about this author

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsDecember 2023Article No.: 3316Pages 75911–75924

Published:30 May 2024Publication History

0citation
0
Downloads

Metrics

Total Citations0Total Downloads0

Last 12 Months0

Last 6 weeks0

Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
Publisher Site

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Sketchy: memory-efficient adaptive regularization with frequent directions

Pages 75911–75924

PreviousChapterNextChapter

ABSTRACT

Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank k: in the online convex optimization (OCO) setting over dimension d, we match full-matrix d² memory regret using only dk memory up to additive error in the bottom d - k eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.

Skip Supplemental Material Section

Supplemental Material

Available for Download

pdf

3666122.3669438_supp.pdf (426.3 KB)

Supplemental material.

References

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.Google Scholar
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.Google ScholarCross Ref
James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408-2417. PMLR,Google Scholar
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842-1850. PMLR, 2018.Google Scholar
Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. Efficient full-matrix adaptive regularization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 102-110. PMLR, 09-15 Jun 2019.Google Scholar
Xinyi Chen, Naman Agarwal, Elad Hazan, Cyril Zhang, and Yi Zhang. Extreme tensoring for low-memory preconditioning. In International Conference on Learning Representations, 2019.Google Scholar
Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019.Google Scholar
Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.Google Scholar
Rohan Anil, Sandra Gadanho, Da Huang, Nijith Jacob, Zhuoshu Li, Dong Lin, Todd Phillips, Cristina Pop, Kevin Regan, Gil I. Shamir, Rakesh Shivanna, and Qiqi Yan. On the factory floor: Ml engineering for industrial-scale ads recommendation models, 2022.Google Scholar
Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. Ten lessons from three generations shaped google's tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1-14. IEEE, 2021.Google ScholarDigital Library
William J Dally, Stephen W Keckler, and David B Kirk. Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42-51, 2021.Google ScholarDigital Library
Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762-1792,Google ScholarDigital Library
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596-4604. PMLR, 2018.Google Scholar
Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157-325, 2016.Google ScholarDigital Library
Edo Liberty. Even simpler deterministic matrix sketching. arXiv preprint arXiv:2202.01780, 2022.Google Scholar
Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.Google Scholar
Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.Google Scholar
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232-2241. PMLR, 2019.Google Scholar
Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, and Vineeth N Balasubramanian. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481-9488, 2021.Google ScholarCross Ref
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195-1199, 2017.Google ScholarDigital Library
Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.Google Scholar
Craig Bakker, Michael J Henry, and Nathan O Hodas. Understanding and exploiting the low-rank structure of deep networks. 2018.Google Scholar
Zeke Xie, Qian-Yuan Tang, Zheng He, Mingming Sun, and Ping Li. Rethinking the structure of stochastic gradients: Empirical and statistical evidence. arXiv preprint arXiv:2212.02083, 2022.Google Scholar
Gabriel Krummenacher, Brian McWilliams, Yannic Kilcher, Joachim M Buhmann, and Nicolai Meinshausen. Scalable adaptive stochastic optimization using random projections. Advances in Neural Information Processing Systems, 29, 2016.Google Scholar
Yuanyu Wan and Lijun Zhang. Efficient adaptive online learning via frequent directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.Google Scholar
Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, and John Langford. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems, 29, 2016.Google Scholar
Cheng Chen, Luo Luo, Weinan Zhang, Yong Yu, and Yijiang Lian. Efficient and robust high-dimensional linear contextual bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4259-4265, 2021.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, 2019.Google Scholar
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.Google Scholar
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google Scholar
Ashok Cutkosky. Better full-matrix regret via parameter-free online learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8836-8846. Curran Associates, Inc., 2020.Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211-252, 2015. .Google ScholarDigital Library
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.Google Scholar
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206-5210. IEEE, 2015.Google ScholarCross Ref
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.Google Scholar
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.Google Scholar
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389-5400, 2019.Google Scholar
Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.Google ScholarCross Ref
Andrew V Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing, 23(2):517-541, 2001.Google Scholar
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.Google Scholar
Shun-ichi Amari, Jimmy Ba, Roger Baker Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When does preconditioning help or hurt generalization? In International Conference on Learning Representations, 2020.Google Scholar
Luo Luo, Cheng Chen, Zhihua Zhang, Wu-Jun Li, and Tong Zhang. Robust frequent directions with application in online learning. The Journal of Machine Learning Research, 20(1):1697-1737, 2019.Google ScholarDigital Library
Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1-27, 2011.Google Scholar
Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197-209. PMLR, 2014.Google Scholar
Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291-307, 2005.Google ScholarDigital Library
Koenraad MR Audenaert. A generalisation of mirsky's singular value inequalities. arXiv preprint arXiv:1410.4941, 2014.Google Scholar
Justin M. Gilmer, George E. Dahl, and Zachary Nado. init2winit: a jax codebase for initialization, optimization, and tuning research, 2021. URL http://github.com/google/init2winit.Google Scholar
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.Google Scholar
Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020.Google Scholar
Tensorflow datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets 2023.Google Scholar
Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6 (60):3021, 2021. .Google ScholarCross Ref
J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9 (3):90-95, 2007. .Google ScholarDigital Library
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Rio, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357-362, September 2020. .Google ScholarCross Ref
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, António H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261-272, 2020. .Google ScholarCross Ref
MLCommons® open engineering consortium. MLCommons Algorithmic Efficiency. https://github.com/mlcommons/algorithmic-efficiency, 2023.Google Scholar
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. Mlperf inference benchmark, 2019.Google Scholar
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.Google Scholar
Boris Dayma and Rohan Anil. Evaluation of Distributed Shampoo: Comparison of optimizers: Distributed Shampoo, Adam & Adafactor. Weights & Biases Report, 2022.Google Scholar
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.Google Scholar
Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Disentangling adaptive gradient methods from learning rates. arXiv preprint arXiv:2002.11803, 2020.Google Scholar
Tsuyoshi Ando. Concavity of certain maps on positive definite matrices and applications to hadamard products. Linear algebra and its applications, 26:203-241, 1979.Google Scholar
Rajendra Bhatia. Matrix analysis. Springer, 1997.Google ScholarCross Ref
Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.Google Scholar
Roger W Brockett. Finite dimensional linear systems. SIAM, 2015.Google ScholarCross Ref

Cited By

View all

Recommendations

The sketchy database: learning to retrieve badly drawn bunnies
We present the Sketchy database, the first large-scale collection of sketch-photo pairs. We ask crowd workers to sketch particular photographic objects sampled from 125 categories and acquire 75,471 sketches of 12,500 objects. The Sketchy database gives ...
Read More
On the HSS iteration methods for positive definite Toeplitz linear systems
We study the HSS iteration method for large sparse non-Hermitian positive definite Toeplitz linear systems, which first appears in Bai, Golub and Ng's paper published in 2003 [Z.-Z. Bai, G.H. Golub, M.K. Ng, Hermitian and skew-Hermitian splitting ...
Read More
Restrictive preconditioners for conjugate gradient methods for symmetric positive definite linear systems
The restrictively preconditioned conjugate gradient (RPCG) method for solving large sparse system of linear equations of a symmetric positive definite and block two-by-two coefficient matrix is further studied. In fact, this RPCG method is essentially ...
Read More

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Information
Contributors

Published in
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems
December 2023
80772 pages
Editors:
A. Oh,
T. Naumann,
A. Globerson,
K. Saenko,
M. Hardt,
S. Levine
Copyright © 2023 Neural Information Processing Systems Foundation, Inc.
Sponsors
In-Cooperation
Publisher
Curran Associates Inc.
Red Hook, NY, United States
Publication History
- Published: 30 May 2024
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Bibliometrics
Citations0

Article Metrics
- Total Citations
  View Citations
- Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Sketchy | Proceedings of the 37th International Conference on Neural Information Processing Systems (2024)

New Citation Alert added!

New Citation Alert!

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Export Citations

References