Joint Econometrics and Statistics Workshop
ScreeNOT: Optimal Singular Value Thresholding and Principal Component Selection in Correlated Noise
Elad Romanov (Stanford)
Tuesday 20 June 2023 12:00 - 13:00
Many of our seminars and public events this year will continue as in person or as hybrid (online and in person) events. Please check our website listings and Twitter feed @STICERD_LSE for updates.
Unless otherwise specified, in-person seminars are open to the public.
Those unable to join the seminars in-person are welcome to participate via zoom if the event is hybrid.
About this event
Principal Component Analysis (PCA) is a fundamental and ubiquitous tool in statistics and data analysis. The bare-bones idea is this. Given a data set of n points y_1, ..., y_n, form their sample covariance S. Eigenvectors corresponding to large eigenvalues--namely directions along which the variation within the data set is large--are usually thought of as "important" or "signal-bearing"; in contrast, weak directions are often interpreted as "noise", and discarded along the proceeding steps of the data analysis pipeline. Principal component (PC) selection is an important methodological question: how large should an eigenvalue be so as to be considered "informative"? Our main deliverable is ScreeNOT: a novel, mathematically-grounded procedure for PC selection. It is intended as a fully algorithmic replacement for the heuristic and somewhat vaguely-defined procedures that practitioners often use--for example the popular "scree test". Towards tackling PC selection systematically, we model the data matrix as a low-rank signal plus noise matrix Y = X + Z; accordingly, PC selection is cast as an estimation problem for the unknown low-rank signal matrix X, with the class of permissible estimators being singular value thresholding rules. We consider a formulation of the problem under the spiked model. This asymptotic setting captures some important qualitative features observed across numerous real-world data sets: most of the singular values of Y are arranged neatly in a "bulk", with very few large outlying singular values exceeding the bulk edge. We propose an adaptive algorithm that, given a data matrix, finds the optimal truncation threshold in a data-driven manner under essentially arbitrary noise conditions: we only require that Z has a compactly supported limiting spectral distribution--which may be a priori unknown. Under the spiked model, our algorithm is shown to have rather strong oracle optimality properties: not only does it attain the best error asymptotically, but it also achieves (w.h.p.) the best error--compared to all alternative thresholds--at finite n.
Econometrics and Statistics seminars are held on Fridays in term time at 12:00-13:00, ONLINE, unless specified otherwise.
Seminar organisers: Dr Tatiana Komarova and Dr Yunxiao Chen.
For further information please contact Lubala Chibwe: firstname.lastname@example.org.
Please use this link to subscribe or unsubscribe to the Econometrics and Statistics seminars mailing list (stats).