Md Taufique Hussain

2papers

2 Papers

1.2DCMar 21
Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Hussam Al Daas, Grey Ballard, Laura Grigori et al.

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.

15.2DCMay 15
High-Performance Star-M SVD for Big Data Compression

Md Taufique Hussain, Grey Ballard, Aditya Devarakonda et al.

In the era of big data, effectively compressing large datasets while performing complex mathematical operations is crucial. Tensor-based decomposition methods have shown superior compression capabilities with minimal loss of accuracy compared to traditional matrix methods. Under the star-M tensor framework, tensors can be decomposed in a matrix-mimetic way, including using the star-M SVD. This tensor SVD has optimality guarantees and has shown exceptional performance on specific types of data, but software implementations have been mostly limited to productivity-oriented languages. In this work, we present our development of a shared-memory parallel, high-performance solution designed to efficiently implement the underlying algorithms. This software will enable optimal compression of extensive scientific datasets, paving the way for enhanced data analysis and insights.