

# Extreme Big Data and **Deep Learning Algorithm Platform**

These Researches are Supported by JST CREST Grant Numbers JPMJCR1303 and JPMJCR1687

## **Predicting Statistics of ASGD**

Collaborative work with DENSO CORPORATION and DENSO IT LABORATORY, INC

In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN

#### Objective function*E*



Nowadays, FPGA can rival CPU/GPU performance and energy efficiency, but also known for its hardness for programming. We compared three high-level programming approaches for FPGAs • 30-core many-core system (reps. for programmability) • LegUp High-Level Synthesis (reps for multiple custom accelerators) • Intel OpenCL for FPGA (reps for Deep-pipeline designs) We evaluated using Rodinia Benchmark Suite on Stratix V FPGA. We improved memory hierarchy for many-core and multi-accelerator designs through cache multi-banking.

**Evaluating Apps on FPGAs** 

We propose a empirical performance model for an ASGD deep learning system SPRINT which considers probability distribution of mini-batch size and staleness



Reference: Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", IEEE BigData 2016

### Fast SpGEMM on GPU



- Intel OpenCL for FPGA shown highest average performance
- LegUp can remain competitive for good performance and spatial/temporal locality, even without improvement.
- Many-core system offers good programmability, but often does not perform well compared to other approaches

Reference: "Evaluating High-Level Design Strategies on FPGAs for High-Performance Computing", A. Podobas, H.R. Zohouri, N. Maruyama, S. Matsuoka, IEEE FPL 2017

### **Unlimited GPU Memory with DRAGON**

#### We have devised new Sparse General Matrix-Matrix Multiplication algorithm on GPU, which achieves further speedups and reduces memory usage so that various matrix data can be applied by utilizing GPU's on-chip shared memory and appropriate assigning of GPU resources.

**Two Phases Algorithm** : 1st phase counts the number of non-zero elements of output matrix, and 2nd phase calculates the output matrix  $\rightarrow$  Reduce memory usage Grouping rows (1, 2, 6)  $\rightarrow$  Better utilization of GPU resources Two ways threads assignments  $\rightarrow$  Improve the load-balance Hash table on fast shared memory  $\rightarrow$  Accelerate counting part (3) and calculation part (7)

#### **Double Precision Performance**

■ CUSP ■ cuSPARSE ■ BHSPARSE

| <b>)</b>     | (1) Count #intermediate products                                                                                                                                                               |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|              | (2) Divide the rows into groups by #intermediate products                                                                                                                                      |
|              | (3) Count #non-zero elements                                                                                                                                                                   |
|              | (4) Set row pointers of utput matrix                                                                                                                                                           |
| es           | (5) Memory allocation of output matrix                                                                                                                                                         |
|              | (6) Divide the rows into groups by #non-zero elements                                                                                                                                          |
| y            | <ul> <li>(7) Compute the output matrix</li> <li>a. Calculate values and column indices on hash table</li> <li>b. Shrink the hash table</li> <li>c. Store to the memory with sorting</li> </ul> |
| Memory Usage |                                                                                                                                                                                                |

Problem sizes grow larger than GPU and even host memory capacity. We proposed DRAGON, a framework that seamlessly enables all classes of GPU applications to directly access terabytes of data on storage with user-oblivious addressing. DRAGON leverages GPU hardware page-faulting mechanism and two-level prefetching with direct page-cache access to achieve this functionality while bestowing up to 2.3x speedup compared with using Unified Memory + POSIX IO.

#### **Architecture**



**Operation** 



Collaborative work with Oak Ridge National Laboratory

**Data transfer methods** 

**hr** cudaHostRegister + mmap

**DRAGON** is the only

existing out-of-core

solution with user-

oblivious addressing

**um** Unified Memory + IO

dg DRAGON

**df** cudaMemcpy + IO





cuSPARSE BHSPARSE nsparse



[1] Dalton et al., "CUSP: Generic parallel algorithms for sparse matrix and graph computations ver.0.5.1" [2] NVIDIA, "Nvidia cuda sparse matrix library (cuSPARSE)" [3] Liu et al., "An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data", IPDPS2014

Code available at: https://github.com/EBD-CREST/nsparse

Reference: Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, "High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU", ICPP 2017.



#### **Presentation at SC18 Technical Program:** 11:00-11:30, Wednesday, Room C141/143/149

Reference: Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka, "DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access", SC18

### http://www.gsic.titech.ac.jp/sc18