<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="https://www.gsic.titech.ac.jp"  xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>[GSIC] Tokyo Institute of Technology | Global Scientific Information and Computing Center - Grand Challenge (Large Scale GPU applications)</title>
 <link>https://www.gsic.titech.ac.jp/en/taxonomy/term/46</link>
 <description></description>
 <language>en</language>
<item>
 <title>Preliminary report</title>
 <link>https://www.gsic.titech.ac.jp/en/node/652</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;Preliminary report within one month is required, after the user project completes the TSUBAME grand-challenge program. The project result in the preliminary report is opened to the public in this web page as below.&lt;/p&gt;
&lt;p&gt;Their final report is also required within one year and opened to the public in &quot;&lt;a href=&quot;&quot;&gt;the list of adapted projects&lt;/a&gt;&quot;.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#h24au&quot;&gt;2012 Autumn&lt;/a&gt;, &lt;a href=&quot;#h24sp&quot;&gt;2012 Spring&lt;/a&gt;, &lt;a href=&quot;#h25sp&quot;&gt;2013 Spring&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;h24au&quot;&gt;2012 Autumn adopted subject&lt;/h2&gt;
&lt;h3 id=&quot;h24au01&quot;&gt; Large-Eddy Simulation of Wind Blowing for a Wide Area of Tokyo with 1-m Resolution by using Lattice Boltzmann Method on TSUBAME 2.0&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Takayuki Aoki - Professor, Tokyo Institute of Technology, GSIC&lt;br /&gt;
[Category]&lt;br /&gt;
      Category A (all nodes)&lt;/p&gt;
&lt;p&gt;A lot of tall buildings and complex structures in urban large cities make the air flow turbulent. In order to understand the detail of the airflow there, it is necessary to carry out large-scale CFD (Computational Fluid Dynamics) simulations. We have developed a CFD code based on LBM (Lattice Boltzmann Method). Since air flows in large cities are turbulent with a several-millions Reynolds number, a LES (Large-Eddy Simulation) model has to be introduced into the LBM equation. The dynamic Smagorinsky model is often used however it requires an average operation for a wide area to determine the model constant. Since it becomes huge overhead for large-scale computations, we applied the coherent-structure Smagorinsky model which does not take any spatial averages and is able to determine the model constant locally. The code is written in CUDA and the GPU kernel function is well tuned to achieve high performance on Fermi-core GPUs. By introducing the overlapping technique between the GPU-to-GPU communication and the GPU kernel computation, we have improved 30% for the large-scale computation. Although the LBM computation is essentially memory bound, we obtained fairly good performances in both the strong and the weak scalabilities. We achieved 149 TFLOPS in single precision, which is compatible with 15% of the peak performance of 1,000 GPUs. We used 4,032 GPUs for the computation with 10,080 × 10,240 × 512 mesh. By executing this large-scale computation, detailed winds behind buildings, so called &quot;wind street&quot; along a big street, the damage of typhoon and other will be revealed with much higher accuracy than before. The LES computation for the area 10km × 10km with 1-m resolution has never been done before in the world.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24au01_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Result of Wind Blowing in a Wide Area of Tokyo&quot; /&gt;&lt;br /&gt;
      Figure   Result of Wind Blowing in a Wide Area of Tokyo&lt;/p&gt;
&lt;dd&gt;&lt;a class=&quot;pt&quot; href=&quot;#header&quot;&gt;PageTop&lt;/a&gt; &lt;/dd&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h3 id=&quot;h24au02&quot;&gt;Performance evaluation of large-scale graph processing benchmark, Graph500 with newly introduced graph kernel using highly scalable search method&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Toyotaro Suzumura - Visiting Associate Professor,  Tokyo Institute of Technology&lt;br /&gt;
[Category]&lt;br /&gt;
      Category A (all nodes)&lt;/p&gt;
&lt;p&gt;Graph500 is a new benchmark that ranks supercomputers by executing a large-scale graph search problem.&lt;br /&gt;
It does breadth-first searches (BFS) in undirected large graphs and the ranking results depend on the throughput in TEPS (Traversed Edges Per Second). We investigated optimal parameters for our Graph500 BFS implementation and found that our implementation can perform BFS faster than before. We run Graph500 benchmark with optimal parameters and achieved 431 GTEPS with the large graph, SCALE 35. We also run our optimized implementation for Single Source Shortest Path problem (SSSP), which is new kernel on Graph500 benchmark. We evaluated the performance and found that (a) 2D partitioning method for adjacency matrix is effective for SSSP as well as BFS, (b) Inter-process synchronization cost is large on computing SSSP.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24au02_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Performance of Graph500 benchmark on TSUBAME2.0&quot; /&gt;&lt;br /&gt;
      Figure   Performance of Graph500 benchmark on TSUBAME2.0&lt;/p&gt;
&lt;dd&gt;&lt;a class=&quot;pt&quot; href=&quot;#header&quot;&gt;PageTop&lt;/a&gt; &lt;/dd&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h2 id=&quot;h24sp&quot;&gt;2012 Spring adopted subject&lt;/h2&gt;
&lt;h3 id=&quot;h24sp01&quot;&gt;Performance Evaluation of Large Scale Graph Processing Benchmark, Graph500 with Highly Scalable GPU Implementation&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Toyotaro Suzumura - Visiting Associate Professor,  Tokyo Institute of Technology&lt;br /&gt;
[Category]&lt;br /&gt;
      Category A (all nodes)&lt;/p&gt;
&lt;p&gt;Graph500 is a new benchmark that ranks supercomputers by executing a large-scale graph search problem.&lt;br /&gt;
It does breadth-first searches (BFS) in undirected large graphs and the ranking results depend on the throughput in TEPS (Traversed Edges Per Second). We developed CPU implementation and GPU implementation using our new efficient method for computing large distributed BFS based on 2D partitioning. In our grand challenge benchmark, we ran our implementations. Our CPU implementation can solve BFS for large-scale graph with 137 billion vertices and 2.2 trillion edges for 10.85 seconds with 1366 nodes and 2732 CPUs, which corresponds to 203 GTEPS. Our GPU implementation can solve BFS for the graph with 34.4 billion vertices and 550 billion edges for 1.73 seconds with 1366 nodes and 4096 GPUs.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24sp01_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Performance of Graph500 benchmark on TSUBAME2.0&quot; /&gt;&lt;br /&gt;
      Figure   Performance of Graph500 benchmark on TSUBAME2.0&lt;/p&gt;
&lt;dd&gt;&lt;a class=&quot;pt&quot; href=&quot;#header&quot;&gt;PageTop&lt;/a&gt; &lt;/dd&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h3 id=&quot;h24sp02&quot;&gt;Solving extremely large-scale semidefinite optimization problems via parallel computation of the interior-point method&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Katsuki Fujisawa - Professor,  Department of Industrial and Systems Engineering Chuo University&lt;br /&gt;
[Category]&lt;br /&gt;
      Category A (all nodes)&lt;/p&gt;
&lt;p&gt;Semidefinite Programming (SDP) is one of the most important problems in current research areas in optimization problems. It covers a wide range of applications such as combinatorial optimization, structural optimization, control theory, economics, quantum chemistry, sensor network location, data&lt;br /&gt;
mining, etc. Solving extremely large-scale SDP problems has significant importance for the current and future applications of SDPs. In 1995, Fujisawa et al. started the SDPA Project aimed for solving large-scale SDP problems with numerical stability and accuracy. SDPA is one of pioneers&#039; codes to solve general SDPs. SDPARA is a parallel version of SDPA on multiple processors and distributed memory, which replaces two major bottleneck parts (the generation of the Schur complement matrix and its Cholesky&lt;br /&gt;
factorization) of SDPA by their parallel implementation. In this grand challenge, we accelerate this part by using massively parallel GPUs with much higher computation performance than CPUs. In order to achieve scalable performance with thousands of GPUs, we utilize high performance BLAS kernel coupled with optimization techniques to overlap computation(Figure 1), PCI-Express communication and MPI communication. In particular, SDPARA has been successfully applied on combinatorial optimization and truss topology optimization, the new version of SDPARA(7.5.0-G) on a large-scale super computer called&lt;br /&gt;
TSUBAME2.0 in Tokyo Institute of Technology has succeeded to solve the largest SDP problem(Figure 2) which has over 1.48 million constraints and make a new world record. Our implementation has also achieved 533 TFlops in double precision for the large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs(Figure 3).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24sp02_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;GPU Computation, PCIe communication and MPI communication are overlapped&quot; /&gt;&lt;br /&gt;
      Figure 1 : GPU Computation, PCIe communication and MPI communication are overlapped&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24sp02_fig2.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;The largest SDP problem and its block diagonal structure&quot; /&gt;&lt;br /&gt;
      Figure 2 : The largest SDP problem and its block diagonal structure&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h24sp02_fig3.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Performance of GPU Cholesky factorization on the full TSUBAME2.0 system (up to 1,360nodes, 4,080GPUs)&quot; /&gt;&lt;br /&gt;
      Figure 3 : Performance of GPU Cholesky factorization on the full TSUBAME2.0 system (up to 1,360nodes, 4,080GPUs)&lt;/p&gt;
&lt;dd&gt;&lt;a class=&quot;pt&quot; href=&quot;#header&quot;&gt;PageTop&lt;/a&gt; &lt;/dd&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h2 id=&quot;h25sp&quot;&gt;2013 Spring adopted subject&lt;/h2&gt;
&lt;h3 id=&quot;h25sp01&quot;&gt;Solving the Schrödinger Equation of Some Organic and Inorganic Molecules with Superparallel Computer TSUBAME&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Hiroshi Nakatsuji - Director,  Quantum Chemistry Research Institute&lt;br /&gt;
[Category]&lt;br /&gt;
      Category B&lt;/p&gt;
&lt;p&gt;The present purpose is to solve the Schrödinger equations of general organic molecules with the free complement - local Schrödinger equation (FC-LSE) method. To realize highly accurate Schrödinger level calculations of general organic molecules, the From Atom to Molecule (FATM) method were combined with  the increasing exchange (iExg) method of antisymmetrization (HN, QCRI Report, June 2010) that has a clear order-N characteristics were first implemented. We could effectively use the TSUBAME supercomputer for both the developments of the methodology and programing and the actual calculations. Although the organic molecules applied here were not large, we could obtain highly accurate wave functions than those of ordinary quantum chemistry methods. We could also perform the test calculations of benzene molecule, whose application was not planned in the first proposal. In the timing test of the present program, a super-parallel efficiency was obtained but it might indicate the necessity of tuning for a single processor.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp01_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Solving the Schrödinger equations of general organic molecules with the FC-LSE-iExg method&quot; /&gt;&lt;br /&gt;
      Figure  Solving the Schrödinger equations of general organic molecules with the FC-LSE-iExg method&lt;/p&gt;
&lt;dd&gt;&lt;a class=&quot;pt&quot; href=&quot;#header&quot;&gt;PageTop&lt;/a&gt; &lt;/dd&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h3 id=&quot;h25sp02&quot;&gt;Development and Numerical Evaluation of High-Performance General Solver for Extremely Large-Scale Semidefinite Programming&lt;/h3&gt;
&lt;p&gt;[Person in charge]&lt;br /&gt;
      Katsuki Fujisawa - Professor,  Department of Industrial and Systems Engineering Chuo University&lt;br /&gt;
[Category]&lt;br /&gt;
      Category A (all nodes)&lt;/p&gt;
&lt;p&gt;The semidefinite programming (SDP) problem is one of the most central problems in mathematical optimization. The numerical results of solving large-scale SDP problems can provide useful information and solutions in many research fields. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottleneck parts i.e., the generation of the Schur complement matrix (SCM) and its Cholesky factorization exist in the algorithmic framework of PDIPM. SDP is relevant to a wide range of fields such as combinatorial optimization, structural optimization, control theory, quantum chemistry, and data mining; however, identification of a bottleneck part strongly depends on the problem size, the number of constraints, and the sparsity of the problem. We have developed a new version of SDPARA, which is a parallel implementation on multiples CPUs and GPUs for solving extremely large-scale SDP problems with over a million constraints.&lt;br /&gt;
SDPARA can automatically extract the unique characteristics from an SDP problem and identify the bottleneck. When the generation of SCM becomes a bottleneck part, SDPARA can attain high scalability using a large quantity of CPU cores and some techniques of processor affinity and memory interleaving (Figure 2 and 3). SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques to overlap computation and communication if an SDP problem has over a million constraints and Cholesky factorization constitutes a bottleneck part. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments on the TSUBAME 2.0 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints)(Figure 1), thereby creating a new world record. Our implementation also achieved 1.018 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs (Figure 4).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp02_fig1.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;The largest SDP problem and its block diagonal structure&quot; /&gt;&lt;br /&gt;
      Figure 1 : The largest SDP problem and its block diagonal structure&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp02_fig2a.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;CPU affinity policies: “scatter” and “compact”&quot; /&gt;&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp02_fig2b.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;CPU affinity policies: “scatter” and “compact”&quot; /&gt;&lt;br /&gt;
      Figure 2 : CPU affinity policies: “scatter” and “compact”&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp02_fig3.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Figure 3：Scatter-type affinity and memory interleaving for the HP SL390s G7 on TSUBAME2.0&quot; /&gt;&lt;br /&gt;
      Figure 3 : Scatter-type affinity and memory interleaving for the HP SL390s G7 on TSUBAME2.0&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/h25sp02_fig4.png&quot; align=&quot;center&quot; width=&quot;400&quot; alt=&quot;Performance of GPU Cholesky factorization on the full TSUBAME2.0 system (up to 1,360nodes, 4,080GPUs)&quot; /&gt;&lt;br /&gt;
      Figure  4 : Performance of GPU Cholesky factorization on the full TSUBAME2.0 system (up to 1,360nodes, 4,080GPUs)&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;ul class=&quot;links inline&quot;&gt;&lt;li class=&quot;translation_ja first last&quot;&gt;&lt;a href=&quot;/node/651&quot; title=&quot;平成24-25年度 採択課題 実施概要&quot; class=&quot;translation-link&quot; xml:lang=&quot;ja&quot;&gt;日本語&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
 <pubDate>Fri, 24 May 2013 05:01:59 +0000</pubDate>
 <dc:creator>author1</dc:creator>
 <guid isPermaLink="false">652 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>Large scale Phase-Field simulation for metal dendritic solidification by multi-GPU computing</title>
 <link>https://www.gsic.titech.ac.jp/en/node/464</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;Primary Investigator Takayuki Aoki,    GSIC, Tokyo Institute of Technology&lt;br /&gt;
(taoki[at]gsic.titech.ac.jp)&lt;br /&gt;
Collaborator Takashi Shimokawabe,    Tokyo Institute of Technology&lt;br /&gt;
Collaborator Tomohiro Takaki,    Kyoto Institute of Technology&lt;br /&gt;
Collaborator Akinori Yamanaka,    Tokyo Institute of Technology&lt;/p&gt;
&lt;h3&gt;Background&lt;/h3&gt;
&lt;p&gt;The mechanical properties and performance of metal materials depend on the intrinsic microstructures in these materials. In order to develop engineering materials as expected and to enable design with multifunctional materials, it is essential to predict the microstructural patterns, such as dendritic structures, observed in solidified metals. In materials science and related areas, the phase-field method is widely used as one of the powerful computational methods to simulate the formation of complex microstructures during solidification and phase transformation of metals and alloys. The phase-field model consists of a relatively large number of complex nonlinear terms compared with other stencil computation such as advection calculations and diffusion calculations, requiring larger computational power than other stencil applications. Due to this heavy computational load, previous attempts known to us are limited in grid sizes (e.g., 500x500x500 ) and simulated simple shape such as a single dendrite. To evaluate more realistic solidification with phase-field simulations, it is essential to perform large-scale computation reaching 5000x5000x5000 mesh to describe multiple dendrites in the typical scales of microstructural pattern.&lt;/p&gt;
&lt;h3&gt;Methodology&lt;/h3&gt;
&lt;p&gt;The phase-field model introduces a continuous order parameter, i.e., a phase-field variable, to describe whether the material is solid or liquid. Solid and liquid are represented as the fixed values of this parameter. Interfaces between solid and liquid are treated as diffuse interfaces, which are given by the localized regions where this parameter changes smoothly between these two fixed values. Thanks to this approach, the phase-field method can describe the locations of the interfaces without introducing the explicit tracking of moving interfaces during microstructure evolution.&lt;br /&gt;
The time integration of the phase field variable and the solute concentration are carried out by the second-order finite difference scheme for space with the first-order forward Euler-type finite difference method for time on a three-dimensional regular computational grid. Figure 1 shows that 19 neighbor elements of the phase field variable and seven neighbor elements of the solute concentration are used to compute the government equations on a grid point (i, j, k). We need to read 26 elements from the memory and write back two updated values to the memory for each point of the grid in one time step.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/Fig1-en.png&quot; width=&quot;500&quot; /&gt;&lt;br /&gt;
Figure 1   Spatial access patterns of the neighbor points for the phase field variable and the concentration&lt;/p&gt;
&lt;h3&gt;Description of multi-GPU computing and implementation&lt;/h3&gt;
&lt;p&gt;We decompose the whole computational domain in both y- and z-directions (2D decomposition) and allocate each subdomain to one GPU. We have chosen this method since 3D decomposition, which looks better to reduce communication amount, tends to degrade GPU performance due to complicated memory access patterns for data exchanges between GPU and CPU. Similar to conventional multi-CPU implementations with MPI, our multi-GPU implementation requires boundary data exchanges between subdomains. Because a GPU cannot directly access to the global memory of other GPUs, host CPUs are used as bridges for data exchange. For inter-node cases, this data exchange is composed of the following three steps: (1) the data transfer from GPU to CPU using CUDA APIs such as cudaMemcpy, (2) the data exchange between nodes with MPI, and (3) the data transfer back from CPU to GPU with CUDA APIs. Based on the above discussion, the first method for multi-GPU, named (a) GPU-only method, is implemented. However, this basic method suffers from costs of three-step data transfer described above. Their impact gets larger when we use more GPUs. In order to improve scalability, hiding such costs by overlapping communication and computation is effective. We present two methods both of which adopt overlapping techniques: (b) Hybrid-YZ method and (c) Hybrid-Y method. Both methods enable overlapping by treating boundary regions separately from an inside region in that both GPUs and CPUs cooperatively participate in computation to improve effects of overlapping.&lt;br /&gt;
In our implementation on TSUBAME 2.0, we determined the number of used CPU cores as follows. Since each node has twelve CPU cores (two 6-core Xeon CPUs), we assign four cores to each of three GPUs. Thus each subdomain is cooperatively computed by a single GPU and four CPU cores. CPU code is compiled with Intel C++ Compiler 11.1.072, and multi cores are exploited by using OpenMP.&lt;/p&gt;
&lt;h3&gt;Results&lt;/h3&gt;
&lt;p&gt;The weak scaling results of our phase-field simulations running on TSUBAME 2.0 is shown in Fig.2 and It is found the computation linearly scales to 4000 GPU. We choose each GPU handles the domain of 4096x128x128. A snapshot of the dendritic growth is exhibited in Fig.3. The performance evaluation has successfully demonstrated that strong scalability is improved by using our overlapping techniques; with 4,000 GPUs with 16,000 CPU cores, the performance reaches 2.0000045 PFlops for a mesh of 4,096x6,500x10,400, which is the first over peta-scale result as a real stencil application we know to date.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/sites/default/files/weak_scaling.png&quot; width=&quot;500&quot; /&gt;&lt;br /&gt;
Figure 2   Weak Scaling of multi-GPU computation in both single- and double- precision calculation on TSUBAME 2.0.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/Fig3.png&quot; width=&quot;500&quot; /&gt;&lt;br /&gt;
Figure 3   Dendritic growth in the binary alloy solidification with 768x1632x3264 mesh using 1156 GPUs of TSUBAME 2.0.&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;ul class=&quot;links inline&quot;&gt;&lt;li class=&quot;translation_ja first last&quot;&gt;&lt;a href=&quot;/node/440&quot; title=&quot;フェーズフィールド法による金属材料デンドライト凝固成長の大規模GPU計算&quot; class=&quot;translation-link&quot; xml:lang=&quot;ja&quot;&gt;日本語&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
 <pubDate>Sun, 28 Aug 2011 10:02:41 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">464 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>Development of an ultra-fast computing pipeline for metagenome analysis with next-generation DNA sequencers</title>
 <link>https://www.gsic.titech.ac.jp/en/node/454</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;Akiyama Laboratory, Department of Computer Science, Tokyo Institute of Technology&lt;br /&gt;
info[at]bi.cs.titech.ac.jp&lt;/p&gt;
&lt;h3&gt;Background&lt;/h3&gt;
&lt;p&gt;Metagenome analysis is the study of the genomes of uncultured microbes obtained directly from microbial communities in their natural habitats. The analysis is useful for not only understanding symbiotic systems but also watching environment pollutions. However, metagenome analysis requires comparisons of sequence data obtained from a sequencer with sequence data of remote homologues in databases because current databases do not include sequence data for most of microbes in the sample. For that purpose, mapping software used in the genome analysis performed for known organisms are insufficient because comparison between sequences of remote homologues has to consider mutations, insertion and deletion. Therefore, sensitive sequence homology search processes are required in metagenome analysis. Unfortunately, this process needs large computation time and is thus a bottleneck in current metagenome analysis based on the data from the latest DNA sequencers generally called next-generation sequencers.&lt;/p&gt;
&lt;h3&gt;Methods&lt;/h3&gt;
&lt;p&gt;We developed a fully automated pipeline for metagenome analysis that can deal with huge data obtained from a next generation sequencer in realistic time by using the large computing power of TSUBAME 2.0 supercomputer.&lt;br /&gt;
In the pipeline, two different sequence homology search tools can be selected;&lt;br /&gt;
1) BLASTX, standard sequence homology search software used in many metagenomic researches&lt;br /&gt;
2) GHOSTM, GPU-based fast sequence homology search software&lt;/p&gt;
&lt;p&gt;GHOSTM is our original sequence homology search program. The program is implemented by using NVIDIA&#039;s CUDA and able to search homologues in a short time by using GPU-computing. The program with high sensitivity setting (the length of indexes in the initial search process as K=3) shows high search sensitivity which is comparable with BLAST and even the program with high speed setting (the length as K=4) have much higher sensitivity than BLAT, a well-known fast homology search program, and its sensitivity is enough for metagenome analysis (Fig. 1).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/fig1.jpg&quot; width=&quot;300&quot; /&gt;&lt;br /&gt;
Fig.1. comparison of sensitivities of homology search programs&lt;/p&gt;
&lt;h3&gt;Experiments&lt;/h3&gt;
&lt;p&gt;We performed a large-scale metagenome analysis by using our pipeline. We used data sampled from polluted soils and obtained by using a next-generation sequencer. We evaluated the effective performance of the pipeline with both 1) BLASTX on CPUs and 2) GHOSTM on GPUs. &lt;/p&gt;
&lt;p&gt;Original metagenomic data:	224 million DNA reads (75 bp)&lt;br /&gt;
Size after excluding low-quality data:	71 million DNA reads&lt;br /&gt;
Homology search:	71 million DNA reads vs. NCBI nr amino-acid sequence DB (4.2GB)&lt;/p&gt;
&lt;h3&gt;Results&lt;/h3&gt;
&lt;p&gt;File I/O processes including a database copy and writing search results caused a contention problem when we used many computation nodes. Thus, we changed to employ a sophisticated file transfer manner where data are simultaneously copied from local disk of a node to another in a binary-tree manner. As results, the pipeline shows almost linear speedup to the number of computing cores. When we use BLASTX as a homology search program, the pipeline achieves to process about 24 million reads per an hour with 16,008 CPU cores (1,334 computing nodes) (Fig. 2). When we use GHOSTM (K=4) as a homology search program, the pipeline achieves to process about 60 million reads per an hour with 2,520 GPUs (840 computing nodes) (Fig. 3).&lt;br /&gt;
These results indicate the pipeline can process genome information obtained from a single run of next generation sequencers in a few hours. We believe our pipeline will accelerate metagenome analysis with next generation sequencers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/fig2.jpg&quot; width=&quot;500&quot; /&gt;&lt;br /&gt;
Fig. 2. Speedup of the BLASTX-based system for the number of CPU cores&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/fig3.jpg&quot; width=&quot;500&quot; /&gt;&lt;br /&gt;
Fig. 3. Speedup of the GHOSTM-based system for the number of GPUs&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;ul class=&quot;links inline&quot;&gt;&lt;li class=&quot;translation_ja first last&quot;&gt;&lt;a href=&quot;/node/442&quot; title=&quot;次世代シーケンサーを用いたメタゲノム解析向けの超高速パイプラインの構築&quot; class=&quot;translation-link&quot; xml:lang=&quot;ja&quot;&gt;日本語&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
 <pubDate>Wed, 27 Jul 2011 09:40:03 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">454 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy using GPGPU</title>
 <link>https://www.gsic.titech.ac.jp/en/node/452</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;Sekijima Laboratory&lt;br /&gt;
sekijima [at] gsic.titech.ac.jp&lt;br /&gt;&lt;a href=&quot;http://www.bio.gsic.titech.ac.jp&quot;&gt;http://www.bio.gsic.titech.ac.jp&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Background&lt;/h3&gt;
&lt;p&gt;Molecular Dynamics (MD) simulation is a powerful tool for simulating the motion of molecules and obtaining detailed information on the interesting systems and phenomena. During the simulation, we iteratively integrate numerically the equation of motion of the whole system and the time series of the positions and momentum of every molecule or atom, which is called the trajectory results from the simulation. For this reason, MD is widely used in computational biology especially for studying the motion of proteins. In our research, we have showed the usefulness of GPU-accelerated MD simulation from three viewpoints. One is the speed of computation, another is the energy consumption and the other is the accuracy.&lt;/p&gt;
&lt;h3&gt;Methodology&lt;/h3&gt;
&lt;ul&gt;&lt;li&gt;Nucleosome (25095 Atoms)
&lt;/li&gt;&lt;li&gt;PMEMD  -SPDP mode for GPU MD from AMBER11
&lt;/li&gt;&lt;li&gt;Implicit Solvent(GB) model
&lt;/li&gt;&lt;li&gt;NVE ensemble
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/nucleosome.png&quot; width=&quot;200&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Result&lt;/h3&gt;
&lt;p&gt;In this study, we performed molecular dynamics simulation using GPU on TSUBAME2.0 supercomputer. Then the performance performance and result was compared to those of conventional CPU molecular dynamics regarding calculation speed, energy consumption and accuracy. Our experiment showed that using GPUs could yield a great improvement in both the calculation speed and energy consumption. Comparing 8 nodes 16 GPUs result and 8 nodes 96 CPU cores result, we could achieve about 10 times acceleration and 75 % energy reduction. And the results of GPU molecular dynamics was consistent with&lt;br /&gt;
that of CPU molecular dynamics within acceptable range.&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/result_0.jpg&quot; width=&quot;700&quot; /&gt;&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/Rg.jpg&quot; width=&quot;700&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Reference&lt;/h3&gt;
&lt;p&gt;Shiqiao Du, Takuro Udagawa, Toshio Endo and Masakazu Sekijima&quot;Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.0 Supercomputer&quot;, &lt;a href=&quot;http://www.apsipa2011.org/&quot; target=&quot;_blank&quot;&gt;APSIPA ASC 2011&lt;/a&gt;, accepted&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description>
 <pubDate>Thu, 21 Jul 2011 15:18:32 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">452 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>High Performance Seismic Simulation with GPGPU</title>
 <link>https://www.gsic.titech.ac.jp/en/node/450</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;SC11 Special Awards paper candidate&lt;br /&gt;
Matsuoka-lab, Department of Mathematics And Computer Science&lt;br /&gt;&lt;a href=&quot;http://matsu-www.is.titech.ac.jp/en/contact&quot;&gt;http://matsu-www.is.titech.ac.jp/en/contact&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Background / Motivations&lt;/h3&gt;
&lt;p&gt;In extremely parallel computation environments, like Tsubame 2.0, the number of computing nodes is so large that we cannot neglect the probability of node failures. To make progress in such high failures frequency, it is important to enable calculation to continue after node failures.&lt;br /&gt;
In order to reach this goal, we have to save the state of application periodically, which is called checkpointing. It is important to reduce the checkpointing overhead to achieve better performance, because its time cost approach about 25% of the total execution time in current systems.&lt;br /&gt;
In our research, we have improved the checkpoint performance of GPGPU version of SPECFEM3D, which is one of the leading applications to simulate seismic wave propagations.&lt;/p&gt;
&lt;h3&gt;Method&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/ftdt.jpg&quot; width=&quot;300&quot; align=&quot;right&quot; /&gt;&lt;br /&gt;
We have implemented efficient checkpointing facility into CUDA version of SPECFEM3D. The checkpointing of parallel applications are divided into two parts: 1) saving memory contents of parallel applications locally by stopping the execution of the application, and 2) encoding that data among nodes using Reed-Solomon, to not lose checkpointed data during a node failure.&lt;br /&gt;
GPGPU applications typically use only GPUs to their calculation and some CPU cores are idle during calculation. The temporal overhead of checkpointing can be hidden by offloading the data encoding and transfer to those idle CPU cores.&lt;/p&gt;
&lt;h3&gt;Evaluation&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/weak.jpg&quot; width=&quot;300&quot; align=&quot;right&quot; /&gt;&lt;br /&gt;
We have compared the performance of the SPECFEM3D GPGPU version among the following conditions: 1) without any checkpointing, 2) with checkpointing locally on SSDs 3) checkpointing by our method (FTI L1, L2), 4) checkpointing by our method (FTI L1, L2, L3) and 5) with checkpointing by existing method (BLCR+Lustre),&lt;br /&gt;
The performance graph shows that our method can provide checkpointing facility with smaller impact to execution, and keeps better performance even when we solve the same size problem by many computing nodes (Strong scaling). &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/strong.jpg&quot; width=&quot;300&quot; align=&quot;right&quot; /&gt;&lt;br /&gt;
In addition to that, we simulated the seismic movement of the March 11th Tohoku earthquake with SPECFEM3D. We split the simulated area into a 960×960 mesh and provided slip distribution of planes as source and calculated 1500 seconds of movement in east-west, north-south, and vertical axis on 70 stations. The simulated displacement result agrees with the actual displacement after the earthquake.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/result.jpg&quot; width=&quot;600&quot; /&gt;&lt;br /&gt;
Synthetic Seismograms at Hirono station in Fukushima prefecture&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description>
 <pubDate>Thu, 21 Jul 2011 14:32:18 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">450 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>Turbulence Simulation using FMM</title>
 <link>https://www.gsic.titech.ac.jp/en/node/446</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;R. Yokota, L.A.Barba, Boston University, yokota[at]bu.edu, labarba[at]bu.edu&lt;br /&gt;
T. Narumi, University of Electro-Communications, narumi[at]cs.uec.ac.jp&lt;br /&gt;
K. Yasuoka, Keio University, yasuoka[at]mech.keio.ac.jp&lt;/p&gt;
&lt;h3&gt;Background and Objectives&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/Picture 1.png&quot; width=&quot;200&quot; align=&quot;right&quot; /&gt;&lt;br /&gt;
Traditional methods for simulating turbulent flows have been the spectral method (based on FFT) and finite difference method (based on linear solvers). The fast multipole method (FMM) is more parallel compared to these algorithms. Our goal is to take turbulence simulation to the next level of parallelism by making use of our FMM solver.&lt;/p&gt;
&lt;h3&gt;Description of runs &lt;/h3&gt;
&lt;p&gt;The simulation of isotropic turbulence for 2048^3 unknowns is performed by an FMM based method and an FFT based method, and the computational efficiency is compared.&lt;br /&gt;
Mesh: 2048^3&lt;br /&gt;
Re_λ: 500&lt;br /&gt;
Library: ExaFMM vs. FFTW&lt;/p&gt;
&lt;h3&gt;Results&lt;/h3&gt;
&lt;p&gt;Both the FMM and FFT based solvers take approximately 30 seconds per time step on 2,048 GPUs.&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/fig2_0.jpg&quot; /&gt;&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/fig3_0.jpg&quot; /&gt;&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;ul class=&quot;links inline&quot;&gt;&lt;li class=&quot;translation_ja first last&quot;&gt;&lt;a href=&quot;/node/443&quot; title=&quot;Turbulence Simulation using FMM&quot; class=&quot;translation-link&quot; xml:lang=&quot;ja&quot;&gt;日本語&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
 <pubDate>Thu, 21 Jul 2011 09:43:35 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">446 at https://www.gsic.titech.ac.jp</guid>
</item>
<item>
 <title>Multiphysics Biofluidics Simulation</title>
 <link>https://www.gsic.titech.ac.jp/en/node/445</link>
 <description>&lt;div class=&quot;field field-name-body field-type-text-with-summary field-label-hidden&quot;&gt;&lt;div class=&quot;field-items&quot;&gt;&lt;div class=&quot;field-item even&quot;&gt;&lt;p&gt;Massimo Bernaschi, IAC, National Resource Counsil, Italy&lt;br /&gt;
Satoshi Matsuoka, GSIC, Tokyo Institute of Technology&lt;/p&gt;
&lt;h3&gt;Background and Overview&lt;/h3&gt;
&lt;p&gt;Gaining a deeper insight of blood circulation is a necessary step to improve our understanding of cardiovascular diseases. However, a detailed, realistic simulation of hemodynamics represents challenging issues both in terms of physical modeling and high performance computing technology. The model must handle both the motion of fluid within the complex geometry of the vasculature, which may be unsteadily changed by heartbeat, and the dynamics of red blood cells (RBCs), white blood cells and other suspended bodies. Previous such coupled simulations were confined to microscale vessels. In order to scale up the simulation up to realistic geometries, higher computational power is required. In this work, authors present multiscale simulation of cardio vascular flows of human by using 4,000 GPUs of Tokyo-Tech TSUBAME2.0 supercomputer. Spatial resolution extends to 5cm down to 10um, and the simulation involves a billion fluid nodes, with 300 million suspended bodies.&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/kaxiras_mechionna_7k.jpg&quot; width=&quot;200&quot; /&gt;&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/RBC.jpg&quot; width=&quot;200&quot; /&gt;&lt;br /&gt;
Red blood cells as ellipsoidal particles&lt;/p&gt;
&lt;h3&gt;Methodology&lt;/h3&gt;
&lt;p&gt;The multiphysics/multiscale simulation described above is performed with the MUPHY code by authors&#039; group. MUPHY couples Lattice Boltzmann (LB) method for the fluid flow and a specialized version of Molecular Dynamics (MD) for the suspended bodies. A CUDA version of MUPHY has been implemented for the evaluation on thousands of GPUs. For fluid computation, while GPUs are computing elements, CPUs are used for boundary exchange; thus overlapping computation and communication is achieved.&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/method.jpg&quot; width=&quot;200&quot; /&gt;&lt;br /&gt;
Since our target domain has irregular structure, partitioning the domain among 4,000 GPUs is challenging; for this purpose, a third-party software PT-SCOTCH, the parallel version of SCOTCH graph partitioning tool.&lt;/p&gt;
&lt;h3&gt;Experimental Results&lt;/h3&gt;
&lt;p&gt;The performance of the multiscale simulations is measured by using 4,000 NVIDIA Tesla M2050 GPUs on TSUBAME2.0. Since a single node embodies three GPUs, 1,334 nodes are used. The simulations include:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;1 billion lattice sites for the fluid&lt;/li&gt;
&lt;li&gt;300 million RBCs&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Most computations are performed in single point precision. Used system software packages include CUDA compiler version 3.2, Intel Fortran compiler version 11.1 and OpenMPI version 1.4.2.&lt;br /&gt;
As a result, the total performance of about 600TFlops (=0.6PFlops) has been achieved, with a parallel efficiency in excess of 90 percent. Among the computation components, LB components are more than a factor of two better than those authors reported previously. This performance corresponds to simulating a full heartbeat at microsecond resolution in 8 hours.&lt;br /&gt;&lt;img src=&quot;http://www.gsic.titech.ac.jp/sites/default/files/performance-cap.jpg&quot; width=&quot;500&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Reference&lt;/h3&gt;
&lt;p&gt;M. Bernaschi, M. Bisson, T.Endo, M. Fatica, S. Matsuoka, S. Melchionna, S. Succi. Petaflop biofluidics simulations on a two million-core system. In Proceedings of IEEE/ACM Supercomputing &#039;11, Seattle, 2011. (To appear)&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;ul class=&quot;links inline&quot;&gt;&lt;li class=&quot;translation_ja first last&quot;&gt;&lt;a href=&quot;/node/441&quot; title=&quot;Multiphysics Biofluidics Simulation&quot; class=&quot;translation-link&quot; xml:lang=&quot;ja&quot;&gt;日本語&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
 <pubDate>Thu, 21 Jul 2011 09:38:06 +0000</pubDate>
 <dc:creator>webmaster</dc:creator>
 <guid isPermaLink="false">445 at https://www.gsic.titech.ac.jp</guid>
</item>
</channel>
</rss>
