SET

From HP-SEE Wiki

Revision as of 11:48, 2 August 2013 by Gurov (Talk | contribs)
Jump to: navigation, search

Contents

General Information

  • Application's name: Simulation of Electron Transport
  • Application's acronym: SET
  • Virtual Research Community: Computational Physics
  • Scientific contact: Todor Gurov, Aneta Karaivanova, (gurov, anet)[@]parallel.bas.bg
  • Technical contact: Emanouil Atanassov, emanouil[@]parallel.bas.bg
  • Developers: Assoc. Prof. Dr. E. Atanassov, Department Grid Technology and Applications. IICT-BAS, Bulgaria
  • Web site: http://gta.grid.bas.bg

Short Description

SET uses Monte Carlo methods in order to solve integral equations describing electron transport. The methods use variance reduction for reducing the required CPU time. Billions of simulated trajectories are required for achieving accurate results. The application of these methods can benefit simulation of semiconductor devices at the nano-scale as well as other problems in computational electronics.

The advanced variance reduction techniques in this application require fast inter-process communication (using MPI or similar type of interface), while the total number of trajectories is in the number of billions. Thus a large scale computational resource with fast interconnection is required (a supercomputer or HPC cluster).

Problems Solved

The application deals with simulation of semiconductor devices at small scale and attempts to provide new insights into the physical phenomena that are occurring.

Scientific and Social Impact

Accurately predict or model new physical effects, occurring at the small scales (nanometer and femtosecond) scale. The reduced physical dimensions of contemporary electronic devices make quantum effects increasingly relevant for modeling the device operation.

The improved understanding of these physical effects can allow further improvement in the design of semiconductor devices. Bulgaria has a tradition in the electronic industry and nowadays there are efforts to revive these activities.

Collaborations

  • IME-TU, Vienna, Austria
  • RBI, Zagreb, Croatia

Beneficiaries

  • Researchers in the field of semiconductor physics
  • Manufacturers of small scale semiconductor devices

Number of users

11

Development Plan

  • Concept: Done before the project started.
  • Start of alpha stage: Done before the project started.
  • Start of beta stage: M8
  • Start of testing stage: M9
  • Start of deployment stage: M15
  • Start of production stage: M16

Resource Requirements

  • Number of cores required for a single run: From 1 to up to 8000
  • Minimum RAM/core required: 100 MB
  • Storage space during a single run: 1 GB
  • Long-term data storage: 10 GB
  • Total core hours required: 3 000 000

Technical Features and HP-SEE Implementation

  • Primary programming language: C
  • Parallel programming paradigm: MPI/OpenMP
  • Main parallel code: MPI
  • Pre/post processing code: Own developer
  • Application tools and libraries: SPRNG library, scrambling sequences

Usage Example

Infrastructure Usage

  • Home system: HPCG/BG
    • Applied for access on: 09.2010
    • Access granted on: 09.2010
    • Achieved scalability: 512 cores
  • Accessed production systems:
  1. BG/BG
    • Applied for access on: 10.2010
    • Access granted on: 10.2010
    • Achieved scalability: 4096 cores
  • Porting activities: The application has been successfully ported from the x86-64 Infiniband cluster system (HPCG) to the IBM BlueGene/P machine. Some of the code needed to be corrected, especially to take into account that the IBM system is 32bit and to comply with the IBM compiler rules.
  • Scalability studies: Tests on 512, 1024, 2048 and 4096 cores on IBM Blue Gene /P.

Running on Several HP-SEE Centres

  • Benchmarking activities and results: At the initial phase the application was benchmarked and optimized on the HPCG cluster at IICT-BAS. After successful deployment on 512 cores the second phase of the benchmarking was initiated and it was deployed on the Bulgarian Super Computer. It was tested there and showed good scalability results on 512, 1024, 2048 and 4096 cores. The test case uses 10 millions of trajectories to simulation 180 femtosecond evolution.
  • Other issues: Code corrections, especially due to the IBM system being 32bit.

Achieved Results

The SET application was testing with new random number generators using permutations. Optimizations of transition density using genetic algorithm and acceptance-rejection methods were done. Initial scientific results for simulation of electron transport on quantum wires and are obtained.

The numerical results presented in the Figure 1 are obtained for zero temperature and GaAs material parameters: the electron effective mass is 0.063, the optimal phonon energy is 36 meV, the static and optical dielectric constants are αs = 10.92 and α=12.9. The initial condition at t=0 is given by a function which is Gaussian in energy, (φ(k)=exp(-(b1 k2-b2)2), b1=96 and b2=24), scaled in a way to ensure, that the peak value is equal to unity.

Zp20k+.png Zp20k-.png

Figure 1. Solutions |k|f(0, 0, kz, t) versus |k|21014m-2, evolution time t=200 fs and at positive direction on the z-axis (left picture), and at negative direction on the z-axis (right picture). The electric field is 0, 6 kV/cm, and 12 kV/cm and the number of random walks per point is 1 million. The result for the Wigner function is shown on Figure 2, where the Wigner function is computed for all 800 × 260 points in the plane z × kz.

Wigner3.jpg

Figure 2. The Wigner function solution at t = 180 fs presented in the plane z × kz . The electric field is 15kV/cm and the number of Markov chains per pointsolution is 1 billion. The timing results for evolution time t = 180 fs and for all 800 × 260 points, are shown in Table 1. The number of the Markov chain’s trajectories is 1 billion.

Table 1. The CPU time (seconds) for all 800 × 260 points, the speed-up, and the parallel efficiency.
Blades/Cores CPU Time (s) Speed-up Parallel Efficiency
1 x 8 = 8 202300 - -
4 x 8 = 32 50659 3.9937 0.99834
8 x 8 = 64 25423 7.9574 0.99467
16 x 8 = 128 12735 15.8853 0.99283
Blades/Cores/ Hyper-threading CPU Time (s) Speed-up Parallel Efficiency
1 x 8 x 2 = 16 148602 - -
4 x 8 x 2 = 64 37660 3.94588 0.98647
8 x 8 x 2 =128 18957 7.83889 0.97986
16 x 8 x 2 =256 9552 15.55716 0.97232

The results shown in Table 2 are obtained on IBM BlueGene/P. The solution again is estimated for evolution time t = 180 fs and for all 800 × 260 points, as the number of the Markov chain’s trajectories again is 1 billion. Both timing results demonstrate a very good speed-up and parallel efficiency.

Table 2. The CPU time (seconds) for all 800 × 260 points, the speed-up, and the parallel efficiency.
Cores CPU Time (s) Speed-up Parallel Efficiency
1024 23498 - -
2048 12082 1.9449 0.97245
4096 6091 3.8769 0.96923

We also implemented our algorithm using CUDA and tested it on our GPU-based resources. The random number generator that we used was the default CURAND generator from the CUDA SDK. The parallelization of the code using CUDA was achieved without major rewrite of the code or changes to the program logic. The work is first split into blocks of trajectories to be computed. The master process sends the work to the slave processes, which initialize the respective GPU device and repeatedly execute the respective GPU kernel and return the results.

The same computation as above was performed in 67701 seconds on one NVIDIA M2090 card, which means that one card’s performance is comparable to that of 3 blades with hyperthreading turned off. We believe that this result can be improved, because there could be some warp divergence due to logical statements in the code. This issue can be mitigated by changes in the way the samples are computed by the threads; to make sure that the divergence is limited. We also tested the algorithm when running on several GPU cards in parallel. When 6 TESLA NVIDIA M2090 cards from the same server were used to compute 107 trajectories, we obtained about 93 % parallel efficiency. For such relatively small number of trajectories, the main source of inefficiency is the time spent in the cudaSetDevice call in the beginning of the computations.

Publications

  • S. Ivanovska, A. Karivanova, and N. Manev, Numerical Integration Using Sequences Generating Permutations, LSSC 2011, LNCS, Springer, 2012, Volume 7116/2012, 455-463, DOI: 10.1007/978-3-642-29843-1_51, ISSN: 0302-9743.
  • T. Gurov, S. Ivanovska, A. Karaivanova, N. Manev, Monte Carlo Methods Using a New Class of Congruential Generators, ICT Innovations 2011, Sept. 14-16, 2011, Skopje, LNCS, Springer, 2012, Volume 150/2012, 257-267, DOI: 10.1007/978-3-642-28664-3_24, ISSN: 1867-5662.
  • T. Gurov, E. Atanassov and A. Karaivanova, Monte Carlo Methods for Electron Transport: Scalability Study, Proceedings of ISPDC 2012, IEEE CPS, p.188-194, DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ISPDC.2012.33, ISBN: 978-1-4673-2599-8.
  • E. Atanassov, D. Dimitrov and S. Ivanovska, Efficient Implementation of the Heston Model Using GPGPU, Monte Carlo Methods and Applications, De Gruyter, 2012, 21-28, ISBN: 978-3-11-029358-6, ISSN: 0929-9629..
  • T. Gurov, S. Ivanovska, A. Karaivanova, N. Manev, A Study of a New Class of Congruential Generators for Monte Carlo Methods, Journal Information Technologies and Control, ISSN 1312-2622, (accepted for publication).
  • E. Atanassov, D. Georgiev, N. L. Manev, ECM Integer factorization on GPU Cluster, Jubilee 35th International Convention on Information and Communication Technology, electronics and microelectronics - MIPRO 2012/DC-VIS, 343-346, ISBN: 978-953-233-069-4.
  • E. Atanassov, T. Gurov, A. Karaivanova, Monte Carlo Simulation of Ultrafast Carrier Transport: Scalability Study, Proceedings of HP-SEE UF 2012, 17-19 October 2012, Belgrade, Springer (accepted for publication).
  • E. Atanassov, M. Durchova, “Generation of the Scrambled Halton Sequence Using Accelerators”, Proceeding of the 36th International Convention MIPRO2013/DC-VIS, pp. 197-201, May 2013, ISSN 1847-3938.

Presentations

  • T. Gurov, “Message Oriented Framework with Low Overhead for Efficient Use of HPC Resources”, Special Session “High Performance Monte Carlo Simulation”, during 8th LSSC’11 Conference, June 6-10, Sozopol, Bulgaria.
  • A. Karaivanova, “Monte Carlo Simulations of Electron Transport using a Class of Sequences Generating Permutations”, 3rd AMITANS’11 Conference, June 20-25, Albena, Bulgaria.
  • N. Manev, “Monte Carlo Methods using a New Class of Congruential Generators”, ICT Innovations 2011 conference, 14-16 September, 2011, Skopje, FYROM.
  • E. Atanassov, “Message Oriented Framework with Low Overhead for Efficient Use of HPC Resources”, ICT Innovations 2011 conference, 14-16 September, 2011, Skopje, FYROM.
  • T. Gurov, “Study Scalability of SET Application using The Bulgarian HPC Infrastructure”, the 8th International Conference on Computer Science and Information Technologies – CSIT2011, September 26-30, 2011, Yerevan, Armenia.
  • E. Atanassov, “Stochastic Modeling of Electron Transport on different HPC architectures”, PRACE Workshop on HPC approaches on Life Sciences and Chemistry, 17-18 February, 2012, Sofia, Bulgaria.
  • E. Atanassov, “High-Performance Framework for Advanced Applications”, 2nd workshop on supercomputer applications, 22-24 April, 2012, Bansko, Bulgaria.
  • N. Manev, “ECM Integer factorization on GPU Cluster”, Jubilee 35th International Convention on Information and Communication Technology, electronics and microelectronics (MIPRO2012), 21-25 May 2012, Opatija, Croatia.
  • E. Atanassov, “Efficient Implementation of a Stochastic Electron Transport Simulation Algorithm Using GPGPU Computing”, 4th AMITANS 2012 Conference, June 11-16, 2012, Varna, Bulgaria.
  • E. Atanassov, “Monte Carlo Methods for Electron Transport: Scalability Study”, 11th ISPDC2012 Conference, June 25-29, 2012, Munich, Germany.
  • A. Karaivanova, “Monte Carlo methods for Electron Transport: Scalability Study Using HP-SEE Infrastructure”, invited talk during HP-SEE User Forum 2012, October 17-19, Belgrade, Serbia.

Foreseen Activities

  • Message oriented frameworks overcome some deployment limitations like lack of common Grid middleware are installed. This problem is planned to be solved in next step, but it is not an immediate problem for production use.
  • The availability of HPC resources enables new research to be performed where we will be investigating the impact of the applied electric field on the devices made from different semiconductor materials.
Personal tools