Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Adriano Kurz Maron Renata Hax Sander Reiser Maur´ıcio Lima Pilla
May 16th , 2013
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 1 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
1 Introduction
Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU
PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results
Methodology Simulation Time Discussion 4 Conclusions
Main Contributions Future Work Acknowledgements High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 2 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
1 Introduction
Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU
PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results
Methodology Simulation Time Discussion 4 Conclusions
Main Contributions Future Work Acknowledgements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 2 of 23
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Conclusions
Background & Motivation
Quantum Computing Context • Quantum computers may be
exponentially faster than classical ones • Performance comes from quantum
mechanics phenomena • Entirely new algorithms are required Figure : “Quantum hardware”: few quantum bits.
Quantum Simulation • Performs the operations related to the temporal evolution • Simulates the behavior of quantum algorithms as if they were
being executed on a quantum hardware • Quantum simulation are computationally expensive. High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 3 of 23
Conclusions
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Background & Motivation
Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡
1 1
√1 2
1 −1
! ×
α β
! =
α+β α−β
!
2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i
(H ⊗H )|φi =
√1 2
1 1
1
−1
! ⊗ √12
1 1
1
−1
!
1 1 1 ≡ 2 1 1
1 −1 1 −1
1 1 −1 −1
1 −1 −1 1
α β γ = δ
α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 4 of 23
Conclusions
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Background & Motivation
Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡
1 1
√1 2
1 −1
! ×
α β
! =
α+β α−β
!
2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i
(H ⊗H )|φi =
√1 2
1 1
1
−1
! ⊗ √12
1 1
1
−1
!
1 1 1 ≡ 2 1 1
1 −1 1 −1
1 1 −1 −1
1 −1 −1 1
α β γ = δ
α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ
Important Gate by gate implementation is possible and, in this case, more efficient, but it does not apply when generating entangled states.
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 4 of 23
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Conclusions
Background & Motivation
Motivation of our Work Sequential simulators...
Parallel simulators...
• have low memory requirements
• less optimized (brute force)
• have a clever logic for the computation
• focus on parallelization techniques
• provide graphical interfaces
• are mostly limited by memory
• are limited by the simulation time.
Our Challenge: Best of both worlds • Merge both approaches from sequential and parallel simulators: • Applying algorithm that explores the patterns in the definition of quantum
transformations to reduce the number of operations • Distributing the computation across GPUs (future work)
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 5 of 23
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Conclusions
Related Work
Related Work Parallel Simulators
Sequential Simulators
Libquantum
Massive Parallel QCS
• C library for quantum simulation
• Distributed simulation
• Decoherence support
• Fortran 90
• Official SPEC CPU2006 benchmark
• Communication through MPI
• GPL v.3 free software.
• 42-qubit Shor’s algorithm on JUGENE.
QuIDDPro
QCS Using CUDA
• QuIDDs for transformations and states
• GPU acting as a co-processor
• Explores data patterns
• 1, 2-qubit universal transformations
• Limited by execution time • Grover’s algorithm with 40 qubits
• 26-qubits QFT : 95× speedup vs
• 8.23 × 104 sec. and 0.398 MB RAM.
Libquantum • Limited by global memory space.
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 6 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
VPE-qGM
VPE-qGM (Visual Programming Environment for the qGM Model) • Graphical environment for modeling and
simulation of quantum algorithms • Based on the qGM (Quantum Geometric
Machine) model
What makes out project relevant? • A complex environment is being developed • Support for simulation with GPUs and
clusters • Integrated interfaces for quantum circuits
and processes from the qGM model
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 7 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
1 Introduction
Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU
PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results
Methodology Simulation Time Discussion 4 Conclusions
Main Contributions Future Work Acknowledgements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 8 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
PyCUDA Framework
PyCUDA Why PyCUDA? • Integrated to the Python language. The CUDA kernel is still coded in C • Data allocation and data copy are simplified • Dynamic code generation.
Why PyCUDA in our project? • The VPE-qGM environment is developed in python • Computational cost of the host-code is low • Creation of the data to be sent to GPU is... • based on string manipulation • easily prototyped in Python.
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 9 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
Constant-Size Source Matrices and Auxiliary Data
Data Structures
Data stored as a vector
Optimized for sparse matrices
Constant memory is used
Auxiliary data helps indexing
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 10 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
CUDA Kernel
Kernel
• Blocks of 256 threads • Each thread generates 4 amplitudes • 2q /4 threads are necessary • Number of threads grows exponentially
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 11 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
CUDA Kernel
CUDA Kernel
• Each thread covers all elements of one line of each
matrix • The last matrix is accessed by all
• Each block generates a partial quantum state of 1024
amplitudes • Amplitudes are stored in the shared
memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
CUDA Kernel
CUDA Kernel
• Each thread covers all elements of one line of each
matrix • The last matrix is accessed by all
• Each block generates a partial quantum state of 1024
amplitudes • Amplitudes are stored in the shared
memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
CUDA Kernel
CUDA Kernel
• Each thread covers all elements of one line of each
matrix • The last matrix is accessed by all
• Each block generates a partial quantum state of 1024
amplitudes • Amplitudes are stored in the shared
memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration 1
Operation sMem[0]+ =
1 2
·
1 2
·
1 2
·
1 2
·1 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
1 2 1 2
· ·
1 2 1 2
· ·
1 2 1 2
· ·
1 2 1 2
·1 ·0 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
3
sMem[0]+ =
1 2 1 2 1 2
· · ·
1 2 1 2 1 2
· · ·
1 2 1 2 1 2
· · ·
1 2 1 2 1 2
·1 ·0 ·0
• A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
3
sMem[0]+ =
4
sMem[0]+ =
1 2 1 2 1 2 1 2
· · · ·
1 2 1 2 1 2 1 2
· · · ·
1 2 1 2 1 2 1 2
· · · ·
1 2 1 2 1 2 1 2
·1 ·0 ·0
• A thread stores the partial sum at the shared memory
·0
• The input (current) state is accessed through global memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
3
sMem[0]+ =
4
sMem[0]+ =
5
sMem[1]+ =
.. .
.. .
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
·1 ·0 ·0
• A thread stores the partial sum at the shared memory
·0
• The input (current) state is accessed through global
·0
memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
3
sMem[0]+ =
4
sMem[0]+ =
5
sMem[1]+ =
.. .
16
.. .
sMem[3]+ =
.. . .. .
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
·1 ·0 ·0
• A thread stores the partial sum at the shared memory
·0
• The input (current) state is accessed through global
·0
memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are 1 2
·
1 2
·
1 2
·
1 2
·0
obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
CUDA Kernel
CUDA Kernel
Iteration
Operation
1
sMem[0]+ =
2
sMem[0]+ =
3
sMem[0]+ =
4
sMem[0]+ =
5
sMem[1]+ =
.. .
16
sMem[3]+ =
1024
sMem[3]+ =
.. .
.. . .. .
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
· · · · ·
1 2 1 2 1 2 1 2 1 2
·1 ·0 ·0
• A thread stores the partial sum at the shared memory
·0
• The input (current) state is accessed through global
·0
memory (and that’s bad)
• After 1024 iterations, 4 amplitudes of a partial state are 1 2
·
1 2
·
1 2
·
1 2
·0
1 2
·
1 2
·
1 2
·
1 2
·0
obtained
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
1 Introduction
Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU
PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results
Methodology Simulation Time Discussion 4 Conclusions
Main Contributions Future Work Acknowledgements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 14 of 23
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Conclusions
Methodology
Scenario and Comparison • Hadamard transformations up to 20 qubits • Why? Transformation with the highest computing cost in the VPE-qGM
GPU Simulation Simulation time obtained from NVIDIA Visual Profiler 5.0 (30 exec.)
• Intel Core i7 − 3770, 8 GB RAM, NVIDIA GT640 (GK 104), Ubuntu 12.04 64 bits.
Distributed Simulation Time average for 15 simulations of each case
• Intel Core2Quad Q8200 2, 33GHz, 4 GB RAM, Ubuntu 12.04 64 bits.
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 15 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
Simulation Time
Data collected
Distributed Simulation • Max standard dev: 1.9% for H ⊗18 • H ⊗19 and H ⊗20 would require approximately 1 and 5 hours
Figure : Simulation times for the experiments
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 16 of 23
Introduction
QC Simulation on the VPE-qGM Using GPU
Results
Conclusions
Discussion
Analyzing the Data
Speedups • ≈ 550× vs. a 1-core simulation • ≈ 85× vs. a 8-core simulation
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 17 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
1 Introduction
Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU
PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results
Methodology Simulation Time Discussion 4 Conclusions
Main Contributions Future Work Acknowledgements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 18 of 23
Results
QC Simulation on the VPE-qGM Using GPU
Introduction
Conclusions
Main Contributions
Contributions Main achievements: • A significant boost in the simulation was obtained by the GPU acceleration • H ⊗19 and H ⊗20 are now supported by the VPE-qGM
Constraints: • Exponential increase in the simulation time. H ⊗25 = ≈ 1.26 · 106 seconds • Even with more efficient computation, simulation is limited to 25 qubits – ≈ 1 GB
RAM would be required for the state vector
Upsides • Great boost in the simulation • Only 4 months of work • Many possible improvements
Downsides • GPU’s potential not fully used • Outperformed by other simulators
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 19 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
Future Work
Next Challenges and our Ultimate Goal Future work: • Improve the kernel – a new algorithm is already in development • Extend support for controlled quantum transformations • Optimize use of all GPU’s resources • Big obstacle: suitable approach for efficient storage/representation of the state
vector.
How does our project is different from other solutions? • Provides graphic interfaces for the simulation • Open source code • Integration between accelerator and optimized algorithm • Applies optimizations for both transformations and state vector storage • Will consolidates the support for simulation in a cluster with GPUs
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 20 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
Acknowledgements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Acknowledgements:
• FAPERGS Agency • GreenGrid Project (PRONEX FAPERGS/CNPq)
• CAPES
Contacts •
[email protected] •
[email protected]
•
[email protected]
13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing CCGRID 2013 High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 21 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
Acknowledgements
Further questions
Matrix computation is used. Why don’t we use an existing package? • To optimize the execution of controlled gates by identifying unnecessary
computation (vectors that do not change any amplitudes) • Current solutions for Kronecker Product on GPUs applies to a small number of
matrices • We want dynamic generation of the elements associated with the resulting
matrix, avoiding storing an exponential number of elements
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 22 of 23
QC Simulation on the VPE-qGM Using GPU
Introduction
Results
Conclusions
Acknowledgements
Data copy using PyCUDA Data in the constant memory: • At the kernel:
device
constant
dType dadosGPU [tamanho ]
• Get GPU’s memory address: endGPU = kernel .get global (0 dadosGPU 0 )[0] • host =⇒ device data copy: pycuda .driver .memcpy htod (endGPU , objetoNumPy )
Global memory data: • readMemory =numpy .array (numpy .zeros (2q ), dtype = numpy .complex64, order =0 C 0 ) • readMemory gpu = gpuarray .to gpu(readMemory ) • writeMemory gpu = gpuarray .zeros (2q , dtype = numpy .complex64, order =0 C 0 ).
Condition for such operations: • All data must be stored as objects of the numPy lib • Advantage: Data movement is simplified
High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 23 of 23