High-Performance Quantum Computing Simulation ...

10 downloads 284 Views 9MB Size Report
May 16, 2013 - Integrated to the Python language. The CUDA kernel is still ... The VPE-qGM environment is developed in python ..... Profiler 5.0 (30 exec.).
Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Adriano Kurz Maron Renata Hax Sander Reiser Maur´ıcio Lima Pilla

May 16th , 2013

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 1 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 2 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 2 of 23

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Conclusions

Background & Motivation

Quantum Computing Context • Quantum computers may be

exponentially faster than classical ones • Performance comes from quantum

mechanics phenomena • Entirely new algorithms are required Figure : “Quantum hardware”: few quantum bits.

Quantum Simulation • Performs the operations related to the temporal evolution • Simulates the behavior of quantum algorithms as if they were

being executed on a quantum hardware • Quantum simulation are computationally expensive. High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 3 of 23

Conclusions

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Background & Motivation

Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡

1 1

√1 2

1 −1

! ×

α β

! =

α+β α−β

!

2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i

(H ⊗H )|φi =

√1 2

1 1

1

−1

! ⊗ √12

1 1

1

−1

!

  1  1 1  ≡ 2   1 1

1 −1 1 −1

1 1 −1 −1

1 −1 −1 1

     α     β         γ  =     δ

α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ

     

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 4 of 23

Conclusions

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Background & Motivation

Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡

1 1

√1 2

1 −1

! ×

α β

! =

α+β α−β

!

2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i

(H ⊗H )|φi =

√1 2

1 1

1

−1

! ⊗ √12

1 1

1

−1

!

  1  1 1  ≡ 2   1 1

1 −1 1 −1

1 1 −1 −1

1 −1 −1 1

     α     β         γ  =     δ

α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ

     

Important Gate by gate implementation is possible and, in this case, more efficient, but it does not apply when generating entangled states.

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 4 of 23

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Conclusions

Background & Motivation

Motivation of our Work Sequential simulators...

Parallel simulators...

• have low memory requirements

• less optimized (brute force)

• have a clever logic for the computation

• focus on parallelization techniques

• provide graphical interfaces

• are mostly limited by memory

• are limited by the simulation time.

Our Challenge: Best of both worlds • Merge both approaches from sequential and parallel simulators: • Applying algorithm that explores the patterns in the definition of quantum

transformations to reduce the number of operations • Distributing the computation across GPUs (future work)

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 5 of 23

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Conclusions

Related Work

Related Work Parallel Simulators

Sequential Simulators

Libquantum

Massive Parallel QCS

• C library for quantum simulation

• Distributed simulation

• Decoherence support

• Fortran 90

• Official SPEC CPU2006 benchmark

• Communication through MPI

• GPL v.3 free software.

• 42-qubit Shor’s algorithm on JUGENE.

QuIDDPro

QCS Using CUDA

• QuIDDs for transformations and states

• GPU acting as a co-processor

• Explores data patterns

• 1, 2-qubit universal transformations

• Limited by execution time • Grover’s algorithm with 40 qubits

• 26-qubits QFT : 95× speedup vs

• 8.23 × 104 sec. and 0.398 MB RAM.

Libquantum • Limited by global memory space.

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 6 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

VPE-qGM

VPE-qGM (Visual Programming Environment for the qGM Model) • Graphical environment for modeling and

simulation of quantum algorithms • Based on the qGM (Quantum Geometric

Machine) model

What makes out project relevant? • A complex environment is being developed • Support for simulation with GPUs and

clusters • Integrated interfaces for quantum circuits

and processes from the qGM model

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 7 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 8 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

PyCUDA Framework

PyCUDA Why PyCUDA? • Integrated to the Python language. The CUDA kernel is still coded in C • Data allocation and data copy are simplified • Dynamic code generation.

Why PyCUDA in our project? • The VPE-qGM environment is developed in python • Computational cost of the host-code is low • Creation of the data to be sent to GPU is... • based on string manipulation • easily prototyped in Python.

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 9 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

Constant-Size Source Matrices and Auxiliary Data

Data Structures

Data stored as a vector

Optimized for sparse matrices

Constant memory is used

Auxiliary data helps indexing

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 10 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

CUDA Kernel

Kernel

• Blocks of 256 threads • Each thread generates 4 amplitudes • 2q /4 threads are necessary • Number of threads grows exponentially

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 11 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

CUDA Kernel

CUDA Kernel

• Each thread covers all elements of one line of each

matrix • The last matrix is accessed by all

• Each block generates a partial quantum state of 1024

amplitudes • Amplitudes are stored in the shared

memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

CUDA Kernel

CUDA Kernel

• Each thread covers all elements of one line of each

matrix • The last matrix is accessed by all

• Each block generates a partial quantum state of 1024

amplitudes • Amplitudes are stored in the shared

memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

CUDA Kernel

CUDA Kernel

• Each thread covers all elements of one line of each

matrix • The last matrix is accessed by all

• Each block generates a partial quantum state of 1024

amplitudes • Amplitudes are stored in the shared

memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration 1

Operation sMem[0]+ =

1 2

·

1 2

·

1 2

·

1 2

·1 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

1 2 1 2

· ·

1 2 1 2

· ·

1 2 1 2

· ·

1 2 1 2

·1 ·0 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory

·0

• The input (current) state is accessed through global memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

.. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory

·0

• The input (current) state is accessed through global

·0

memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

16

.. .

sMem[3]+ =

.. . .. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory

·0

• The input (current) state is accessed through global

·0

memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are 1 2

·

1 2

·

1 2

·

1 2

·0

obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

16

sMem[3]+ =

1024

sMem[3]+ =

.. .

.. . .. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory

·0

• The input (current) state is accessed through global

·0

memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are 1 2

·

1 2

·

1 2

·

1 2

·0

1 2

·

1 2

·

1 2

·

1 2

·0

obtained

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 13 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 14 of 23

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Conclusions

Methodology

Scenario and Comparison • Hadamard transformations up to 20 qubits • Why? Transformation with the highest computing cost in the VPE-qGM

GPU Simulation Simulation time obtained from NVIDIA Visual Profiler 5.0 (30 exec.)

• Intel Core i7 − 3770, 8 GB RAM, NVIDIA GT640 (GK 104), Ubuntu 12.04 64 bits.

Distributed Simulation Time average for 15 simulations of each case

• Intel Core2Quad Q8200 2, 33GHz, 4 GB RAM, Ubuntu 12.04 64 bits.

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 15 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

Simulation Time

Data collected

Distributed Simulation • Max standard dev: 1.9% for H ⊗18 • H ⊗19 and H ⊗20 would require approximately 1 and 5 hours

Figure : Simulation times for the experiments

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 16 of 23

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

Discussion

Analyzing the Data

Speedups • ≈ 550× vs. a 1-core simulation • ≈ 85× vs. a 8-core simulation

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 17 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 18 of 23

Results

QC Simulation on the VPE-qGM Using GPU

Introduction

Conclusions

Main Contributions

Contributions Main achievements: • A significant boost in the simulation was obtained by the GPU acceleration • H ⊗19 and H ⊗20 are now supported by the VPE-qGM

Constraints: • Exponential increase in the simulation time. H ⊗25 = ≈ 1.26 · 106 seconds • Even with more efficient computation, simulation is limited to 25 qubits – ≈ 1 GB

RAM would be required for the state vector

Upsides • Great boost in the simulation • Only 4 months of work • Many possible improvements

Downsides • GPU’s potential not fully used • Outperformed by other simulators

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 19 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

Future Work

Next Challenges and our Ultimate Goal Future work: • Improve the kernel – a new algorithm is already in development • Extend support for controlled quantum transformations • Optimize use of all GPU’s resources • Big obstacle: suitable approach for efficient storage/representation of the state

vector.

How does our project is different from other solutions? • Provides graphic interfaces for the simulation • Open source code • Integration between accelerator and optimized algorithm • Applies optimizations for both transformations and state vector storage • Will consolidates the support for simulation in a cluster with GPUs

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 20 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Acknowledgements:

• FAPERGS Agency • GreenGrid Project (PRONEX FAPERGS/CNPq)

• CAPES

Contacts • [email protected][email protected]

[email protected]

13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing CCGRID 2013 High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 21 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

Acknowledgements

Further questions

Matrix computation is used. Why don’t we use an existing package? • To optimize the execution of controlled gates by identifying unnecessary

computation (vectors that do not change any amplitudes) • Current solutions for Kronecker Product on GPUs applies to a small number of

matrices • We want dynamic generation of the elements associated with the resulting

matrix, avoiding storing an exponential number of elements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 22 of 23

QC Simulation on the VPE-qGM Using GPU

Introduction

Results

Conclusions

Acknowledgements

Data copy using PyCUDA Data in the constant memory: • At the kernel:

device

constant

dType dadosGPU [tamanho ]

• Get GPU’s memory address: endGPU = kernel .get global (0 dadosGPU 0 )[0] • host =⇒ device data copy: pycuda .driver .memcpy htod (endGPU , objetoNumPy )

Global memory data: • readMemory =numpy .array (numpy .zeros (2q ), dtype = numpy .complex64, order =0 C 0 ) • readMemory gpu = gpuarray .to gpu(readMemory ) • writeMemory gpu = gpuarray .zeros (2q , dtype = numpy .complex64, order =0 C 0 ).

Condition for such operations: • All data must be stored as objects of the numPy lib • Advantage: Data movement is simplified

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 23 of 23