High-Performance Quantum Computing Simulation ...

Introduction

QC Simulation on the VPE-qGM Using GPU

Results

Conclusions

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Adriano Kurz Maron Renata Hax Sander Reiser Maur´ıcio Lima Pilla

May 16th , 2013

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 1 of 23


Introduction

Results

Conclusions

1 Introduction

Background & Motivation Related Work VPE-qGM 2 QC Simulation on the VPE-qGM Using GPU

PyCUDA Framework Constant-Size Source Matrices and Auxiliary Data CUDA Kernel 3 Results

Methodology Simulation Time Discussion 4 Conclusions

Main Contributions Future Work Acknowledgements High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 2 of 23


Introduction

Results

Conclusions

1 Introduction




Main Contributions Future Work Acknowledgements


Results


Introduction

Conclusions

Background & Motivation

Quantum Computing Context • Quantum computers may be

exponentially faster than classical ones • Performance comes from quantum

mechanics phenomena • Entirely new algorithms are required Figure : “Quantum hardware”: few quantum bits.

Quantum Simulation • Performs the operations related to the temporal evolution • Simulates the behavior of quantum algorithms as if they were

being executed on a quantum hardware • Quantum simulation are computationally expensive. High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 3 of 23

Conclusions

Results


Introduction


Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡

1 1

√1 2

1 −1

! ×

α β

! =

α+β α−β

!

2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i

(H ⊗H )|φi =

√1 2

1 1

1

−1

! ⊗ √12

1 1

1

−1

!

  1  1 1  ≡ 2   1 1

1 −1 1 −1

1 1 −1 −1

1 −1 −1 1

     α     β         γ  =     δ

α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ

     


Conclusions

Results


Introduction


Quantum Computing 1 qubit - Basic quantum transformations act on a state vector, such as |ψi = α|0i + β|1i H |ψi ≡

1 1

√1 2

1 −1

! ×

α β

! =

α+β α−β

!

2 qubits - Quantum transformations obtained from the Kronecker Product act on a state vector, such as |φi = α|00i + β|01i + γ|10i + δ|11i

(H ⊗H )|φi =

√1 2

1 1

1

−1

! ⊗ √12

1 1

1

−1

!

  1  1 1  ≡ 2   1 1

1 −1 1 −1

1 1 −1 −1

1 −1 −1 1

     α     β         γ  =     δ

α+β+γ+δ α−β+γ−δ α+β−γ−δ α−β−γ+δ

     

Important Gate by gate implementation is possible and, in this case, more efficient, but it does not apply when generating entangled states.


Results


Introduction

Conclusions


Motivation of our Work Sequential simulators...

Parallel simulators...

• have low memory requirements

• less optimized (brute force)

• have a clever logic for the computation

• focus on parallelization techniques

• provide graphical interfaces

• are mostly limited by memory

• are limited by the simulation time.

Our Challenge: Best of both worlds • Merge both approaches from sequential and parallel simulators: • Applying algorithm that explores the patterns in the definition of quantum

transformations to reduce the number of operations • Distributing the computation across GPUs (future work)


Results


Introduction

Conclusions

Related Work

Related Work Parallel Simulators

Sequential Simulators

Libquantum

Massive Parallel QCS

• C library for quantum simulation

• Distributed simulation

• Decoherence support

• Fortran 90

• Official SPEC CPU2006 benchmark

• Communication through MPI

• GPL v.3 free software.

• 42-qubit Shor’s algorithm on JUGENE.

QuIDDPro

QCS Using CUDA

• QuIDDs for transformations and states

• GPU acting as a co-processor

• Explores data patterns

• 1, 2-qubit universal transformations

• Limited by execution time • Grover’s algorithm with 40 qubits

• 26-qubits QFT : 95× speedup vs

• 8.23 × 104 sec. and 0.398 MB RAM.

Libquantum • Limited by global memory space.


Introduction


Results

Conclusions

VPE-qGM

VPE-qGM (Visual Programming Environment for the qGM Model) • Graphical environment for modeling and

simulation of quantum algorithms • Based on the qGM (Quantum Geometric

Machine) model

What makes out project relevant? • A complex environment is being developed • Support for simulation with GPUs and

clusters • Integrated interfaces for quantum circuits

and processes from the qGM model



Introduction

Results

Conclusions

1 Introduction







Introduction

Results

Conclusions

PyCUDA Framework

PyCUDA Why PyCUDA? • Integrated to the Python language. The CUDA kernel is still coded in C • Data allocation and data copy are simplified • Dynamic code generation.

Why PyCUDA in our project? • The VPE-qGM environment is developed in python • Computational cost of the host-code is low • Creation of the data to be sent to GPU is... • based on string manipulation • easily prototyped in Python.


Introduction


Results

Conclusions

Constant-Size Source Matrices and Auxiliary Data

Data Structures

Data stored as a vector

Optimized for sparse matrices

Constant memory is used

Auxiliary data helps indexing


Introduction


Results

Conclusions

CUDA Kernel

Kernel

• Blocks of 256 threads • Each thread generates 4 amplitudes • 2q /4 threads are necessary • Number of threads grows exponentially


Introduction


Results

Conclusions

CUDA Kernel

CUDA Kernel

• Each thread covers all elements of one line of each

matrix • The last matrix is accessed by all

• Each block generates a partial quantum state of 1024

amplitudes • Amplitudes are stored in the shared

memory High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 12 of 23

Introduction


Results

Conclusions

CUDA Kernel

CUDA Kernel






Introduction


Results

Conclusions

CUDA Kernel

CUDA Kernel







Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration 1

Operation sMem[0]+ =

1 2

·

1 2

·

1 2

·

1 2

·1 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)

• After 1024 iterations, 4 amplitudes of a partial state are obtained



Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

1 2 1 2

· ·

1 2 1 2

· ·

1 2 1 2

· ·

1 2 1 2

·1 ·0 • A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)




Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

· · ·

1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory • The input (current) state is accessed through global memory (and that’s bad)




Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

· · · ·

1 2 1 2 1 2 1 2

·1 ·0 ·0

• A thread stores the partial sum at the shared memory

·0

• The input (current) state is accessed through global memory (and that’s bad)




Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

.. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0


·0

• The input (current) state is accessed through global

·0

memory (and that’s bad)




Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

16

.. .

sMem[3]+ =

.. . .. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0


·0


·0


• After 1024 iterations, 4 amplitudes of a partial state are 1 2

·

1 2

·

1 2

·

1 2

·0

obtained



Introduction

Results

Conclusions

CUDA Kernel

CUDA Kernel

Iteration

Operation

1

sMem[0]+ =

2

sMem[0]+ =

3

sMem[0]+ =

4

sMem[0]+ =

5

sMem[1]+ =

.. .

16

sMem[3]+ =

1024

sMem[3]+ =

.. .

.. . .. .

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

· · · · ·

1 2 1 2 1 2 1 2 1 2

·1 ·0 ·0


·0


·0


• After 1024 iterations, 4 amplitudes of a partial state are 1 2

·

1 2

·

1 2

·

1 2

·0

1 2

·

1 2

·

1 2

·

1 2

·0

obtained



Introduction

Results

Conclusions

1 Introduction






Results


Introduction

Conclusions

Methodology

Scenario and Comparison • Hadamard transformations up to 20 qubits • Why? Transformation with the highest computing cost in the VPE-qGM

GPU Simulation Simulation time obtained from NVIDIA Visual Profiler 5.0 (30 exec.)

• Intel Core i7 − 3770, 8 GB RAM, NVIDIA GT640 (GK 104), Ubuntu 12.04 64 bits.

Distributed Simulation Time average for 15 simulations of each case

• Intel Core2Quad Q8200 2, 33GHz, 4 GB RAM, Ubuntu 12.04 64 bits.



Introduction

Results

Conclusions

Simulation Time

Data collected

Distributed Simulation • Max standard dev: 1.9% for H ⊗18 • H ⊗19 and H ⊗20 would require approximately 1 and 5 hours

Figure : Simulation times for the experiments


Introduction


Results

Conclusions

Discussion

Analyzing the Data

Speedups • ≈ 550× vs. a 1-core simulation • ≈ 85× vs. a 8-core simulation



Introduction

Results

Conclusions

1 Introduction






Results


Introduction

Conclusions

Main Contributions

Contributions Main achievements: • A significant boost in the simulation was obtained by the GPU acceleration • H ⊗19 and H ⊗20 are now supported by the VPE-qGM

Constraints: • Exponential increase in the simulation time. H ⊗25 = ≈ 1.26 · 106 seconds • Even with more efficient computation, simulation is limited to 25 qubits – ≈ 1 GB

RAM would be required for the state vector

Upsides • Great boost in the simulation • Only 4 months of work • Many possible improvements

Downsides • GPU’s potential not fully used • Outperformed by other simulators



Introduction

Results

Conclusions

Future Work

Next Challenges and our Ultimate Goal Future work: • Improve the kernel – a new algorithm is already in development • Extend support for controlled quantum transformations • Optimize use of all GPU’s resources • Big obstacle: suitable approach for efficient storage/representation of the state

vector.

How does our project is different from other solutions? • Provides graphic interfaces for the simulation • Open source code • Integration between accelerator and optimized algorithm • Applies optimizations for both transformations and state vector storage • Will consolidates the support for simulation in a cluster with GPUs



Introduction

Results

Conclusions

Acknowledgements

High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model Acknowledgements:

• FAPERGS Agency • GreenGrid Project (PRONEX FAPERGS/CNPq)

• CAPES

Contacts • [email protected] • [email protected]

• [email protected]

13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing CCGRID 2013 High-Performance Quantum Computing Simulation for the Quantum Geometric Machine Model 21 of 23


Introduction

Results

Conclusions

Acknowledgements

Further questions

Matrix computation is used. Why don’t we use an existing package? • To optimize the execution of controlled gates by identifying unnecessary

computation (vectors that do not change any amplitudes) • Current solutions for Kronecker Product on GPUs applies to a small number of

matrices • We want dynamic generation of the elements associated with the resulting

matrix, avoiding storing an exponential number of elements



Introduction

Results

Conclusions

Acknowledgements

Data copy using PyCUDA Data in the constant memory: • At the kernel:

device

constant

dType dadosGPU [tamanho ]

• Get GPU’s memory address: endGPU = kernel .get global (0 dadosGPU 0 )[0] • host =⇒ device data copy: pycuda .driver .memcpy htod (endGPU , objetoNumPy )

Global memory data: • readMemory =numpy .array (numpy .zeros (2q ), dtype = numpy .complex64, order =0 C 0 ) • readMemory gpu = gpuarray .to gpu(readMemory ) • writeMemory gpu = gpuarray .zeros (2q , dtype = numpy .complex64, order =0 C 0 ).

Condition for such operations: • All data must be stored as objects of the numPy lib • Advantage: Data movement is simplified


High-Performance Quantum Computing Simulation ...

High-Performance Quantum Computing Simulation ...

Suggest Documents

Quantum computing in power system simulation - CiteSeerX

High-Performance Quantum Computing Simulation for the Quantum ...

Quantum computing

Quantum computing

Quantum Computing

Jaguar: A highperformance quantum chemistry software program with ...

Quantum computing

Quantum Computing

HighPerformance Polybenzoxazine Nanocomposites Containing

HighPerformance Glass Fiber Development for

Quantum Computing Simulation Infrastructure (QCSI) using the Cloud

HighPerformance PhotoelectrochemicalType ...

introduction to quantum computing

Prospects for Quantum Computing

Quantum Computing with Parafermions

Quantum-inspired computing - CiteSeerX

Quantum computing (Kwantumberekening)

Quantum Computing - arXiv

Efficient Distributed Quantum Computing

What is Quantum Computing?

On Quantum Computing

Quantum Computing Using Optics

SUPERCONDUCTING QUANTUM COMPUTING - CiteSeerX

quantum computing - FI MUNI