nvidia / cuda-quantum Goto Github PK

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows

Home Page: https://nvidia.github.io/cuda-quantum/

License: Other

CMake 2.62% C++ 68.91% Python 14.26% C 9.96% Roff 1.03% Shell 1.99% Dockerfile 1.20% Cuda 0.03%

cpp python quantum quantum-algorithms quantum-computing quantum-machine-learning quantum-programming-language unitaryhack

cuda-quantum's Introduction

Welcome to the CUDA-Q repository

The CUDA-Q Platform for hybrid quantum-classical computers enables integration and programming of quantum processing units (QPUs), GPUs, and CPUs in one system. This repository contains the source code for all C++ and Python tools provided by the CUDA-Q toolkit, including the nvq++ compiler, the CUDA-Q runtime, as well as a selection of integrated CPU and GPU backends for rapid application development and testing.

Getting Started

To learn more about how to work with CUDA-Q, please take a look at the CUDA-Q Documentation. The page also contains installation instructions for officially released packages.

If you would like to install the latest iteration under development in this repository and/or add your own modifications, take a look at the latest packages deployed on the GitHub Container Registry. For more information about building CUDA-Q from source, see these instructions.

Contributing

There are many ways in which you can get involved with CUDA-Q. If you are interested in developing quantum applications with CUDA-Q, this repository is a great place to get started! For more information about contributing to the CUDA-Q platform, please take a look at Contributing.md.

License

The code in this repository is licensed under Apache License 2.0.

Contributing a pull request to this repository requires accepting the Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. A CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately. Simply follow the instructions provided by the bot. You will only need to do this once.

Feedback

Please let us know your feedback and ideas for the CUDA-Q platform in the Discussions tab of this repository, or file an issue. To report security concerns or Code of Conduct violations, please reach out to [email protected].

cuda-quantum's People

Contributors

Stargazers

Watchers

Forkers

boschmitt amccaskey schweitzpgi kevzos bettinaheim amctechz pranavdurai10 praveen8ae pkabir9 q-inho zappytan smtsjhr sourcegraph-ce anthony-santana tlubowe 1tnguyen joolstorrentecalo stevescia poojarao8 cualquiercosa327 aaelsharkawy hamidelmaazouz pppppyamap rickyhong kechen666 00mjk actione kueipoh chiaraferra oqc-tech elicharlese hejunbok kukushechkin snsunx vaibhavrumale mdepasca zyphrflaw vikilyc orclassiq better-call-haroon jrhemstad johanneskuhlmann splch orcacomputing cicacica harryquantum deyh2020 bmhowe23 skhi-bridges ajschmidt8 adilnaut zhaoyilunnn shabeebk guogggg xevor11 abhikumar63 5l1v3r1 srulre achzhou chemix-lunacy abhiram6121 zohimchandani khalatepradnya quantumtr fieldofnodes jjacobelli skspark jyalim chivier mmvandieren justinlietz gooddaisy issamsaid yuchens2021 nithingovindugari calixphd thomascherickal aryanj13 poseidonfluids butchertx yaraslaut acastanedam dlyongemallo yuzhengcuhk hzk sailfish009 i-hkzuq1 pratikdhanave sukeshrondla ringgi svetlanasieber mawolf2023 ludsfer lucky9-cyou cyberepsilon cyd3nt gmh5225 qiaoyi213 annagrin sacpis

cuda-quantum's Issues

[RFC] Finish global control-flow statement support

Overview

Most of the support for C++'s break, continue, and return statements is already present in the compiler, but it is not complete. The first two statements can appear in C++ iterative statements. The last, return, can appear in a function.

Support for the statements is complete to the point that there are high-level operations in the CC dialect, there is a pass (lower-unwind) to convert these operations to the correct control-flow, and that control-flow will contain paths that may deallocate, deconstruct, uncompute both classical and quantum data. These paths currently only automatically create quantum deallocations.

Passes other than lower-unwind will not rewrite the high-level CC global control flow operations into more primitive control-flow in the function even as a copy-paste of functionality from the lower-unwind pass. It is unnecessary and poor compiler design.

Known Issues

Deallocation pass

At present the pass to deallocate quantum allocations is naive. It assumes that a quantum kernel is straight-line code. It finds all quantum allocations and inserts deallocations for them at the end of the function. This is incorrect and will not work with the control-flow graph rewrite of the global control-flow statements mentioned above.

It is therefore a requirement that the deallocation pass be able to find only those cases where deallocations are needed and insert the deallocations only where needed.

Documentation

The documentation should be updated such that it says the compiler does support break, continue, and return statements.

Possible Issue: Quake to QTX

It is unknown if the Quake to QTX pass correctly handles complex CFGs with deallocations.

Additional Tasks

Once the deallocation pass is rewritten, the lower-unwind pass needs to be added to the pipeline(s) in the various tools.

End-to-end tests, particularly those that test the new deallocation pass, must be written.

Build issues with cuQuantum SDK v23.03 installed via Ubuntu apt-get

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

When installing cuquantum via apt-get (e.g., sudo apt-get -y install cuquantum) as described here, with the latest cuQuantum SDK v23.03, the install location is different from the previous version:

custatevec.h is in /usr/include
libcustatevec_static.a is in /usr/lib/x86_64-linux-gnu

This is incompatible with the CMake script of custatevec runtime backend.

Steps to reproduce the bug

Install the latest cuquantum (v23.03) via Ubuntu deb repository (apt-get)
Not able to build custatevec backend by passing -DCUSTATEVEC_ROOT to cmake. Will get errors like

No rule to make target '/usr/lib64/libcustatevec_static.a', needed by 'lib/libnvqir-custatevec.so'. Stop. if set -DCUSTATEVEC_ROOT=/usr etc.

Expected behavior

Cmake script to be compatible with the latest cuquantum SDK installed via apt-get.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: main
Python version: 3.8.10
C++ compiler: gcc-10
Operating system: Ubuntu 20.04

Suggestions

CMake find_library seems to work fine in this case: looking into the default system lib paths, e.g., /usr/lib/x86_64-linux-gnu, to search for the file.

For example, I guess this would work for both the old (/opt/nvidia/cuquantum) and new installation settings.

find_library(CUSTATEVEC_STATIC_LIB
        NAMES custatevec_static
        HINTS ${CUSTATEVEC_ROOT}/${CUSTATEVEC_LIBDIR}
 )

[RFC] CUDA Quantum Applications: Chemistry

Background

The framework is at a stable point where we can begin to think about applications that build on top of the CUDA Quantum primitives / programming model. A good first use case for this is in quantum chemistry. We have use cases where we need to compute ground states for prototypical molecules via something like the variational quantum eigensolver. But this could evolve into future algorithms that are not variational.

Problem

We lack support for common data structures that would make it easier for programmers to write quantum chemistry application code.

Proposal

We should create a set of data structures that allow our users to describe a molecular system at a high level and generate the spin_op data type necessary for integration with cudaq::observe(). I propose the following structures:

namespace cudaq {

struct atom {
  const std::string name;
  const std::vector<double> coordinates;
};

class molecular_geometry {
private:
  std::vector<atom> atoms;

public:
  molecular_geometry(std::initializer_list<atom> &&args);
  std::size_t size() const ;
  auto begin();
  auto end();
  auto begin() const ;
  auto end() const;
};

class one_body_integrals {
private:
  std::unique_ptr<double> data;

public:
  std::vector<std::size_t> shape;
  one_body_integrals(const std::vector<std::size_t> &shape);
  double &operator()(std::size_t i, std::size_t j);
  void dump();
};

class two_body_integals {
private:
  std::unique_ptr<double> data;

public:
  std::vector<std::size_t> shape;
  two_body_integals(const std::vector<std::size_t> &shape);
  double &operator()(std::size_t p, std::size_t q, std::size_t r,
                     std::size_t s);
  void dump();
};

struct molecular_hamiltonian {
  spin_op hamiltonian;
  one_body_integrals one_body;
  two_body_integals two_body;
  std::size_t n_electrons;
  std::size_t n_orbitals;
  double nuclear_repulsion;
  double hf_energy;
};

molecular_hamiltonian create_molecule(const molecular_geometry &geometry,
                                      const std::string &basis,
                                      int multiplicity, int charge);

}

With users creating molecules in the following manner and having access to all the usual data

cudaq::molecular_geometry geometry{{"H", {0., 0., 0.}},
                                     {"H", {0., 0., .7474}}};
auto molecule = cudaq::create_molecule(geometry, "sto-3g", 1, 0);

// Print out the cudaq::spin_op and the integrals
molecule.hamiltonian.dump();
molecule.one_body.dump();
molecule.two_body.dump();

// Can get other useful information
auto nElectrons = molecule.n_electrons;
auto nOrbitals = molecule.n_orbitals;
auto nuclearRepulsion = molecule.nuclear_repulsion;
auto hartreeFockEnergy = molecule.hf_energy;

Dependencies

Thus far I've described the public API for creating molecular hamiltonians in CUDA Quantum C++. An initial implementation should implement create_molecule in a manner that is extensible for the 3rd party library that can generate the hamiltonian. A first implementation there could use the pybind embedded interpreter to invoke pyscf.

Another thing we will need is a library that allows us to provide a getter / setter for general tensor data with runtime-known shape. For this I propose that we pull in as a tpl the https://github.com/xtensor-stack/xtensor header-only library. It is BSD-3 clause.

Update kernel_builder measurement functions to leverage quake.mz(qvec)

Do this instead of looping over quake.qextract calls.

Error should be thrown when trying to access qubit outside range of allocated qubits

The code snippet below should raise an error

kernel, features = cudaq.make_kernel(list)

qubits = kernel.qalloc(8)

kernel.rx(features[0], qubits[100])

[RFC] QIR and opaque pointers

Overview

The QIR specification has opaque (typed) pointers such a %Qubit *. These typed pointer may be in use in target backends and their translators.

As of LLVM 17, typed pointers are no longer supported. All pointers must be "opaque", that is they will no longer carry any type annotation as to what the pointed-to object's type is or may be.

CUDA Quantum will need some resolution on this situation in order to remain current with the tip of LLVM/Clang/MLIR development.

Bug with getting expected value from observe_result with no shots provided

contrived example

op = cudaq.SpinOperator()
for i in range(2):
    op += cudaq.spin.z(i)
print(op)

kernel = cudaq.make_kernel()
q = kernel.qalloc(2)
kernel.rx(np.pi/2., q[0])
kernel.ry(np.pi, q[1])

result = cudaq.observe(kernel, op)
e = result.expectation_z(cudaq.spin.z(1))
print(e)
assert e == -1 # assert its close to 1 at least.

The above does not work unless you set the shots_count kwarg.

Build issues with cuQuantum installed via tarball

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

Tarball-based installation of cuQuantum provides multiple sets of libraries for CUDA 11 and 12. This makes the location of cuStateVec libraries under /opt/nvidia/cuquantum/lib/11/ and /opt/nvidia/cuquantum/lib/12/. Currently Cmake looks directly under lib/ and fails with the following error:

ninja: error: '/opt/nvidia/cuquantum/lib/libcustatevec_static.a', needed by 'lib/libnvqir-custatevec.so', missing and no known rule to make it

Steps to reproduce the bug

Clone and build LLVM 16
Download cuQuantum tarball and (conveniently) extract it under /opt/nvidia/cuquantum
Run the build script build_cudaq.sh with: LLVM_INSTALL_PREFIX=/path/to/llvm/repository/build_16 bash scripts/build_cudaq.sh

The build will fail and errors can be checked in the logs: /home/hamidelmaazouz/repositories/nvidia/cuda-quantum/build/logs/cmake_error.txt

Expected behavior

The build is expected to consider tarball-based installation of cuQuantum

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version:
Python version:
C++ compiler:
Operating system:

Suggestions

Some options to work around the issue:

Option 1: simply move relevant libraries and put them directly under /opt/nvidia/cuquantum/lib. This solution is hacky but cumbersome and error-prone. Reason is it clutters the tarball installation and the relevant libraries depend on which CUDA version is installed on the system.
Option 2: Adapt the build script build_cudaq.sh and runtime/nvqir/custatevec/CMakeLists.txt to systematically link against the correct version of cuStateVec depending on which CUDA version is installed on the system.

RuntimeError for MGMN backend

The multi gpu multi node backend from the docs here gives the following error

cudaq.set_qpu('cuquantum_mgmn')

RuntimeError: Invalid qpu name: cuquantum_mgmn

Seeding the simulator

Option to seed the simulator for reproducibility of results please

Enable docs build as part of CI

Each CI run should also produce the CUDA Quantum docs as an artifact.
The pipeline should fail if there are any warnings or errors during the docs build.
Note that right now, we suppress the warning for missing documentation, so unfortunately, the pipeline will not yet detect this automatically - more updates are needed to reenable this warning.

[RFC] Merge quantum allocations

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

Overview

Allocation of qubits and vectors of qubits are done in the bridge when walking the AST. This can result in many different allocations of various sizes.

One way to reduce the number of these allocations is to combine them into one large vector and select qubits out of that vector. Also related is to move allocations as close to the entry block as possible so that they can be ganged together in this vector. [Not all allocations can be moved to the entry block. In particular, they may have a runtime computed size value.]

Aggregating the allocations may make determining the number of qubits needed, fusing qubits over distinct live ranges, and other optimizations easier or faster.

[RFC] Add for loop capability to kernel_builder

Background

To further enable programmability via the kernel_builder, we should add the ability for one to specify a for loop within the Quake code being constructed.

What would this look like

auto [circuit, inSize] = cudaq::make_kernel<std::size_t>();
auto qubits = circuit.qalloc(inSize);
circuit.h(qubits[0]);
circuit.for_loop(0, inSize - 1, [&](auto &index) {
  circuit.x<cudaq::ctrl>(qubits[index], qubits[index + 1]);
});

circuit, inSize = cudaq.make_kernel(int)
qubits = circuit.qalloc(inSize)
circuit.h(qubits[0])
circuit.for_loop(0, inSize-1, lambda index : circuit.cx(qubits[index], qubits[index+1]))
counts = cudaq.sample(circuit, 5)

llvm - project missing CMakeList.txt file error (following Build CUDA Quantum from Source page)

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

Following the "Get LLVM / Clang / MLIR" section of CUDA Quantum Open Beta Installation page, when i run the following Cmake, i get a "project does not appear to contain CMakeLists.txt" error.

Can a CmakeLists.txt file be provided?

Command and error below:

/workspace/cuda-quantum/llvm-project/build (master) $ cmake .. -G Ninja -DLLVM_TARGETS_TO_BUILD="host" -DCMAKE_INSTALL_PREFIX=/opt/llvm/ -DLLVM_ENABLE_PROJECTS="clang;mlir" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_INSTALL_UTILS=TRUE
CMake Warning:
Ignoring extra path from command line:

".."

CMake Error: The source directory "/workspace/cuda-quantum/llvm-project" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.

(see below 'Steps ...' section for more info)

Any help is appreciated!

Steps to reproduce the bug

'''

/workspace/cuda-quantum/llvm-project/build (master) $ cmake .. -G Ninja -DLLVM_TARGETS_TO_BUILD="host" -DCMAKE_INSTALL_PREFIX=/opt/llvm/ -DLLVM_ENABLE_PROJECTS="clang;mlir" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_INSTALL_UTILS=TRUE
CMake Warning:
Ignoring extra path from command line:

".."

CMake Error: The source directory "/workspace/cuda-quantum/llvm-project" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
'''

other info:

ls /workspace/cuda-quantum/llvm-project
bolt clang-tools-extra CONTRIBUTING.md libc libcxxabi lld llvm-libgcc polly runtimes third-party
build cmake cross-project-tests libclc libunwind lldb mlir pstl SECURITY.md tpls
clang compiler-rt flang libcxx LICENSE.TXT llvm openmp README.md src utils
prior to the above failed cmd, i ran these cmds and all succeeded:

git init
git remote add origin https://github.com/llvm/llvm-project
git remote add origin https://github.com/llvm/llvm-project
git fetch origin --depth=1 c0b45fef155fbe3f17f9a6f99074682c69545488
git reset --hard FETCH_HEAD

Expected behavior

I expected the llvm_project to be successfully built, such that i could move on to the next step: "Build CUDA Quantum"

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: cuda-quantum:0.3.0
C++ compiler: gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
Operating system: "Ubuntu 22.04.2 LTS"

Suggestions

No response

Support broadcasting

For code snippets like this:

for i in range(data.shape[0]):

    exp_vals = cudaq.observe(kernel, hamiltonian, data[i], shots_count = shots).expectation_z()

observe should automatically broadcast the call based on the leading dimension of data rather than creating a for loop in python which can be slow.

Incorrect list of QPUs returned from query of runtime

cudaq.list_qpus() returns: ['tensornet', 'custatevec', 'dm', 'qpp', 'cuquantum_mgpu', 'custatevec_f32'] where custatevec should instead be cuquantum

[RFC] Add the ability to take a cudaq::spin_op to its matrix form

Background

For a lot of problems it will be nice to be able to map a spin_op to its matrix representation. The main use case I foresee for this is for extracting eigenvalues and eigenvectors to compare against results coming from the quantum computer.

Proposal

Add the complex_matrix type. This type should hold the raw data for a matrix with complex elements. It should hide away implementation details for the matrix (so one could use Eigen but not expose Eigen to the public cudaq API).

Add complex_matrix cudaq::spin_op::to_matrix(). This should compute the matrix elements based on the current Pauli string information.

Consider using CPM for 3rd party libraries

https://github.com/cpm-cmake/CPM.cmake

[RFC] Using operator! on pure-device kernels, etc.

Overview

The negation operator is implemented for many of the quantum primitive operations, but is not available for the higher-order function cudaq::control and user-defined pure-device kernel arguments.

intrinsic

  // syntactic sugar for using a negated qubit, q0, as a control.
  x<cudaq::ctrl>(!q0, q1);

At the heart of the issue is that, in full generality, a qubit may be both a control and a target in a quantum kernel. However, the operator! should only be allowed on qubits used as controls.

pure-device kernel

  struct Kernel {
    // unknown how this kernel will use q0, q1, and q2.
    template <...>
    void operator()(cudaq::qubit &q0, cudaq::qubit &q1, cudaq::qubit &q2) __qpu__;
  };

In the case of these pure-device kernels, we propose that the syntactic sugar, operator!, not be allowed.

Constrained Autogeneration: `cudaq::control`

In the case of the cudaq::control high-level function, the user provides a list of "known controls" as part of the syntax. This case resembles the intrinsic case in that some arguments are known to be controls. The rest of the arguments are passed to a kernel and may or may not be controls. The proposal is to allow the operator! syntactic sugar for the qubits that are in the "must be controls" positions of this high-level function.

cudaq::control

  cudaq::control(kernel, /*controls=*/ {q0, !q1, !q2}, /*may not use ! for trailing args*/ q3, q4, q5);

Downstream CMake integration

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

I'm not sure if this behavior is intended or not.
CMake downstream integration as described in the docs seems rather wonky.
It seems impossible to add compilation flags to nvq++ from cmake.
Also mixing regular c++ code with cuda-qauntum code seems to produce issues.

Steps to reproduce the bug

Suppose the following directory structure:
project
├─cudaq_code
│ ├─src
│ └─CMakeLists.txt
├─cpp_code
│ ├─src
│ └─CMakeLists.txt
└─CMakeLists.txt
with
project:CMakeLists.txt:

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(CudaQuantumSimulator VERSION 0.1.0 LANGUAGES CXX) 
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED on)
set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
set(BACKEND_NAME MySimulator)

find_package(CUDAQ REQUIRED)
find_package(NVQIR REQUIRED)

add_subdirectory(src)
add_subdirectory(examples)

cpp_code:CMakeLists.txt

nvqir_add_backend(${BACKEND_NAME} MySimulator.cpp)
target_link_libraries(nvqir-${BACKEND_NAME} PRIVATE some::lib) # <- target name gets overwritten

cudaq_code:CMakeLists.txt

set(CUDAQ_COMPILER_FLAGS "${CUDAQ_COMPILER_FLAGS} --qpu MySimulator") # <- this does not work 
set(CUDAQ_EXTRA_COMPILER_FLAGS "${CUDAQ_EXTRA_COMPILER_FLAGS} --qpu MySimulator") # <- this does not work 
add_executable(ghz_exe Example.cpp)
add_dependencies(ghz_exe nvqir-${BACKEND_NAME})

By default, both directories are compiled with nvq++ in this setting, which can results in errors when building regular .cpp files.

Expected behavior

There is a workaround for the issue with the compiler selection.
One can specifically set which files to compile with nvq++ by changing the following
project:CMakeLists.txt:

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(CudaQuantumSimulator VERSION 0.1.0 LANGUAGES CXX CUDAQ) # <- add language 
...

and adding
cpp_code:CMakeLists.txt

SET_SOURCE_FILES_PROPERTIES(MySimulator.cpp PROPERTIES LANGUAGE CXX)

cudaq_code:CMakeLists.txt

SET_SOURCE_FILES_PROPERTIES(Example.cpp PROPERTIES LANGUAGE CUDAQ)

But still compiler definitions are disregarded for nvq++.
Expected compile command:
$HOME/.cudaq/bin/nvq++ --cmake-host-compiler /usr/bin/c++ -c <src>/cudaq_code/Example.cpp -o cudaq_code/CMakeFiles/ghz_exe.dir/Example.cpp.o --qpu MySimulator
Actual command:
$HOME/.cudaq/bin/nvq++ --cmake-host-compiler /usr/bin/c++ -c <src>/cudaq_code/Example.cpp -o cudaq_code/CMakeFiles/ghz_exe.dir/Example.cpp.o

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: 0.3.0
Python version: 3.10.6
C++ compiler: clang14.0.0
Operating system: Ubuntu 22.04

Suggestions

No response

[RFC] Common Kernels

Background

It would be nice to have a "standard library" of common quantum kernels in both C++ and Python. This set of common kernels will be something that grows over time.

Python

In python I envision something like this

import cudaq
from cudaq import spin, kernels
circuit = cudaq.make_kernel()
q = circuit.qalloc(2)
# synthesize exp(i theta H)
kernels.trotter(circuit, spin.x(0) * spin.x(1), np.pi/2., q)
print(circuit)

C++

In C++ I envision something like this

#include "cudaq/kernels/so4.h"
int main() {
  auto kernel = [](std::vector<double> params) __qpu__ {
    cudaq::qreg q(2);
    cudaq::so4(params, q);
  };
  // -- or --
  auto [circuit, params] = cudaq::make_kernel<std::vector<double>>();
  auto qubits = circuit.qalloc(2);
  cudaq::so4(circuit, params, qubits); 
  ... 
  circuit(concreteParams);
}

ExecutionManager Needs Updates for Handling Qudits

Background

The CUDA Quantum language specification allows one to program at the general qudit level. Most work so far has been at the qudit<2> == qubit level. This has influenced some of the function names / functionality for the cudaq::ExecutionManager. We need to do a little bit of refactoring to better accommodate backends and simulators that allow general qudit manipulation.

The main thing that needs updating for ExecutionManager is that the apply method treats controls and targets just by their integer index (size_t). This is fine when we assume qubits are the primary unit of quantum information, but if we want ExecutionManager subtypes for qudit simulations, there is no way to get the number of levels for the qudit.

The proposal here is to add a new type QuditInfo that enapsulates the qudit index and the number of levels, and update the ExecutionManager API to use this type for describing control and target qudits.

Also, the class has returnQubit and resetQubit named methods. These should be updated to returnQudit and returnQubit.

Testing

To adequately test this, we need to implement the ExecutionManager with a very simple qudit simulation capability. Then we can implement our qudit instruction set (functions that make calls to the current ExecutionManager), and write a CUDA Quantum kernel that invokes this.

Future

In the future we should provide a concrete qudit simulator that implements the ExecutionManager API.

We also need to refactor how the sample_result type works. This is being discussed in the next CUDA Quantum language specification meeting. Basically, instead of the sample_result storing binary results, it needs to be able to store general integer results in order to accommodate cudaq::sample for general qudit systems.

Internal issue 381

__qpu__ void heisenbergU(qoda::qreg<> &q) {
  auto nQubits = q.size();
  ... 
}

struct ctrlHeisenberg {
  void operator()(int nQubits) __qpu__ {
    qoda::qubit ctrl;
    qoda::qreg q(nQubits);
    qoda::control(heisenbergU, {ctrl}, q);
    mz(q);
  }
};

[RFC] Changes to qspan

Background

The qspan type has represented thus far a sub-view of an existing container of qubits. This type has non-owning semantics and is typically used as an argument to a pure device kernel, as well as the return type on existing qreg / qvector slicing and sub-register extraction. The qspan is meant to model the standard span. This type may be a bit unknown to the current quantum computing community, and the span is meant to provide a mutable view (non-owning) of existing containers, which is not what we want here with qubits.

Proposal

We propose to change the qspan name to qview to better model it as a non-owning view on an existing vector or array of qubits.

Backends

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

Currently, I have access to the following backends:

cudaq.list_qpus()

['tensornet', 'custatevec', 'dm', 'qpp', 'cuquantum_mgpu', 'custatevec_f32']

All of the ones that show up aren't documented in the docs here

Moreover, we also have cudaq.set_platform('mqpu') for which the syntax changes from set_qpu to set_platform.

I believe we should have one syntax for all. set_backend makes the most sense as it includes gpus and qpus.

[Language Specification] Changes to qreg

Background

It has recently been noted that the name qreg does not convey its underlying semantics well. When we talk about registers classically in C++, this often carries with it an implicit endian-ness and therefore programmers may expect qubits ordered in a specific way when using the qreg. Under the hood, the qreg (and as it is described currently in the spec) defines an owning-container of qubits that can be created at runtime (std::vector-like) or at compile time (std::array-like). The typical use case is to leverage qreg with its vector-like semantics. In this case, the ordering of the qubits is opposite what one might expect from reading the name qreg.

Proposal (accepted)

We should update the specification with two new types: cudaq::qvector and cudaq::qarray, which will split apart the existing cudaq::qreg type into these two separate types. qvector will provide a runtime-dynamic owning container of qudits and the qarray will provide a compile-time owning container of qudits.

We will retain qreg in the specification document for a short time, but mark it as deprecated. After this short time (before a stable version 1.0), we will remove qreg from the document.

[RFC] Library mode and AOT vs JIT Discussion

After the recent release, it is a good time to revisit some foundational aspects of this effort so that we can continue to evolve and improve the design of the platform.

Here is a recap of a recent team discussion around the roles of ahead-of-time (AOT) and just-in-time (JIT) compilation, and how library-mode fits in on simulation platforms:

One thing I wanted to touch on was the role of library mode going forward. To recap, library mode is the mode of compilation that nvq++ provides that skips all generation of MLIR and lowering to QIR. Since we have the CUDA Quantum runtime (and this is critical, as it let's us build up a valid Clang AST so that we can generate the MLIR), we can enable a mode of compilation that simply falls back on Clang and produces a binary that links to the existing runtime shared library. Execution of kernels in this binary do not go through any QIR generated functions but instead directly queue up instructions in the library via the ExecutionManager abstraction / extension. These queued instructions are then applied to a user-specified backend simulator. This mode of compilation / execution is extremely useful, as the programmer can leverage the full extent of C++ within the bounds of the CUDA Quantum language specification. On simulated architectures, it may be that this is preferable to the AST Bridge / MLIR mode of compilation, in that it is a frictionless approach to quantum programming and should work well regardless due to the runtime nature of quantum circuit synthesis. The drawback is that programmers will need the foreknowledge that compilation and execution under this mode may lead to quantum kernel expressions that are not executable on physical architectures. We covered this extensively in our discussion and noted that as hardware improves, our implementations in the AST Bridge can also continue to improve, enabling more expressiveness in CUDA Quantum kernels that ultimately target physical QPUs.

Another related concept here is differentiation of AOT and JIT compilation in the context of quantum kernel programming. We discussed the likely requirement that some quantum kernels cannot be fully compiled / optimized ahead of time due to qubit counts potentially not being known until runtime. CUDA Quantum is unique in this respect, in that the kernel expressions really represent quantum circuit templates, and these templates (just like in C++ with template types), must ultimately be specialized / synthesized with runtime-known constants. The CUDA Quantum runtime will need some mechanism for online just-in-time compilation that incorporates runtime constants currently known and synthesizes optimal code at runtime that can be executed on the target architecture. There will likely also be opportunities for AOT compilation and we should keep the platform open for these opportunities.

Action items going forward: pull back on the strict requirement for AST Bridge MLIR generation for simulation platforms, defaulting to library mode in this context. This will enable frictionless quantum programming in C++, with the caveat that some programs may not be executable on physical architectures. We will work to provide an AST Bridge - related preprocessing step that can report warnings / errors to the programmer in this regard. Next, we will continue on with our compiler work (decompositions, optimizations, etc.) that are usable from the JIT compilation context. We will validate that output QIR targeting physical architectures is fully specification compliant (Base, Adaptive, etc).

We will move away from the -qpu flag and instead start using -target which will better encapsulate the target processor architecture as well as the required runtime quantum_platform subtype.

[RFC] CircuitSimulator Refactor

Background

The past month has given us a lot of feedback / requirements on the extensible NVQIR CircuitSimulator type. After implementing the MGPU and TensorNet backends, there are a few changes I'd like to propose for the CircuitSimulator type that will make it easier for new simulation contributions later on.

Gate API

CircuitSimulator currently enumerates pure-virtual methods for all operations defined in our MLIR dialects. There really is no need for this, as all simulation backends only require gate matrix data and control and target qubit indices. The first proposal here is to make these non-virtual (i.e. only implemented on CircuitSimulator) so that they need not be implemented by subtypes. But they will still remain for now, so as to minimize changes in the NVQIR driver code. The update will be to provide a protected pure virtual method, applyGate(const GateApplicationTask&) that subtypes will implement to affect evolution of the state in a manner specific to the sub-type simulation strategy. Here, GateApplicationTask will be a private struct that contains the matrix, controls, and targets.

We should also add a public method for invoking a custom quantum operation, i.e. one where we only have the matrix data, controls, and targets.

Mid-circuit measurement register naming

One bug that has arisen is the fact that the CircuitSimulator was only storing mid-circuit measurement data for circuits that had conditional statements on qubit measurements. This should be an easy thing to fix with an internal private GateApplicationTask queue. Here's the example that currently does not work as expected but will with the introduction of a queue.

auto qubits = entryPoint.qalloc(2);
entryPoint.x(qubits[0]);
entryPoint.mz(qubits[0], "c0");
entryPoint.x(qubits[1]);
entryPoint.mz(qubits[1], "c1");
auto counts = cudaq::sample(entryPoint);

The results here currently do not store the measurement results to c0, c1. By introducing a queue on the CircuitSimulator, and for each quantum operation invocation enqueuing that task, we give ourselves an opportunity to flush the queue at specific points in simulation, like the first mz call above. And at these flush points we are free to sample and persist the results according to the register name given by the programmer.

Handle One-Time Static Init (e.g. MPI)

Another issue that has arisen is in the case of a simulation strategy that can leverage MPI. In this case, we need to provide some kind of one-time initialization and finalization capability for MPI_Initialize() and MPI_Finalize() to run. There are a few subtleties here: this can only happen once, and in Python, you could envision someone calling set_qpu(...) multiple times targeting different MPI enabled backends.

A potential solution to this issue is to have MPI-enabled backends wrap MPI initialization and finalization in conditional statements that check if MPI has been initialized already.

Proposed CircuitSimulator Structure

Here we show the sub-type pertinent parts of the CircuitSimulator update.

using namespace cudaq;
namespace nvqir {
class CircuitSimulator {
public:
 ... public methods for NVQIR ... 
};

template<typename ScalarType>
class CircuitSimulatorBase : public CircuitSimulator {
  protected:
    struct GateApplicationTask {
      const std::vector<std::complex<ScalarType>> matrix;
      const std::vector<std::size_t> controls;
      const std::vector<std::size_t> targets;
      GateApplicationTask(const std::vector<std::complex<ScalarType>> m,
                        const std::vector<std::size_t> c,
                        const std::vector<std::size_t> t)
        : matrix(m), controls(c), targets(t) {}
    };

    std::queue<GateApplicationTask> gateQueue;

    /// This must be implemented by subclasses to evolve the state
    virtual void applyGate(const GateApplicationTask &task) = 0;

    /// Noise-capable simulators can apply all kraus channels defined in the 
    /// provided noise model on the given operation / qubits 
    virtual void applyNoiseChannel(const std::string_view gateName,
                                 const std::vector<std::size_t> &qubits) ;

    /// Increase the state by 1 qubit
    virtual void addQubitToState() = 0;
    /// Zero-out / clear the current state, takes state back to nQubits = 0
    virtual void resetQubitStateImpl() = 0;

    /// Subclass specific measurement of the specified qubit. 
    virtual bool measureQubit(const std::size_t qubitIdx) = 0;
    
  public:
    /// Public method for invoking a general, custom operation
    void applyCustomOperation(const std::vector<std::complex<ScalarType>>& matrix, 
                                                    const std::vector<std::size_t>& controls, 
                                                    const std::vector<std::size_t>& targets);

    /// Subclasses can implement spin_op observation
    virtual ExecutionResult observe(const cudaq::spin_op &term);

    /// Subclasses can implement state sampling
    virtual ExecutionResult
    sample(const std::vector<std::size_t> &qubitIdxs, const int shots) = 0;
    
    /// Return the name of this simulator
    virtual std::string name() const = 0;

    /// For the Python API, we need the ability to create a 
    /// a clone of a simulator we currently have a handle on.
    virtual CircuitSimulator *clone() = 0;
};
}

Update C++ standard lib runtime dependency to libstdc++-12

CUDA Quantum has a runtime dependency on the C++ standard library. Right now, this is libstdc++-11-dev.
Outside the docker image that is distributed, this will cause issues when e.g. a newer gcc 12 is present and the standard library resolves to version 12 on the system. While this isn't fully addressing a broader issue, it would be good to update this dependency to libstdc++-12.

Update the nvq++ compiler to support libstdc++-12
Check in CUDA Quantum Docker image and validate it as part of CI
Update the runtime dependency in the CUDA Quantum Docker image
Add gcc12 to the pipeline and remove gcc11

The broader issue to address is that we update CUDA Quantum to use a specific version of the standard library rather than relying on the system setup. This work will be tracked separately and is not covered by this work item.

Fix bug with mid-circuit measurement to a named register

Update the CircuitSimulator to support caching sampling results to a user-specified register name

import qoda
kernel = qoda.make_kernel()
qreg = kernel.qalloc(2)
kernel.x(qreg)
kernel.mz(qreg, register_name="test_measurement")
measure_counts = qoda.sample(kernel, shots_count=10)
measure_counts.register_names() # -> should have test_measurement

[docs] Double check the documentation on operator! syntactic sugar

The operator! syntactic sugar is not universal. Double check the documentation.

Target multiple architectures in parallel

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

Future workflows will require running on GPUs and QPUs in parallel.

I envision a workflow suggested below to allow error mitigation:

exp_vals = cudaq.observe_n_async(kernel, h, parameters, backed = 'superconducting_provider')  

approx_exp_vals = cudaq.observe_n_async(kernel_clifford_approximation, h, parameters, backed = 'gpu0')

Both lines should be able to run in parallel via a futures request.

Seg fault after printing the circuit in python

Required prerequisites

Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

After printing the circuit using print(circuit), you get a segmentation fault if you attempt to apply any gates on the qubit register.

Steps to reproduce the bug

import cudaq

circuit = cudaq.make_kernel()
qubitRegister = circuit.qalloc(2)

circuit.x(qubitRegister[0])  
print(circuit)
circuit.x(qubitRegister[0])

Expected behavior

It seems like one should be able to view the circuit and still edit it. It's behaving more like a dump() than a print().

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: A day behind on main (2f820a0)
Python version: Python 3.10.6
C++ compiler: clang-16
Operating system: Ubuntu 22.04.2 LTS

Suggestions

No response

Add reset() to Kernel / kernel_builder

We need to add qubit / qreg reset to the C++ kernel_builder as well as the Python Kernel type.

Set up workflow to update documentation automatically

In principle, live publishing only requires setting up a GitHub action that updates the live/docs branch of this repo.
However, there are a couple of things to look into before that, mainly:

Update/extend the CI to include generating the documentation.
Set up proper versioning of the documentation prior to doing so.

Other considerations:

How to push docs fixes for older versions
If docs update are published for the latest main, then docs are simply ahead of officially released packages and there would be no additional steps needed for docs publishing as part of package releases. If docs are only published with the package, then this relates e.g. to issue #41.

Add python code formatting to CI

Add the command yapf --style=google -r -i to our CI to automate formatting of all .py files in the repository. Will work similar to our current clang-format workflow.

Naming discrepancy between C++ and Python for getting # qubits from a spin_op

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

h.n_qubits() in C++ vs h._get_cubit_count() in Python. They should be the same across languages. I have a slight preference for the c++ version because it's easier to type.

Remove outdated functions from `cudaq.py`

The following functions should be removed from the python API: qtx_translate, quake_synth, quake_translate, cudaq_opt, cudaq_quake, nvqpp

Async observe in python causes deallocation bug

When calling observe_async from Python, we have to construct a lambda that gets added to the current QPU execution queue. When this lambda is ultimately executed, it is on a thread that does not have access to the Python GIL. This leads to a bug where a pybind type is being deallocated and can't because it hasn't acquired the GIL.

[RFC] Decomposition

This is already a lot but not all, I'm posting it to kickstart the discussion

Definition

In quantum compilation, the term "decomposition" is quite overloaded since many techniques and transformations can fit the idea of "breaking down" (transforming) one gate into a sequence of other gates. Hence, it will be necessary for us to pinpoint what we mean by "decomposition" in the scope of the compiler. One tentative definition is the following:

We define "decomposition" as systematically breaking down high-level gates into a series of lower-level ones. A decomposition technique builds this lower-level sequence by applying some construction rule(s). We emphasize "systematic" because it is the characteristic that differentiates decomposition from other transformation techniques. For example, compared to synthesis, decomposition techniques should be faster and produce predictable results, i.e., we know the result in the number of operations and the number of qubits in advance.

Note that this definition still has some loose ends; specifically, it needs more clarity about which gates are considered high-level and low-level. As we will see next, such clarification requires more context.

Role in compilation

While compiling a technology-independent quantum program, high-level operations must often be broken down into lower-level ones. For example, transforming a two-control Z operation into a sequence of CNOTs and Ts:

                                                                  ┌───┐
  ───●────  ──────────────●───────────────────●──────●─────────●──┤ T ├
     │                    │                   │      │         │  └───┘
     │                    │                   │    ┌─┴─┐┌───┐┌─┴─┐┌───┐
  ───●─── = ────●─────────┼─────────●─────────┼────┤ X ├┤ ┴ ├┤ X ├┤ T ├
     │          │         │         │         │    └───┘└───┘└───┘└───┘
   ┌─┴─┐      ┌─┴─┐┌───┐┌─┴─┐┌───┐┌─┴─┐┌───┐┌─┴─┐                 ┌───┐
  ─┤ z ├─   ──┤ X ├┤ ┴ ├┤ X ├┤ T ├┤ X ├┤ ┴ ├┤ X ├─────────────────┤ T ├
   └───┘      └───┘└───┘└───┘└───┘└───┘└───┘└───┘                 └───┘

(Note: ┴ denotes the adjoint of T)

We call the above a decomposition pattern, and applying it to a circuit objectively breaks down high-level operations in sequences of low-level ones. (This is always the case when breaking down multi-control operations.) However, what is considered high-level (low-level) changes depending on the target device. For example, suppose we want to try a program on different devices. In one of these devices, the CNOT is a primitive operation, while in another, it is not, but CZ is. Hence, for the first device, we need to decompose CZ but not CNOT, while for the second is the other way around. We could use the following patterns:

CZ decomposition	CNOT decomposition
───●─── = ────────●──────── │ │ ┌─┴─┐ ┌───┐┌─┴─┐┌───┐ ─┤ Z ├─ ─┤ H ├┤ X ├┤ H ├─ └───┘ └───┘└───┘└───┘	───●─── = ────────●──────── │ │ ┌─┴─┐ ┌───┐┌─┴─┐┌───┐ ─┤ X ├─ ─┤ H ├┤ Z ├┤ H ├─ └───┘ └───┘└───┘└───┘

CZ decomposition

CNOT decomposition

  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ Z ├─   ─┤ H ├┤ X ├┤ H ├─
   └───┘     └───┘└───┘└───┘

  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ X ├─   ─┤ H ├┤ Z ├┤ H ├─
   └───┘     └───┘└───┘└───┘

Ultimately, the compilation process must implement the program using only the primitive unitary operators supported by the specific quantum device. In this sense, the compiler cannot blindingly decompose operations; the process must move towards these primitive operations and, hopefully, not get lost or enter a loop. For example, if we were decomposing a two-control Z operation (CCZ) while targeting the second device, we could first apply the pattern for CCZ and then use the pattern for CZ---meaning that we compose decomposition patterns: by applying one pattern, we create a sequence of gates that might itself be further decomposed. Note that we would enter a loop if we tried to apply both the CZ and CNOT decomposition patterns.

Before moving on, let's further suppose that H is not a primitive operation in these devices, but Rx is. We would need to transform these Hadamards gates into Rx ones---which, however, is a 1-1 transformation and doesn't fit our decomposition definition because it doesn't "break down" an operation.

Implementation

This RFC discusses two alternative ways of implementing decomposition into our compilation flow. However, those different ways might not be alternatives to each other but complementary.

Quake or QTX

First, we should address "where" decomposition must happen in the compilation flow. The compiler provides two dialects to represent quantum programs, Quake and QTX, and both have a similar level of abstraction regarding quantum operations. For example, both can have an X operation with an arbitrary number of controls. The difference lies elsewhere: Quake uses memory (reference) semantics and thus is not an SSA representation, while QTX is---in fact, QTX has a few quirks concerning arrays to keep them in an SSA form. The following example shows this difference:

Quake	QTX
func.func @foo() { %q0 = quake.alloca : !quake.qref %q1 = quake.alloca : !quake.qref quake.h %q0 quake.x [%q0] %q1 ... }	qtx.circuit @foo() { %w0_0 = qtx.alloca : !qtx.wire %w1_0 = qtx.alloca : !qtx.wire %w0_1 = qtx.h %w0_0 : !qtx.wire %w1_1 = qtx.x [%w0_1] %w1_0 : !qtx.wire ... }

Quake

QTX

func.func @foo() {
  %q0 = quake.alloca : !quake.qref
  %q1 = quake.alloca : !quake.qref
  quake.h %q0
  quake.x [%q0] %q1
  ...
}

qtx.circuit @foo() {
  %w0_0 = qtx.alloca : !qtx.wire
  %w1_0 = qtx.alloca : !qtx.wire
  %w0_1 = qtx.h %w0_0 : !qtx.wire
  %w1_1 = qtx.x [%w0_1] %w1_0 : !qtx.wire
  ...
}

In Quake, the quantum operations do not create new values, so both X and H act on the same qubit reference %q0. If Quake was SSA, this would imply that changing the order of X and H would not affect the result---which is not the case. In QTX, quantum operations return new values for their targets but not for controls; this aligns with the definition of SSA and shows the order of two operations that share control wires, but not targets, can be changed without affecting the result.

The question remains:

Should we decompose in Quake or in QTX? (TLDR: most likely QTX)

Doing decomposition in Quake is more straightforward, as the discussion above demonstrates. Indeed, we would rewrite the operation using a specific decomposition pattern, and we can more and less straightforwardly rely on existing MLIR infrastructure to do it.

Decomposition in QTX is more complicated: a simple pattern such as the "CZ" one wouldn't be problematic since it only uses one target and produces one new value. However, the "CCZ" pattern complicates things: when using this pattern, we will substitute an operation that generates one new value with a sequence of operations that creates three---the controls become targets. This means that when decomposing a CCZ, say %new_t = x [%c0, %c1] %t, we need to change any use of %c0 and %c1 that come after this operation, but not the ones that come before. (Unfortunately, MLIR seems to not be built to handle this sort of rewriting, so it requires a bit more work.)

Despite the complications, I think decomposition should still happen in QTX. The reasoning here is that decompositions do not occur in isolation, and we should retain high-level operations until optimization passes are run. These optimization passes are more likely to exist in QTX since the dialect provides a great advantage to specific optimizations as data dependencies are explicit in the program structure and thus is the dialect used for optimization. Therefore, the more likely compilation flow looks like this:

AST to Quake
Quake to QTX
Optimize
Decompose
Optimize more

If we decompose too early, we will hinder the ability to optimize high-level operations since once they are decomposed, it will be less likely that the compiler will to be able to completely optimize them out.

Alternatively, we could keep converting between Quake and QTX:

AST to Quake
Quake to QTX
Optimize
QTX to Quake
Decompose
Quake to QTX
Optimize more

To consider this idea, we need to further investigate how costly is this conversion and whether it can scale up. An interesting advantage of doing decomposition in Quake is that we can keep short-circuiting the compiler when optimizations are turned off---since there would be no need to use QTX.

Compiler rewrite patterns or library function

Now we turn to whether we should implement decomposition using compiler rewrite patterns or library functions. I will use the following toy example to guide the discussion:

CUDA Quantum	QTX
void foo() __qpu__ { cudaq::qubit c0, c1, t; x<cudaq::ctrl>(c0, c1, t); }	qtx.circuit @foo() { %c0 = qtx.alloca : !qtx.wire %c1 = qtx.alloca : !qtx.wire %t = qtx.alloca : !qtx.wire %t_0 = x [%c0, %c1] %t : [!qtx.wire, !qtx.wire] !qtx.wire }

CUDA Quantum

QTX

void foo() __qpu__ {
  cudaq::qubit c0, c1, t;
  x<cudaq::ctrl>(c0, c1, t); 
}

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = x [%c0, %c1] %t : [!qtx.wire, !qtx.wire] !qtx.wire
}

We want to decompose the CCX gate into a sequence of CXs and Ts. The first step could be decomposing CCX into CCZ and H, which would then be followed by the decomposition of CCZ.

  ───●────  ────────●────────
     │              │
     │              │
  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ x ├─   ─┤ H ├┤ Z ├┤ H ├─
   └───┘     └───┘└───┘└───┘

Let's look at how a decomposition pass would differ in both cases. For reference, here is how the rewrite pattern and library function would look:

Rewrite pattern	Library function
struct XOpDecomposition : public OpRewritePattern { using OpRewritePattern::OpRewritePattern; LogicalResult matchAndRewrite( qtx::XOp op, PatternRewriter &rewriter) const override { Location loc = op->getLoc(); Value t = op.getTarget(); t = createOp<qtx::HOp>(rewriter, loc, t); t = createOp<qtx::ZOp>(rewriter, loc, op.getControls(), t); t = createOp<qtx::HOp>(rewriter, loc, t); op.getResult(0).replaceAllUsesWith(t); rewriter.eraseOp(op); return success(); } };	void x_decomp(cudaq::qubit &c0, cudaq::qubit &c1, cudaq::qubit &t) { h(t); z<cudaq::ctrl>(c0, c1, t); h(t); }

Rewrite pattern

Library function

struct XOpDecomposition : public OpRewritePattern {
  using OpRewritePattern::OpRewritePattern;

  LogicalResult matchAndRewrite(
    qtx::XOp op, PatternRewriter &rewriter) const override {

    Location loc = op->getLoc();
    Value t = op.getTarget();
    t = createOp<qtx::HOp>(rewriter, loc, t);
    t = createOp<qtx::ZOp>(rewriter, loc, op.getControls(), t);
    t = createOp<qtx::HOp>(rewriter, loc, t);
    op.getResult(0).replaceAllUsesWith(t);
    rewriter.eraseOp(op);
    return success();
  }
};

void x_decomp(cudaq::qubit &c0, 
              cudaq::qubit &c1,
              cudaq::qubit &t) {
  h(t);
  z<cudaq::ctrl>(c0, c1, t);
  h(t);
}

(Note: I assuming that decomposition is done on QTX. I will try to enumerate differences if it was done in Quake, for one: the rewrite patterns would be simpler)

Bird's eye view

The result of using a compiler rewriter pattern is straightforward. The decomposition pass will walk the IR and rewrite, in place, all operations that have a registered decomposition pattern:

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = h %t : !qtx.wire
  %t_1 = z [%c0, %c1] %t_0 : [!qtx.wire, !qtx.wire] !qtx.wire
  %t_2 = h %t_1 : !qtx.wire
}

The decomposition pass will walk the IR and substitute all operations with a registered decomposition by a call to the library function that implements the decomposition:

qtx.circuit @x_decomp(%c0: !qtx.wire, %c1: !qtx.wire, %t: !qtx.wire) -> (!qtx.wire, !qtx.wire, !qtx.wire) {
  %t_0 = h %t : !qtx.wire
  %t_1 = z [%c0, %c1] %t_0 : [!qtx.wire, !qtx.wire] !qtx.wire
  %t_2 = h %t_1 : !qtx.wire
  return %c0, %c1, %t_2 : !qtx.wire, !qtx.wire, !qtx.wire
}

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %c0_0, %c1_0, %t_0 = qtx.call @x_decomp(%c0, %c1, %t) : ...
}

In this simple example, an extra inlining step is necessary to get us to the same result as the solution base on rewrite patterns

Details

The discussion until this point served the purpose of painting a broader picture of what decomposition is, its role in the compilation, and a bird's eye view of the two alternative implementations. At this point, these alternatives might seem similar. However, choosing one over the other will have nontrivial implications for the compiler and the compilation process.

I will assume that using the rewrite patterns is the baseline ("natural") way of implementing decomposition. This is because the rewrite patterns and library functions will look similar (minus the MLIR boilerplate, which we can minimize). Hence, from the point of view of someone implementing decomposition patterns, there is little difference (code-wise). Also, we already have decomposition implemented as rewrite patterns, so I presume the implications of this method are understood and will focus on the details of using library functions.

The main disadvantage of the rewrite patterns method is that adding or changing decompositions requires changing the compiler. This disadvantage primarily motivates the use of library functions.

The first couple of questions that arise for using library functions are:

Where will the decomposition functions live?
How do we inject these functions into the translation unit?

We could place these functions in a header file, say cudaq/decomposition.h, that would be included in the cudaq header, guaranteeing that the decomposition function will be available to the compiler. The downside here is that the compiler will do the work of parsing and compiling all decomposition functions in all translation units. Also, note that the compiler will need to know the name of these functions, so adding new decomposition patterns would still require changing the compiler unless we pass these names as configuration options.

Once we settle these questions, we turn to what will happen during compilation that relies on library functions. The careful reader might have noticed that in our previous example, the rewrite pattern handles X operations with an arbitrary number of controls, while the library function is limited to operations of two controls. We can fix this by changing the library function:

void x_decomp(cudaq::qreg &controls, cudaq::qubit &t) {
  h(t);
  z<cudaq::ctrl>(controls, t);
  h(t);
}

This new function complicates decomposition in QTX because the number controls can only be bounded at decomposition time. Specifically, the compilation won't be able to lower this function to QTX, so before doing decomposition, we will have something like this:

func.func @x_decomp(%controls: !quake.vec<?>, %t: !quake.qref) {
  h %t
  z [%controls] %t
  h %t
}

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = x [%c0, %c1] %t : [!qtx.wire, !qtx.wire] !qtx.wire
}

Now, it needs to deal with the following problems:

The user code uses separate qubits as controls, not a vector (or array in the case of QTX).
The compiler must specialize the function for the appropriate number of controls. (This would be less of a problem if decomposition was done using Quake.)

After dealing with issues, we are still left with the problem that likely burdens the compiler the most. When using a library function, the decomposition process does not straightforwardly generates the sequence of operations that implements the gate; instead, it injects a function that, if executed, would do so. We then rely on the compiler optimization capabilities to get us this sequence. In our toy example, this was a simple inlining of the function; however, not all decompositions are this simple.

Summary

It seems to me that using rewrite patterns is a better solution. We already have simple decompositions implemented using them, and we understand the technical implications. Furthermore, from a compiler developer perspective, the difference between writing a pattern and a decomposition function seems minimal, so there is little gain. (This assumes all infrastructure required to support using library functions for decomposition is done.) From a user perspective, library functions could enable an easy way to change decompositions without changing the compiler, but what percentage of users would want to do that? And since there is an extra cost for allowing this, are we willing to impose it on all users?

[RFC] Unified IR

Unification and Simplification of the Quantum IRs

Background

Presently, there are two distinct IRs for Quantum gates and their support. The first IR is in a memory SSA form where gate operations are ordered by their memory side-effects. The second IR repeats the semantics of the first IR in terms of the available quantum gates but has threaded values with volatile semantics. Specifically, quantum values may only be read/used once and afterwards are considered mutated/destroyed. Quantum values are thus not SSA-values in this IR (although they appear to be).

Revisiting Fundamentals

Quantum vs. Classical

There is a clarification needed between what is meant by quantum “SSA values” vs. classical SSA values. Specifically, in a classical sense an SSA value is read-only and will never change value; it’s state cannot change in the subsequent computation. In quantum computing, the “SSA value” is destroyed and unrecoverable when it is used or referenced in any (target) quantum position in subsequent gates. Effectively, this means quantum SSA-values are not, in fact, SSA values but reside in a highly volatile quantum-value space. In our examples below, this means that a !quantum type value is volatile and can only be used once (as a target).

However, a !ref<!quantum> value has no such restriction. The reference is not volatile; the referent is. Exactly like other implicit memory-SSA IRs such as LLVM, the volatility of the underlying quantum state is fully ordered by the effective memory-SSA semantics.

This volatility property of a quantum SSA-value is a novel concept to classical SSA IR principles. Many useful transformations rely on SSA-values not being volatile in this way.

Compiler IR Principles

There is no semantic difference in many of the quantum gate operations between the Quake and QTX dialects. They both share a Common dialect subset of ops that are the same gates, but which use different type systems. There are other distinctive operations, but these are similar in differentiation to LLVM’s getelementptr vs. extract_value distinction. To be precise, the distinction between a memory SSA and a value SSA form.

In retrospect, it is hard to justify having wholly distinct dialects for this minor difference. Compiler IRs, such as LLVM and other MLIR designs, do not have this bifurcation. Instead they allow for the IR to mix memory SSA and value SSA forms freely.

Let’s take a simple increment of a variable as an example.

%1 = load %variable
%2 = add %1, 1
store %2 to %variable

The memory SSA form is made salient with the load and store operations, while being simultaneously represented with the value SSA form of the add operation. There is no cause to separate the SSA representation into two mutually exclusive presentations.
The same thing should hold true for a quantum dialect. Indeed, the relationship between Quake and QTX amounts to folding the quantum loads and stores into the definition of Quake gates.

quake.gate %q
———————————————————
%q0 = “quantum load” %q
%q1 = qtx.gate %q0
“quantum store” %q1 to %q

This folding is important because of the non-SSA volatile property of quantum "values". Without it, it would be required to load and store each quantum value before and after every single quantum gate. This would at minimum triple the size of the IR and have no practical purpose in reasoning about the IR.

The distinction between the dialects is then reduced to something trivial: the types of the arguments/results of the quantum gate. A particular gate does the same operation whether it has to dereference the quantum values or not.

Recognizing this, it is possible to consider a generalized treatment of mem-to-reg and reg-to-mem to perform load forwarding of any type over irreducible control-flow graphs can be performed. Whether the value is a classical type or a quantum type doesn’t change the algorithm. Furthermore, the distinction can be fully accounted for by overloading the gate to accept both memory and value SSA forms by way of type distinctions.

gate %qr : (!ref<!quantum>) -> ()

%qv2 = gate %qv1 : (!quantum) -> !quantum

The above example shows us a quantum gate operation that is overloaded (by type) to have both a memory (reference) and register (SSA-like volatile value) semantics. MLIR lets one define operations as cross products of a name and set of types very easily. It’s one of its strengths as an IR framework. Note that the core semantics of the operation, what the operation means, has not changed, only the type of its operands and results (if any).

Mixed modality advantages

In a mixed-modality IR with both reference (memory) and value (register) semantics, one has choices and can freely use either mode or a combination of both. This greatly simplifies reasoning about the IR and eliminates the need to “fully convert” between the semantic levels. Full conversion and its implications thus no longer need be considered or bottleneck the IR.

For example, it is possible to write a function that has known-at-runtime characteristics, such as a library routine, that also contains fully optimized circuits (i.e., SCPs with quantum-values) as well. Moreover, issues such as “what is a vector of quantum values” can be decomposed into more fundamental concepts: (1) a classical vector of references along with (2) references that can be dereferenced to modify and use a volatile qubit state.

Another example is that it lets the compiler mix pass-by-value and pass-by-reference quite freely. Switching between one or the other becomes a question of optimization rather than a hard requirement.

func @foo.reference(!ref<!quantum>)
func @foo.value(!quantum) -> !quantum
func @foo.mixed(!quantum, !ref<!quantum>) -> !quantum

All three of the above declarations are legal and have semantics.

Complete kernel_builder for loop feature set.

Issue new PRs to make the for_loop method more general for the kernel_builder in both C++ and Python.

[RFC] cudaq::observe_n(...), spin_op observe broadcasting

Background

#25 has brought up a new feature request, specifically to be able to use a single call to cudaq::observe on a set of argument data, e.g. given an ansatz that takes as input a list of float parameters, call observe on a 2D array of floats, where each row of that data matrix corresponds to the expected value <psi (row_i) | H | psi (row_i)>.

One thing to note is that this introduces more opportunities for "easy" parallelization on multi-GPU architectures.

Proposal

I would propose we introduce the observe_n function to handle these broadcasted observe calls. Since the observe function is variadic and is dependent on the CUDA Quantum ansatz kernel signature, we should think about applying this function on a general ArgumentSet type, which encapsulates N vectors, one for each argument in the kernel function (N being the number of arguments). So the ith element of the ArgumentSet is the ith element of all the vectors holding argument values. Here's how I'd do this (writing it here to get feedback / changes)

// ------ Library Code -------
template<typename... Args>
using ArgumentSet = std::tuple<std::vector<Args>...>;

template <typename... Args>
// Use a concept to validate all are vector types.
auto make_argset(Args &&...args) {
  return std::make_tuple(args...);
}

template <typename QuantumKernel, typename... Args>
  requires ObserveCallValid<QuantumKernel, Args...>
std::vector<double> observe_n(QuantumKernel &&kernel, spin_op H,
                                ArgumentSet<Args...> &&params) { ... }

Here's how this would look for user code

// ------ User Code -------
cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
                     .21829 * z(0) - 6.125 * z(1);
auto params = cudaq::linspace(-M_PI, M_PI, 50);

{
  auto ansatz = [](double theta) __qpu__ {
    cudaq::qreg q(2);
    x(q[0]);
    ry(theta, q[1]);
    x<cudaq::ctrl>(q[1], q[0]);
  };

  auto allExpVals = cudaq::observe_n(ansatz, h, cudaq::make_argset(params));
}

{
  // Demo non-trivial kernel signature. 
  auto ansatz = [](double theta, int size) __qpu__ {
    cudaq::qreg q(size);
    x(q[0]);
    ry(theta, q[1]);
    x<cudaq::ctrl>(q[1], q[0]);
  };

  auto allExpVals = cudaq::observe_n(
            ansatz, h, cudaq::make_argset(params, std::vector(params.size(), 2)));
}

A possible implementation for cudaq::observe_n could look like this

template <typename QuantumKernel, typename... Args>
  requires ObserveCallValid<QuantumKernel, Args...>
std::vector<double> observe_n(QuantumKernel &&kernel, spin_op H,
                                ArgumentSet<Args...> &&params) {
  std::vector<double> expVals;

  // Loop over all sets of arguments, the ith element of each vector 
  // in the ArgumentSet tuple
  for (std::size_t i = 0; i < std::get<0>(params).size(); i++) {
    // Construct the current set of arguments as a new tuple
    // We want a tuple so we can use std::apply with the 
    // existing observe() functions. 
    std::tuple<Args...> currentArgs;
    cudaq::tuple_for_each_with_idx(
        params, [&]<typename IDX_TYPE>(auto &&element, IDX_TYPE &&idx) {
          std::get<IDX_TYPE::value>(currentArgs) = element[i];
        });

    // Call observe with the current set of arguments (provided as a tuple)
    auto expVal = std::apply(
        [&](auto &&...tupleArgs) -> double {
          return observe(kernel, H, tupleArgs...);
        },
        currentArgs);

    // Store the result.
    expVals.push_back(expVal);
  }

  return expVals;
}

Note that with that inner expectation value computation, we are free to use observe_async and assign these tasks to any available virtual QPUs asynchronously and in parallel.

In Python, this should look like this

kernel, params = cudaq.make_kernel(list)
... create the ansatz ... 

params = np.random.uniform(-np.pi, np.pi, size=(50, 20))
expVals = cudaq.observe_n(kernel, H, params)

Also, we could do similar things with cudaq::sample_n(...)

Add for_each_term and for_each_pauli to Python bindings

spin_op has for_each_term and for_each_pauli in C++. We need to add this to the python bindings too.

Add a wrapper for enforcing tolerance in the optimizer in Python bindigs

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

Currently, python bindings don't have a stopping criterion for the optimizer based on error tolerance. Do we want to use the same name as the c++?

Python integration with pip

To install the CUDA Quantum python bindings, you must currently build cudaq with the CUDAQ_ENABLE_PYTHON flag set to true and point your python path to the installed .so file. We should allow for installation with pip and/or eventually conda.

Example:
pip install cudaq
would install the cudaq python module and make it accessible via import as import cudaq.

Tasks

Beta Give feedback

Make python a standalone package #130
Support pip installation without the --user flag
Options

[RFC] Rename cudaq::kernel_builder -> cudaq::kernel

Background

Recently we update the Python bindings for kernel_builder to Kernel to be more Pythonic. I actually like this a little better. This type in C++ is meant to build Quake function representations, so it is in some sense a kernel builder. But ultimately it is a callable CUDA Quantum kernel, so I think it may be better to just call it cudaq::kernel.

Changing the spin_op.dump() to print out the Pauli string w/o qubit #

Required prerequisites

Search the issue tracker to check if your feature has already been mentioned or rejected in other issues.

Describe the feature

spin_op currently dump the qubit number it acts on after each of the Paulis, for instance

X0X1I2I3Y4.....Z13...

If you're trying to parse a Pauli string, it makes it difficult to do so in a general manner if the number of qubits jump in place value (for n= 9, 10, 100 qubits).

get_qubit_count (Python) or n_qubits(C++) already tell you how many qubits you have, so the qubits # are not really needed. Additionally, this will make interoperability with other libraries a little easier.

[Refactor] Finish documenting and testing the python state class

We should add the State class to the python documentation, which also requires filling out/refactoring the existing docstrings. This can also coincide with a bit more thorough unit tests of the State api from python.

Set up workflow to produce and push CUDA Quantum Docker image

Set up a workflow that produces and pushes the CUDA Quantum Docker image.
That workflow should

Run all e2e tests using the CUDA Quantum installation and runtime environment in the image (this may require test updates).
Validate all examples that are included in the image and all viable backends.
Produce a new version of the docs.

Prerequisites to take care of first:

Setup workflow that builds a minimal version of the CUDA Quantum Docker image
Set up workflow to create an image with all the dev dependencies for HPC backends
Set up a workflow to build a version of the CUDA Quantum Docker image that includes GPU accelerated backends included in this repository
Set up a workflow to build a version of the CUDA Quantum Docker image that includes additional backends not included in this repository

To be evaluated:
Can we fully automate the publishing on NGC? It would be good if the docs and the image were updated at the same time.
Container signing options, e.g. https://github.blog/2021-12-06-safeguard-container-signing-capability-actions/.
Produce image for other architectures than x86_64,amd64.

nvidia / cuda-quantum Goto Github PK

cuda-quantum's Introduction

Welcome to the CUDA-Q repository

Getting Started

Contributing

License

Feedback

cuda-quantum's People

Contributors

Stargazers

Watchers

Forkers

cuda-quantum's Issues

Overview

Known Issues

Deallocation pass

Documentation

Possible Issue: Quake to QTX

Additional Tasks

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Background

Problem

Proposal

Dependencies

Overview

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Required prerequisites

Describe the feature

Overview

Background

What would this look like

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Background

Proposal

Overview

intrinsic

pure-device kernel

Constrained Autogeneration: cudaq::control

cudaq::control

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Background

Python

C++

Background

Testing

Future

Background

Proposal

Required prerequisites

Describe the feature

Background

Proposal (accepted)

Background

Gate API

Mid-circuit measurement register naming

Handle One-Time Static Init (e.g. MPI)

Constrained Autogeneration: `cudaq::control`