I am supper happy to be part of the team at Qblox which has been and is still actively working with Nvidia to promote the NVQLink standard, as a way to interconnect heterogenous systems computed on GPUs, CPUs and QPUs (quantum processor, including the upfront quantum controllers which Qblox is developping).

NVQLink at a glance Link to heading
There’s not much public information yet, but one can expect it to be released very soon, and I will update this memo accordingly. For now, this is what one can find on the internet, especially from Nannod, which I need to credit for the diagram below.

What performance spec Link to heading
The key “hardware” specifications emphasize the performance of network & computing resources:
Network throughput: Up to 400 Gb/s from the GPU to the QPU.
Network latency: round-trip latency (FPGA → GPU → FPGA) less than 4.0 microseconds.
GPU HW: Real-time host built on NVIDIA GB200 Grace Blackwell superchips, with a lot of TFLOPS.
Why NVQLink is need Link to heading
You may ask, “Why is this important?”
It creates an open standard to tightly integrate quantum hardware with accelerated GPU compute .
It enables hybrid workflows: Neural network calibration, quantum error correction (QEC), etc.
It provides an open platform that integrates SW and HW as one, allowing anyone to interact with quantum computers, without having to drown in an ocean of knowledge, thanks to Cuda-Q.
NVQLink Specification: Architecture Design Link to heading
Looking forward to the deep dive once the NVQLink specification is publicly available.
The NVQLink white paper is now available from https://arxiv.org/abs/2510.25213, so let’s deep dive into the details of the proposed architecture:
System Architecture Link to heading
(system diagram adapted from the original picture from the white paper)
As expected, the details match the high-level overview from Nannod. From a component perspective, the NVQLink architecture comprises the Real-time Host (RTH) and the QPU Control System (QSC). Those two components are connected by a low-latency, scalable real-time Interconnect (RTI). The RTH has traditional HPC compute resources like CPUs and GPUs. As for the QSC, it typically includes the Pulse Processing Units (PPU) that control the QPU.
This diagram introduces two important keywords: fn and NI. In the NVQLink mental model (”programming model”), each of the CPUs, GPUs, PPUs (or other specialized ICs/ FPGA) is referred to as “devices”, and NVQLink makes it possible to remote procedure call (“callbacks” or fn) into any of those devices. This is a highly powerful solution, where NVQLink acts as the glue for the heterogeneous system. The actual implementation of the fn runtime is highly optimized to the point where marshalling is even taken care of. This way, one can ensure latencies of a few microseconds. As for the NI, it stands for Network Interface, and it is conceptualized as a “small” and optional networking card/interface/cable that all parties can use to have a unified, interconnected system.
Network Architecture Link to heading
Without any surprise, NVQLink makes use of RDMA and GPUDirect, to bypass any kind of unnecessary CPU processing, and ensure an optimal latency . As written in the specification, “benefiting from these two technologies, only the NIC and GPU are involved during the processing of packets coming from and going to the QSC, without any host involvement”.
The specification also recommends using the “Unreliable Connection” RDMA mode, as the latency price to pay for RCs (Reliable Connections) may be overkill compared to a properly engineered network.
The specification provides a “proof of concept” for the network architecture, as a means to verify the possibly achievable latency. The Holoscan Sensor Bridge module is used, which provides means to send data between an FPGA and NIC using the RDMA over Converged Ethernet (RoCE) protocol, as well as handling the enumeration steps and the control signals. Using this architecture, the spec shows that it is possible to achive a sub-4 microsecond round-trip latency, for a 32 bytes RDMA payload, equivalent to 92 bytes ethernet frame.
(diagram adapted from the original picture from the white paper)
CudaQ+NVQLink Integrated Architecture (aka programing model) Link to heading
NVQLink distinguishes between slow and fast modalities, and I will specifically examine the impact of fast modality architecture in this section (so-called “High Latency Sensitivity”). The main difference with slow modalities is that the architecture allows for Just-in-Time compilation (JIT) as well as possible RTH mediation during execution.
For CudaQ + NVQLink to work with fast modalities, the specification stipulates that the complete ISA programs must be uploaded to FPGAs in advance and triggered atomically, with minimal interactive communication with the Real-time Host during execution. JIT is really a cool piece of technology, so it’s a pity it can’t be used for fast modalities. However, the specification makes it clear that the FPGAs are allowed to “receive” dynamic updates from the RTH (”via an instruction queue”), provided that the instruction queue remains non-empty until program termination. That sounds like “just in time” scheduling.

Without any surprise, the specification makes it clear that the compilation must perform aggressive ahead-of-time optimizations - and this is definitely something that all quantum control stack vendors do at heart. It is also mentioned that if any callback needs to be executed in the GPU, the CUDA kernel in the GPU must be preinitialized and actively waiting for events - nothing special here, this is the standard DOCA GPUNetIO workflow

The CUDA=Q compilation and lowering workflow to the NVQLink architecture is based on the standard LLVM MLIR architecture (see diagram above). The first step is to parse the CUDA kernels and generate the Quantum IR (also known as QIR, or QUAKE) and CC (classic compute) intermediate IRs, which are abstracted at the gate level. A later phase introduces the necessary optimization and kernel fusion, producing a pulse-level dialect. Then, depending on the modality type, the next lowering phase utilizes either the RTH or FPGA mediation.
Runtime Architecture Link to heading
This section covers the software runtime, from heterogeneous/distributed memory addressing, as well as interaction and synchronisation of the different devices involved in the NVQLink system. to be completed