Quanta Computer Vision

Magazine: Features
Quanta Computer Vision

FREE CONTENT FEATURE

Light impinges on a camera's sensor as a collection of discrete quantized elements, or photons. An emerging class of devices, called single-photon sensors, offers the unique capability of detecting individual photons with high-timing precision. With the increasing accessibility of high-resolution single-photon sensors, we can now explore what computer vision would look like if we could operate on light, one photon at a time.

Quanta Computer Vision

By Varun Sundar, Mohit Gupta, January 2025

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Computational photography, Emerging optical and photonic technologies, Quantum technologies

Digital cameras have democratized the ability to capture stunning, well-composed photos by providing accessibility to billions of users in pocket-friendly form factors. A far cry from the early days of photography when compositing a single image was an hours-long process involving bulky boxes fitted with light-sensitive metal sheets, today's photography only requires the effortless click of a button. At the heart of this incredible transformation of camera technology is the critical role of computation. Modern cameras utilize powerful image-signal processors that apply sophisticated computation and algorithms, which is crucial to overcome physical limitations inherent to the small form factor of (in particular) mobile devices. For instance, up until a decade ago, capturing high-quality images in nighttime conditions was very difficult. The small sensors in mobile cameras collect only a fraction of the light collected by a professional camera (such as a digital single-lens reflex, or DSLR, camera). Thanks to modern computing techniques, nighttime mobile photography is now very much a reality, enabled by computational techniques such as burst photography that composit a sharp image from multiple noisy raw images. It is thus fair to say that most modern cameras are "computational cameras" where the physical sensor technology and computational techniques work synergistically to create the final images or videos. Beyond consumer photography, the ubiquity of computational cameras has directly impacted several other domains from scientific imaging to industrial applications, including emerging technologies such as autonomous vehicles, robotics, and augmented reality.

Current digital camera sensors consist of an array of pixels, with each pixel acting as a photon bucket that collects light. These photon buckets require a minimum of a few hundred to a thousand photons to create an image that is not severely impacted by noise, or worse still, result in no outputs (i.e., an all-dark image). But there is an even finer granularity light can be sensed at, which is at its fundamental discretization of individual photons. Sensing light at the photon level would, at least in theory, allow us to push the capabilities of computational cameras to their physical limit. Such newfound photon-level capabilities could enable capturing images in scenarios that pose steep challenges for today's cameras. Imagine a moonlit night during a camping trip, where most cameras today can only capture very noisy photos, or an autonomous car exiting a dark tunnel suddenly contending with strong sunlight that can easily saturate images taken by existing cameras.

Capturing every photon might sound like something out of science fiction, but fortunately, there is an emerging class of sensors called single-photon or quanta sensors whose pixels are sensitive to the arrival of individual photons. Owing to their extreme sensitivity, time resolution, and shrinking cost, size, and power requirements, single-photon devices are driving an imaging revolution. A new generation of devices is emerging, with novel functionalities that were hitherto considered impossible. These include imaging at a trillion frames per second, seeing around corners by bouncing light off walls thereby turning walls into mirrors, and even zooming in on microscopic biological events at nano time-scales in super low-light conditions. Figure 1 illustrates, at a high level, the imaging principle of single-photon cameras (and contrasts it to conventional camera pixels).

The Single-Photon Revolution

At first glance, single-photon cameras might seem like specialized gadgets meant only for scientific research. But thanks to a hardware revolution that has been unfolding over the past decade, these sensors are becoming increasingly more accessible. Their key characteristics—resolution, noise and pixel quality—have improved dramatically. What used to be largely limited to scientific grade devices with low-resolution (32x32 pixel arrays) have now evolved into megapixel single-photon sensors [1, 2] with industrial powerhouses like SONY and Canon working on even higher-resolution versions [3]. Further, since these sensors can be made using standard manufacturing processes similar to those needed for conventional cell-phone cameras (i.e., CMOS technology), they are becoming cheaper and more widespread. In fact, we are already seeing them in consumer devices like the iPhone 13 and newer models, where they are employed to estimate scene depths using the time-of-flight principle.

We believe the emergence of such low-cost, high-resolution single-photon sensors could mark a potential phase transition in the arc of camera technology. Allowing quanta cameras to step outside the realm of scientific laboratories into the real world where they could act as an all-purpose imaging device that operates across a wide gamut of illumination scenarios (from very dark photon-starved scenes to bright scenes with an abundance of light, such as direct sunlight at noon) for a variety of applications (from niche scientific to mainstream computer vision and photography).

Quanta Vision: Computer Vision on Photon Streams

Each pixel of a single-photon camera acts as a light teaspoon instead of a light bucket like the pixels of conventional cameras. Single-photon camera pixels hold one photon at a time and output a time series of binary values, with a 1 indicating that at least one photon was detected during an exposure window and a 0 indicating otherwise. Due to the inherent randomness of photon arrivals, these photon detections can be represented by Bernoulli random variables. Figure 2a shows a sequence of binary frames captured by a single-photon device, or a quanta camera. Bright regions of a scene feature a greater frequency of binary ones occurring, while dark regions feature lower frequencies.

Sensing light at the photon level would, at least in theory, allow us to push the capabilities of computational cameras to their physical limit.

To achieve our goal of computer vision with photons, or quanta computer vision, we must design algorithms that can operate on the unconventional raw data captured by single-photon cameras. Current computer vision and machine learning algorithms have been designed to extract information from images and video streams as captured by conventional cameras. This faces a fundamental challenge: A single quanta image is heavily quantized (1 bit), extremely noisy, and does not lend itself to the current convention. This, therefore, requires rethinking the stack of visual processing algorithms for photon stream data, starting from low-level signal processing to mid-level computer vision and high-level machine learning and reasoning algorithms.

One potential approach to mitigate the limited amount of information contained in a single quanta image is to stack photon detections across several quanta images to reduce image noise and increase the amount of scene information. For instance, if there is no scene or camera motion involved, simply summing photon detections across time results in a sharp image. However, we live in a world that's constantly in motion. Therefore, summing quanta images naively can result in motion blur with more stacked images lending itself to greater motion blur and lesser photon noise, which resembles the familiar noise-blur tradeoff that is fundamental to almost all imaging modalities, including consumer photography. We can either take a sharp but noisy image, or a blurred image. This noise-blur tradeoff is illustrated in Figures 2b and 2c.

It is possible to sidestep the noise-blur tradeoff by compensating for motion in a series of quanta images before extracting scene information for downstream tasks. We take inspiration from burst photography that has been successfully deployed in modern smartphone cameras. A mobile camera captures a stack of noisy, but motion-blur free, images that are subsequently combined using an align-and-merge procedure. Similarly, we can consider burst photography at the level of quanta images, or "quanta burst photography" [4], with the goal of compositing a temporal sequence of quanta images by first aligning them using motion estimation algorithms. As Figure 2d shows, quanta burst photography can provide a sharp and noise-free image of a scene from a stack of quanta images.

The next revolution in computer vision, robotics, and machine learning could be fueled by access to light at a finer granularity…

Techniques such as quanta burst photography thus provide a first viable path to perform quanta vision via a two-step method—first reconstruct a high-quality image from single-photon data, which is then passed on to various downstream computer vision models. We have recently demonstrated that this approach is able to push computer vision outside the "Goldilocks Zone" (where neither illumination levels nor motion extents are too drastic to cause failures in computer vision algorithms) to extreme imaging scenarios where conventional cameras typically fail, such as those involving rapid motion or low light or across a wide dynamic range. As an example, Figure 3a demonstrates imaging in a high dynamic range scene where the brightest portion of the scene, which is the bulb's filament, is about 2,000 times brighter than the darkest region. Quanta burst photography produces an image that simultaneously recovers all regions of this scene. Figure 3b shows how quanta vision successfully detects an object at nighttime that is at a distance (and hence occupies a small region of the captured image), which can be quite challenging for existing night vision cameras [4, 5].

Is Quanta Vision Ready for Prime Time?

Quanta burst photography provides a proof-of-concept demonstration of computer vision with single-photon sensors. However, it remains far from being ready for consumer-grade applications, especially where power and latency are at a premium. The major open research challenge is the compute- and time-intensive nature of existing quanta processing techniques, often requiring several minutes to construct a single high-fidelity image from a quanta frame sequence. While the computational time can be reduced to some extent with software-engineering optimizations, explicitly reconstructing a conventional-like image using the two-step approach previously described before carrying out a downstream task is resource inefficient. Often, a downstream task may not require preserving all the information in the scene. We posit it may be possible to go to downstream task objectives implicitly and directly from quanta images without ever having to reconstruct image representations from photon data. An important consideration for implicit quanta vision is the fine granularity of single-photon acquisition. Unlike conventional acquisition where a 100 millisecond capture typically corresponds to a few input frames, single-photon acquisition can feature several thousand quanta images, which will require algorithms to scale to large input sequences.

The sheer volume of single-photon raw data leads to core computational challenges including compute, memory, and bandwidth. Even a modest resolution (approximately 1-megapixel) single-photon device detects photons at 100kHz produces 100Gbits/sec of data. For context, this readout rate is orders of magnitude greater than common data peripherals such as USB 3.0 (which has a maximum bandwidth capacity of about 5Gbits/sec). Reading out a massive amount of data further necessitates a power expenditure around three orders of magnitude higher than conventional sensors, which renders single-photon imaging unviable for many resource-constrained applications (for example, a micro-drone, cell-phone, and AR/VR goggles).

Quanta burst photography provides a proof-of-concept demonstration of computer vision with single-photon sensors.

Since data movement is at the heart of these issues, we could try moving compute closer to the single-photon sensor. This task is far from being straightforward. Applying existing computer-vision algorithms to single-photon data naïvely, by simply treating photon detections as extremely highspeed videos, can lead to prohibitive costs. A lightweight image classifier (e.g., Mobilenet-v2, which is geared toward operability on smartphone devices) consumes about 9.2 GFLOPs or 9.2 billion floating-point operations when running on a 1-megapixel image. Running Mobilenet-v2 on quanta images on a frame-by-frame basis would require about 920 TFLOPs per second, which is a compute affordance that is equivalent to strapping an NVIDIA DGX server to a camera sensor. More fundamentally, shrinking an exorbitant amount of compute within the small confines of an image sensor would require setting up thermal-dissipation systems with prohibitive capacities. Such a resource expenditure is quite impractical and can only be justified for very niche scientific or industrial applications.

To tackle the single-photon data deluge, we began exploring lightweight representations of quanta images that can be viably computed near the sensor, alleviating the need to read out raw photon detections. These representations, which we call photon-cube projections (a sequence of quanta images can be seen as spatiotemporal photon information, or a photon cube [6]), involve simple computational operations (for example, linear functions, spatial shifts) that process single-photon raw data in an online manner. In other words, each photon detection is accessed only once and does not have to be stored in memory [7]. Photon-cube projections can computationally emulate the capabilities of a wide variety of imaging systems after-the-fact (see Figure 4) including both existing conventional cameras and exposure bracketing, and unconventional ones such as video compressive sensing that aims to capture high-speed videos from one or a few encoded capture(s), biologically-inspired event cameras whose outputs respond to changes in measured brightness values, and cameras that emulate sensor motion without any actual physical movement (thereby functioning as a dolly or a trucking camera).

Zooming out, the challenges that need to be addressed for quanta vision to achieve widespread main-stream adoption are interdisciplinary in nature. Our focus so far has been on low-level signal processing, where we need to design efficient and lightweight algorithms that can process photon detections in real-time speeds near the sensor. For high-level computer vision objectives, such as object detection, semantic segmentation, or image prompting, we require machine-learning algorithms that can scale to the fine temporal resolution of quanta image sequences. Such quanta machine-learning algorithms will necessitate curating large-scale datasets to facilitate end-to-end learning and potentially even foundation models that cater not just to one specific downstream task, but instead learn universal representations that can be mapped to several tasks of interest. We anticipate the core enabler for both low-level signal processing and high-level machine vision will be new kinds of computational architectures that can effectively handle single-photon workloads and translate the theoretical complexity of a proposed algorithm to tangible wall clock numbers. In the recent past, dedicated architectures have emerged for specialized workloads, including tensor processing units for accelerating artificial intelligence computations, and neuromorphic hardware that can run sparse matrix multiplications (and other floating-point operations) while tacking on minimal overheads. In a similar spirit, we envision photon processing units located close to a single-photon sensor, perhaps right beneath the sensor with advanced 3D stacking technologies, that can accelerate the massively parallel amount of information captured by a single-photon sensor. The north star for a photon processing unit would be to accept multiple user inputs—such as the task to be performed, latency tolerance, and compute budgets—and synthesize processor instructions to transform photons to requisite objectives.

The advent of digital cameras more than two decades ago enabled sensing light at the level of individual pixels and catalyzed a revolution in photography, image processing, and computer vision. In the subsequent decade, images as a grid of pixel values were the bedrock for advancements in machine learning, artificial intelligence, and robotics. We believe the next revolution in computer vision, robotics, and machine learning could be fueled by access to light at a finer granularity: the level of individual photons.

References

[1] Ulku, A .C. et al. A512 × 512 SPAD Image sensor with integrated gating for wide field flim IEEE Journal of Selected Topics in Quantum Electronics 25, 1 (2019), 1-12. DOI: 10.1109/JSTQE.2018.2867439.

[2] Morimoto, K. et al. Megapixel time-gated SPAD image sensor for 2D and 3D imaging applications. Optica 7, 4 (2020), 346-354.

[3] Morimoto, K. et al. 3.2 megapixel 3D-stacked charge focusing SPAD for low-light imaging and depth sensing. In 2021 IEEE International Electron Devices Meeting (IEDM). San Francisco, 2021, 20.2.1-20.2.4. DOI: 10.1109/IEDM19574.2021.9720605.

[4] Ma, S. et al. Quanta burst photography. ACM Transactions on Graphics 39, 4 (2020), 1-16.

[5] Ma, S. et al. Burst vision using single-photon cameras. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2023, 5375-5385.

[6] Fossum, E. R. The quanta image sensor (QIS): Concepts and challenges. In Imaging and Applied Optics. OSA Technical Digest (CD). Optica Publishing Group, 2011, paper JTuE1. DOI: 10.1364/COSI.2011.JTuE1.

[7] Sundar, V. et al. Sodacam: Software-defined cameras via single-photon imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, 8165-8176.

Authors

Varun Sundar is a graduate student at the University of Wisconsin-Madison, pursuing a Ph.D. in computer science. He previously received a bachelor's degree in electrical engineering from the Indian Institute of Technology, Madras in 2020.

Mohit Gupta is an associate professor of computer sciences at the University of Wisconsin-Madison. He received a Ph.D. from the Robotics Institute, Carnegie Mellon University, and was a postdoctoral research scientist at Columbia University. He directs the WISION Lab with research interests in computer vision and computational imaging.

Figures

Figure 1. (a) Unlike a conventional camera whose pixels function as photon buckets, (b) a single-photon sensor pixel is extremely sensitive to light and functions as a photon teaspoon. (c) Single-photon sensitivity enables imaging in a wide range of challenging conditions, including in very dark environments, in the presence of fast motion, or in high dynamic range scenes. (d) These nascent sensors have potential applications in several scientific and industrial domains.

Figure 2. (a) A sequence of quanta images. (b) Summing a few quanta images produces a short exposure, which preserves motion information but is quite noisy. (c) Summing many quanta images can reduce image noise but results in substantial motion blur. (d) Quanta burst photography temporally composites quanta frames and can alleviate the noise-blur tradeoff by producing a low-noise, low-blur, sharp image that additionally captures the dynamic range of the scene.

Figure 3. A demonstration of quanta vision in challenging imaging scenarios. (a) A high dynamic range scene where the brightest regions are about 2,000 times brighter than the darker regions, and (b) detection of objects that occupy a small region of the captured image at nighttime. This task fails with a conventional night-vision camera but can succeed with single-photon sensors. Images courtesy of Ma et al. [4 , 5 ].

Figure 4. Outputs of a single-photon sensor can be computationally transformed to emulate several imaging modalities, thereby rendering it as a "Swiss-army knife" camera.

Crossroads The ACM Magazine for Students

Magazine: Features Quanta Computer Vision

FREE CONTENT FEATURE

Quanta Computer Vision

Magazine: Features
Quanta Computer Vision