Magazine: Features
Learning from Nature's Cameras: Bio-Inspired Computational Imaging
FREE CONTENT FEATURE
For the past half-billion years, evolution has produced a diversity of eyes and brains that work together to solve visual problems with remarkable efficiency and robustness. By reverse-engineering these systems, we can uncover powerful principles to build the next generation of computational cameras.
Learning from Nature's Cameras: Bio-Inspired Computational Imaging
Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition
Computational imaging expands the limits of what we can see by using the thoughtful co-design of optics, sensors, and algorithms [1]. In many settings, this paradigm allows post-processing to exceed the system's optical capacity, as is the case with super-resolution that can reach beyond the diffraction limit and can even extend down to the atomic scale. In other settings, like optical computing, some of the system's computational burden can be shifted onto the sensing pipeline, harnessing the physical propagation of light to perform computation with no additional energy cost. Powerful and inventive computational imaging approaches have allowed observation of physical phenomena such as shock waves, microorganisms in motion, the insides of unharmed human bodies, and the black hole at the center of our galaxy. In contrast to mainstream computer vision, the subfield of computational imaging is much smaller and marked by the added creativity enabled by the inclusion of detector design. But computational imaging is also the normal way for vision to work.
This is because "vision" is not just a set of problems that exist in computer science, but also a description of pre-existing abilities for perceiving the world. Vision has happened, successfully and with remarkable variety, for more than half a billion years [2]. Fossils from the Cambrian radiation, around 540 million years ago, demonstrate a diversity of sophisticated optical anatomy. Alongside Hallucigenia's small and simple eyes, Opabinia developed five eyes of different sizes on independent stalks. Meanwhile, trilobites' stone-lensed compound eyes varied in structure across species, sometimes including stalks or ridges and sometimes covering nearly the entire visual field. Every one of these was a computational imaging system, because their eyes and brains evolved in cooperation with each other.
Considering these strange creatures, one is tempted to ask: Why are they like that? Why those optical designs? Why grouped together in those sets? Were some of these animals more successful because their vision worked better? What happened to visual signals downstream from the eyes? How did information flow from lenses to retinae to nervous systems to behaviors? What did it cost them and was it worth it? Were any of these sensing systems optimal? Not yet converged? Stuck in local minima? How are they similar to and different from the vision of animals and technologies in use today? The answer to these questions is, mostly, that we do not know. Following the Cambrian period, these particular animals disappeared with little trace in evolutionary history, and over time many new solutions to visual challenges have emerged, including our own eyes and brains. Our contemporaneous visual systems are, fortunately, accessible to active inquiry, but much remains unknown.
So what should we do about the fact that vision is happening all around us, all the time, in ways we do not yet understand? A generally useful principle in engineering is that we should learn from the best. In the case of mainstream computer vision, human-level performance for tasks like object recognition was the goal of the field for many years. This is no longer so. Since 2015, deep neural networks have outperformed the humans originally responsible for providing "ground truth" object labels in the ImageNet dataset, and we have yielded a variety of tasks of growing complexity since then. Humans are unusually intelligent and highly visual animals, so many have assumed that the best bio-inspired vision approaches will be human-inspired, and that if human performance is no longer state-of-the-art, then there is no reason to consider eons' worth of variety in visual system evolution. The remainder of this article will focus on why this is not the case, what we stand to gain from correcting our error (and how to do it), and why computational imaging is the right way to approach this wealth of opportunity.
Performance in computer vision typically refers to accuracy on benchmarks, which are datasets associated with specific tasks and shared across a community that competes on per-label error metrics. For image or scene reconstruction, these labels often take the form of the expected output at each pixel, so that, for example, a stereo vision algorithm can claim to be state-of-the-art if it produces the lowest mean squared error across pixels of output depth on at least one publicly available stereo image and depth map dataset. Benchmarks are motivating and save time on data collection and task definition, and have thus enabled rapid progress on many difficult problems, but this approach to evaluating performance is far from complete.
"Vision" is not just a set of problems that exist in computer science, but also a description of preexisting abilities for perceiving the world.
Natural vision systems succeed when they enable survival and reproduction. This is a complex mandate and is unlikely to align with standard computer vision metrics. Instead, behavior-driven sensing should collect and compute the minimal signal needed to respond robustly and correctly to relevant stimuli. This "quick and dirty" approach to vision requires sparse and rough estimates rather than pixel-perfect reconstructions. And quick does matter—a predator recognized too late is functionally never seen. In fact, biological vision provides solutions under tight constraints, not only on compute time, but also on compute power (every calorie is acquired under risk) and sensor compactness (optics and processing must be miniaturized to fit within the body). Animals, particularly those with small brains, commonly exceed the technological state of the art on tightly-constrained visual sensing. This is increasingly important as our technology moves in the direction of similarly-constrained edge devices, and as we begin to contend with the ecological impact of resource-intensive approaches to artificial intelligence such as large language models and mainstream computer vision networks. Eventually, the fields of robotics and vision may mature to the point that behavioral benchmarks become available, and sensors can be evaluated within the context of a perception-action cycle. In the meantime, we must take care to develop metrics that incentivize useful and responsible steps forward.
An additional advantage to biological vision is its sheer diversity; hundreds of millions of species, over hundreds of millions of years, have each solved a whole suite of visual problems. Living animals can reveal their visual algorithms through behavioral responses while even long-dead fossils preserve optical anatomy. In the field of computational imaging specifically, where the degrees of freedom enabled by generalizing optical design can prove intractably large, this pre-identification of tried-and-true cameras provides vital clues. It is important to note that while evolution is not guaranteed to have produced globally optimal solutions, brain and particularly retinal tissue is metabolically expensive and the visual system is under harsh selection pressure. It is therefore reasonable to assume that extant, or preserved, visual systems are at least locally optimal for survival-enhancing behaviors in context. This means if an evolved visual system violates our assumptions of how vision should work, there is a reasonable chance that this indicates a gap in our current knowledge.
The efficiency and robustness of biological vision set attractive goals for computational imaging systems. In order to bring these examples to bear on new technologies, there are two basic starting points: technological goals and biological examples.
In the technology-first approach, which follows the standard engineering process, the first step is to describe a technical niche in terms of task and constraints. Next, one can analogize to an ecological niche and identify organisms that have solved similar problems under similar circumstances. There may be many of these to choose from, particularly for edge devices or low-power applications. Each of these serves as an "answer in the back of the textbook" to provide a possible new mechanism. Unfortunately, many of these systems remain poorly understood, so the underlying principles might require additional research to uncover. This may not serve projects with a tight schedule or a strong profit imperative, but if there is space for basic science, the payoff to both fields may be significant.
In the biology-first approach, the emphasis is on reverse engineering a promising but poorly understood visual system. Here are a few signs of a promising visual system: It can see things that standard cameras cannot (e.g., color bands other than RGB, polarization, single photons); it can do more behaviorally than we'd expect its anatomy to support (e.g., surprising ability to distinguish similar stimuli, good visual performance with a small brain); or it can do less behaviorally than we'd expect (e.g., coarser stimulus discriminations in practice than the eye is capable of). The first of these categories demonstrates a practical value for novel sensor types, the second points to advances in algorithm design, and the third suggests avenues for computationally efficient task-specific systems. Once the visual principle has been uncovered, its applicability may well be broader than the specific setting in which it is discovered.
Both approaches invite a holistic understanding of vision, which can provide powerful and unexpected insights into both principles and applications (see Figure 1). Moving responsively between these domains of contribution enables a perception-action cycle in which progress in each direction inspires and sustains inquiry in the other.
Our work on spider-inspired 3D cameras provides a case study of the second approach. Behavioral biologists discovered that jumping spiders, which pounce on their prey, jump to the wrong distance under an unexpected illumination color (narrowband red as opposed to narrowband green, which is more similar to sunlight). Furthermore, the size of their jump errors corresponds to the chromatic aberration (the change in focus across wavelengths) measured in their lenses. This indicates a change in focus affects the spider's depth perception, which biologists connected to the computational imaging technique of depth from defocus. This technological connection was exciting but raised more questions than it answered.
Depth from defocus has traditionally been considered a computationally expensive depth cue. In contrast, depth from stereo was considered less computationally expensive. As a result, a great deal of progress has been made in building efficient stereo cameras, including multicamera consumer devices that might be in your pocket right now. The jumping spider suggested our assumptions about the relative costs of defocus and stereo depth estimation might be wrong. Here was an animal with tight computational constraints due to its poppyseed-sized brain, with plenty of eyes to do stereo, that nonetheless appeared to be relying on defocus. Because this contradicted the state of the art in engineering, it pointed to the possibility of an undiscovered shortcut in the math and physics of the depth from defocus problem.
The key to unraveling this mystery came from the anatomy of the spiders' two primary eyes. Behind each lens is a movable tube, at the end of which sits a stack of translucent retinae (see Figure 2a). This means the spider sees multiple versions of the world simultaneously at slightly different focus levels. This is similar to the sweep you see on your phone camera when you tap to bring something into focus, but all at once and all the time. Importantly, the retinae are stacked directly on top of one another, and this violated another assumption about depth from defocus. If defocus is the depth signal and a larger signal is more robust, then the change in defocus should be large to be effective. Instead, the spider seemed to observe a very small change in defocus, even a differential one.
Casting the depth from defocus problem as a differential equation solved the mystery. This formulation immediately implies a class of efficient algorithms for recovering depth from small changes in focus. Better yet, a layered retina is not required. The focus change can be generated by any of a variety of physical mechanisms (camera or object motion, lens or sensor motion, optical power of the lens, aperture changes, and any combination of these). This change in approach mirrors the algorithmic options for optic flow. Previous depth from defocus methods, in which blur size was typically doubled, are analogous to identifying and tracking scene points. In contrast, our method is highly similar to differential optic flow algorithms, based on subpixel motion and brightness constancy, which use only a small number of local computations (derivatives and solving a linear system). In fact, the governing equation describing our depth from differential defocus (DFDD) cameras can be seen as a brightness constancy equation, modified to account for defocus blur.
Hundreds of millions of species, over hundreds of millions of years, have each solved a whole suite of visual problems.
From our differential defocus equations, we have built a family of prototype 3D cameras. The first, shown in Figure 2b, used scene or camera motion with a standard shallow-depth-of-field camera [3]. This approach has the advantage of simple hardware. It can be entirely passive, provided it is mounted on an already-moving platform such as a robot or observes dynamic objects. Depth and 3D velocity are recovered by solving a local linear system relating the spatial and temporal intensity derivatives in a small image patch. Next, we reduced the computational footprint to a per-pixel depth equation by equipping the camera with an electric tunable lens (see Figure 2c) [4]. We oscillated the lens' optical power at its resonant frequency of 100 Hz, syncing our image capture to the peak and trough of the oscillation for a 100 frame-per-second depth sensor. The baseline in-focus distance was also adjusted to track objects of interest, keeping them in focus to extend the effective working range of the system. Our third prototype (see Figure 2d) utilized the same principle but replaced the tunable lens with a custom-designed metalens [5]. This nanophotonic device splits light as though it had passed through a pair of differentially-defocused lenses with the same optical center. This allows us to use the same computation as before, but with optics that are 30 times lighter and consume no power. Finally, we have equipped a camera with both a tunable lens and a motorized iris (see Figure 2e) [6], which allows us to compare a pair of optical derivatives and eliminate spatial derivatives from our reconstruction equation. Because computing spatial derivatives amplifies image noise and relies on a window of pixels, switching to purely optical derivatives improves accuracy even while reducing computational cost.
Each prototype deepened our practical understanding of the design space while revealing novel areas of interest in the mathematics, which in turn led to new camera designs. Each time we iterated through our principle-application cycle, we saw new advantages in efficiency, accuracy, and robustness. Figure 3a shows a characterization of the floating point operations (FLOPs) and working range of a variety of depth-from-defocus algorithms. The spider-inspired approaches, indicated in green, have improved over time to surpass the state of the art on both metrics. Additional advances in our understanding of underlying principles include a novel connection between depth photography and partially coherent phase microscopy (see Figure 3b) [7], and a new characterization of spider eye movements from X-ray video (see Figure 3c).
The case study demonstrates the promise of biology for the field of computational imaging. Note all of the described advances in both engineering and basic science have resulted from exploring a single behavior that involves roughly an eighth of the retinas of one type of animal. In a world abundant with examples of strange and successful visual systems, we have much left to explore and much to gain from doing so.
Why us? The computational imaging community is uniquely well-suited to translate results in vision science to advances in technology. Our inherently interdisciplinary approach to sensing prepares us to handle the challenges of cross-field communication and address multiple standards of rigor. Our work keeps us in close touch with the full imaging pipeline and aware of the current trade-offs between known methods. This makes us the most likely to recognize gaps between an animal's behavior and what its optical and neural anatomy "should" be able to do, and these gaps are a powerful source of new understanding.
Why now? An exciting array of new tools are available, in both imaging (from novel optical devices to unprecedented increases in data and compute) and biology (from genetic tools to computer-vision-enabled behavioral analyses). Science and engineering are increasingly interdisciplinary. But most importantly, we are coming to realize that the next generation of imaging technologies will face newly restrictive power constraints, with the proliferation of edge devices even as sustainability becomes a pressing responsibility. Nature's cameras have faced these constraints for hundreds of millions of years—it's time to learn from them.
[1] Bhandari, A., Kadambi, A., and Raskar, R. Computational Imaging. MIT Press, 2022.
[2] Land, M. F. and Nilsson, D.-E. Animal Eyes. OUP Oxford, 2012.
[3] Alexander, E. et al. Focal flow: Measuring distance and velocity with defocus and differential motion. In Proceedings of the European Conference on Computer Vision (ECCV '16). Springer, 2016, 667-682.
[4] Guo, Q., Alexander, E., and Zickler, T. Focal track: Depth and accommodation with oscillating lens deformation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR '17). IEEE, 2017, 966-974.
[5] Guo, Q. et al. Compact single-shot metalens depth sensors inspired by eyes of jumping spiders. Proceedings of the National Academy of Sciences 116, 46 (2019), 22959-22965.
[6] Luo, J. et al. Depth from coupled optical differentiation. arXiv preprint arXiv:2409.10725 (2024).
[7] Alexander, E. et al. Depth from defocus as a special case of the transport of intensity equation. In 2021 IEEE international Conference on Computational Photography (ICCP '21). IEEE, 2021, 1-13.
Emma Alexander is an assistant professor of computer science at Northwestern University's McCormick School of Engineering. Her training is in physics (B.S. with distinction, Yale), computer science (B.S. with distinction, Yale; M.S. and PhD., Harvard) and vision science (postdoc, University of California, Berkeley).
Figure 1. Understanding vision.
Figure 2. Spider-inspired depth from differential defocus cameras.
Figure 3. Advancing applications and principles.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2024 ACM, Inc.