The Inverse Problems You Carry in Your Pocket

Magazine: Features
The Inverse Problems You Carry in Your Pocket

FREE CONTENT FEATURE

In the spaces between data-hungry generative models and measurement-rich computational imaging, we can find the field of computational photography. Can cell phone cameras be an accessible and affordable bridge between modern computer vision and traditional inverse imaging problems?

The Inverse Problems You Carry in Your Pocket

By Ilya Chugunov, January 2025

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Computational photography, Mobile devices

One million dollars. The field of magnetic resonance imaging (MRI) contains arguably some of the most interesting multidisciplinary research problems in science with intersecting fields like electrical engineering, quantum physics, neuroscience, and developmental biology. But a million dollars is what you'll want to have in the bank if you're looking into financing an MRI machine for your next computational imaging project. Yet it is in the context of MRIs and space telescopes that we often introduce students to inverse imaging problems, recovering an underlying image or signal from indirect, incomplete, or imprecise measurements. For example, an accelerated MRI records a fast undersampled scan of a patient and reconstructs the full scan via compressed sensing [1], taking advantage of the mathematical properties of sparse signals to narrow down what the missing data is. Or, an example I used for many of my homework assignments, is the case of the Hubble Space Telescope, which initially suffered from optical aberrations due to a flawed primary mirror and sparked a race for computational methods to make it produce sharp images [2].

These are settings where the logistical costs to improve measurements, whether it's scheduling another appointment to re-scan a patient or launching a new mirror into space, outweigh the mathematical and computational costs of extracting every last drop of information from the data you already have.

Unfortunately, this context can also feed into an unhealthy divide in research disciplines, where scientific imaging is the domain of linear algebra and matrix problems, and conventional photography is the domain of black-box machine learning methods and cute cat images. Combined with the relatively limited access students have to scientific devices compared to the abundance of online image collections available to anyone with an internet connection, I would argue this explains at least in part some of the stark differences between the fields of computer vision and computational imaging.

However, if we take a closer look at the origins of those online image collections—not the PNGs that comprise them, but the (mostly cell phone) cameras used to collect them—we can find many of the same imaging questions as we find in their MRI and space telescope counterparts. The pixels recorded in these PNGs are just the final product of photons reflected off object surfaces, focused through a stack of wafer-thin lenses, filtered by color, and converted first to electrons and then to digital readouts through the magic of semiconductor circuits. This is not to mention all the other components in our mobile devices: the micro-electromechanical systems that measure the phone's acceleration, rotation, and surrounding magnetic field; the processing chips that coordinate, compile, and collect the outputs of these sensors; and the displays that give the user realtime access to all of this data. Viewed through this lens, the modern cell phone is really a miniaturized multi-sensor imaging system with an internet connection.

But where are the inverse problems? What are the measurements and what are we solving for? When we look at how images are used in computer vision, we see they're most often treated as inputs to a downstream task. In this perspective, images need to be mapped to a ground truth output—segmentation maps, depth estimates, object classes—which we must generate through hand annotation, computer simulation, via a separate sensor (e.g., a depth camera), or some combination thereof. Even "low-level vision" tasks, such as removing noise from photos, can be framed as a forward process to map an input noisy image to an output clean image.

To find the inverse problem here, all it takes is a change of perspective. If we treat the noisy image as an output of the imaging process we outlined before, the product of random photons and stray electrons, we can ask what the input real-world scene should look like to produce a noisy image like this. In other words, the photo itself is an indirect, incomplete, or imprecise measurement of the physical properties of the world—properties which we want to recover such as 3D geometry, lighting, reflectance, and material characteristics.

This view becomes clearer when we look at the setting of burst photography, sequences of frames collected one after another [3]. We can ask what the scene must have been composed of, and how those elements must have changed over time to produce the captured image burst. We can ask about motion in the scene, whether it's from changes in the camera position or from parts of the scene moving on their own. If we can determine this motion, we can identify points we've observed multiple times and try to infer what the appearance of that point must be in the real world to produce these observations.

Educational Value

While the optical properties of a cell phone camera (see Figure 1) don't exactly match that of a space telescope, or really in any way resemble what goes on inside an MRI machine, you (and the students in your classroom or lab) likely already own a cell phone. With a device in your hands, you can go through the full cycle of collecting and processing data. Sleuthing out why some captures produce perfect reconstructions with your model while some fail to give you anything, and searching for a better model for these failure cases. And even if the objective is just to take cute pictures of your cat, a cycle of inquiry and exploration into the data you collect in every way resembles the processes of scientific imaging. The following sections detail the work we have done in the process of building models for this kind of inverse scene reconstruction for the applications of depth estimation, reflection and occlusion removal, and image stitching; concluding with a discussion on the potential future of mobile imaging.

Depth Estimation

"Shakes on a Plane: Unsupervised depth estimation from unstabilized photography." Ilya Chugunov, Yuxuan Zhang, and Felix Heide. CVPR 2023. [4]

If you are reading this, you are likely not a tripod. It is also likely that you don't use a tripod for cell phone photography. Which, when we remove the effects of digital and optical stabilization, yields a surprising amount of camera motion as a result of doing "nothing at all." During two seconds of view-finding, even while trying to steady the camera on a subject, your cell phone is likely to drift on the order of a centimeter from where it started. For the purposes of the photo, this camera shake is an undesired effect that can produce motion blur in the image. However, for the purposes of depth estimation, this motion actually contains rich information about the scene's geometry. Given the high resolution of a modern phone camera sensor, 12 million pixels in an area roughly the size of a pencil eraser, even an object five meters away exhibits several pixels of observable parallax—apparent motion due to the camera's shift in position. If we can estimate the parallax between these images we observe during view-finding, we can recover the 3D shape of objects in the scene, which we can use for a host of downstream tasks such as object segmentation, relighting, and scene editing.

Given we know very little about how the phone actually moved—we can get a good estimate of its rotation from the phone's gyroscope, but the accelerometer is not reliable enough to give us estimates of its translation in space—we must solve for both the motion of the phone and the world simultaneously. Unfortunately, this means, like many other problems in inverse imaging, our depth reconstruction task is extremely ill-conditioned. There are many different solutions, different combinations of camera paths and object depths, that would produce identical observed images. This is where we must introduce some assumptions—or constraints—into our model, based on our observations and the physical laws that govern them. First, we assume that the camera path is smooth—i.e., your hand slowly bobs back and forth during the capture and does not teleport. Next, we assume the scene is static, that any motion we see must be from parallax and not from the objects themselves moving. Lastly, we assume the depth is mostly continuous: If a point in the scene is close to the camera, its neighbors are likely also close to the camera. We implement these constraints as mathematical components in our model of the scene: smooth camera motion becomes a smooth polynomial model for the camera's translation, a static scene is modeled with a static camera projection matrix, and continuous depth is modeled as a continuous neural field—a tiny neural network that learns a function to map input coordinates to output values.

Setting up this physically constrained model, we feed in our observations a two-second small motion video clip and optimize it via gradient descent to match these observations. And with no ground truth data, the model never having observed any other data than the video it's been fit to, we can recover a geometrically accurate reconstruction of the scene. Turning what would otherwise be discarded motion data into a 3D model of a tiger (see Figure 2).

Layer Separation

"Neural Spline Fields for Burst Image Fusion and Layer Separation." Ilya Chugunov, David Shustin, Ruyu Yan, Chenyang Lei, and Felix Heide. CVPR 2024. [5]

Another, more implicit assumption that we made in the previous model was assuming that pixels in the image had a depth at all. Looking again at the exact same two-second small motion video data as before, we can identify a different problem and a different model to tackle it. What if we assume the pixel was part of an occlusion, a fence blocking your view of some beautiful construction equipment or a reflection on top of a museum artwork (see Figure 3). If we move the camera, and the reflection shifts right as the art appears to shift left, we can no longer model this all as a single layer as before—a point on a static depth map can't suddenly split in half. Here we give up on estimating the exact geometry of the scene, and instead focus on this layer separation problem, splitting the fence from its background. Our camera model remains exactly the same, but we swap the scene model from a depth map to two stacked image planes. For the foreground plane, we add an alpha channel which determines its opacity, allowing it to model translucent objects like reflections on glass, or transparent areas like the gaps in a fence. All we need to do now is match the motion of the content on these layers to the motion we observe in the video, and we have a perfect candidate model for the scene.

By setting good mathematical constraints, we create a model that, having never seen a fence before, can almost perfectly separate one from its background.

Here again, we run into the problem of ill-conditioning. If we try to learn an unconstrained flow volume, letting any point in the planes flow to any other location over time, we quickly run into optimization problems. This model has too many conflicting solutions. Why should it separate the fence onto one plane and the background on another when it can just move whatever color it needs to whatever pixel needs it? What we need again are some good constraints: Points don't jump around, and points that move together belong together. In other words, neighboring points on the same plane should have similar, smooth motion. But how do we enforce this? We can introduce a regularization term, penalizing the model for producing any abrupt changes in the flow of points. This can work, but seems wasteful; why should we let the model learn abrupt changes in points' positions and then negotiate with it to remove them? Intuitively, it would be much more sensible to have the model learn smooth flows from the start.

This was exactly the design principle by which I designed "neural spline fields," tiny neural networks that map coordinates to spline control points—points traced by a smooth polynomial function. If the only knobs the model can turn during optimization are the points that define a smooth function, then the only functions it will ever produce are smooth ones. With this neural spline field flow, we can now optimize the model via gradient descent, confident that whatever motion model it learns in the process will be one that adheres to the constraints that we've set. By setting good mathematical constraints, we create a model that, having never seen a fence before, can almost perfectly separate one from its background. Turning would otherwise be discarded motion data into a vehicle for layer separation.

Image Stitching

"Neural Light Spheres for Implicit Image Stitching and View Synthesis." Ilya Chugunov Amogh Joshi, Kiran Murthy, Francois Bleibel, and Felix Heide. SIGGRAPH Asia 2024. [6]

Moving from accidental to intentional camera motion, in this task we investigate how to stitch images together from an input panoramic video capture. Where for depth reconstruction and layer separation we had many near identical observations of a scene, for panorama stitching we have such a fast rotation of the camera that points in the scene might only be observed for a total of three of four frames. Yet the objective remains the same, to find the simplest possible useful model that can explain our observations.

The search for this model, which took the better part of a year, proved similar to a popular story involving three bears. A purely spherical model proves under-parametrized. Pixels stuck to the inside of a globe fail to match the parallax motion we observe as our phone camera travels as much as a meter during data collection. Whereas a full light field approach, where the color of the scene depends on both the estimated camera's location and view angle, proves over-parameterized. This model fits our observations but extrapolates horribly when we try to render unobserved parts of the scene.

The final model ended up being a hybrid of these two approaches, a smooth deformation field applied to a sphere model to handle effects such as parallax, combined with a view-dependent color offset to handle occlusions, reflections, and lighting changes. The results exceeded our expectations, as we were able to reconstruct scenes recorded during the dead of night, or with billowing steam clouds where baseline approaches faltered (see Figure 4). But it is the process I want to emphasize. While many similar scene reconstruction approaches start with the same data, video from a hand-held cell phone, they paradoxically start by throwing away the information they want to reconstruct. They run their images through structure-from-motion software to recover the rotations and lens characteristics of the device, neglecting the fact that the same device already provides that information—calibrated to a higher precision than their software can ever recover.

Even if the objective is to take cute pictures of your cat, a cycle of inquiry and exploration into the data you collect in every way resembles the processes of scientific imaging.

The reason for this isn't spite or ignorance, it's just a lot of engineering work. These device values are not stored in the PNGs or MP4s recorded by its default camera app. They are metadata floating in the aether of the phone's operating system that need to be painstakingly extracted into a workable form, often through custom code that doesn't even earn a spot in the supplementary material of the accompanying academic publication. But it is thanks to this engineering work that we can build this compact model for image stitching. We don't need to add potentially conflicting optimizable parameters for lens distortion when the phone gives us reliable estimates, and its gyroscope gives us an extra set of observations in the form of rotation values for us to more reliably fit the data.

The Mobile Photography Future

Mobile photography is dead, long live mobile photography.

While in the previous sections I outlined ways in which mobile photography problems parallel the design patterns we see in scientific and computational imaging, it also deserves a spotlight of its own. The past decade has seen rapid, and sometimes back-and-forth, evolution of the sensors and systems in cell phone devices. We've seen dual-, triple-, and penta-camera configurations, depth ranging systems with solid-state lasers, and even phones with a built-in infrared thermometer. Like transistors in the GPU sector, silicon CMOS sensor technology continues to shrink beyond levels we thought possible, fitting more and more photodiodes in ever-smaller spaces. This has led to the introduction of dual-, quad-, and even octa-pixel sensors into mobile devices—where each pixel is divided into multiple diodes to compare light signals, enabling ultra-fast autofocus capabilities. In combination with our recent research [7] into split-aperture coded imaging with split-pixel sensors (see Figure 5), we can imagine a future where nano-fabricated meta-lenses or diffractive optical elements make their way into consumer electronics. A joint miniaturization and specialization of the sensor and optical hardware together for applications such as high-quality ultrawide and telephoto imaging, which is currently severely bottlenecked by the lack of available volume inside mobile devices (and is also the culprit behind the ever-increasing size of the camera bump on modern cell phones).

As the definition of a "mobile device" evolves, so too will its cameras. The modern cell phone could very well go the way of the fax machine over the next decades, supplanted by augmented or mixed reality devices with unique hardware solutions to unique imaging problems. We already see VR devices experimenting with a variety of active depth sensors, infrared cameras, and eye-tracking systems. I don't know what the future of mobile imaging holds, but it's likely going to involve a growing complexity of signals in search of compact reconstruction models, and it's likely still not going to involve tripods.

References

[1] Lustig, M., Donoho, D., and Pauly, J. M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 58, 6 (2007), 1182–1195.

[2] White, R. L. and Hanisch, R. J. (Eds.) The restoration of HST images and Spectra–II. In Proceedings of a Workshop held at the Space Telescope Science Institute, Baltimore, Maryland. Space Telescope Science Institute, 1993.

[3] Delbracio, M., Kelly, D., Brown, M. S., and Milanfar, P. Mobile computational photography: A tour. Annual Review of Vision Science 7, 1 (2021), 571–604.

[4] Chugunov, I., Zhang, Y., and Heide, F. Shakes on a plane: Unsupervised depth estimation from unstabilized photography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, New York, 2023, 13240–13251.

[5] Chugunov, I., Shustin, D., Yan, R., Lei, C., and Heide, F. Neural spline fields for burst image fusion and layer separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, New York, 2024, 25763–25773.

[6] Chugunov, I., Joshi, A., Murthy, K., Bleibel, F., and Heide, F. Neural light spheres for implicit image stitching and view synthesis. In SIGGRAPH Asia 2024 Conference Papers (SA '24). ACM, New York, 2024.

[7] Shi, Z., Chugunov, I., Bijelic, M., Côté, G., Yeom, J., Fu, Q., Amata, H., Heidrich, W., and Heide, F. Split-aperture 2-in-1 computational cameras. ACM Transactions on Graphics 43, 4 (2024), 1–19.

Author

Ilya Chugunov is a Ph.D. candidate in the Princeton Computational Imaging Lab. He is an NSF graduate research fellow researching neural representations for mobile computational photography. His interests include 3D reconstruction, sensor fusion, and birds.

Figures

Figure 1. The modern cell phone is a portable computational imaging platform with an internet connection and fits as well into a scientific laboratory as it does into your pocket.

Figure 2. From just a tiny bit of natural hand motion, we can estimate high-quality depth maps for these subjects.

Figure 3. By separating what content moves together, we can almost perfectly remove the fence from this construction site, and both reveal the fox behind the glass and the scene behind me as I capture the data.

Figure 4. Implicitly stitching content together, the neural light sphere allows us to re-render our scene with a much larger field of view than we started.

Figure 5. Split-coded optics like these could be the next step in bringing computational imaging into the mobile domain.

Crossroads The ACM Magazine for Students

Magazine: Features The Inverse Problems You Carry in Your Pocket

FREE CONTENT FEATURE

The Inverse Problems You Carry in Your Pocket

Magazine: Features
The Inverse Problems You Carry in Your Pocket