The future of mixed reality is adaptive

Magazine: Features
The future of mixed reality is adaptive

FREE CONTENT FEATURE

In a future where we replace our smartphones and notebooks with mixed reality headsets, the way we create user interfaces will change drastically. Future interfaces will need to adapt automatically to users' context, guided by optimization-based methods and machine learning, to become beneficial for end-users.

The future of mixed reality is adaptive

By David Lindlbauer, October 2022

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Interaction devices, Interactive systems and tools, Mixed / augmented reality, Virtual reality

Virtual reality (VR) is here to stay. After decades of innovation in display, sensing, and input technologies, VR devices are readily available, enabling a wide range of applications in entertainment, health, training, and education. VR offers users fully immersive experiences in virtual worlds filled with virtual objects. Designing such worlds involves carefully tailoring all components, including user interface elements, avatars, and game objects specific to individual applications. Designed as an alternative world, VR and its components are disconnected from the users' physical environment. In this sense, VR interfaces are an extension of traditional computing devices such as smartphones, tablets, or notebooks. Virtual contents are "hosted" on a display and inherently disconnected from the physical world. This bounded interaction space is tremendously advantageous for content creators such as designers or programmers. When designing applications for 2D displays or VR, we consider factors such as screen resolution, aspect ratio, or if the device is operated using touch or mouse. In most situations, we do not have to consider the context, e.g., whether users are in the office or at home, alone or with their friends, or whether the ambient light makes the virtual element we display less legible. We can assume all interface elements are visible, within reach, and readable if they are well designed in the first place.

Mixed reality (MR)^a has the potential to fundamentally change this paradigm of how we create user interfaces. If we extrapolate the current developments in display technologies and sensing, I believe in the near future, we will have access to a see-through head-mounted display that is lightweight, unobtrusive, and has a large field of view. It will sense the environment through object detection and semantic segmentation algorithms and display virtual contents that appear to be embedded in the physical world. Imagine the following scenario. You are in your living room wearing a see-through MR display. You see a messenger application floating on top of your coffee table, indicating that you have a message from a friend. You are watching a YouTube video on a virtual 50-inch screen on the other side of the room while a small calendar application floats next to it to gently remind you of your dentist appointment in two hours. Next to it is your to-do app, which shows you should go grocery shopping after your appointment. The applications appear to be fully embedded in the environment, just like other decorative objects, capturing barely any attention and showing just the necessary information.^b Suppose you receive a call from a friend who wants to confirm your evening plans. The information you now need is: How do you get to the other end of the city, what are some good restaurants, are there any movies playing, and did you remember to buy a birthday gift? None of the information in your current virtual environment helps you answer any of those questions, and you frantically scramble to bring up a map, restaurant recommender app, movie planner, etc. In such a scenario, your virtual environment didn't help you at all, and you had to manually pull all the information by changing apps.

When we think about adapting MR systems, we should be looking beyond "simple" appearance changes of classical responsive interfaces.

This example highlights one of the dozens of instances we face every day where we switch our context. We change our tasks, environments, social settings, or our "internal" state from being awake after a cup of coffee to being tired at the end of a workday. Those switches can happen gradually or instantaneously. If we access digital information on a smartphone, switching tasks is simple. We open and close the apps we want and are good to go; chances are high that we didn't open many apps simultaneously as our screen space was very limited. For spatially distributed apps in MR, however, switching requires more effort. Since these apps can be presented anywhere within your environment, they might be out of reach, occluded by other objects or people in the environment, or not currently visible. Since we switch contexts many times a day, manually changing where and when applications should be displayed is too much of a burden on users. On the other hand, content creators, such as application designers, cannot foresee all the possible contexts users will be in. One simply cannot design a spatial layout for applications that require knowledge of all possible room layouts of a living room, for example, or all possible other variations of users' context. This means for MR to be beneficial for users, it needs to be able to sense and adapt to users' current context. This article outlines what I believe will be key factors for the future of adaptive MR interfaces. In essence, we need to sense users' current context as input for adaptive MR systems, and then adapt various interface factors such as the placement, appearance, and level of detail of the virtual interface elements as the output of the system. I will describe examples of systems and methods to create such adaptive MR systems, and what I believe will be some interesting directions to achieve a future where MR is truly beneficial for users.

Input: When Should Mr Systems Adapt?

Any adaptive MR system needs to be able to sense users' context. The term context, however, is challenging to define. Dey [defines context as follows: "Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves" [1]. While this definition emphasizes that context encompasses many different aspects, it only provides a limited view of what information we actually need to obtain in order to make effective adaptations to an MR system. On one hand, context needs to be very specific: There is a living room with a couch, table, three chairs, TV, etc., and a user who is sitting on the couch writing a research article and does not want to be disturbed, but they are easily distracted. Only when we define situations accurately and specifically are we able to know whether an application is useful. On the other hand, contextual understanding in MR needs to be general because it is impossible to enumerate all possible rooms users will be in, tasks they will perform, or combinations of the two. Those two types of context, specific and general, often conflict with each other. Therefore, in the work of my research group, we typically start with the question: "What information is absolutely necessary to enable our system to adapt correctly?" Rather than focusing on all the information we could gather, this forces us to keep any sensing to a minimum.

Besides specific and general context, we can additionally distinguish between extrinsic and intrinsic contextual information. Extrinsic information includes anything from the furniture arrangement or number of people in a space, users' current task, or their body or hand pose. Extrinsic information is observable through sensors, such as cameras, that are in a space or worn by users. With intrinsic information, I refer to users' internal state, from cognitive and perceptual load to stress or mood. This type of information is inherently challenging to observe and ambiguous. Both extrinsic and intrinsic information are important factors for any adaptive MR system.

In my research group, we rely on both extrinsic and intrinsic information to adapt when, where, and how to display virtual elements in MR. As extrinsic information, we use aspects such as the physical objects in a space, specifically their type and position. In our work on context-aware MR [2], for example, we used the geometric information of a space and users' current perspective to determine whether a virtual element would be visible. Visible objects were then displayed as world-anchored elements (positioned relative to the environment), whereas occluded elements were displayed as view-anchored elements (positioned relative to users' field of view). We then leveraged intrinsic information to further adapt the MR interfaces, as shown in Figure 1. In our work SemanticAdapt [3], we demonstrated an approach that computes the semantic association between the physical objects and the virtual interface elements. A virtual music application, for example, is semantically closer to a (physical) pair of headphones than to a cup. We then exploited this information for optimizing the placement of virtual interface elements. The workflow is illustrated in Figure 2.

Both approaches also leveraged different types of intrinsic information. SemanticAdapt, for example, took users' preferences into account. Every time the system adapted a layout, it took users' prior layouts into account and used those to "warm start" the optimization procedure. In comparison, our work on context-aware MR took a different, more direct approach. We were interested in how to leverage users' cognitive load to adapt how much information an MR interface would display. We estimated users' cognitive load by calculating the Index of Pupillary Activity (IPA) [4]. The IPA effectively calculates the frequency of meaningful changes in pupil size, which is positively correlated with cognitive load (higher load leads to more changes in pupil size). Our system used this information to decrease the amount of displayed information when we detected a higher cognitive load and showed more information when users' cognitive load was estimated to be lower.

In the previous examples, we leveraged the type of objects in a space and room geometry as extrinsic information, and users' preference and estimated cognitive load as intrinsic information. This only presents a small fraction of contextual information that could be useful for adaptive systems. The selection of input information always has to strike a balance between sensing complexity, privacy, and usefulness, which arguably has to be decided on a per-application basis.

Output: What Should Mr Systems Adapt?

The parameters to change the content and look-and-feel of an MR interface elements are plentiful, just like for traditional 2D interfaces. MR systems can take inspiration from responsive websites, for example, where designers modify parameters of the virtual elements' appearance such as the information density based on screen size (desktop vs mobile browser) [5], the color based on ambient light ("dark mode"), the size of individual buttons based on input modality (mouse versus touch), or users' capabilities [6]. As MR interfaces can be embedded as 3D objects in the physical world, parameters relating to virtual objects' spatial relationship with the environment, such as the world, geometry, users, and each other, also need to be considered. The choice of optimal position, rotation, and scale of the virtual elements for any MR interface are influenced by many input parameters such as users' position and the world geometry. In our work on context-aware MR, we automatically controlled the visibility, level of detail, and type of placement (world-anchored versus view-anchored). In SemanticAdapt, we controlled the placement and scale of virtual interface elements.

The selection of input information always has to strike a balance between sensing complexity, privacy, and usefulness.

Besides appearance and spatial positioning, the representation of virtual objects and the information it aims to convey is an important parameter. Representation can refer to a variety of ways to present virtual objects in MR. For example, we can consider whether to display an element with a 2D representation like a virtual window, just like a traditional desktop or smartphone interface. However, some information lends itself very well to a 3D representation such as 3D models, maps, or historical artifacts in an educational context. In additional to this, representation could also mean we completely modify the way we present information to users. In our work on navigation instructions in MR, for example, we performed two studies to find out which type of representation users prefer when performing different navigation scenarios [7]. We compared different types of navigation instructions (see Figure 3) such as arrows on the ground, avatars that users needed to follow, path visualizations, and more drastic changes such as desaturating the environment except for where users needed to walk. In a crowdsourcing study and a VR study, we asked users to state their preferences when using the instructions for different tasks such as strolling to a restaurant or rushing to an airport gate. There were clear differences in preferences when users were presented with casual browsing scenarios or rushed scenarios, with more drastic changes to the environment being appreciated to avoid distractions for the latter. This indicates when we think about adapting MR systems, we should be looking beyond "simple" appearance changes of classical responsive interfaces, and present users with a wide range of options to find the optimal representation of virtual contents for a given task, environment, social setting, or state of mind.

Methods: How Should Mr Systems Adapt?

Users' context changes dozens of times a day, and neither the timing nor the type of change is predictable by content creators. This means that top-down design approaches (i.e., design offline, then deploy) like those used for 2D interfaces or VR would no longer work. Any adaptive interface fundamentally will rely on automatic or semi-automatic methods that control the output parameters of the virtual interface elements. Based on our prior work, I believe future systems will rely on a mix of methods with varying complexity and computational cost.

There exists a plethora of methods that can be used to adapt the various parameters of adaptive interfaces, both for traditional desktop and mobile UIs, VR, or MR. On the simplest level, we can use heuristics to determine how an adaptive interface should behave. For example, we could define that certain apps automatically become visible once users arrive at a certain location. Such rule-based systems, however, do not scale well, as we have to account for many different situations. Heuristic optimization, as an extension, enables us to combine multiple heuristics into one or more objective functions. We can then use methods such as greedy search to determine how our system should output information. An interesting example of how to optimize the layout of an MR game given the geometry of a room is demonstrated in FLARE [8]. Heuristic optimization can be efficient and allows solving problems that cannot be formulated easily using mathematical functions. However, they mostly do not give us any guarantees about the goodness of a specific solution. On the contrary, exact methods, including combinatorial optimization or linear programming, provide such guarantees and can be solved efficiently using state-of-the-art solvers like Gurobi. By defining an objective function (the goal) and constraints (the rules), these methods can be used to solve assignment problems. In our work, we used integer linear programming to automatically decide which virtual elements to display by maximizing the utility of an interface (the objective) while not exceeding the cognitive load of users (the constraint) [3]. Lastly, there exists many learning-based approaches that use methods such as reinforcement learning to infer which MR label to display based on users' current gaze trajectory [9]. While these methods offer promising generalizability, challenges in data collection, explainability, and computational complexity remain.

This means for MR to be beneficial for users, it needs to be able to sense and adapt to users' current context.

All these methods have to perform the challenging task of adapting MR contents in a way that is beneficial for users while having no guarantees that the input information they receive is correct. While methods to infer extrinsic information, such as pose estimation or object recognition show increasingly high quality, intrinsic information such as cognitive load is inherently challenging to work with. Adaptive MR interfaces, therefore, always need to be built with imperfect information in mind and need to work with high levels of uncertainty. Reasoning under this type of uncertainty is one of the main challenges of the overall field of computational interaction, independent of whether the application is in MR, 2D interfaces, or input sensing.

Is The Future of Mixed Reality Adaptive?

Personally, I am excited about the future of mixed reality. The main reason is that even though the topic has seen decades of research, we have the opportunity to build it from the ground up. The hardware and software are not yet ready for end consumers, and with HCI researchers involved in its academic and industrial development, we can shape the future of interaction beyond flat displays. Simply transferring what we know from 2D displays to MR will not be sufficient to convince anyone why allowing the virtual world to permeate into the physical environment is a good idea. We need to build the technology in a way that puts an emphasis on being accessible, privacy-aware, explainable, and ultimately beneficial for users. This means that even though we need optimization-based and learning-based methods to adapt to users' ever-changing context, we need to make sure the systems we develop are predictable, avoid distraction, and blend into the physical environment in a way that is superior to current technologies such as phones, tablets, or smartwatches. Most current technologies constantly demand our attention. The goal of adaptive MR is to take away attention from the virtual world but embed virtual elements seamlessly into our physical environment, making them less intrusive and more beneficial for us.

Acknowledgments

This article is shaped by the research we do at the Augmented Perception Lab at Carnegie Mellon University, the work I did with Otmar Hilliges and Anna Feit at ETH Zurich, as well as many conversations on the topic with brilliant students and colleagues from academia and industry. The notion of computational interaction as a framing for the collaborative nature of user interface generation and adaptation is key for many aspects of my thinking. Check out Computational Interaction by Oulasvirta et al. [10] as a good starting point. Thanks to Hyunsung Cho, Catarina Fidalgo, and Yifei Cheng for providing valuable feedback on this article.

References

[1] Dey, A.K. Understanding and using context. Personal and Ubiquitous Computing 5, 1 (2001), 4–7.

[2] Lindlbauer, D., Feit, A.M. and Hilliges, O. Context-aware online adaptation of mixed reality interfaces. In Proceedings of the 32^nd Annual ACM Symposium on User Interface Software and Technology. ACM, New York, 2019, 147–160.

[3] Cheng, Y., Yan, Y., Yi, X., Shi, Y. and Lindlbauer, D. SemanticAdapt: Optimization-based adaptation of mixed reality layouts leveraging virtual-physical semantic connections. In The 34^th Annual ACM Symposium on User Interface Software and Technology. ACM, New York, 2021, 282–297.

[4] Duchowski, A.T., Krejtz, K., Krejtz, I., Biele, C., Niedzielska, A., Kiefer, P., Raubal, M. and Giannopoulos, I. The index of pupillary activity: Measuring cognitive load vis-à-vis task difficulty with pupil oscillation. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, New York, 2018, 1–13

[5] DiVerdi, S., Hollerer, T. and Schreyer, R. Level of detail interfaces. In Proceedings of the Third IEEE and ACM International Symposium on Mixed and Augmented Reality. ACM, New York, 2004, 300–301.

[6] Gajos, K.Z., Weld, D.S. and Wobbrock, J.O. 2010. Automatically generating personalized user interfaces with Supple. Artificial Intelligence 174, 12–13 (2010), 910–950.

[7] Lee, J., Jin, F., Kim, Y. and Lindlbauer, D. User preference for navigation instructions in mixed reality. 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2022. 802–811.

[8] Gal, R., Shapira, L., Ofek, E. and Kohli, P. FLARE: Fast layout for augmented reality applications. 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2014, 207–212.

[9] Gebhardt, C., Hecox, B., van Opheusden, B., Wigdor, D., Hillis, J., Hilliges, O. and Benko, H. Learning cooperative personalized policies from gaze data. In Proceedings of the 32^nd Annual ACM Symposium on User Interface Software and Technology. ACM, New York, 2019, 197–208.

[10] Oulasvirta, A., Kristensson, P.O., Bi, X. and Howes, A. eds. Computational Interaction. Oxford University Press, 2018.

Author

David Lindlbauer is an assistant professor in the Human Computer Interaction Institute at Carnegie Mellon University. His research focuses on understanding how humans perceive and interact with digital information, and to build technology that goes beyond the flat displays to advance our capabilities when interacting with the virtual world. To achieve this, he creates and studies enabling technologies and computational approaches that control when, where and how virtual content is displayed to increase the usability of mixed reality interfaces.

Footnotes

a. I use mixed reality as my personal favorite among augmented reality, extended reality, mediated reality, etc.

b. I leave the question of whether we want a deeper integration of digital information into our lives, or if the separation is the only reason that keeps us sane to future articles. Done poorly, the door might certainly be open to ad-fueled attention-grabbing virtual contents. Done right, I would argue digital information finally might take a back seat and not govern such a big portion of our waking moments.

Figures

Figure 1. Screenshots of an adaptive MR system that controls when, where, and how to display virtual user interface elements based on users' cognitive load, task, and environment. Image from [2 ].