Future Technologies
Semantic Digital Twins: Enhancing Performance in Wireless Communication and LLM Inference
This paper presents a novel technique that combines ISAC with LLMs to extract features from sensor data, to create a semantic digital twin (SDT).

By Wireless Advanced System Competency Centre, Huawei: Peiyao Chen, Yiqun Ge, Qifan Zhang, Wuxian Shi, Zheyuan Wei
1 Introduction
The development of artificial intelligence (AI), represented by large language models (LLMs), has opened doors to exciting possibilities across various industries such as manufacturing, healthcare, and transportation. From intelligent management of production lines to automated control of transportation systems and precise prediction in medical diagnostics, the applications of LLMs are gradually reshaping our lives and work, offering a more efficient, safer, and healthier future for society. LLMs possess the capability to understand and process contextual information, which enables them to generate coherent and meaningful responses, mimicking human-like communication.
With the advancement of wireless communication technology, 6G wireless systems are evolving toward higher frequencies (including millimeter-wave and even terahertz), wider bandwidths, and larger-scale antenna arrays. On the one hand, communication systems are acquiring capabilities similar to sensing systems, utilizing widely-covered mobile communication signals to extract distance, angle, material, and other information from radio waves through analysis of direct, reflected, and scattered signals, thereby achieving a perception of target objects or environmental attributes and states. On the other hand, sensing technologies, through high-precision positioning, environment reconstruction, etc., provide real-time replication of the physical world, i.e., constructing a parallel digital world, known as a "digital twin". The digital twin helps enhance communication performance, for example, through precise beamforming and efficient channel state information (CSI) detection. Integrated sensing and communication (ISAC) is expected to become a trend. Additionally, the combination of LLMs with sensing technologies endows devices such as sensors and cameras with intelligent perception capabilities. They not only identify, detect, and collect vast as well as diverse data but also possess the ability to analyze and optimize the data, enabling them to perceive and understand the external environment. The future will witness the integration of ISAC with LLMs, jointly driving the era of 6G "Artificial Intelligence of Things (AIoT)".
In this vision, there will be a significant portion of communication between sensors, robots, and other intelligent devices in future communication systems, particularly in the context of 6G. The emergence of LLMs enables intuitive and efficient communication between humans and machines, as well as among machines, advancing the concepts of semantic communication — which enhances efficiency by reducing the volume of data transmission and focusing on conveying meaning rather than just raw data — in the realm of 6G research. Through LLMs, information from various modalities (such as images, audio, and point clouds) can be extracted and transformed into a common tokenized representation. These discrete tokens extracted from the LLM vocabulary encapsulate the semantics of underlying data, regardless of their original modalities. This offers exciting possibilities for seamless communication and information exchange among different devices and systems. Additionally, this token-based semantic communication method makes it easier to integrate information into knowledge graphs and other semantic representation frameworks, facilitating decision-making based on a comprehensive understanding of the environment. With context-aware communication, devices can dynamically adjust their behavior based on the surrounding environment and the overall system goals.
To realize the vision of efficient communication and AIoT, the demand for LLMs will extend beyond human users to encompass a vast network of IoT devices. However, performing LLM inference directly on these devices is often impractical due to their limited computational capabilities. Traditional cloud-based solutions, because of the round-trip communication between users and distant data centers, introduce significant delays, hindering real-time applications that demand instant responses. This is particularly problematic for time-sensitive tasks such as autonomous driving or industrial control, where milliseconds are crucial. This challenge necessitates the need for online LLM inference facilitated by advanced wireless systems, particularly at the base station (BS) level in the context of the upcoming 6G era.
Thus, in future wireless communication systems, in addition to managing and controlling wireless communication and providing connectivity and communication services to user devices, BSs also serve as central hubs for various AI models, each of which is designed for specific functionalities and applications. These pre-trained and validated models are strategically allocated to BSs within the core network, bringing AI capabilities closer to end users. Figure 1 shows such a future system.
This paper contributes to advancing wireless communication and AI inference efficiency by integrating ISAC with LLMs to establish a semantic digital twin (SDT). The rest of the paper is organized as follows. Section 2 introduces the detailed framework of SDT in wireless communication systems, focusing on the integration of ISAC and LLMs. Section 3 explores the applications of SDT in wireless communication and enhancing AI inference capabilities. Section 4 concludes this paper.

Figure 1 Integrating ISAC and LLM in a 6G system
2 Semantic Digital Twin
The concept of digital t win is revolutionizing our understanding and management of complex systems. Digital twins enable network operators to optimize performance by identifying coverage gaps, mitigating interference, and efficiently allocating resources. As we move toward 6G and beyond, digital twins are becoming essential for creating virtual replicas of physical wireless environments, encompassing everything from BSs and user devices to the surrounding terrains. These digital representations are constantly updated with real-time data, facilitating continuous monitoring, analysis, and prediction of system behavior.
In the vision of efficient communication, the integration of semantic communication and digital twin technology presents a compelling prospect for the future of 6G intelligent systems, namely, the tokenized representation of the real-time physical cellular world. Sensors and wireless-related characteristics with tokenized representation, such as CSI and channel quality indicator (CQI), play a crucial role in building an accurate and effective SDT in this context.
2.1 Semantic Sensor Data
The presence of LLMs enables sensors to comprehend specific tasks or objectives assigned to them when processing raw data. LLMs enhance efficiency because they allow sensors to focus attention and processing capabilities on relevant aspects and extract more meaningful information for specific scenes and tasks. This information will contain both semantic concepts related to tasks and additional attributes of objectives, and will be represented by the semantic token \(T^{s}\). For instance, a camera does much more than display basic image pixels. It detects individuals performing specific actions within a particular scene and encodes additional attributes such as their location and movement. Similarly, environmental sensors can convey semantic concepts such as "comfortable", "humid", or "polluted" based on predefined thresholds and environmental models, in addition to reporting temperature values, and provide alerts indicating whether the environment is abnormal.
2.2 Tokenized Radio Channel Measurement
Given that the entire communication network functions as a vast sensor, wireless-related characteristics can significantly enhance the perception and comprehension of the physical world. This is achieved by extracting distance, velocity, and angle information from wireless signals. For example, analysis of CSI can reveal the presence of obstacles, identify different types of interference (e.g., co-channel interference or external sources), and detect the movement of objects within the coverage area. These insights can be encoded as tokens Tc such as "obstacle", "interference", or "movement" along with relevant parameters like location, loss, or direction. Furthermore, if BSs possess a comprehensive RF map of the environment and sufficient computational power, they can potentially reconstruct the coarse but complete CSI from these tokenized representations. This reconstruction ability facilitates a more efficient and compact representation of CSI, reducing the amount of data that needs to be transmitted while preserving essential information about the wireless environment.
2.3 Semantic Digital Twin Representation
In this paradigm (illustrated in Figure 2), the semantic digital twin becomes a dynamic collection of semantic tokens, continuously updated with information from various sensors and radio channel measurements. Each sequence of tokens, representing a specific aspect or event within the environment, carries not only its inherent semantic meaning but also temporal and spatial context. Every piece of information within the digital twin is tagged with a timestamp and location stamp, creating a three-dimensional representation of the environment that encompasses time, space, and semantics (for event descriptions). This enriched digital twin transcends the role of a passive data collector and becomes an active participant in understanding and interpreting the environment.

Figure 2 Semantic digital twin
Such a semantic digital twin is established by BSs. During the establishment, the main challenge lies in token fusion at each timestamp. Assume that the lengths of tokens \(T^{s}\) and \(T^{c}\) are equal, or an additional neural network projection will be employed to align their lengths. We divide tokens \(T = \left \{ T^{s},T^{c} \right \} \) into a certain number of clusters using token features, and then fuse the tokens in the same feature cluster, as shown in Figure 3. Note that tokens within the same feature cluster correspond to identical events or objects, and the number of fused tokens varies across different feature clusters. The clustering method used in the paper is based on a hybrid feature clustering method using semantic tokens. It comprises two main parts: token KNN, which focuses on clustering based on spatial similarity of features, and token fusion, which utilizes large-scale models to consider semantic similarity. During model training, multiple semantic token attributes belonging to the same target or event are aggregated into a cluster using semantic graphs.

Figure 3 Token fusion process
Feature cluster: A variant of the "density peaks clustering based on k-nearest neighbors (DPC-KNN)'' algorithm is used to create a feature cluster. Since the cluster centers are distinguished by their higher density compared to neighboring tokens as well as their relatively large distance from tokens with higher densities, both density \(\rho\) and relative distance \(\delta\) should be considered. Given a set of tokens T , let \(NN_{k} (t_{i})\) be the k-th nearest token to \(t_{i}\) according to semantic similarity. The k-nearest neighbors \(KNN (t_{i})\) of \(t_{i}\) is defined as:
\(\operatorname{KNN}\left(t_{i}\right)=\left\{j \in T \left\lvert\, \frac{t_{i} \cdot t_{j}}{\left\|t_{i}\right\|\left\|t_{j}\right\|} \leq \frac{t_{i} \cdot \mathrm{NN}_{k}\left(t_{i}\right)}{\left\|t_{i}\right\|\left\|\mathrm{NN}_{k}\left(t_{i}\right)\right\|}\right.\right\}\) (1)
Then, the local density \(\rho_{i}\) of token \(t_{i}\) is obtained by calculating the mean distance to k nearest neighbors:
\(\rho _{i}=exp(-\frac{1}{k} \sum_{t_{j}\in KNN(t_{i})}^{} \frac{t_{i}\cdot t_{j}}{\left \| t_{i} \right \| \left \| t_{j} \right \| } )\) (2)
\(\delta_{i}=\left\{\begin{array}{l} \min _{j: \rho_{j}>\rho_{i}} \frac{t_{i} \cdot t_{j}}{\left\|t_{i}\right\|\left\|t_{j}\right\|}, \text { if } \exists j \text { s.t. } \rho_{j}>\rho_{i} \\ \max _{j} \frac{t_{i} \cdot t_{j}}{\left\|t_{i}\right\|\left\|t_{j}\right\|}, \text { otherwise } \end{array}\right.\) (3)
where \(\rho_{i}\) is the local density of token \(t_{i}\).
Let \(s_{i}=\rho_{i}\times \delta _{i}, i\in \left \{ 1,...,|T| \right \} \) denote the token score for each token \(t_{i}\). Then, the cluster centers are determined by selecting the tokens with the highest scores \(s_{i}\), and other tokens are then assigned to the nearest cluster center based on the semantic distances.
Token fusion: A transformer block is applied to each feature cluster to capture the semantic relationships and information interaction between different tokens in the same feature cluster, resulting in fused token clusters \(\tilde{T}_{n}\).
For feature clusters at different timestamps, pairing is based on similarity distances, meaning that clusters will only match if the similarity distance between their centers is less than the given threshold \(d_{c}\). When making decisions or similar tasks, all corresponding feature clusters spanning across time and space are considered, enhancing accuracy and opening up new possibilities for more harmonious interaction between the physical and digital worlds.
3 Applications of Semantic Digital Twins
The spatiotemporal SDT plays a crucial role in both wireless communication and LLM inference.
3.1 In Wireless Communication
By analyzing and integrating historical and real-time data, SDT can help optimize resource allocation and signal processing. Specifically, in technologies like beamforming, it precisely locates signal transmission directions to maximize signal reception efficiency.
In traditional beamforming techniques, directional transmission typically relies on the geographical position of devices or specific signal sources. However, through SDT, the system can recognize and understand specific user activities or states, such as identifying a user's posture or behavior while reading a book. This personalized localization transcends rigid geographical boundaries and signal sources, focusing instead on user behavior and needs. Based on this information, the system adjusts the beamforming direction of the antenna array to precisely target specific user devices. Furthermore, the system can quickly respond to changes in user posture or environmental conditions, dynamically adjusting the beam direction to maintain communication continuity and efficiency. These capabilities enhance the flexibility and adaptability of communication systems and significantly improve user experience and service quality.

Figure 4 Demonstration of SDT with a person holding a book
Figure 4 presents a real-time demonstration of the SDT detecting a person holding a book. In our demonstration, multiple types of sensing devices are used, such as cameras and lasers. To align the data collected by these diverse sensing devices, the token fusion method proposed in Section 2.3 is employed, enabling feature extraction and matching of targets and objects across multiple devices. The detection is divided into environmental detection and semantic detection. The former involves detecting static objects that current LLMs can handle effortlessly. The latter refers to understanding and detecting human actions, which requires analyzing and integrating the relative positions and states of the target individual and surrounding objects. In the demonstration, we maintain two queues: one for semantic states S(p), and the other for the relative positions L(p, o) of individuals and objects corresponding to these states, where p refers to the index of detected persons and o denotes the index of detected objects. Subsequently, contrastive learning between semantic states and relative positions is employed to enhance the precision of detecting and understanding human postures. The entire process is depicted in Figure 5.

Figure 5 Process of SDT construction
3.2 In Enhancing AI Inference
SDT can significantly enhance AI inference in several key aspects:
- Precise visual cropping: Effective performance in visual question answering (VQA) tasks using multimodal LLMs is crucial for applications in medical diagnosis and intelligent transportation. As introduced in 1, the size of the visual subject in the question significantly affects model sensitivity. Larger visual subjects tend to improve accuracy in related question answering. Conversely, smaller or blurry details often challenge models, impairing their ability to process subtle visual cues effectively. Therefore, precise image cropping, which enables models to focus on critical visual regions, notably enhances accuracy and efficiency in VQA tasks. Unlike conventional methods (e.g., 1) that focus on single-image cropping, SDT provides a global view of the environment through token representation, enabling more accurate cropping.
- Context-aware prediction: Current visual LLMs are optimized for single-image tasks and lack temporal memory. Direct training of video LLMs is resource-intensive due to the voluminous nature of video data. Tasks involving actions like "pick up" and "put down" require contextual information for accurate interpretation that single-frame analysis alone may not provide. Enhancing inference tasks, especially those predicting regular actions or scenes, can benefit from SDT's spatiotemporal knowledge within LLMs. For instance, in scenarios where robots navigate crowded areas, SDT offers insights into obstacles, pedestrian movement, and social distancing guidelines, significantly improving robotic action accuracy and effectiveness.
- Effective prompt engineering: SDT can assist in improving and optimizing prompt engineering for LLMs by analyzing and understanding past language data. This refinement enables the inference engine to make more informed and contextually relevant decisions. Consider a scenario where a robot is tasked with retrieving food. If the robot solely relies on its onboard sensors, its capabilities are inherently limited to its immediate surroundings, lacking access to historical environmental information. In such cases, if no visible food items are nearby, the robot might fail to complete its task. However, integrating the SDT's spatiotemporal awareness into the inference process expands the robot's perception beyond its immediate environment. The SDT's collective memory function provides insights into past events and the environment's history, filling knowledge gaps for the robot. For instance, its prompt may contain information about the location of food items in a particular drawer, even if the robot cannot directly observe them. Armed with this background knowledge, the inference engine can guide the robot effectively, enabling it to successfully retrieve the desired food. This example underscores the transformative impact of integrating SDT technology with robotic inference, enhancing robots' intelligence and adaptability in complex environments.
4 Conclusion
This paper introduces a novel approach integrating ISAC with LLMs to establish an SDT, where semantic tokens represent sensor data from sensing devices and radio channel measurements. These tokens are fused according to their feature clusters. By assimilating historical data, SDT enhances wireless communication performance, particularly in providing precise beamforming and personalized user localization. Additionally, it enhances the precision and efficiency of AI inference tasks through accurate visual cropping, context-aware prediction, and effective prompt engineering. This integrated method holds significant promise for advancing intelligent systems in both domains. Future research can further explore SDT's potential across diverse applications, including autonomous driving, smart manufacturing, and environmental monitoring, achieving comprehensive deployment and advancement of IoT technologies.
- Jiarui Zhang, Mahyar Khayatkhoe, Prateek Chhikara, and Filip Ilievski, "Visual cropping improves zero-shot question answering of multimodal large language models," arXiv preprint arXiv:2310.16033, 2023.