Are Semantic Communication and Generative AI Effective Solutions for Carrier Pipes?
How can carriers efficiently enlarge pipes from the supply side? How can they efficiently fill the pipes from the demand side without a killer application?
Since the birth of the communications industry, research has largely focused on accurately and effectively transferring data from sender to receiver. Shannon's theory of communication defines information entropy as syntactic communication. But with the development of coding technology, system transmission capacity is now approaching its limit.
So, how can the industry move forward?
In fact, another level of communication exists: semantic communication 1. In contrast to syntactic communication, which targets accurate transmission despite entropy (or information), semantic communication can ignore the accuracy of entropy exchange while seeking the accurate exchange of semantic information. Therefore, network capacity can be multiplied by enhancing software capabilities without any revolutionary improvements to hardware capabilities.
Does this mean that semantic communication can serve as a shortcut to enlarging the pipes?
3D video appears to have the potential to fill operators' pipes in the 5G and 5.5G eras. However, 3D content is expensive to produce and requires specialized skills. In the LTE era, a social media influencer could create a short, 60-second video on a phone in just a few minutes. However, the production of a 3D video that is only a few seconds long can keep a professional team busy for an entire day.
Considering this, is it possible for generative AI technology to help evolve content from 2D to 3D and reverse the sluggish growth in demand for carrier pipes?
Semantic communication enlarges pipes while generative AI fills them
A human's five senses can determine the ceiling of interpersonal communication.
Communication has long been evolving in terms of the way it interacts with our senses: from the earliest messages sent in Morse code ("What hath God wrought"), to the first voice call ("Mr. Watson, come and help me"), to the first message sent over the the Internet ("LO") . And now with the move to images, video, 3D AR/VR, and video and voice, communication will evolve into something that involves every sense.
As the scope of information increases, our need for information accuracy decreases but the required processing power of the brain increases.This is the basis of our analysis of future communication requirements.
When receiving a text message on a mobile phone, each character included in the message is encoded into binary and then decoded back into text to ensure the accuracy of the message. This makes information fairly easy to read, and takes nothing more than a little brain power and a short time to come up with a reply.
However, in voice communication, the transmitted voice may be accompanied by some tolerable noise. In this case, it requires more brain power to engage in a real-time conversation and understand the meaning.
Ultimately, video communication mixes together different types of sensory information (e.g., auditory, visual, taste, and tactile ), where the accurate transmission of every bit no longer matters. To process such an overwhelming amount of information, people are likely to need AI to record and analyze conversations, and alert them to key details that could easily be missed.
These essential communication requirements, when reflected in the requirements for telecom networks, will diversify networks. These networks will need to handle diverse application scenarios:
First, traditional networks in which information suffers entropy during transmission should be retained, because accurate transmission is still required in some specific scenarios.
Second, the primary mission of networks may shift from transmitting accurate information to transmitting expressive information, as the information to be transmitted evolves from one-dimensional text to two-dimensional images and video, and then to three-dimensional holograms or even higher-dimensional objects. This would reduce requirements for the accuracy of information transmission, meaning networks no longer need to accurately transmit every bit of information. Instead, they need only to accurately transmit semantics. In other words, the dimensions of the transmitted information will be reduced.
Let us consider a typical communication application scenario: 3D video conferencing. Think about this: Is it necessary to support a 3D or higher-dimensional communication application with a network that is capable of 10 Gbit/s per user? (10 Gbit/s is 16,000 times the data volume of The Complete Works of William Shakespeare.) In reality, 3D video conferencing requires accurate syntactic transmission that equivalent to 2D information such as voice, whiteboard content, and presentation slides. The semantic transmission of 3D information requires details like participants' facial micro-expressions, movement, and conference background. Such communication is more efficient and personalized.
Supported by smartphones equipped with HD cameras, the ease of recording, editing, and publishing videos has driven the success of the professional user-generated content (PUGC) model in the mobile broadband era. Huge amounts of videos are produced through this model, uploaded to video platforms like YouTube and TikTok, and then streamed on users' smartphones through carrier pipes. This model has driven high profits for both OTT providers and carriers and benefited the whole industry over the past decade.
With 5G available today and 5.5G on the horizon, carrier pipes will enlarge tenfold, even 100-fold. While capacity on the supply side increases to unprecedented levels, growth on the demand side is becoming an issue. 2D videos generate traffic growth by increasing the number of users, their average daily video viewing times, and video definition. However, these three factors began reaching their limits during the LTE era, and further sustained growth is unrealistic.
So, what key actions can be taken to help increase the traffic running in carrier pipes today and in the future?
My view is that the vastly wide carrier pipes will not be filled solely by traffic that addresses human sensory needs, but also by traffic that is triggered and generated by AI. MIT Technology Review lists 'AI that makes images' in the '10 Breakthrough Technologies 2023'. This reflects an upgrade from 1D text to 2D graphics, but it is far from enough. We currently expect a further upgrade in the industry, from text (semantic) to 3D videos. In the first month of 2023, we saw graphics technology maturing and the emergence of a related AI application.
Neural rendering is an innovative technology in computer graphics, first proposed by several DeepMind scientists in a key paper "Neural Scene Representation and Representation" in Science, June 20183. These same scientists also proposed a machine-learning architecture based on neural rendering: Generative Query Network (GQN). GQN can take images of a scene taken from different viewpoints as input, and then construct (or predict) the appearance of that scene from previously unobserved viewpoints. The paper in question is important for two key reasons:
First, GQN lifts the dimensions of images, meaning AI can complete 3D graphics by using a small number of 2D graphics (sparse samples). AI can also produce a 3D model with nothing more than a surround-shot video clip, complete or incomplete, or even just a few photos.
Second, GQN presents a representation learning approach that does not require manual labeling or domain knowledge, paving the way for machines that can independently learn to comprehend the surrounding world. The scene representation capabilities of GQN involve the process of converting visual sensory data into concise descriptions. If we were to reverse this process, concise descriptions could be used to generate visual sensory data. This is what many of us dream of – the ability to produce video using semantics or text. This demonstrates AI's potential to generate massive amounts of 3D content at low cost.
Based on neural rendering, an implementation method known as neural radiance field (NeRF) quickly developed. NeRF uses a sparse set of input views (i.e., limited images and videos) to optimize a continuous volumetric scene function, thereby synthesizing novel views of complex scenes. This method can be expressed by the following formula:
It is said that every formula included in an article reduces readership by half. However, I hope nobody reading this article feels intimidated by the above formula, because it's actually quite simple. The five parameters on the left represent the spatial position (x, y, z) of the observed point, and the viewing direction (θ, φ). "RGB" on the right represents the observed color, and "σ" represents the observed texture. "F" in the middle simply represents the correspondence between the two sides, meaning the spatial scene where the observer and the observed object are located, which is described by a neural network. The parameters of such a neural network are the results of training based on input views.
On January 9, 2023, an app called Luma AI was launched on Apple's App Store. Luma AI is capable of quickly converting ordinary videos and photos taken on mobile phones into 3D models and 3D videos. This app utilizes the NeRF method to create a sub-real-time 3D image/video production tool. This model of generating 3D content using AI at low cost can be referred to as User Generated 3D Content (UG3DC). Ultimately, acquiring a 3D model may be as simple as taking a picture on a phone.
Looking at the application of Turbo codes as an example, it is known that because an interleaver exists in the Turbo codec, the Hamming distance of the Turbo codec cannot be accurately represented mathematically. However, this does not stop us from widely applying Turbo codes in communications to achieve a performance that comes close to the Shannon limit. Semantic communication is now facing the same problem. In contrast to syntactic communication, semantic communication lacks rigorous mathematical representation: Shannon used a simple logarithmic formula to clearly define information (entropy). Shannon's formula also specifies the maximum rate at which information can be transmitted over a syntactic communications channel. However, semantic communication lacks a theoretical basis due to two basic problems:
First, semantic information lacks a clear definition.
Second, there is no clear mathematical representation of the maximum channel capacity of semantic communication.
Today, some people want to use AI to help realize the practical application of semantic communication, such as using deep learning to match semantic features. Specifically, deep learning can be used to train a neural network to obtain the semantic feature parameters of the information source, after which the parameters are transferred to the other end of the network where the information can be roughly restored. However, semantic communication has not yet been applied to telecom networks.
Generative AI can pave the way to enormous productivity in the short term, and we can look at Luma AI as an example.
First, it must be acknowledged that Luma AI consumes a large amount of computing power and energy, and that the computing power of terminal devices alone is insufficient for common applications, because rendering is excessively time- and power-consuming. For example, I tried to use a 10-second video clip as an input to locally generate a 3D model on a handheld tablet. This task consumed 30%–35% of the device's battery (or around 2,000 mA), and the entire modeling and rendering process took over 30 minutes. After running the app for an hour, only 25% battery remained. However, the app can also be run online. When I chose online rendering, the raw data upload took 2 to 3 seconds (9 MB of video for training), and the modeling and rendering took 10 seconds. This was 180 times more efficient than local rendering and consumed almost zero battery.
Second, Luma AI is about to introduce the ability to generate video based on semantics or textual descriptions. This content production capability will be similar to that of an AI-based text-to-image art generator, except with 3D videos. The era of PUG3DC will truly arrive when such capabilities become strong enough to support the instant generation, one-click generation, or text-based generation of 3D video.
Generally speaking, we cannot expect semantic communication to create profits in the short term, so we should continue to dive deep into syntactic communication. However, generative AI, particularly when using semantics to generate 3D content, is something we can undoubtedly look forward to. We anticipate that AI holds enormous potential in terms of filling carrier pipes, and we have already seen evidence of this.
An explosion in 3D content creation, facilitated by generative AI, might be just around the corner.