Future Technologies
Robots Empowered by AI Foundation Models and the Opportunities for 6G
This paper analyzes FMs into robotics, explores the potential of 6G technology for robotics, and introduces a prototype 6G robotic system.

By Huawei Wireless Technology Lab: Guangjian Wang, Lingjun Yang, Jimmy Jian, Chandan Roy, Li Pan, Guolong Huang, Hua Cai, Wen Tong
1 Introduction
The global vision for developing robotic technologies demonstrates the crucial importance of integrating artificial intelligence (AI) into robots. In the United States, the 2024 edition of "A Roadmap for US Robotics: Robotics for a Better Tomorrow" by the National Robotics Initiative (NRI) highlights AI as a pivotal force. The roadmap outlines advancements in machine learning (ML), artificial general intelligence (AGI) research, pervasive automation, and the convergence of AI with robotics. It also emphasizes personalized AI, AI ethics, and AI-driven scientific discovery, all aimed at shaping the economy, workforce, and national security.
The European Union's joint "Strategic Research Innovation and Deployment Agenda (SRIDA)" for the AI, data, and robotics partnership underscores a human-centric and trustworthy approach to AI and robotics. This agenda focuses on fostering collaboration among industry, academia, and policymakers to drive research, development, and deployment. It aims to establish Europe as a global leader in AI and robotics by stimulating investment and tackling key challenges, thereby enhancing economic, societal, and environmental outcomes in alignment with European values and rights.
China's "14th Five-Year Plan for Robot Industry Development" emphasizes the need to enhance the intelligence and networking capabilities of robots through the integration of AI, 5G, big data, and cloud computing. This plan ensures the functionality, network, and data security of robotic systems, thereby advancing the nation's technological capabilities and industrial applications.
To achieve their perception capabilities, classic AI robotic systems use deep learning (DL) methods deployed in a controlled environment. Although this approach provides an effective way of learning multiple skills, it not only requires significant training time and extensive engineering effort to set up each task, but also lacks distribution shifts and generalizability.
While this might sound reasonable for a single task, the learning costs and effort could exponentially increase when multitasking on a real-world experiment is performed, introducing new challenges inside the robotic domain.
Building generalizable robotic systems faces several challenges. At the same time, however, a novel field of study has emerged that could help enhance robotic systems. A foundation model (FM) is a large-scale AI model that serves as a versatile and general-purpose framework for various downstream tasks by being adapted to specific applications. FMs are pre-trained on internet-scale data, presenting superior generalization capabilities and extending the concepts of transfer learning (TL) and model scaling.
They enable robots to autonomously understand and execute tasks from high-level natural language instructions, dynamically decompose complex tasks, and adjust actions based on real-time feedback, minimizing human intervention. Furthermore, situation awareness is enhanced by enabling semantic understanding of the environment using multimodal data from commonly used sensors such as cameras, LiDAR, and microphones.
These advancements shift robots away from rigid, predefined operations and narrowly focused models, moving them towards dynamic, intelligent task execution and environmental understanding, significantly enhancing their autonomy, flexibility, and efficiency.
In this paper, we analyze the current efforts of academics and industries and the future directions they will take in applying FMs to robotics. Furthermore, we analyze the impact of 6G technology on robotics, highlighting the future applications, integration with AI-FM, and networking requirements. This paper is structured as follows: Section 2 provides the state-of-the-art (SOTA) analysis of the current FMs for robotics. Section 3 provides a brief overview of the standardization effort by the main entities. Section 4 broadly illustrates the market and research opportunities of 6G and AI applied to robotics. Section 5 introduces our 6G robotic prototype. And finally, Section 6 presents conclusions, remarks, and future research directions.
2 SOTA Foundation Models for Robots
This section provides an overview of the types, roles, and capabilities of FMs specific for the robotic domain. We use terminology that is consistent with the ISO standard 8373:2021 for robots and robotic devices. This international standard is key for ensuring that communication is clear and consistent across different industries, academic fields, and geographic regions involved in robotics.
2.1 FM Enablers for Robotics
The key benefits of FMs for robots are summarized as follows:
- Comprehensive knowledge base: FMs provide robots with extensive, multi-domain knowledge, enabling them to understand and execute a wide range of tasks. This knowledge base allows robots to perform complex operations across various fields, without needing extensive reprogramming for each specific task.
- Natural language understanding: FMs possess strong natural language processing (NLP) abilities, allowing robots to comprehend and interact using human language. This simplifies task instruction and communication, enabling users to provide commands and receive feedback in a natural language.
- Multimodal situation awareness: FMs enhance robots' multimodal situation awareness by enabling semantic understanding of their surroundings using various sensors, such as RGB cameras, LiDAR, and microphones. Robots can understand the logical and geometrical connections between objects, assess current situations, interpret events, and predict future occurrences.
- Zero-shot and few-shot learning: FMs excel in zero-shot and few-shot learning, enabling robots to perform tasks with minimal to no task-specific training. This enhances flexibility and adaptability, allowing robots to handle new tasks and environments without needing extensive retraining.
2.2 FM Macro Typologies for Robotics
FMs have the potential to unlock new possibilities in the robotics domain. Among FMs, a subclass of pre-trained models can be utilized to improve various tasks such as perception, prediction, planning, and control:
- Large language models (LLMs): These models would enable robots to understand natural language instructions and potentially respond with natural language.
- Vision transformers (ViTs) or multimodal transformers: These models would be crucial for enabling robots to interpret visual data from their environment through cameras and LiDAR sensors.
- Embodied multimodal language models: This is a broader category that could potentially combine the functionalities of LLMs and ViTs in order to allow robots to understand not only natural language instructions but also the visual context of those instructions.
- Visual generative models (VGMs): In terms of the evolution behind the diffusion models, VGMs trained on massive datasets can create realistic scenarios for robots to virtually practice tasks. This enhances perception, refines movement, and provides diverse training data.
These advancements highlight the potential of using FMs in the field of robotics for the development of models that are more specific to this field rather than just combining existing vision and language models.
2.3 Robotic FMs: Intent Recognition and Visual Reasoning
There has been growing interest recently in transformed-based robotic AI for its strong capabilities of intent recognition and visual reasoning. This architecture uses language embeddings and observations as inputs, and outputs predicted actions. To achieve long-horizon robust and generalizable policies, a vision-language-action (VLA) model applied to language-conditioned robotic manipulation (LcRM) has been introduced for visuomotor control inputs. This approach further reduces the gap between robot physics and AI, improving two main aspects:
- High-level planning: A complex language instruction can be converted and divided into a sequence of basic action primitives, which are then executed by low-level controllers. PaLM-E, a combination of PaLM and ViT, consists of up to 562B parameters and serves as a high-level policy for planning and reasoning.
- End-to-end learning: An LLM can be trained to directly generate actions based on instructions and observations. RT-1 and RT-2 are examples of multitask models that tokenize robot inputs and output actions to enable efficient inference at runtime. Such an approach makes real-time control feasible. Similarly, Octo provides training and fine-tuning of generalist robotic policies (GRPs) using transformer-based diffusion methods. Out of the box, Octo supports multiple RGB camera inputs and multi-arm robots, and can be instructed via language commands or goal images. Furthermore, Octo uses a modular attention structure in its transformer backbone. This allows it to be effectively fine-tuned to robot setups with new sensory inputs, action spaces, and morphologies, using only a small target domain dataset and accessible compute budgets.
Simulation Platforms
Several frameworks have been developed to simulate robots powered by AI-based planning, control algorithms, or both. Two main families of frameworks have been identified as possible platforms on which to base our analysis. There is a third platform, NVIDIA Isaac Lab, but due to the need for a proprietary commercial license, it is not considered.
RoboCasa is a simulation framework for training robots to perform everyday tasks. Methods are provided to train transformer-based models on a combination of proprioceptive robot data (e.g., joint encoder readings) and images (e.g., from a camera on the robot or in the world).
MuJoCo (Multi-Joint dynamics with Contact) is a physics engine specifically designed for simulating physical systems, particularly robots. MuJoCo's realistic simulations can be used to train FMs for various robotic tasks. Such FMs can learn by interacting with the virtual environment, manipulating virtual objects, and receiving feedback on their actions. This training data can then be transferred to the real robot, allowing it to perform similar tasks in the physical world.
HABITAT is a high-performance 3D simulation environment designed specifically for training embodied AI agents, such as robots and virtual assistants. HABITAT simulates various sensors (e.g., RGB-D cameras) commonly used in robots, providing FMs with diverse sensory information for perception and decision-making.
3 Use Cases of Robots Empowered by 6G and AI
In the telecommunication industry, R&D and standardization efforts have been exploring the applications of mobile network in robotics. The 3rd Generation Partnership Project (3GPP) System Aspect 1 (SA1) has studied service robots and identified eight use cases. These include real-time cooperative safety protection, smart communication data collection and fusion using multimodal sensors on multiple robots, and autonomous and teleoperated robots working on mining actuation and delivery. Some technical aspects have been discussed, such as tactile and multimodality communication, integrated sensing and communication (ISAC), metaverse, and high-level communications.
The one6G association aims to evolve, test, and promote next-generation cellular and wireless communication solutions. It envisions that robotic applications will penetrate several application areas and societal sectors. In addition, it has published a series of openly available whitepapers on 6G and robotics, providing in-depth discussions of 6G's enabling functions to robots (e.g., communication, AI/ ML, and ISAC)1. Furthermore, several use cases of robots empowered by 6G are proposed, such as collaborative robots, disaster relief, action planning, industrial robots, and healthcare assistance.
The EU-funded flagship 6G research projects, Hexa-X and Hexa-X-II, have discussed and analyzed various 6G use case and requirements, focusing on autonomous robots that can communicate with each other, with other machines, and with nearby humans to perform individual tasks that contribute to a common cooperative objective. One of those that was discussed and analyzed was cooperating mobile robots (CMRs).

Figure 1 6G capabilities applied to different control levels of robots
4 Opportunities for 6G
Robot control is commonly divided into four levels: task-level, action-level, primitives-level, and servo-level [24, 25]. With the integration of the AI and sensing capabilities of 6G, robots are poised to achieve an even higher level of intelligence, surpassing traditional task-level control. We envision these enhanced capabilities as part of a new level named meta-level. At this level, robots will be able to — in a fully autonomous manner — identify problems, define their tasks, and adapt to dynamic environments based on meta-definitions of their roles, missions, and rules, in addition to possessing real-time situation awareness. Figure 1 illustrates the interoperability between the levels, ISAC functionalities, and native AI infrastructure.
Our vision of how the control levels will be defined for future intelligent robots is as follows:
- Meta-level control: This level empowers robots to autonomously identify problems, define tasks, and adapt to dynamic environments based on meta-definitions of their roles, missions, and rules, with real-time situation awareness.
- Task-level control: This level defines the overall goals and missions of robots, involving high-level planning, decision-making, and task decomposition. Examples include "Clean the kitchen floor" and "Serve a low-calorie sparkling drink."
- Action-level control: This level converts task-level commands into specific movement sequences, including trajectory planning and path generation. An example is planning a path to navigate from the living room to the kitchen without running over a child's toys.
- Primitive-level control: This level involves direct control of the robot's actuators to follow planned trajectories, generating commands for joint positions, velocities, and forces. An example is controlling the arm to move precisely along a path to pick up an object.
- Servo-level control: This level, the lowest one, focuses on maintaining precise control of actuators through feedback loops. It ensures the execution of commands with high accuracy and stability.
The new ISAC and Network-for-AI (NET4AI) features, borne from the 6G vision and the initial research and standardization efforts, could become important enablers for future robots that are empowered by AI FMs.
4.1 Native AI as a Service, Accommodating AI Models and Computing Facilities
6G aims to provide AI as a service (AIaaS) enabled by NET4AI, embedding FMs and other specific AI models directly within the network infrastructure. This integration provides several key benefits:
- Low-latency performance: Embedding AI models within the 6G network significantly reduces latency. Processing data close to the source within the radio access network (RAN) and core network (CN) minimizes the need to transmit data to external servers for processing, resulting in faster response times.
- Access to rich data: AI models within the 6G framework have access to a wealth of data from the RAN and CN, as well as from ISAC. Access to the extensive volume of data enables more accurate and context-aware AI decision-making, enhancing the performance of AI-driven applications.
- Enhanced data integration: The seamless integration of sensing and communication in 6G allows AI models to utilize diverse data sources for more robust and holistic analysis. This integration supports advanced applications like real-time environmental monitoring, adaptive robotic control, and dynamic resource management.
Compared to conventional multi-edge computing (MEC), 6G AIaaS offers improved latency and bandwidth efficiency. It achieves this by embedding AI capabilities directly within the network infrastructure, thereby reducing additional data routing between edge servers and the cellular system. Furthermore, 6G native AI models can access a broader range of data from across the entire network (including ISAC data), leading to more informed AI processing and improved service delivery. The 6G framework also supports dynamic allocation of AI resources in different RAN and CN entities, enabling AI service deployment to be more scalable and flexible. And in comparison to onboard robotic AI systems, 6G native AI offers significant advantages. Specifically, running AIaaS in the network typically offers better computing performance and therefore faster system responsiveness than running AI locally does. Offloading intensive AI computations to the network reduces the power consumption and heat dissipation problems associated with onboard processing, extending the operational life of robots
and reducing costs. Moreover, given the large amount of data available from the network, 6G native AI models provide more accurate and context-aware decision-making. To conclude, AIaaS should have the ability to deploy parts of the "brain" flexibly across the local and network nodes depending on the given needs, such as needing to meet challenging safety requirements.
4.2 ISAC for Robot Comprehensive Situation Awareness
The 3GPP has begun a study of ISAC, recognizing its potential to revolutionize various applications, including robotics. The SA1 has completed its study on ISAC (FS_Sensing), resulting in 32 ISAC use cases detailed in TR22.837 [28] and service requirements specified in TR22.137. These documents consider the comprehensive ISAC approach incorporating sensing based on both 3GPP radio networks and non-3GPP sensors, such as cameras and LiDAR.
The ISAC of the future mobile network is beneficial to robot applications in the following aspects:
- Integrated sensing, communication, and AI in the same standardized network architecture: Integrating sensing, communication, and AI FMs into a unified 6G network architecture offers transformative benefits for future intelligent robots. This approach enhances real-time decision-making and situation awareness by providing robots with immediate access to comprehensive, real-time data.
- Networked sensing for comprehensive situation awareness: Integrating sensing, communication, and AI FMs into a unified 6G network architecture allows robots to achieve comprehensive situation awareness through networked sensing. Instead of relying solely on a robot's onboard sensors, ISAC provides access to a richer array of data from various sensing nodes on the network, including other robots and environmental sensors.
- Integrated sensing and positioning: Mobile robots require positioning capabilities in order to find objects and perform navigation. ISAC can be used to improve positioning accuracy by fusing passive sensing and active positioning functions of the mobile network.
- Sensing digital twin (DT) construction: Real-time and accurate sensing data is needed to construct DTs for robots. In the future, ISAC might support the creation of precise and dynamic virtual replicas for effective DTs, improving collaboration among multiple robots.
4.3 Enhancing Future Robots with 6G Communication
The advent of 6G communication will significantly enhance future robots by leveraging hyper reliability, ultra-low latency, advanced quality of service (QoS) provisions, and interworking with robotic software and protocols.
- Hyper reliable and low-latency communication (HRLLC): HRLLC has become crucial for industrial applications if robot control will be centralized. 6G provides hyper-reliable and stable communication channels with minimum jitter, ensuring smoother operation and synchronization of robotic systems. This is vital for tasks that require high precision and reliability.
- Advanced QoS framework: 6G introduces advanced QoS frameworks that dynamically allocate network resources based on the specific needs of AI FMs and specialized robotic applications. Through its enhanced data throughput capabilities, 6G enables efficient transmission of AI training data, sensor data, and real-time analytics, supporting complex decision-making processes and learning algorithms.
- New protocols for interworking: 6G's support for seamless interworking with robotic software and communication protocols such as Data Distribution Service (DDS), Open Platform Communications Unified Architecture (OPC UA), Message Queuing Telemetry Transport (MQTT), and Zenoh allow robots to benefit from its capabilities without requiring an extensive redesign of existing systems.
- Real-time closed-loop teleoperation and training: 6G enables real-time closed-loop teleoperation of robots by humans or AI. This is crucial for solving unknown complex tasks as well as training AI models to acquire new skills through imitation learning. Through 6G's robust communication infrastructure, operators can remotely control robots in real time, providing hands-on training that accelerates AI learning and adaptation.
- New business opportunities: The power of AI, coupled with 6G sensing capabilities of both the robot and the network, unlocks new business opportunities for network owners and robot service providers. Real-time robot operations necessitate the integration of sensing, AI, and control functions with low latency and high data throughput to ensure seamless and efficient performance. Depending on the deployment of AIaaS agents, timely integration of sensing data from various sources is essential. Additionally, robotic operators and vendors may also favor resource-intensive services through an integrated mobile network solution that ensures contracts and trust for uninterrupted and reliable operation.
5 MELISAC — FM-powered Robot for 6G Proof-of-Concept
In this section, we introduce MELISAC (Machine Learning Integrated Sensing and Communication), our proof-of-concept (PoC) compound robot that integrates several advanced technologies, including intelligent robotic control, online robot training, and ISAC.
5.1 Hardware Setup
MELISAC is a dual-arm compound robot consisting of two industrial articulated collaborative robots (cobots), the UR5e2 and an automated guided vehicle (AGV). The UR5e is mounted on an aluminum frame atop the AGV. This configuration enables autonomous navigation and precise object manipulation. For end-effectors, MELISAC is equipped with MiaHand3, a pair of anthropomorphic robotic hands that allow it to perform tasks in a manner similar to human hands. This capability is particularly beneficial for training AI models that control robots by demonstrating human task execution.
Additionally, an ISAC-capable sub-THz radio system is deployed on the robot, with its antenna mounted either on the body frame or as an end effector. A local computer handles onboard computation for action control and signal processing.

Figure 2 MELISAC in Hannover Messe 2023 and its software architecture
5.2 Software Architecture
In our deployment, sensor data processing and action planning are managed by the local computer, while computationally intensive tasks (e.g., AI inference) are offloaded to edge servers, as illustrated in Figure 2.
- Cobot arms and AMR controllers: These are the native controllers provided by the robot manufacturers. They expose application programming interfaces (APIs) for executing low-level robot functions (e.g., emergency stop, obstacle detection, and kinematics/inverse kinematics).
- Adaptation API: This is an adaptation layer that abstracts low-level control for the high-level controller. It is essential for hardware-agnostic FM-based control functions.
- Human-machine interfaces (HMIs): These are modalities for human-robot interactions, such as speech and gestures.
- Radio frequency (RF) sensing: This refers to the RF system for integrated radio sensing and communication. Radio sensing provides an additional perception layer alongside RGB-D cameras and microphones.
Due to their computation and memory requirements, SOTA FMs need to be deployed on powerful servers located on the edge cloud. Each FM is loaded into an AI agent, which combines its FM with necessary software stacks. The AI agents interact with each other in a text-based multi-agent system located on the edge cloud. The local computer communicates with robot components and the AI agent in the edge cloud using ROS2.
- Chat agent: an AI agent powered by an LLM with a large vocabulary and general knowledge capable of engaging in conversations with humans on various topics.
- Vision agent: a vision-language FM agent specializing in extracting semantics from video and image inputs, as well as classifying and localizing objects of interest.
- Robotic agent: a robotic-FM agent responsible for high-level planning of robot actions based on inputs from the chat agent (user requests) and vision agent (environmental context).
- Voice agent: provides real-time speech-to-text and text-to-speech conversion.
Robotic FMs may often struggle with unfamiliar tasks in unstructured environments. In such cases, a human operator can step in to demonstrate the task. MELISAC allows a teleoperator to control it over the network using teleoperation data for training. This human-in-the-loop online training adds an adaptation layer to the pre-trained FM and should be continuous on the cloud.
5.3 Technical Discussions
End-to-end models vs. chain of models: A key question in building FM-controlled robots is whether to use a single end-to-end FM for all input modalities or a pipeline of multiple models. The single-model approach, seen in Octo, RT-1, and RT-2, often has better generalization because all modalities are trained together and allows real-time control with one inference. The model-pipelining approach, despite offering flexibility, transparency, and customization, incurs extra inference time and integration complexity. Existing frameworks such as Promptflow4 and DSPy5 can help manage these challenges. The choice depends on data availability and hardware suitability. A domain-specific task with confidential data might benefit from a pipelined vision model with language and action models, whereas an end-to-end model trained on large Internet datasets is better for general tasks.
Integration with robot manufacturers' APIs: Currently, manufacturers provide control stacks with high-level APIs for capabilities like simultaneous localization and mapping (SLAM) and movement, while low-level action control remains restricted for safety compliance. Integrating AI into robots requires extended access to sensors and actuators. Given that full replacement of low-level control by FM-based solutions is unlikely, an integration scheme is needed to embed FM functionalities within existing systems. Retrieval-augmented generation (RAG) can help FMs learn control using standard low-level API documentation. A logical transition step is to define common interfaces between high-level functions (potentially FM-based) and low-level APIs, ensuring both safety and functionality. This requires collaboration between manufacturers and FM developers, with standardization of these interfaces being beneficial but not essential.
6 Conclusions and Remarks
Robotic FMs, despite their impressive ability to grasp basic objects and movements, struggle with complex tasks. They lack a nuanced understanding of real-world physics, hindering their ability to perform actions that require subtle manipulation. Furthermore, high precision and dexterity remain out of reach for current robotic FMs. In addition to these physical limitations, FMs need more than basic instructions for complex tasks and cannot learn intricate skills simply by observing. These shortcomings are compounded by slow control frequencies that restrict their ability to operate in real-time, high-speed environments. Even for tasks requiring smooth, precise movements, FMs are not well-suited. On top of all this, training them for entirely new actions without prior examples remains a significant challenge. These limitations, coupled with the lack of reliable and safe robot control systems, highlight the need for significant advancements in robotic FMs.
To adapt to future advancements, it is necessary to augment FMs with task-specific AI models, DT technology, and high-performance computing resources. Integration of specialized AI promises to improve precision and dexterity, while DT technology offers advanced physical simulations and AI training. This fosters a deeper understanding and prediction of physical interactions. The development of intelligent hybrid control systems that incorporate high-level planning from FMs, task-specific AI for specialized skills, and traditional methods for low-level execution will ensure smoother and more efficient operations. Additionally, leveraging advanced computing and programming tools to elevate control frequencies and real-time responsiveness will enable robots to handle dynamic tasks more effectively. This comprehensive approach will significantly enhance robotic autonomy, flexibility, and efficiency, empowering them to navigate complex real-world scenarios with greater competence. The advent of 6G, with its advanced AI and sensing capabilities, promises to propel robots beyond traditional task-level control, enabling them to operate at a new meta-level. This will empower them with autonomous problem identification, task definition, and adaptation to dynamic environments. And by leveraging 6G's ISAC and AIaaS, these robots can identify tasks and solve problems with greater autonomy and efficiency, guided by the meta-definitions of their roles, missions, and rules in addition to possessing real-time situation awareness.