February 9, 2024 | V. "Juggy" Jagannathan
The history of artificial intelligence (AI) is littered with the notion of agents. What exactly is an agent? I remember attending a workshop on distributed AI in Gloucester, Mass., in 1986. The workshop report gives a comprehensive view of what this concept entails. It even gives a historical perspective as of 1986 (at that time, I was building multi-agent systems for Boeing Advanced Tech Center).
Back to the question of what exactly is an agent? An agent is a piece of code that is given a particular goal or task to achieve, then plans how to execute that task and executes it on behalf of the person requesting it. The more fine tuned the task, the more fine tuned the agent. The broader the task, the broader the capability of agent. You can also build a cooperative group of agents, which coordinate actions to achieve the overall goal. I even edited a book on this front, exploring architectures that support coordinating distributed agents. These are now sometimes referred to as a “swarm” of agents. Drones come into the mind with that term!
The reincarnation of the above concepts is of course driven by the current emergence of LLMs with incredible abilities. This article by Nvidia explains how LLMs can be used to implement the notion of an agent. It talks about an agent taking the task from the user, provided as a natural language statement. Then a planning module kicks in where the LLM breaks down how to solve the problem. Then comes executing each aspect of the plan, aggregating the results and providing the answer back. The question of how to execute a particular action comes to the selection of a particular tool or application that takes some input and produces an output.
HuggingFace has blogs that show how one can use open-source LLMs to implement such agents and there is a slew of research papers on how to use open-source LLMs to implement task decomposition and execution. In this paper, dubbed “HuggingGPT,” the authors combine ChatGPT with other solutions found in Huggingface. It allows them to ask things like “can you describe this picture and count how many objects in the picture.” In this recent paper published in January 2024, the authors build a “Multimodal Large Language Model for Tool Agent Learning.” The study of combining modalities to execute tasks has become of hot area for research.
A new device made waves at the 2024 Consumer Electronics Show (CES). The device? “Rabbit R1.” What in the world is this device and what does it do? Well, the answer harkens back to the notion of agents. The device is defined as a “Large Action Model.” These models take a verbal command to execute the task. The steps are basically what was discussed above, however, there is a unique aspect of this invention. Yes, it is a new invention, but it is also a reimagining of a ubiquitous crutch: the smart phone. Rabbit R1 is a device, like a smart phone, but explicitly engineered to take voice commands, invoke various other underlying applications (executing actions) and provide the results on the device screen or with voice. Sort of in-between an ambient device and a smart phone. No navigating websites or following individual results.
Here are a few examples of what can be done:
It takes human assistance to the logical next level. A companion app which runs on a laptop is needed to hook together what services are available for you to execute the action. You can’t call an Uber if you don’t already have an Uber account! The developers are envisioning what they have built as a platform for integrating other applications. It’s called the Rabbit.os (for operating system). Reminiscent of input-output testing tools, it allows you to show how to use a particular application to carry out a task and then learns from it. Show it once and that application becomes integrated with the others to which you have access.
Clearly going after Apple and Google – how successful they are going to be is another question. Check out the keynote by the company founder. Ambitious to say the least.
I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.
“Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research.