Virtual human: robots can now understand natural languages

20 September 2017

Computer scientists from Brown University developed a software system to make robots better at following spoken instructions, no matter how abstract or specific those instructions may be.

The development, which was presented at the Robotics: Science and Systems 2017 conference in Boston, is a step toward robots that are able to more seamlessly communicate with human collaborators.

The research was led by Dilip Arumugam and Siddharth Karamcheti, both undergraduates at Brown when the work was performed. They worked with graduate student Nakul Gopalan and postdoctoral researcher Lawson L.S. Wong (Croucher Fellowship 2015) in the lab of Stefanie Tellex, a professor of computer science at Brown.

“We hope to enable effective human-robot interaction and collaboration using natural language like English and Chinese as medium. Currently, we rely heavily on programmers for computer instruction code in robots, which greatly hinders the widespread use of robots in workplace and daily life,” Wong said.

For example, imagine someone in a warehouse working side-by-side with a robotic forklift. The person might say to the robotic partner, “Grab that pallet.” That’s a highly abstract command that implies a number of smaller sub-steps -- lining up the lift, putting the forks underneath and hoisting it up. However, other common commands might be more fine-grained, involving only a single action: “Tilt the forks back a little,” for example.

These different levels of abstraction can cause problems for current robot language models. Most models try to identify cues from the words in the command as well as the sentence structure and then infer a desired action from that language. The inference results then trigger a planning algorithm that attempts to solve the task. But without taking into account the specificity of the instructions, the robot might overplan for simple instructions, or underplan for more abstract instructions that involve more sub-steps. 

We envision a future where people will instruct and interact with robots in the same way that people interact with other people, of which natural language is a major component

But this new system adds an additional level of sophistication to existing models. In addition to simply inferring a desired task from language, the new system also analyses the language to infer a distinct level of abstraction.

To develop their new model, researchers used Mechanical Turk, Amazon’s crowdsourcing marketplace, and a virtual task domain called Cleanup World. The online domain consists of a few color-coded rooms, a robotic agent and an object that can be manipulated -- in this case, a chair that can be moved from room to room.

Mechanical Turk volunteers watched the robot agent perform a task in the Cleanup World domain -- for example, moving the chair from a red room to an adjacent blue room. Then the volunteers were asked to say what instructions they would have given the robot to get it to perform the task they just watched. The volunteers were given guidance as to the level of specificity their directions should have. The instructions ranged from the high-level: “Take the chair to the blue room” to the stepwise-level: “Take five steps north, turn right, take two more steps, get the chair, turn left, turn left, take five steps south.” A third level of abstraction used terminology somewhere in between those two.

The researchers used the volunteers’ spoken instructions to train their system to understand what kinds of words are used in each level of abstraction. From there, the system learned to infer not only a desired action, but also the abstraction level of the command. Knowing both of those things, the system could then trigger its hierarchical planning algorithm to solve the task from the appropriate level.

Having trained their system, the researchers tested it in both the virtual Cleanup World and with an actual Roomba-like robot operating in a physical world similar to the Cleanup World space. They showed that when a robot was able to infer both the task and the specificity of the instructions, it responded to commands in one second 90 percent of the time. In comparison, when no level of specificity was inferred, half of all tasks required 20 or more seconds of planning time.

“Although we have significantly exceeded the previous state of the art, there is still much room for improvement,” said Wong. “People can express the same thing in numerous ways, even ambiguous or literally incorrect ways. Since our system is trained using machine learning, quality of training data has to be assured. Multiple iterations of data collection are therefore needed.”

Another issue the team has to deal with is the stability between interacting system components. “The system compiles of speech recognition, network communications, state estimation, high-level planning and low-level control. Failure in any of these parts would cause the system as a whole to fail as well, so significant engineering effort had to be dedicated to ensuring system robustness.”

The team look forward to enabling robots to achieve new tasks and to understand their corresponding instructions. “Imagine teaching a novice to cook. Initially, you may have to specify “break the eggs into a bowl, then swirl the contents vigorously in a circular motion”, but later on, this can simply be instructed with “beat the eggs”.” Although continual learning is still beyond the reach of current machine learning and robotics, but the development of this new system is definitely a great leap forward.