UR 2026 · Ubiquitous Robots · Osaka, Japan
Robust Assistive Mobile Manipulation via Structured LLM Programs, Confirmation Loops, and Hierarchical Skill Recovery
Trung Bui et al.
Korea Electronics Technology Institute (KETI)
July 2026 · Ritsumeikan University, Osaka
Good afternoon everyone. My name is Trung Bui, from KETI, Korea. Today I will talk about how we make an assistive mobile robot robust, using three ideas: structured LLM programs, confirmation loops, and hierarchical skill recovery.
Why an assistive mobile manipulator?
People with disabilities need help with everyday object tasks — fetch, deliver, tidy
Homes are cluttered, dynamic, unstructured — fixed scripts break
Language is the natural interface: “bring me the coffee can on the table”
LLMs plan well — but a wrong step on a real robot is expensive.
Robustness must come from the system , not the model alone.
Why do we build this robot? People with disabilities need help with everyday tasks. Real homes are cluttered and always changing, so fixed scripts break. Language is the natural interface. Large language models can plan, but one wrong step on a real robot is expensive. So robustness must come from the whole system, not only from the model. This picture is a real view from our robot camera. You can see how cluttered the scene is.
Three mechanisms for robustness
1 · Structured LLM programs — the planner emits typed, schema-constrained skill calls , gated by a zero-token symbolic check before execution
2 · Confirmation loops — every executed step is verified in layers (skill result → symbolic state → VLM), and the operator can inspect & edit the robot’s belief live
3 · Hierarchical skill recovery — failures are repaired at the cheapest possible level : in-skill correction → suffix-only replan → bounded full replan
Together: language flexibility + execution safety + fault tolerance — on one integrated robot.
Our approach has three mechanisms. First, structured LLM programs. The planner does not write free text. It writes typed skill calls, checked by a symbolic gate before execution. Second, confirmation loops. Every step is verified in layers, and a human can watch and edit the robot's belief at any time. Third, hierarchical skill recovery. When something fails, we repair at the cheapest level first. Together, these give language flexibility, execution safety, and fault tolerance.
System at a glance
robot_agent — robot-agnostic runtime: closed planning loop, world state, verifier, FastAPI + WebSocket
SAGE planner (pyplanner) + a single open-weight LLM (on-premise, Ollama)
kcare_robot — 23 skills over ROS2 · VisionServe — off-board GPU perception
Web dashboard — optional multi-user supervision layer
Here is the system at a glance. A user command goes to the brain. The brain understands, plans, acts, verifies, and repairs. The brain is robot-agnostic. It talks to a planner called SAGE, which uses one open-weight language model, running on premise. The robot side has twenty-three skills over ROS2, and a separate GPU vision server. A web dashboard lets people supervise the robot, but the robot also works without it.
1 · Structured LLM programs
Task → sub-goals → typed steps ; each step = one skill affordance + arguments
A symbolic gate simulates preconditions & effects — 0 LLM tokens
Invalid step → typed feedback to the model, fixed before the robot moves
The plan is a checkable program , not a story.
task: "bring me the coffee can"
sub-goal 1: locate the object
Find (object="coffee can" )
sub-goal 2: fetch and deliver
MoveTo (place="table" )
Pick (object="coffee can" )
MoveTo (place="user" )
Place (object="coffee can" )
gate: Pick rejected — object not
found yet → reordered before exec
Pillar one: structured programs. The planner breaks a task into sub-goals, and each sub-goal into typed steps. Each step is one skill with arguments, like Find, MoveTo, Pick. Before the robot moves, a symbolic gate simulates every step: are the preconditions satisfied? This check costs zero LLM tokens. If a step is invalid, the gate sends typed feedback to the model, and the model fixes the plan before execution. So the plan is a checkable program, not a story.
2 · Confirmation loops — verify every step
Closed loop: perceive → plan → map → act → verify → repair
Layered verifier: skill result → symbolic state → VLM check — cheapest first
Persistent world state (arrived · found · holding + grasp memory) survives across runs
Operator sees every event live and can edit the belief mid-run from the dashboard
Pillar two: confirmation loops. Execution is a closed loop: perceive, plan, map, act, verify, repair. After every action, a layered verifier confirms the step. First the skill's own result. Then a symbolic check of the world state. Then, for critical actions, a vision-language check. Cheapest first. The robot keeps a persistent world state: where it is, what it found, what it is holding. A human operator sees every event live over WebSocket, and can even edit the robot's belief in the middle of a run.
3 · Hierarchical skill recovery
Level 1 · inside the skill — wrist-camera fine_move self-corrects the grasp approach
Level 2 · suffix-only repair — on a failed step, regenerate only the remaining steps of the failed sub-goal; the completed prefix is kept
Level 3 · bounded replanning — at most 3 replans per task; failures propagate cleanly, never loop forever
Recover at the cheapest level: 2.4–3.3× fewer LLM calls than whole-plan replanning.
Pillar three: hierarchical recovery. Level one is inside the skill. The wrist camera lets fine-move correct the grasp by itself. Level two is suffix-only repair. If a step fails, we regenerate only the remaining steps of that sub-goal. The finished part of the plan is kept. Level three is bounded replanning: at most three replans per task, so the robot never loops forever. Repairing at the cheapest level needs about two point four to three point three times fewer LLM calls than replanning the whole task.
Skills, vision & hardware
23 stateless skills → pyconnect agents → ROS2 actuators · open-vocabulary perception (GroundingDINO · GroundedSAM · grasp detection) on an off-board GPU server
This is the robot side. Twenty-three stateless skills: navigation, perception, manipulation, and low-level control. Skills talk to the hardware through pyconnect agents over ROS2. Perception is open-vocabulary: we send an image and a text prompt to a GPU server, and get boxes, masks, and grasp poses back. The head camera finds the object; the wrist camera refines the grasp.
Real run · “pick the phone”
(a) head-camera detection (ph 0.88, LYING) · (b) segmentation + grasp candidate (q0.97) · (c) wrist alignment (w72 mm, +2°) · (d) grasp verified (near=100%)
Now a real run: pick the phone. In frame A, the head camera detects the phone lying on the table. In frame B, the wrist camera segments it and scores a grasp candidate. In frame C, fine-move aligns the gripper: seventy-two millimeters wide, plus two degrees. In frame D, the grasp check passes. Let me play the video. [PLAY — about 30 seconds]
Real run · “pick the coffee can”
Same detect → segment → align → verify sequence on a standing can (co 0.80 · q0.96 · w68 mm · near=98%)
A second run: pick the coffee can. The same sequence — detect, segment, align, verify — on a standing can. Notice the verification numbers on screen: they come from the same layered verifier I showed before. [PLAY]
One platform, three arm-mount modes
KAAIR 6-DOF arm on a vertical lift rail · two-finger + suction gripper · pan-tilt head RGB-D + wrist RGB-D · mobile base (Nav2)
This is our platform. A six degree-of-freedom arm rides a vertical lift rail, so we can mount it left, front, or right, depending on the task. It has a two-finger gripper with suction, a pan-tilt head camera, a wrist camera, and a mobile base with Nav2 navigation.
Inside the planner (SAGE)
Hierarchical decomposition — task → sub-goals → steps (one LLM call each)
Hybrid memory — retrieves few-shot examples from a curated seed set plus its own successful episodes
Symbolic gate + suffix repair — the two mechanisms from pillars 1 & 3
Single open-weight model — the whole stack runs on-premise
2.4–3.3× fewer LLM calls to recover from a failure vs. whole-plan replanning
A quick look inside the planner, SAGE. It decomposes hierarchically: task, sub-goals, steps. It has a hybrid memory: a small curated seed set, plus its own successful episodes, retrieved as few-shot examples. And it uses the symbolic gate and suffix repair you already saw. Everything runs on a single open-weight model, fully on-premise. The key number: recovery needs two to three times fewer LLM calls than whole-plan replanning.
Three ways to drive the same robot
Mode Entry point Use case
Web dashboard HTTP / WebSocket multi-user supervision, live world-state editing
CLI kcare_robot skill::inputsoperators, scripting, tests
Python API kcare_robot.skills.*researchers, new behaviors
All three reach the same skill registry — identical behavior
New robots inherit the whole stack from a project template ; per-site config profiles hot-switch deployments
One practical point. The same robot can be driven three ways: the web dashboard for supervision, the command line for operators, and a Python API for researchers. All three reach the same skill registry, so the behavior is identical. And a new robot can inherit this whole stack from a project template, with per-site configuration profiles.
Takeaways
Structured LLM programs + a zero-token symbolic gate make plans checkable before the robot moves
Confirmation loops — layered verification + live human oversight — catch failures as they happen
Hierarchical recovery repairs at the cheapest level: 2.4–3.3× fewer LLM calls
Demonstrated on a real assistive mobile manipulator, fully on-premise
To conclude. Structured programs plus a symbolic gate make plans checkable before the robot moves. Confirmation loops catch failures as they happen. Hierarchical recovery repairs at the cheapest level, with two to three times fewer LLM calls. And all of this runs on a real assistive robot, fully on-premise. Thank you very much. I am happy to take questions.