These notes are a summary of concepts presented in “YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks.”
Bandyopadhyay, S., Bahirwani, V., Aggarwal, L., Guda, B.P., Li, L., & Colaco, A. (2025). YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks.
- Multimodal AI Agents
- AI models capable of interactively assisting users with daily tasks
- Focus on proactive and cooperative engagement
- YET to Intervene (YETI) Multimodal Agent
- Research focus
- Identifying circumstances for proactive intervention
- Purpose
- Assist users in correcting mistakes during task execution
- Research focus
- Key Features of YETI
- Scene understanding signals
- Based on Structural Similarity Index Measure (SSIM) across video frames
- Alignment signal
- Tracks consistency of user actions with expected task actions (e.g., monitoring object count)
- Algorithmic efficiency
- On-the-fly computation reduces overhead while maintaining accuracy
- Scene understanding signals
- Proactive Communication in AI Agents
- Intelligence
- Anticipating task developments
- Adaptivity
- Adjusting intervention timing dynamically
- Civility
- Respecting user boundaries and ethical considerations
- Intelligence
- Data Collection Framework
- AR devices
- Capture first-person task execution videos and expert guidance
- AR devices
- Interaction Types
- Proactive interactions
- High-level guidance, follow-up instructions, feedback, error correction
- Reactive interactions
- Expert responses to user queries and dialogues
- Proactive interactions
- Signals for Proactive Intervention
- Change in object count
- Tracks user attention and task focus (e.g., pauses during instruction listening)
- Temporal coherence
- SSIM analysis identifies meaningful video frames for intervention
- Change in object count
- Proactive Activities Categorization
- Proactive interaction
- Initiating engagement without user prompts (e.g., high-level instructions)
- Proactive Intervention
- Taking steps to alter user behavior (e.g., correcting errors)
- Both
- Follow-up instructions, confirming/correcting actions
- Proactive interaction
- Enhanced Sensory Integration
- Additional modalities
- Hand pose, eye gaze, head orientation, IMU readings, depth data
- Benefit
- Comprehensive context understanding for nuanced intervention
- Additional modalities
- Use Cases
- Assembly tasks
- Examples: Assembling/disassembling furniture and equipment
- Maintenance and repair
- Examples: Fixing a motorcycle, changing mechanical belts
- Consumer electronics
- Examples: Setting up printers, cameras, coffee machines, gaming consoles
- Assembly tasks
- Key Benefits of Proactive Intervention
- Complexity management
- Helps manage tasks that become complex due to mistakes
- Safety considerations
- Guides users to avoid harm during risky tasks
- Bridging expertise gap
- Acts as an instructor for novice users
- Complexity management