Research Digest
AI-generated summaries of top robotics and ML papers, powered by Gemma 4 running locally.
Normalizing Trajectory Models
2026-05-12
Normalizing Trajectory Models (NTM) introduces a framework that models the non-Gaussian reverse conditional \(p(x_s | x_t)\) as a conditional normalizing flow, enabling exact likelihood training for high-quality, few-step image generation.
BAMI: Training-Free Bias Mitigation in GUI Grounding
2026-05-10
BAMI introduces a training-free inference framework that mitigates precision and ambiguity biases in GUI grounding by employing coarse-to-fine focus and candidate selection manipulations.
\(π_0\): A Vision-Language-Action Flow Model for General Robot Control
2026-05-10
π0 is a novel flow matching architecture built on a pre-trained Vision-Language Model (VLM) to create a generalist robot policy capable of performing complex, dexterous tasks via direct prompting or fine-tuning.
World Model for Robot Learning: A Comprehensive Survey
2026-05-09
This survey provides a comprehensive, robot-learning-centered review of world models, detailing their coupling with robot policies, their function as simulators, and their application in robotic video generation.
ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation
2026-05-09
ActCam is a zero-shot, training-free framework that achieves joint control over character motion and camera trajectory in image-conditioned video generation by constructing camera-aligned pose and depth conditioning signals.
Audio-Visual Intelligence in Large Foundation Models
2026-05-07
This survey provides the first comprehensive review of Audio-Visual Intelligence (AVI) within the paradigm of large foundation models, unifying perception, generation, and interaction research under a coherent framework.
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
2026-05-06
AlbumFill is a training-free framework that uses vision-language reasoning to automatically retrieve identity-consistent reference images from personal photo albums to guide personalized image completion.
Posterior Augmented Flow Matching
2026-05-05
Posterior-Augmented Flow Matching (PAFM) generalizes standard Flow Matching (FM) by replacing single-target supervision with an expectation over an approximate posterior of valid target completions, thereby reducing gradient variance.
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
2026-05-01
HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework by leveraging a Bird’s-Eye View (BEV) representation and LLM-enhanced world queries.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
2026-05-01
GR00T N1 is an open Vision-Language-Action (VLA) foundation model for humanoid robots featuring a dual-system architecture that integrates a Vision-Language Model (System 2) for reasoning and a Diffusion Transformer (System 1) for real-time motor action generation.
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
2026-05-01
GenWildSplat is a feed-forward framework that achieves generalizable 3D scene reconstruction from sparse, unposed images by integrating appearance adaptation and occlusion modeling without requiring per-scene optimization.
Eureka: Human-Level Reward Design via Coding Large Language Models
2026-05-01
EUREKA is a novel reward design algorithm powered by coding Large Language Models (LLMs) that autonomously generates human-level reward functions for complex reinforcement learning tasks through evolutionary optimization.
An Adaptable, Safe, and Portable Robot-Assisted Feeding System
2026-05-01
This paper presents an adaptable, safe, and portable robot-assisted feeding system designed to enable people with mobility impairments to feed themselves using multi-modal online learning for bite acquisition and real-time perception for bite transfer.