Research Digest

AI-generated summaries of top robotics and ML papers, powered by Gemma 4 running locally.

Normalizing Trajectory Models

2026-05-12

Normalizing Trajectory Models (NTM) introduces a framework that models the non-Gaussian reverse conditional \(p(x_s | x_t)\) as a conditional normalizing flow, enabling exact likelihood training for high-quality, few-step image generation.

BAMI: Training-Free Bias Mitigation in GUI Grounding

2026-05-10

BAMI introduces a training-free inference framework that mitigates precision and ambiguity biases in GUI grounding by employing coarse-to-fine focus and candidate selection manipulations.

\(π_0\): A Vision-Language-Action Flow Model for General Robot Control

2026-05-10

π0 is a novel flow matching architecture built on a pre-trained Vision-Language Model (VLM) to create a generalist robot policy capable of performing complex, dexterous tasks via direct prompting or fine-tuning.

World Model for Robot Learning: A Comprehensive Survey

2026-05-09

This survey provides a comprehensive, robot-learning-centered review of world models, detailing their coupling with robot policies, their function as simulators, and their application in robotic video generation.

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

2026-05-09

ActCam is a zero-shot, training-free framework that achieves joint control over character motion and camera trajectory in image-conditioned video generation by constructing camera-aligned pose and depth conditioning signals.

Audio-Visual Intelligence in Large Foundation Models

2026-05-07

This survey provides the first comprehensive review of Audio-Visual Intelligence (AVI) within the paradigm of large foundation models, unifying perception, generation, and interaction research under a coherent framework.

AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

2026-05-06

AlbumFill is a training-free framework that uses vision-language reasoning to automatically retrieve identity-consistent reference images from personal photo albums to guide personalized image completion.

Posterior Augmented Flow Matching

2026-05-05

Posterior-Augmented Flow Matching (PAFM) generalizes standard Flow Matching (FM) by replacing single-target supervision with an expectation over an approximate posterior of valid target completions, thereby reducing gradient variance.

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

2026-05-01

HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework by leveraging a Bird’s-Eye View (BEV) representation and LLM-enhanced world queries.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

2026-05-01

GR00T N1 is an open Vision-Language-Action (VLA) foundation model for humanoid robots featuring a dual-system architecture that integrates a Vision-Language Model (System 2) for reasoning and a Diffusion Transformer (System 1) for real-time motor action generation.

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

2026-05-01

GenWildSplat is a feed-forward framework that achieves generalizable 3D scene reconstruction from sparse, unposed images by integrating appearance adaptation and occlusion modeling without requiring per-scene optimization.

Eureka: Human-Level Reward Design via Coding Large Language Models

2026-05-01

EUREKA is a novel reward design algorithm powered by coding Large Language Models (LLMs) that autonomously generates human-level reward functions for complex reinforcement learning tasks through evolutionary optimization.

An Adaptable, Safe, and Portable Robot-Assisted Feeding System

2026-05-01

This paper presents an adaptable, safe, and portable robot-assisted feeding system designed to enable people with mobility impairments to feed themselves using multi-modal online learning for bite acquisition and real-time perception for bite transfer.