OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks


This is a companion discussion topic for the original entry at https://arxiv.org/abs/2606.29537