Gunnar Atli Sigurdsson

Multimodal Learning in Video and 3D

Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. In an observed environment, locating an object requires searching among all objects in the environment and comes with various challenges, including privacy. We apply vision-language models to these large 3d search spaces.

Gunnar A Sigurdsson, Jesse Thomason, Gaurav S Sukhatme, Robinson Piramuthu, "RREx-BoT: Remote Referring Expressions with a Bag of Tricks", In IROS, 2023. [pdf] [bibtex] [web]

Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Gaurav S Sukhatme, Ruslan Salakhutdinov, "A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search", In ICLR, 2022. (AI2THOR Rearrengement Challenge Winner) [pdf] [bibtex]

Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, Gaurav S Sukhatme, "Clip-nav: Using clip for zero-shot vision-and-language navigation", In CoRLW LangRob, 2022. [pdf] [bibtex]

Shiyuan Huang, Robinson Piramuthu, Shih-Fu Chang, Gunnar A. Sigurdsson, "Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy", In ECCVW, 2022. [pdf] [bibtex]

Visual Grounding in Web-Scale Video Corpora

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.

Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman, "Visual Grounding in Video for Unsupervised Word Translation", In CVPR, 2020. [pdf] [bibtex] [web]

Unifying Third and First Person

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a step in this direction, with the introduction of Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. This enables learning the link between the two, actor and observer perspectives. Thereby, we address one of the biggest bottlenecks facing egocentric vision research, providing a link from first-person to the abundant third-person data on the web. We use this data to learn a joint representation of first and third-person videos, with only weak supervision, and show its effectiveness for transferring knowledge from the third-person to the first-person domain.

Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari, "Beyond the Camera: Neural Networks in World Coordinates", In ArXiv, 2020. [pdf] [bibtex] [web]

Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari, "Actor and Observer: Joint Modeling of First and Third-Person Videos", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. (Spotlight Presentation) [pdf] [bibtex] [poster] [code]

Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari, "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos", In ArXiv, 2018. [pdf] [bibtex] [web]

Hollywood in Homes / Charades

allenai.org/plato/charades/

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities.

Gunnar A. Sigurdsson, Olga Russakovsky, Abhinav Gupta, "What Actions are Needed for Understanding Human Actions in Videos?", In International Conference on Computer Vision (ICCV), 2017. [pdf] [bibtex] [poster] [code]

Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta, "Asynchronous Temporal Fields for Action Recognition", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [pdf] [bibtex] [code]

Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta, "Much Ado About Time: Exhaustive Annotation of Temporal Data", In HCOMP, 2016. (Oral Presentation) [pdf] [bibtex] [web]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta, "Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [web]

Learning Visual Storylines

What does a typical visit to Paris look like? In this work, we show how to automatically learn the temporal aspects, or storylines of visual concepts from web data. Our novel Skipping Recurrent Neural Network (S-RNN) uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums.

Gunnar A. Sigurdsson, Xinlei Chen, Abhinav Gupta, "Learning Visual Storylines with Skipping Recurrent Neural Networks", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [code]

Shape Analysis

Image segmentation algorithms commonly return segmentation masks that represent the shape of objects. Particularly, in the medical imaging domain, this shape incorporates information about, for example, the state of the segmented organ. By looking at the shape of an object, in two or three dimensions, it is possible to look for signs of disease.

Gunnar A. Sigurdsson, Zhen Yang, Trac D. Tran, Jerry L. Prince, "Interpretable exemplar-based shape classification using constrained sparse linear models", In Proc. SPIE, vol. 9413, no. , pp. 94130R-94130R-7, 2015. (Oral Presentation) [pdf] [bibtex] [doi] [slides] [code]

Diffusion MRI processing

Using modern diffusion weighted magnetic resonance imaging protocols, the orientations of multiple neuronal fiber tracts within each voxel can be estimated. Further analysis of these populations, including application of fiber tracking and tract segmentation methods, is often hindered by lack of spatial smoothness of the estimated orientations. For example, a single noisy voxel can cause a fiber tracking method to switch tracts in a simple crossing tract geometry. In this work, a generalized spatial smoothing framework that handles multiple orientations as well as their fractional contributions within each voxel is proposed.

Gunnar A. Sigurdsson, Jerry L. Prince, "Smoothing fields of weighted collections with applications to diffusion MRI processing", In Proc. SPIE, vol. 9034, no. , pp. 90342D-90342D-7, 2014. [pdf] [bibtex] [doi] [poster] [code]

Polysomnography Analysis

The Icelandic biomedical company, Nox Medical, provides solutions for sleep monitoring and diagnostics. With portable sleep monitors, there is opportunity to measure large population of people in their own beds. From this data, we are interested in exploring relationships between underlying disease and measurements. We performed statistical analysis on various time-series from the device and looked at the discriminative power of various features for classifying events. Using regression and classification algorithms, such as neural networks, we were able to predict vibration in patients from sounds, and warrant further study of relationships with disease. Our work was incorporated in to the company's software suite, Noxturnal.

Erna S. Arnardottir, Bardur Isleifsson, Jon S. Agustsson, Gunnar A. Sigurdsson, Magdalena O. Sigurgunnarsdottir, Gudjon T. Sigurđarson, Gudmundur Saevarsson, Atli T. Sveinbjarnarson, Sveinbjorn Hoskuldsson, Thorarinn Gislason, "How to measure snoring? A comparison of the microphone, cannula and piezoelectric sensor", In Journal of Sleep Research, 2015. [bibtex] [doi]

Erna Sif Arnardottir, Magdalena Osk Sigurgunnarsdottir, Gunnar Atli Sigurdsson, Gudmundur Saevarsson, Sveinbjorn Hoskuldsson, Thorarinn Gislason, "Snoring - Validation of different objective measurements", In European Respiratory Journal, European Respiratory Society, vol. 42, no. Suppl 57, 2014. [bibtex]

Active Radiator

The immense popularity of wireless communications has left the common frequency bands crowded, prompting researchers to utilize available spectrum at ever higher frequencies. At mm-wave frequencies there is pronounced need for novel antenna designs that are tightly integrated with their driving circuitry in order to reduce power losses. A radiator concept for 94 GHz CMOS-technology was reviewed, scaled up, and redesigned to work at 2.4 GHz on a FR-4 printed circuit board, in the interest of testing the concept. The radiator works in similar manner to an array of dipoles, and can connect directly to the last amplifier stage without impedance matching, due to load-pull matched input impedances, accomplishing all of its power combining in the air. 3D full-wave electromagnetic field simulations were performed on all transmission line structures and furthermore, various ways to achieve symmetric power splitting and shorted transmission line stubs with coupled lines were designed and experimented with, in order to achieve acceptable efficiency and radiation pattern of the radiating array. Work with Steven Bowers and Ali Hajimiri at Caltech.

Siddharth Singi, Zhanpeng He, Alvin Pan, Sandip Patel, Gunnar A. Sigurdsson, Robinson Piramuthu, Shuran Song, Matei Ciocarlie, "Decision Making for Human-in-the-loop Robotic Agents via Uncertainty-Aware Reinforcement Learning", In ICRA, 2024. [bibtex]

Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang, "FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation", In ArXiv, 2024. [pdf] [bibtex]

Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Tagyoung Chung, Jing Huang, Nanyun Peng, "Unsupervised melody-to-lyric generation", In ACL, 2023. [pdf] [bibtex]