Gunnar Atli Sigurdsson

Senior Applied Scientist at Amazon AGI

{firstletterofmyfirstname}@{myfirstname}.xyz

LinkedIn Profile   GitHub Page   Google Scholar Profile Google Scholar   Semantic Scholar ProfileSemantic Scholar

My research focuses on multimodal learning across video, text, 3D, audio, and robotics.

News

missing

Multimodal Learning in Video and 3D

Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. In an observed environment, locating an object requires searching among all objects in the environment and comes with various challenges, including privacy. We apply vision-language models to these large 3d search spaces.

, , , , "RREx-BoT: Remote Referring Expressions with a Bag of Tricks", In IROS, 2023. [pdf] [bibtex] [web]
, , , , , "A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search", In ICLR, 2022. (AI2THOR Rearrengement Challenge Winner) [pdf] [bibtex]
, , , , , "Clip-nav: Using clip for zero-shot vision-and-language navigation", In CoRLW LangRob, 2022. [pdf] [bibtex]
, , , , "Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy", In ECCVW, 2022. [pdf] [bibtex]

Visual Grounding in Web-Scale Video Corpora

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.

, , , , , , , , "Visual Grounding in Video for Unsupervised Word Translation", In CVPR, 2020. [pdf] [bibtex] [web]
missing

missing

Unifying Third and First Person

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a step in this direction, with the introduction of Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. This enables learning the link between the two, actor and observer perspectives. Thereby, we address one of the biggest bottlenecks facing egocentric vision research, providing a link from first-person to the abundant third-person data on the web. We use this data to learn a joint representation of first and third-person videos, with only weak supervision, and show its effectiveness for transferring knowledge from the third-person to the first-person domain.

, , , , "Beyond the Camera: Neural Networks in World Coordinates", In ArXiv, 2020. [pdf] [bibtex] [web]
, , , , , "Actor and Observer: Joint Modeling of First and Third-Person Videos", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. (Spotlight Presentation) [pdf] [bibtex] [poster] [code]
, , , , , "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos", In ArXiv, 2018. [pdf] [bibtex] [web]

Hollywood in Homes / Charades

allenai.org/plato/charades/

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities.

, , , "What Actions are Needed for Understanding Human Actions in Videos?", In International Conference on Computer Vision (ICCV), 2017. [pdf] [bibtex] [poster] [code]
, , , , "Asynchronous Temporal Fields for Action Recognition", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [pdf] [bibtex] [code]
, , , , , "Much Ado About Time: Exhaustive Annotation of Temporal Data", In HCOMP, 2016. (Oral Presentation) [pdf] [bibtex] [web]
, , , , , , "Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [web]
missing

missing

Learning Visual Storylines

What does a typical visit to Paris look like? In this work, we show how to automatically learn the temporal aspects, or storylines of visual concepts from web data. Our novel Skipping Recurrent Neural Network (S-RNN) uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums.

, , , "Learning Visual Storylines with Skipping Recurrent Neural Networks", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [code]

Show More

Shape Analysis

Image segmentation algorithms commonly return segmentation masks that represent the shape of objects. Particularly, in the medical imaging domain, this shape incorporates information about, for example, the state of the segmented organ. By looking at the shape of an object, in two or three dimensions, it is possible to look for signs of disease.

, , , , "Interpretable exemplar-based shape classification using constrained sparse linear models", In Proc. SPIE, vol. 9413, no. , pp. 94130R-94130R-7, 2015. (Oral Presentation) [pdf] [bibtex] [doi] [slides] [code]
missing

missing missing

Diffusion MRI processing

Using modern diffusion weighted magnetic resonance imaging protocols, the orientations of multiple neuronal fiber tracts within each voxel can be estimated. Further analysis of these populations, including application of fiber tracking and tract segmentation methods, is often hindered by lack of spatial smoothness of the estimated orientations. For example, a single noisy voxel can cause a fiber tracking method to switch tracts in a simple crossing tract geometry. In this work, a generalized spatial smoothing framework that handles multiple orientations as well as their fractional contributions within each voxel is proposed.

, , "Smoothing fields of weighted collections with applications to diffusion MRI processing", In Proc. SPIE, vol. 9034, no. , pp. 90342D-90342D-7, 2014. [pdf] [bibtex] [doi] [poster] [code]

Polysomnography Analysis

The Icelandic biomedical company, Nox Medical, provides solutions for sleep monitoring and diagnostics. With portable sleep monitors, there is opportunity to measure large population of people in their own beds. From this data, we are interested in exploring relationships between underlying disease and measurements. We performed statistical analysis on various time-series from the device and looked at the discriminative power of various features for classifying events. Using regression and classification algorithms, such as neural networks, we were able to predict vibration in patients from sounds, and warrant further study of relationships with disease. Our work was incorporated in to the company's software suite, Noxturnal.

, , , , , , , , , , "How to measure snoring? A comparison of the microphone, cannula and piezoelectric sensor", In Journal of Sleep Research, 2015. [bibtex] [doi]
, , , , , , "Snoring - Validation of different objective measurements", In European Respiratory Journal, European Respiratory Society, vol. 42, no. Suppl 57, 2014. [bibtex]
missing

missing

Active Radiator

The immense popularity of wireless communications has left the common frequency bands crowded, prompting researchers to utilize available spectrum at ever higher frequencies. At mm-wave frequencies there is pronounced need for novel antenna designs that are tightly integrated with their driving circuitry in order to reduce power losses. A radiator concept for 94 GHz CMOS-technology was reviewed, scaled up, and redesigned to work at 2.4 GHz on a FR-4 printed circuit board, in the interest of testing the concept. The radiator works in similar manner to an array of dipoles, and can connect directly to the last amplifier stage without impedance matching, due to load-pull matched input impedances, accomplishing all of its power combining in the air. 3D full-wave electromagnetic field simulations were performed on all transmission line structures and furthermore, various ways to achieve symmetric power splitting and shorted transmission line stubs with coupled lines were designed and experimented with, in order to achieve acceptable efficiency and radiation pattern of the radiating array. Work with Steven Bowers and Ali Hajimiri at Caltech.

CV available on request.

, , , , , , , , "Decision Making for Human-in-the-loop Robotic Agents via Uncertainty-Aware Reinforcement Learning", In ICRA, 2024. [bibtex]
, , , , , , , , , "FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation", In ArXiv, 2024. [pdf] [bibtex]
, , , , , , , , , , "Unsupervised melody-to-lyric generation", In ACL, 2023. [pdf] [bibtex]
, , , , "RREx-BoT: Remote Referring Expressions with a Bag of Tricks", In IROS, 2023. [pdf] [bibtex] [web]
, , , , , "Characterizing Video Question Answering with Sparsified Inputs", In ArXiv, 2023. [pdf] [bibtex]
, , , , , "A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search", In ICLR, 2022. (AI2THOR Rearrengement Challenge Winner) [pdf] [bibtex]
, , , , "Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy", In ECCVW, 2022. [pdf] [bibtex]
, , , , , "Clip-nav: Using clip for zero-shot vision-and-language navigation", In CoRLW LangRob, 2022. [pdf] [bibtex]
, , , , , , , , "Visual Grounding in Video for Unsupervised Word Translation", In CVPR, 2020. [pdf] [bibtex] [web]
, , , , "Beyond the Camera: Neural Networks in World Coordinates", In ArXiv, 2020. [pdf] [bibtex] [web]
, , , , , "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos", In ArXiv, 2018. [pdf] [bibtex] [web]
, , , , , "Actor and Observer: Joint Modeling of First and Third-Person Videos", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. (Spotlight Presentation) [pdf] [bibtex] [poster] [code]
, , , , "Asynchronous Temporal Fields for Action Recognition", In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [pdf] [bibtex] [code]
, , , "What Actions are Needed for Understanding Human Actions in Videos?", In International Conference on Computer Vision (ICCV), 2017. [pdf] [bibtex] [poster] [code]
, , , , , "Much Ado About Time: Exhaustive Annotation of Temporal Data", In HCOMP, 2016. (Oral Presentation) [pdf] [bibtex] [web]
, , , "Learning Visual Storylines with Skipping Recurrent Neural Networks", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [code]
, , , , , , "Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding", In European Conference on Computer Vision, 2016. [pdf] [bibtex] [poster] [web]
, , , , "Interpretable exemplar-based shape classification using constrained sparse linear models", In Proc. SPIE, vol. 9413, no. , pp. 94130R-94130R-7, 2015. (Oral Presentation) [pdf] [bibtex] [doi] [slides] [code]
, , , , , , , , , , "How to measure snoring? A comparison of the microphone, cannula and piezoelectric sensor", In Journal of Sleep Research, 2015. [bibtex] [doi]
, , "Smoothing fields of weighted collections with applications to diffusion MRI processing", In Proc. SPIE, vol. 9034, no. , pp. 90342D-90342D-7, 2014. [pdf] [bibtex] [doi] [poster] [code]
, , , , , , "Feasibility of a non-invasive sensor for measuring ICU patient mobility", In Critical Care Medicine, vol. 42, pp. A1389, 2014. (Research Citation Award) [bibtex] [doi]
, , , , , , "Snoring - Validation of different objective measurements", In European Respiratory Journal, European Respiratory Society, vol. 42, no. Suppl 57, 2014. [bibtex]

Please see my GitHub page for released code.