Multimodal Question Answering

This was a research collaboration between Oath (the artist formerly known as Yahoo!) and the Language Technologies Institute at Carnegie Mellon where I worked in a group headed by Professors Alexander Hauptmann, Robert Frederking, and Eric Nyberg. Below is the research abstract, but I think you’ll find the poster more enjoyable (here). I am a big proponent of interactive machine learning systems, and this project was one of many such research projects I was able to work on during my time at LTI.

We present an integrated multimodal question and answering (QA) system. We will present two case studies. One will focus on Flickr data, where the answers rely on the text description, the actual content of the photos or videos or within both modalities. The other will be a more general question answering system relying on Yahoo! Answers. The front end integrates both of these services into an Android app, which is capable of receiving a question in written or spoken form. The multimodal system will analyze the question through a pipeline that resides in an external server and will answer back with the most likely answers from both modalities. The best possible answers will be presented to the user as a list of cards that contain the answer in text and the associated multimedia. We will present some cases where the right answer is contained within the multimedia description, another case where the correct answer is within the content of the multimedia as no description is available and one final case where both modalities enrich the answer to the user. The Yahoo! Answers QA part of the system follows a similar approach but does not rely on modalities outside of text. Finally, we will demonstrate our user feedback system that allows for swiping away incorrect or irrelevant answers for the user to improve our models.