Multimodal Summarization in Natural Language Conversations
Citation:
Samantha Kotey, 'Multimodal Summarization in Natural Language Conversations', [Thesis], Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science, 2025Abstract:
Summarization is the task of creating a shorter version of a piece of content, that is representative of the original format. This process not only applies to written text, but is also essential for verbal exchanges in the form of dialogue. Naturally, humans recount snippets of relevant information from stories and events, when engaging in conversations. Subconsciously, these stories are composed and summarized from memory, incorporating linguistic, auditory, and visual elements. In order for machines to replicate this behaviour, multiple modalities should be integrated into the summarization process. Although substantial progress has been achieved in text summarization, research that leverages additional modalities, particularly in conversation analysis, remains under explored.
To this end, we address the challenges associated with developing multimodal systems that can automatically summarize human conversations. We explore emerging natural language domains such as podcasting and online video conferencing meetings, where lengthy conversations frequently occur, and where applications are in high demand. Automatic summarization would significantly reduce the time required to create trailers and teasers for podcasting episodes, or minutes of meetings in a professional context.
The objective of this thesis, is to contribute knowledge that advances the field of multimodal dialogue summarization. To achieve this, we investigate the strengths and limitations of using individual modalities, such as text and audio, for generating summaries independently. Specifically, we propose a method to generate long fine grained summaries of podcast conversations, demonstrating the effectiveness of text as a single modality. Similarly, we explore the limitations of using audio independently by introducing a method to generate audio clip summaries, directly using raw audio embeddings. Following this, we examine the challenges of multimodal integration, particularly focusing on incorporating video into summarization systems. In order to study the complex interactions between modalities, we create dense annotations at the utterance level, for an existing multimodal dataset. Additionally, we propose a multimodal transformer architecture, that incorporates cost sensitive learning and gated fusion techniques. This thesis presents a comprehensive overview of the proposed methods and thoroughly evaluates their performance. The implications of the findings are also discussed, along with potential applications within the field.
Sponsor
Grant Number
Irish Research Council (IRC)
GOIPG/2019/2353
Description:
APPROVED
Author: Kotey, Samantha
Sponsor:
Irish Research Council (IRC)Advisor:
Harte, NaomiPublisher:
Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer ScienceType of material:
ThesisCollections
Availability:
Full text availableMetadata
Show full item recordThe following license files are associated with this item: