Multimodal Summarization in Natural Language Conversations

Kotey, Samantha

This item is covered by a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Internationa. Click to find out more

File Type:

PDF

Item Type:

Thesis

Date:

2025

Author:

Kotey, Samantha

Access:

openAccess

Citation:

Samantha Kotey, 'Multimodal Summarization in Natural Language Conversations', [Thesis], Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science, 2025

Download Item:

(Multimodal_Summarization_in_Natural_Language_Conversations_2025.pdf) 11.58Mb

Abstract:

Summarization is the task of creating a shorter version of a piece of content, that is representative of the original format. This process not only applies to written text, but is also essential for verbal exchanges in the form of dialogue. Naturally, humans recount snippets of relevant information from stories and events, when engaging in conversations. Subconsciously, these stories are composed and summarized from memory, incorporating linguistic, auditory, and visual elements. In order for machines to replicate this behaviour, multiple modalities should be integrated into the summarization process. Although substantial progress has been achieved in text summarization, research that leverages additional modalities, particularly in conversation analysis, remains under explored. To this end, we address the challenges associated with developing multimodal systems that can automatically summarize human conversations. We explore emerging natural language domains such as podcasting and online video conferencing meetings, where lengthy conversations frequently occur, and where applications are in high demand. Automatic summarization would significantly reduce the time required to create trailers and teasers for podcasting episodes, or minutes of meetings in a professional context. The objective of this thesis, is to contribute knowledge that advances the field of multimodal dialogue summarization. To achieve this, we investigate the strengths and limitations of using individual modalities, such as text and audio, for generating summaries independently. Specifically, we propose a method to generate long fine grained summaries of podcast conversations, demonstrating the effectiveness of text as a single modality. Similarly, we explore the limitations of using audio independently by introducing a method to generate audio clip summaries, directly using raw audio embeddings. Following this, we examine the challenges of multimodal integration, particularly focusing on incorporating video into summarization systems. In order to study the complex interactions between modalities, we create dense annotations at the utterance level, for an existing multimodal dataset. Additionally, we propose a multimodal transformer architecture, that incorporates cost sensitive learning and gated fusion techniques. This thesis presents a comprehensive overview of the proposed methods and thoroughly evaluates their performance. The implications of the findings are also discussed, along with potential applications within the field.

URI:

https://hdl.handle.net/2262/111819

Sponsor

Grant Number

Irish Research Council (IRC)

GOIPG/2019/2353

Author's Homepage:

https://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:KOTEYS

Description:

APPROVED

Author: Kotey, Samantha

Advisor:

Harte, Naomi

Publisher:

Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science

Type of material:

Thesis

URI:

https://hdl.handle.net/2262/111819

Collections

Availability:

Full text available

Subject:

Multimodal Summarization, Spoken Document Summarization, Audio Summarization, Speech Summarization, Meeting Summarization, Large Language Models, Machine Learning

Metadata

Show full item record

The following license files are associated with this item:

Original License

Browse

My Account