Feature Extraction for Spam Classification

Davy, Michael

dc.contributor.advisor	Cunningham, Padraig
dc.contributor.author	Davy, Michael
dc.date.accessioned	2006-06-16T15:57:26Z
dc.date.available	2006-06-16T15:57:26Z
dc.date.issued	2004-09
dc.date.submitted	2006-06-16T15:57:26Z
dc.description.abstract	E-mail has emerged as one of the primary means of communication used in the world today. Its rapid adoption has left it ripe for misuse and abuse. This came in the guise of Unsolicited Commercial E-mail (UCE) or as it is otherwise known Spam. For a time spam was considered only a nuisance but due mainly to the copious amounts of spam being sent it has progressed from being a nuisance to become a major problem. The volume of spam has reached epidemic proportions with estimates of up to 80% of all e-mail sent actually being spam. Spam filtering offers a way to curb the problem. Identifying and removal of spam from the e-mail delivery system allows end-users to regain a useful means of communication. A lot of research in spam filtering has been centred on more sophistication in the classifiers used. This thesis begins to investigate the impact of applying more sophistication to lower layers in the filtering process, namely extracting information from e-mail. Several types of obfuscation are discussed which are becoming ever more present in spam in order to try confuse and circumvent the current filtering processes. The results obtained by removing certain types of obfuscation show to improve the classification process. The main theory under investigation was the impact of pair tokens on the classification process. It is quite reasonable to think that pairs of tokens will offer more value than single tokens alone. For example ?enlarge your? seems to suggest more information than single tokens alone. Results obtained show conclusively that pair tokens offer no value and in fact increase error over three independent data sets.	en
dc.format.extent	550845 bytes
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.relation.hasversion	TCD-CS-2005-09.pdf	en
dc.subject	Computer Science	en
dc.title	Feature Extraction for Spam Classification	en
dc.type	Masters (Taught)
dc.type	Master of Science (M.Sc.)
dc.publisher.institution	Trinity College Dublin. Department of Computer Science	en
dc.identifier.uri	http://hdl.handle.net/2262/822

Files in this item

Name:: TCD-CS-2005-09.pdf
Size:: 537.9Kb
Format:: PDF

View/Open

Name:: license.txt
Size:: 3.589Kb
Format:: Text file

View/Open

This item appears in the following Collection(s)

Computer Science (Theses and Dissertations)
Computer Science (Theses and Dissertations)
Computer Science Technical Reports
Trinity College Dublin Theses & Dissertations

Show simple item record

Browse

My Account

Feature Extraction for Spam Classification

Files in this item

This item appears in the following Collection(s)