Show simple item record

dc.contributor.advisorCunningham, Padraig
dc.contributor.authorDavy, Michael
dc.date.accessioned2006-06-16T15:57:26Z
dc.date.available2006-06-16T15:57:26Z
dc.date.issued2004-09
dc.date.submitted2006-06-16T15:57:26Z
dc.description.abstractE-mail has emerged as one of the primary means of communication used in the world today. Its rapid adoption has left it ripe for misuse and abuse. This came in the guise of Unsolicited Commercial E-mail (UCE) or as it is otherwise known Spam. For a time spam was considered only a nuisance but due mainly to the copious amounts of spam being sent it has progressed from being a nuisance to become a major problem. The volume of spam has reached epidemic proportions with estimates of up to 80% of all e-mail sent actually being spam. Spam filtering offers a way to curb the problem. Identifying and removal of spam from the e-mail delivery system allows end-users to regain a useful means of communication. A lot of research in spam filtering has been centred on more sophistication in the classifiers used. This thesis begins to investigate the impact of applying more sophistication to lower layers in the filtering process, namely extracting information from e-mail. Several types of obfuscation are discussed which are becoming ever more present in spam in order to try confuse and circumvent the current filtering processes. The results obtained by removing certain types of obfuscation show to improve the classification process. The main theory under investigation was the impact of pair tokens on the classification process. It is quite reasonable to think that pairs of tokens will offer more value than single tokens alone. For example ?enlarge your? seems to suggest more information than single tokens alone. Results obtained show conclusively that pair tokens offer no value and in fact increase error over three independent data sets.en
dc.format.extent550845 bytes
dc.format.mimetypeapplication/pdf
dc.language.isoenen
dc.relation.hasversionTCD-CS-2005-09.pdfen
dc.subjectComputer Scienceen
dc.titleFeature Extraction for Spam Classificationen
dc.typeMasters (Taught)
dc.typeMaster of Science (M.Sc.)
dc.publisher.institutionTrinity College Dublin. Department of Computer Scienceen
dc.identifier.urihttp://hdl.handle.net/2262/822


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record