Wednesday, April 4, 2007

Spam Dataset for classification

1. Title:  SPAM E-mail Database

2. Sources:
(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
(b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835
(c) Generated: June-July 1999

3. Past Usage:
(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
(b) Determine whether a given email is spam or not.
(c) ~7% misclassification error.
False positives (marking good mail as spam) are very undesirable.
If we insist on zero false positives in the training/testing set,
20-25% of the spam passed through the filter.

4. Relevant Information:
The "spam" concept is diverse: advertisements for products/web
sites, make money fast schemes, chain letters, pornography...
Our collection of spam e-mails came from our postmaster and
individuals who had filed spam. Our collection of non-spam
e-mails came from filed work and personal e-mails, and hence
the word 'george' and the area code '650' are indicators of
non-spam. These are useful when constructing a personalized
spam filter. One would either have to blind such non-spam
indicators or get a very wide collection of non-spam to
generate a general purpose spam filter.

For background on spam:
Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.
http://www.ics.uci.edu/~mlearn/databases/spambase/

No comments: