The dataset contains just two fields:
- text: The text of the email.
- spam: A binary variable indicating if the email was spam.
text | spam |
---|---|
Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqgestions and the information isoverwhelminq ; but a good catchy logo , stylish statlonery and outstanding website will make the task much easier . we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader : it isguite ciear that without good products , effective business organization and practicable aim it will be hotat nowadays market ; but we do promise that your marketing efforts will become much more effective . here is the list of clear benefits : creativeness : hand - made , original logos , specially done to reflect your distinctive company image . convenience : logo and stationery are provided in all formats ; easy - to - use content management system letsyou change your website content and even its structure . promptness : you will see logo drafts within three business days . affordability : your marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction guaranteed : we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration . have a look at our portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | 1 |
Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ramble is segovia not group try slung kansas tanzania yes chameleon or continuant clothesman no libretto is chesapeake but tight not waterway herald and hawthorn like chisel morristown superior is deoxyribonucleic not clockwork try hall incredible mcdougall yes hepburn or einsteinian earmark no sapling is boar but duane not plain palfrey and inflexible like huzzah pepperoni bedtime is nameable not attire try edt chronography optima yes pirogue or diffusion albeit no | 1 |
Subject: re : subscriptions stephanie , please , discontinue credit and renew the two other publications : energy & power risk management and the journal of computational finance . enron north america corp . from : stephanie e taylor 12 / 12 / 2000 01 : 43 pm to : vince j kaminski / hou / ect @ ect cc : subject : subscriptions dear vince , we will be happy to renew your subscription to risk . in addition , the following publications are up for renewal : reg . subscription cost with corp . discount credit $ 1145 . 00 $ 973 . 25 energy & power risk management $ 375 . 00 $ 318 . 75 the journal of computational finance $ 291 . 75 $ 247 . 99 if you wish to renew these , we will also take care of this for you . i would appreciate your responding by december 18 th . please include your company and cost center numbers with your renewal . thank you , stephanie e . taylor esource 713 - 345 - 7928 | 0 |
Subject: re : baylor - enron case study cindy , yes , i shall co - author this paper and i have planted the idea in john martin ' s head . vince from : cindy derecskey @ enron on 10 / 25 / 2000 11 : 38 am to : vince j kaminski / hou / ect @ ect cc : subject : baylor - enron case study vince , i forgot to inquire whether you would also like to be present during the interview process with john martin and ken , jeff and andy ? let me know . . . . thanks , cindy | 0 |
> emails <- read.csv('emails.csv', stringsAsFactors=FALSE)
> dim(emails)
[1] 5728 2
> table(emails$spam)
0 1
4360 1368
> library(tm)
> library(SnowballC)
> corpus <- Corpus(VectorSource(emails$text))
> corpus <- tm_map(corpus, tolower)
> corpus <- tm_map(corpus, removePunctuation)
> stopwords('english')[1:5]
[1] "i" "me" "my" "myself" "we"
> corpus <- tm_map(corpus, removeWords, stopwords('english'))
> corpus <- tm_map(corpus, stemDocument)
> dtm <- DocumentTermMatrix(corpus)
> dtm
A document-term matrix (5728 documents, 28687 terms)
Non-/sparse entries: 481719/163837417
Sparsity : 100%
Maximal term length: 24
Weighting : term frequency (tf)
> spdtm <- removeSparseTerms(dtm, 0.95)
> spdtm
A document-term matrix (5728 documents, 330 terms)
Non-/sparse entries: 213551/1676689
Sparsity : 89%
Maximal term length: 10
Weighting : term frequency (tf)
> emailsSparse <- as.data.frame(as.matrix(spdtm))
> emailsSparse[1:10, 1:10]
000 2000 2001 713 853 abl access account addit address
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 2 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 1 1 1 0
8 0 0 0 0 0 0 0 0 0 0
9 1 0 1 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
> emailsSparse$spam <- emails$spam
> library(caTools)
> set.seed(123)
> split <- sample.split(emailsSparse$spam, SplitRatio=0.7)
> train <- emailsSparse[split==TRUE, ]
> test <- emailsSparse[split==FALSE, ]
> spamLog <- glm(spam ~ ., data = train, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> predPercLog.test <- predict(spamLog, newdata = test, type='response')
> predLog.test <- ifelse(predPercLog.test > 0.5, 1, 0)
> table(predLog.test, test$spam)
predLog.test 0 1
0 1257 34
1 51 376
> library(rpart)
> library(rpart.plot)
> spamCART <- rpart(spam ~ ., data = train, method = 'class')
> predCART.test <- ifelse(predPercCART.test > 0.5, 1, 0)
> table(predCART.test, test$spam)
predCART.test 0 1
0 1228 24
1 80 386
> library(randomForest)
> set.seed(123)
> spamRF <- randomForest(spam ~ ., data = train)
Error in eval(expr, envir, enclos) : object '000' not found
> '000' %in% names(train)
[1] TRUE
> names(train)
[1] "000" "2000" "2001" "713" "853" "abl" "access"
[8] "account" "addit" "address" "allow" "alreadi" "also" "analysi"
[15] "anoth" "applic" "appreci" "approv" "april" "area" "arrang"
[22] "ask" "assist" "associ" "attach" "attend" "avail" "back"
[29] "base" "begin" "believ" "best" "better" "book" "bring"
[36] "busi" "buy" "call" "can" "case" "chang" "check"
[43] "click" "com" "come" "comment" "communic" "compani" "complet"
[50] "confer" "confirm" "contact" "continu" "contract" "copi" "corp"
[57] "corpor" "cost" "cours" "creat" "credit" "crenshaw" "current"
[64] "custom" "data" "date" "day" "deal" "dear" "depart"
[71] "deriv" "design" "detail" "develop" "differ" "direct" "director"
[78] "discuss" "doc" "don" "done" "due" "ect" "edu"
[85] "effect" "effort" "either" "email" "end" "energi" "engin"
[92] "enron" "etc" "even" "event" "expect" "experi" "fax"
[99] "feel" "file" "final" "financ" "financi" "find" "first"
[106] "follow" "form" "forward" "free" "friday" "full" "futur"
[113] "gas" "get" "gibner" "give" "given" "good" "great"
[120] "group" "happi" "hear" "hello" "help" "high" "home"
[127] "hope" "hou" "hour" "houston" "howev" "http" "idea"
[134] "immedi" "import" "includ" "increas" "industri" "info" "inform"
[141] "interest" "intern" "internet" "interview" "invest" "invit" "involv"
[148] "issu" "john" "join" "juli" "just" "kaminski" "keep"
[155] "kevin" "know" "last" "let" "life" "like" "line"
[162] "link" "list" "locat" "london" "long" "look" "lot"
[169] "made" "mail" "make" "manag" "mani" "mark" "market"
[176] "may" "mean" "meet" "member" "mention" "messag" "might"
[183] "model" "monday" "money" "month" "morn" "move" "much"
[190] "name" "need" "net" "new" "next" "note" "now"
[197] "number" "offer" "offic" "one" "onlin" "open" "oper"
[204] "opportun" "option" "order" "origin" "part" "particip" "peopl"
[211] "per" "person" "phone" "place" "plan" "pleas" "point"
[218] "posit" "possibl" "power" "present" "price" "problem" "process"
[225] "product" "program" "project" "provid" "public" "put" "question"
[232] "rate" "read" "real" "realli" "receiv" "recent" "regard"
[239] "relat" "remov" "repli" "report" "request" "requir" "research"
[246] "resourc" "respond" "respons" "result" "resum" "return" "review"
[253] "right" "risk" "robert" "run" "say" "schedul" "school"
[260] "secur" "see" "send" "sent" "servic" "set" "sever"
[267] "shall" "shirley" "short" "sinc" "sincer" "site" "softwar"
[274] "soon" "sorri" "special" "specif" "start" "state" "still"
[281] "stinson" "student" "subject" "success" "suggest" "support" "sure"
[288] "system" "take" "talk" "team" "term" "thank" "thing"
[295] "think" "thought" "thursday" "time" "today" "togeth" "trade"
[302] "tri" "tuesday" "two" "type" "understand" "unit" "univers"
[309] "updat" "use" "valu" "version" "vinc" "visit" "vkamin"
[316] "want" "way" "web" "websit" "wednesday" "week" "well"
[323] "will" "wish" "within" "without" "work" "write" "www"
[330] "year" "spam"
> colnames(train) <- make.names(colnames(train))
> colnames(test) <- make.names(colnames(test))
> spamRF <- randomForest(spam ~ ., data = train)
> predPercRF.test <- predict(spamCART, newdata = test)[ , 2]
> predRF.test <- ifelse(predPercRF.test > 0.5, 1, 0)
> table(predRF.test, test$spam)
predRF.test 0 1
0 1228 24
1 80 386
- Prediction accuracy for logistic regression model on the test dataset: 95.05%
- Prediction accuracy for regression tree model on the test dataset: 93.95%
- Prediction accuracy for random forest model on the test dataset: 93.95%