Abstract
This study sought to evaluate the deterrent impact the Controlling the Assault of Non-Solicited Pornography and Marketing (CAN SPAM) Act has had on email spam rates over time. A sample of 5,490,905 spam emails was collected and aggregated into a monthly time series. Thirteen measures of CAN SPAM Act enforcement were coded from news articles and included in a time-series regression. The results suggest a possible deterrent effect of prosecutions, convictions, and lengthy jail sentences for spammers, but an emboldening effect of short jail sentences. The penalties under the CAN SPAM Act focus on fines more than prison terms. The results find no deterrent effect for fines, as spammers tend to earn a large income from sending spam. The Act might be revised to include prison sentences, especially longer ones to avoid the emboldening effect found. A deterrent impact was found for prosecutions, even though the CAN SPAM Act is under-enforced. Expanding enforcement might also be advisable.
The amount of electronic spam sent has grown since its inception (Gudkova, 2013). Spam can be transmitted through any electronic means of communication (text, forum, chat), but email spam is the most common method (Rao & Reiley, 2012). More than 70% of all emails are spam (Gudkova, 2013). The U.S. Congress enacted the Controlling the Assault of Non-Solicited Pornography and Marketing (CAN SPAM) Act of 2003 in response to this growing problem (CAN SPAM Act, 2005). The law did not illegalize the sending of spam, but instead regulated how spam messages had to be composed and sent.
A year after the Act was passed, the Federal Trade Commission (FTC) released a report evaluating the success of the CAN SPAM Act. The FTC concluded that the volume of spam sent had been affected by the CAN SPAM Act, it was no longer increasing and had begun to level off (Majoras, Leary, Harbour, & Leibowitz, 2005). Other commentators were not so positive about the success of the new law, suggesting that spam had in fact increased following the Act (Gross, 2004; Zeller, 2005). In addition, a time-series analysis found no significant difference in spam volume following the passing of the Act (Kigerl, 2009).
Despite the mixture of findings, all assessments to date of the CAN SPAM Act only consider the impact of the legislation to be dichotomous. That is, comparisons are made prior to the CAN SPAM Acts passing and on or after the Act went into enforcement on January 1, 2004, creating a binary measure of the CAN SPAM Act. Such analyses do not capture the true variation and theoretical influence a policy or law is intended to have on the target human behavior. This study seeks to remedy this limitation by including 13 different continuous measures of CAN SPAM Act activity, enforcement, attention, and public attitudes. Other research has found that these measures of law enforcement have a significant impact on some forms of cybercrime (Guitton, 2012; Png & Wang, 2007), so it is questioned whether they also have an influence on spam as the cybercrime outcome.
In addition to prior research having only limited measures of law enforcement, no other research to date has considered variables in addition to the CAN SPAM Act. Specifically, only the CAN SPAM Act is analyzed when assessing spam time series within the United States, ignoring other possible influences on spammer behavior. This study incorporates economic, demographic, and technological time-series control variables into all analyses, intending to capture additional influences of spam volume over time. Equipped with these models, conclusive evidence can be drawn regarding the effectiveness of the CAN SPAM Act to effect spam rates in the United States.
Literature Review
Electronic spam is defined as the transmission of bulk, unsolicited, commercial electronic mail messages to multiple recipients. Spam can be transmitted using a variety of electronic communications, such as through text messages, online commenting systems, social networks, and chat. However, the most common form of spam is transmitted through email. Email is an effective means to reach a previously unheard of number of recipients, and spam works best based on quantity of messages sent, not the quality of those messages.
The amount of spam sent worldwide has been increasing as the reach of the Internet grows and more people become connected to it. Because of the number of recipients that can be reached online, spam is a profitable business. Spam activity had experienced continued growth over the years (Lee, 2005), but has since leveled off and remained at a steady, if still high, rate (Gudkova, 2013). In 2003, just 45% of all email was considered spam (McCain, 2003). Today, it amounts to 72% of all emails sent (Gudkova, 2013). In fact, as much as 69% of all Internet traffic is spam (Lachhwani & Ghose, 2012).
The CAN SPAM Act
The laws against spam that are of interest to this research are those included under the U.S. CAN SPAM Act of 2003. The legislation was passed by Congress in 2003, going into effect on January 1, 2004. The regulations inherent in the bill set requirements that electronic commercial messages must adhere to when sending advertisements to recipients electronically (including email and other electronic means of communication). The bill does not prohibit unsolicited commercial emails, but rather it regulates the way they are sent and the content that is delivered. The messages must be truthful and not fraudulent. The sender must also comply with a recipient’s express request to opt out of all future emails. Violators of the Act are usually fined but can also receive prison time for additional aggravating violations (CAN SPAM Act, 2005).
Prior to the CAN SPAM Act, laws regulating electronic mail were created at the state level. Spam sent or received in one state fell under the jurisdiction of that state. The CAN SPAM Act preempts the majority of state laws that address spam. The laws left to each state to regulate include fraudulent content of electronic mail. State laws that restrict falsified headers or deceptive contents of email still remain under that state’s jurisdiction, should those laws exist locally. Finally, the CAN SPAM Act does not supersede state laws on computer crime in general (CAN SPAM Act, 2005). Although the CAN SPAM Act replaces most state laws regulating spam, the CAN SPAM Act is superseded by all Federal laws relating to obscenity and the sexual exploitation of children.
One of the first and basic rules set forth in the Act forbids the falsification of email headers (CAN SPAM Act, 2005). The headers of an email message include the recipient’s address, the sender’s address, the return or bounce address, and additional routing details contained in the headers. Another header field in an email subject to restriction is the subject header. Senders are not permitted to write subject titles intended to mislead the recipient on what the contents of the message body are before opening the email.
The sender must provide a channel for the recipient to opt out of further advertisements (CAN SPAM Act, 2005). Opt-out is the ability of the recipient to make a request to discontinue receiving spam, and the willingness of the spammer to honor those opt-out requests. The sender’s valid physical postal address must also be included somewhere in the body of the email. Finally, the message contents or subject heading must identify itself as an advertisement.
In addition to the basic requirements with which spam emails must comply, aggravating offenses exist that can triple the maximum penalties under the CAN SPAM Act (CAN SPAM Act, 2005). The first aggravating offense includes acquiring an email list by unethical means, such as address harvesting or dictionary attacks. Additional aggravating offenses include the automated registration of multiple email accounts from which to send spam from.
There are three major authorities that are authorized to pursue offenders who commit CAN SPAM Act violations. They include (a) the attorney general for violations within a state, (b) Internet access providers, and (c) the FTC. The FTC is the primary enforcer of the Act, however, and is authorized to further update CAN SPAM Act regulations in light of new and emerging technology that warrant regulatory changes. In addition, the FTC has the authority to bring suit for any violation detailed in the Act, whereas the remaining two authorities (Internet service providers and states) may only enforce a subset of the Act’s provisions.
The CAN SPAM Act and Deterrence Theory
Deterrence is defined as the omission of an act as a response to the perceived risk and fear of punishment for contrary behavior (Gibbs, 1975). The penalties set forth in the CAN SPAM Act are intended to serve as a deterrent against illicit spamming. Deterrence theory was first postulated in the late 1700s by Cesare Beccaria and Jeremy Bentham and stated that the rate of a particular crime varies inversely with the celerity, certainty, and severity of punishment of that crime (Beccaria, 1963; Bentham, 1962).
There are multiple channels in which a punishment can deter subsequent or future offending. General deterrence is the threat of punishment which the public or potential offenders are aware of beforehand and thus resist committing a given crime because of the perception of that possible punishment (Gibbs, 1975). General deterrence is contrasted with specific deterrence, specific deterrence being the deterrent effect of an already enforced punishment against an offender having already committed a first or initial offense. The CAN SPAM Act is amenable to either types of deterrence. Offenders may avoid the spam business altogether or comply with the regulations of the Act when sending spam, or a subsequent conviction under the CAN SPAM Act may end an individual spammer’s career.
Can a cyberattack be deterred with the threat of punishment? Deterrence has been suggested to generalize from application to street crime to application to cyber threats and information security (Kunreuther & Heal, 2003; Png, Tang, & Wang, 2006). The literature linking law enforcement efforts, such as those reported in the news, to reductions in traditional street crime has been tenuous at best (Bailey & Peterson, 1994, 1998; Garofalo, 1981; Peterson & Bailey, 1991), typically finding no deterrent effect. However, there has also been some empirical testing of deterrence theory as applied to cybercrime. Prosecution and law enforcement activity against cybercriminals, such as that reported in the news, has been linked to reductions in hacking incidences (Png & Wang, 2007), cyberattacks on businesses (Guitton, 2012), and distributed denial of service attacks (Hui, Kim, & Wang, 2013).
The reason for the different deterrent results between these two crime types might be that perpetrators of cybercrime might be more rational (Guitton, 2012). It seems plausible that a cyberoffender, such as a spammer, would be more likely to read about spam in the news, and thus read about prosecutions of cybercriminals, whereas a more traditional property offender would not. Cybercrime tends to require more technical expertise, and therefore requires more time studying that crime type, so those drawn to such crimes might also be drawn to news reporting of the same crime. However, it is not yet certain the possible deterrent effects of prosecuting spamming offenders specifically.
Efficacy and Evaluation of the CAN SPAM Act
The original formulation of the CAN SPAM Act contained plans by the FTC to conduct an evaluation of the Act’s effectiveness following its implementation (Muris, Thompson, Swindle, Leary, & Harbour, 2004). By December 2005, the FTC had completed its report to Congress evaluating the CAN SPAM Act. Based on the findings of the report, the FTC claimed that the rate of spam sent had begun to flatten out, slowing in its noticeable trend upward over time. It was also acknowledged that the amount of spam received in inboxes had been lessened due to better spam filtering technology (Majoras et al., 2005).
The FTC was not the only authority to evaluate the effectiveness of the CAN SPAM Act. Other independent researchers who also tested the impact of the Act had mostly consisted of computer security firms and spam filtering technology companies. With regard to spam rates, it appeared that the volume of spam sent had in fact gone up following the passing of the Act. According to Scott Chasin, Chief Technology Officer of the spam and malware filtering company MX Logic, spam had increased (Gross, 2004). According to MessageLabs, another anti-spam and cybersecurity firm, spam had grown by 50% to 80% a year following the passing of the Act (Zeller, 2005). However, neither of these sources indicate the type of statistical model used to determine these claims.
A study that utilized a time-series design to determine the impact of the CAN SPAM Act on spam volume found that there was no significant influence on the absolute number of spam messages received (Kigerl, 2009). Spam volume was unaffected following the passing of the Act. The evaluation was based on a data set of spam sent between 1998 and 2008.
There are some limitations to this study that should be mentioned. The measure of the CAN SPAM Act included in all time-series models tested was a dichotomous measure, only indicating an abrupt but permanent impact of the CAN SPAM Act. A binary measure may not be sufficient to capture the true variation and influence that a legal code might have on actual behavior. A continuous measure of CAN SPAM would be more desirable, such as the number of prosecutions under the Act or the amount of media attention given to the Act in the news.
Furthermore, the time-series model used was also not completely sufficient to partial out the impact of the CAN SPAM Act. The model was only a simple time-series regression design, with a single predictor (the CAN SPAM Act). Additional control variables should be included, such as the growth in IT in the United States, or the number of Internet users per capita over time. Including control variables ought to help rule out variation in illicit spamming activity caused by the spread of technology and the Internet, so that we can find the unique effects the CAN SPAM Act might have had.
Current Focus
Data on spam volume sent over a 16-year time span have been collected for this research and tested against a number of time-series metrics capturing law enforcement and other possible deterrents of spammer behavior. A spam sample totaling 5,490,905 email messages transmitted in the United States has been gathered and analyzed. This research found that, contrary to previous literature, the CAN SPAM Act may actually have a deterrent influence on the amount of spam sent.
Method
A data set built from a sample of spam emails was created to investigate the impact the CAN SPAM Act has had on spam volume over time. Two measures of spam volume are included. The first is the absolute count of spam messages sent per month. The second is operationalized as a rate based on the number of Internet users in the United States; that being the amount of spam sent per 100,000 Internet users in the population.
Sample and Data
The sample data, that of spam emails, were taken from publicly available spam archives from which spam emails are collected and stored for subsequent download by researchers. The data were retrieved from the Untroubled Software website (http://untroubled.org/spam) on December 18, 2013. The Untroubled Software website is maintained by Bruce Guenter, and the available spam archives are collected by posting multiple “honeynet” email addresses publicly online for spam crawlers to harvest. The honeynet approach is intended to bait spammers to add a given email address to a spam listserv, with the goal of intentionally receiving spam emails. The nature of the email collection procedure, that of honeynet baiting, suggests that the majority of the emails are in fact spam and not legitimate emails. All messages received in the bait accounts were not solicited; instead spammers capture the recipients email address from a website and proceed to send spam to it. While the CAN SPAM Act does not forbid the unsolicited sending of spam, it does consider address harvesting from online websites to be an aggravating factor (CAN SPAM Act, 2005). Thus, it should be assumed that the majority of emails in the sample are in violation of the CAN SPAM Act.
The data collected includes all individual spam emails hosted for download which were received in bait honeypot email accounts between March 1998 and November 2013, totaling 5,490,905 email messages. No emails are excluded that fall between these dates, and the data includes the entire population of available emails from the archives. Each email is encoded in an individual text file containing the contents of the spam email, which includes header information, the body of the message including any scripts or HTML, and any file attachments the email contained, converted to a plaintext format stored at the end of the file with an encoding scheme called BASE64.
Procedures
To be analyzed in a time-series design, the spam email sample was coded by the date it was received. Software was written to parse each message in the sample to code the messages on this dimension. The data were saved by the software and then aggregated by month, resulting in 189 months total.
Measures and Variables
There are three sets of measures that are included in subsequent models. The first set includes the two dependent variables of spammer volume (spam count and spam rate). The second set includes independent variables representing CAN SPAM Act activity, including enforcement, CAN SPAM Act attention and public awareness, attitudes toward the CAN SPAM Act, and lack of spammer anonymity due to attribution to the spammer’s identity. Finally, a number of economic and technological time-series predictors are available for inclusion in each model, to serve as control variables.
Dependent variables: Spam volume
The spam emails were aggregated by month and incorporated into a measure capturing the absolute number of spam emails received per month. The spam mining software recorded the date each email was received in the bait email account inbox used to collect the spam data. The date was extracted from the header information of each email.
Emails have a number of dates or time stamps representing a message’s transit over the Internet toward its destined recipient. Each time stamp is associated with an email’s “hop” or a transfer between routers or servers on a network. Each hop appends a new header record to the top of an email message, including information such as the server facilitating the hop, the date the message was received by the server while in transit, any authentication details about the message itself, among other things. Because each hop results in a header appended to the top of the email message headers, the top-most time stamp found in the message’s headers can be assumed to be the most recent hop. Therefore, it can be assumed to be the date the message was received. It is this date that the software will record.
The date received was used and not the date sent, because the date the message was sent would be recorded lower in the message’s headers, which are more likely to be falsified. Email sender’s can forge the initial headers before an email is sent, as the email sending protocol (Simple Mail Transfe Protocol) does not always authenticate messages prior to forwarding them on (Haskins & Nielsen, 2004). Once sent, the remaining headers appended to the message are more likely to be accurate, as the spammer has less control over the routing servers.
The date was identified by the software via regular expressions written in Java (a language for matching search patterns in textual data). The software pulls the first date matched from the top of the headers. The pattern matched is any numerical digit one or two characters in length (the day), followed by a three character string representing the month (“jan,” “feb,” “mar,” etc.), followed by a four-digit number (the year).
Some emails contained no dates in them. Other emails contained impossible dates (12/95/2005). These invalid dates totaled 4,766 emails or 0.09% of the sample. The software was also written so as to count the number of IP addresses (IPAs) present in the email headers. This count measure was dichotomized as “1” for the presence of one or more IPAs, and “0” for no IPA and run through a cross tabulation with the presence if invalid dates. Of those spam emails with invalid dates, 96.92% also had missing IPAs. Among emails without invalid dates, 0.02% had missing IPAs. This suggests the invalid dates are a product of header forgery so as to conceal the origin of the message.
There may thus be systematic differences between emails with and without invalid dates, as those with invalid dates may capture more serious spam offending. However, the portion of emails with invalid dates in such a way was very small, and so emails in the sample with invalid or missing date information were eliminated from the data set. The data were then aggregated by month based on the date received measure. A time-series plot of this measure can be seen in Figure 1.

Plot of spam volume per month time series.
Note the large spike in spam volume during late 2006. It was noted from the spam archives source website that the large and sudden increase in spamming activity was not due to a genuine increase in spam volume. Instead, it was caused by a technical change in how the honeynet bait email client was set up. During the 3 months of August through October of 2006, wild card addresses were accepted by the mail server, allowing misspelled recipient user names in the recipient address field to be successfully delivered anyway. To correct for these three outliers in the data, the 3 months were deleted and replaced with interpolated values based on the trend and contiguous values of the time-series data.
The resulting variable is the count of spam messages sent per month between March 1998 and November 2013. A second measure was created from this variable representing spam rate per month, based on the number of Internet users in the U.S. population. Internet users were used instead of the entire population of the United States because recipients of spam (or senders of spam) have to be Internet users. The creation of the rate required two additional variables: the population of the United States per month and the percentage of the population who are Internet users per month (see “Control Variables: Technological, Economic, and Demographic Predictor” section for details on these measures). Some evidence suggests the predictors of cybercrime outcomes can vary depending on whether cybercrime is measured as an absolute count or as a rate (Kigerl, 2013), at least in the case of digital piracy. Spam rates are measured as the count of spam received per month divided by the number of Internet users in the population per month, multiplied by 100,000. The number of Internet users is computed based on the population size of the United States per month multiplied by the percentage of the population who are Internet users per month.
Both spam volume outcome measures are lagged by 1 month in all subsequent analyses. The data are monthly, and so the shortest possible lag for the monthly data was selected, that being a 1 month lag. It is expected that there would not be much of a delay between spammer behavior and actual spam rates. Actual spam rates tend to drop over the weekend, known as the weekend effect, as well as also dropping over the holidays (Thomason, 2007). This suggests that spammers work less on the weekend, which actually affects their spamming volume. Their spam operations do not appear to be automated so much that their productivity would continue unabated during their time off. Therefore, a shorter lag is selected to measure the impact the CAN SPAM Act has on spam volume.
Independent variables: CAN SPAM Act activity
CAN SPAM Act enforcement
The quantity and severity of CAN SPAM Act enforcement and prosecutions highlighted by the media were captured from the news results, such as the number of prosecutions, convictions, the amount of damages awarded during lawsuits. Prior research has associated similar measures with reductions in malicious hacking attempts at the national level (Png & Wang, 2007). This research will investigate if the same holds for spam crimes.
A LexisNexis search of keywords such as “CAN SPAM” and “CAN SPAM Act” yielded 347 news articles published on the topic of the CAN SPAM Act and spam in general. The articles were coded on a number of dimensions. There are nine time-series measures of CAN SPAM Act enforcement and deterrence from the news article sample coding. They include two measures of damages awarded (the total U.S. dollars awarded per month and the count of articles awarding damages per month). There are also two measures of spammer detentions, including the sum of days that spammers are detained per month, and the count of articles mentioning spammer detentions. The number of spammer arrests per month is also recorded. There are also three count variables representing trials under the CAN SPAM Act: the number of convictions, acquittals, and currently ongoing and unresolved trials. Finally, the percentage of articles relating to the CAN SPAM Act per month is recorded. Not all articles were about the CAN SPAM Act, but were instead about other topics related to spam.
News critical of the CAN SPAM Act
Much of the initial attention the CAN SPAM Act received when it was introduced was not positive about the Act’s effectiveness (Arora, 2006; Grimes, 2007; Lee, 2005; Zeller, 2005). Naturally, this kind of reporting could have the opposite effect of deterrence, emboldening spammers located in the United States. Three time-series measures were constructed to capture attitudes toward the CAN SPAM Act: the percentage of articles positive, negative, and neutral about the CAN SPAM Act. The same was done for author attitudes about spam in general. That is, whether authors are positive, negative, or neutral about society’s ability to fight spam.
Attribution of spammers
The impersonal and anonymous nature of crimes perpetrated in cyberspace, like that of spam, can attenuate some of the deterrent effect a legal punishment might impose. Attribution of cybercriminal identities in the news can reduce some of the feelings of anonymity online. That is, news that mentions the identity of a specific cybercriminal, rather than discussing cybercrime in general. Attribution of cybercriminals at the national level has been associated with fewer cybercrime attacks within those countries (Guitton, 2012). This research seeks to perform the same analysis for spam-related cybercrimes, capturing the percentage of articles per month with and without spammer attribution in the news.
Dichotomous impact of the CAN SPAM Act
A simpler measure of the CAN SPAM Act will also be used, representing a before and after intervention variable representing the months in which the CAN SPAM Act was being enforced and in effect as a law. The CAN SPAM Act impact measure is coded “0” for any time before January 1, 2004, before the CAN SPAM Act went into effect. Any time on or after this date the measure is coded “1,” true, for CAN SPAM enforcement.
Google CAN SPAM Act search history
Google offers reporting of time-series plots for popular terms searched for using the Google search engine, called Google Trends (http://www.google.com/trends). It is suspected that Google searches for the CAN SPAM Act ought to reflect public awareness of the law. If awareness of the CAN SPAM Act by spammers has an influence on their behavior (e.g., a deterrent influence), then general searches of the CAN SPAM Act ought to be correlated with spammer awareness and perhaps even deterrence. Multiple time series of different CAN SPAM search queries (e.g., “can spam,” “can spam act,” “can-spam act”) were downloaded from the Google Trends service and merged into a single time series representing the count of all searches related to the CAN SPAM Act per month. Searches were limited to only those within the United States.
Control variables: Technological, economic, and demographic predictor
Control variables are also included in all models to help rule out spurious influences other than the independent variables (CAN SPAM Act enforcement and other deterrent measures). There are three groupings of control variables included: technological, economic, and demographic/other time-series predictors. If the results are suggestive that the CAN SPAM Act might be a deterrent, inclusion of the control variables would enhance this certainty. Any observations with missing data points have been interpolated.
The growth in spam is highly likely to also be in part due to the growth of technology in general. That is why these other forms of growth ought to be accounted for. Technological predictors include the number of Internet users per capita (Pew Internet Research, 2013), the number of technology jobs in computer systems and related services (Federal Reserve Economic Data (FRED), 2013a), and the Wilshire Internet Market Index (FRED Economic Data, 2013h), which measures the price and the total returns on investments (the performance) of publicly traded Internet stocks. All three measures were found to not be sufficiently trend stationary (Dickey–Fuller = −2.13, p = .52; Dickey–Fuller = −2.66, p = .3; and Dickey–Fuller = −2.3, p = .45, respectively). Regular differencing resulted in stationarity (Dickey–Fuller = −5.66, p < .01; Dickey–Fuller = −6.76, p < .01; and Dickey–Fuller = −5, p < .01). For details on this methodology, see the Serial Correlation Tests in the “Analytic Plan” section.
Because the environment under which spam is being tested includes the United States, U.S. economic predictors are also important as control measures. Six economic predictors are included, composed of real disposable personal income per capita (FRED Economic Data, 2013e), Gross domestic product (GDP) growth rates (Organisation for Economic Co-Operation and Development [OECD], 2013), the U.S. unemployment rate (FRED Economic Data, 2013c), the percentage of the population with a college degree (FRED Economic Data, 2013b), the Consumer Price Index (CPI; FRED Economic Data, 2013d), and the Financial Stress Index (FRED Economic Data, 2013f). The CPI measures the inflation level and spending power of the average U.S. household to purchase from a fixed list of consumer goods. The financial stress index measures the amount of financial stress in the markets and is built from 18 time-series data sets capturing interest rates, yield spreads, and other indicators. Five of the variables, income, GDP, unemployment, CPI, and the Financial Stress Index, were found to be serially correlated (Dickey–Fuller = −0.71, p = .97; Dickey–Fuller = −2.99, p = .16; Dickey–Fuller = −2.13, p = .52; Dickey–Fuller = −2.49, p = .37; Dickey–Fuller = −2.96, p = .17, respectively). Regular differencing resulted in significant stationarity (Dickey–Fuller = −5.78, p < .01; Dickey–Fuller = −4.23, p < .01; Dickey–Fuller = −8.21, p < .01; Dickey–Fuller = −6.4, p < .01; Dickey–Fuller = −5.81, p < .01).
Three additional demographic and other variables consist of total population size per month (FRED Economic Data, 2013g), younger population aged 15 to 24 (FRED Economic Data, 2013i), and Uniform Crime Reports (UCR) arrest rates per 100,000 individuals in the population (National Archive of Criminal Justice Data [NACJD], 2013). These variables tend to relate to street crime, so it is considered whether they also relate to the cybercrime offense of sending spam. Both general population (Dickey–Fuller = −2.08, p = .54) and younger population (Dickey–Fuller = −0.76, p = .97) measures were found to be serially dependent. Differencing results in significant stationarity for both (Dickey–Fuller = −20.85, p < .01; Dickey–Fuller = −6.54, p < .01).
Analytic Plan
Missing monthly values
As mentioned, months represent the unit of analysis. However, some of the control measures contain missing values. Because time-series measures tend to be serially correlated, with each observation dependent on the next and preceding contiguous observations closest to it in time, interpolation is appropriate to estimate missing values. Interpolation estimates missing time-series values based on these adjacent and serially correlated observations (Chow & Lin, 1971). The method of interpolation used for the data is the Kalman filter, which breaks a univariate time series into three components: trend, seasonal, and level/noise time series (Brookner, 1998; Pizzinga, 2012). The Kalman filter smoothes across these time series to create estimates of missing values, and finally combining them back into the original series, only without missing cases.
Multivariate analyses
Multiple generalized least squares (GLS) regression models will be conducted to test the impact the CAN SPAM Act measures have on spam volume outcomes lagged by 1 month, net other economic and technological controls. GLS allows for a non-constant variance among the residuals to be controlled for, as may be the case in time-series models due to serial correlation of the residuals (Judge, Griffiths, Hill, Lütkepohl, & Lee, 1986).
Predictor variables included in both of the two regression models were selected via backward stepwise regression based on model Akaike information criterion (AIC). Stepwise regression utilizing AIC is conducted via backward elimination of predictor variables that ends once elimination results in a higher AIC score. Predictors that increase the AIC score when eliminated are retained. Both regression models started with 33 independent and control variables, following backward elimination to select the strongest models per measure of spamming outcomes.
Results
Interrater Reliability Testing: CAN SPAM Independent Variables (IV)
The coded CAN SPAM Act news articles were tested for interrater agreement. Of the 347 LexisNexis articles, 100 were randomly selected and coded a second time by an additional rater. There was sufficient agreement on whether the article related to the CAN SPAM Act or just spam in general (K= .88, p < .001), and whether author attitudes toward the act were positive, neutral, or negative (K = .82, p < .001). The agreement of author attitudes toward spam in general (positive, neutral, or negative) was significant (K = .68, p < .001), but slightly lower than the 0.7 cutoff. There was high agreement on trial status (ongoing trial, conviction, acquittal; K = .83, p < .001) and whether the spammer was detained (K = .93, p < .001). There was perfect agreement on whether the spammer was arrested (K = 1, p < .001) and on whether damages were awarded (K = 1, p < .001). Finally, there was substantial agreement on whether the identity of a given spammer was known (attribution; K = .87, p < .001). The coded measures were considered reliable and thus included in subsequent analyses.
Preliminary Analyses
Time-series regression of spam count per month
A backward stepwise regression using the model AIC was conducted with spam count per month regressed on an initial 33 predictor variables total. Backward elimination yielded 7 predictor variables to be included in the final time-series model. The selected predictors were included in a linear regression model and the residuals were computed to test for autocorrelation. A Durbin–Watson test of the residuals indicated significant autocorrelation of the residuals (dw = 0.36, p < .001). A differenced version of the model was not found to have significant autocorrelation (dw = 1.77, p = .088), although it was marginally significant. The differenced version of the model was therefore used in the final analysis. After visual inspection of the autocorrelation function (ACF) and partial autocorrelation function (PACF) correlograms of the regression (not shown in figures), an autoregressive moving average (ARMA(1,2)) model was identified. A correlogram is a visual plot of the correlation between a time series and itself at a given lag, for successive increments of lags one and up (McDowall, McCleary, Meidinger, & Hay, 1980). The ACF is a simple correlation of a time series with itself at a given lag, while a PACF represents the correlation of a time series with itself at lag k, controlling for all lags in between itself and lag k. While differencing of the data is sufficient to create a trend stationary time-series process, there may still be some degree of serial dependency in the data. ACF and PACF functions can reveal such serial dependency and indicate that said processes need to be controlled for in any subsequent regression models. If there are q spikes in an ACF correlogram, a moving average model of order q should be controlled for (ARMA(0,q)). If there is a decay pattern in the ACF function, an autoregressive parameter at order p should be controlled for (ARMA(p,0)), p being inversely proportionate to the speed of decay. The reverse interpretation is required of the PACF correlogram, with a spike indicating an autoregressive process and decay patterns indicating a moving average process. A final GLS time-series regression model was run using the 7 selected and differenced predictors and included in Table 1.
Generalized Least Squares Time-Series Regression of Spam Count per Month, Autoregressive Moving Average (ARMA(1,2)) (n = 189).
Note. R2 = 11.47%.
p < .1. *p < .05. **p < .01.
Time-series regression of spam rate per Internet user per month
A backward stepwise regression using the model AIC was conducted with spam rate per Internet user per month regressed on an initial 33 predictor variables total. Backward elimination yielded 11 predictor variables to be included in the final time-series model. The selected predictors were included in a linear regression model and the residuals were computed to test for autocorrelation. A Durbin–Watson test of the residuals indicated significant autocorrelation of the residuals (dw = 0.63, p < .001). The data set was differenced, resulting in sufficient stationarity of the residuals (dw = 2.09, p = .825). After inspection of the ACF and PACF correlograms of the regression (not shown in figures), an ARMA(0,1) model was identified. A final GLS time-series regression model was run using the 7 selected and differenced predictors and included in Table 2.
Generalized Least Squares Time-Series Regression of Spam Rate per Internet User per Month, Autoregressive Moving Average (ARMA(0,1)) (n = 189).
Note. R2 = 17.63%. CAN SPAM = Controlling the Assault of Non-Solicited Pornography and Marketing.
p < .1. *p < .05. ***p < .001.
Multivariate Analyses
Time-series regression of spam count per month
Referring to Table 1, percentage of the U.S. population who are Internet users was found to be significant and negative (B = −.023, p = .035), suggesting that more Internet users per capita was associated with less spam sent the following month. This finding is contrary to what might be predicted, as Internet users ought to reflect the number of possible spammers or recipients of spam. Spammers have to be Internet users, and recipients of spam also have to be Internet users.
The population size aged 15 to 25 is positively associated with spam at Lag 1 (B = .028, p = .031). This finding might indicate that the population concentration of youth is related to cybercrime, or at least spam crime, in similar ways as it is related to traditional street crime. However, the population size in general is also predictive of spam (B = .055, p = .032), and has a stronger effect size. This may simply mean that more people also means more spammers, and may have little to do with youth. The outcome is absolute, not a rate, and population size is also absolute, therefore one can anticipate this association.
The remaining predictors in the model are the CAN SPAM Act and deterrence independent variables. The count of articles that mention an unresolved ongoing CAN SPAM trial per month is associated with less spam sent the following month (B = −.027, p = .025). The count of articles that mention a spammer being convicted is also associated with less spam, but only approached significance (B = −.024, p = .052). Both findings suggest a possible deterrent effect of CAN SPAM trials on spam volume sent per month. However, the count of articles mentioning a spammer being detained per month is positively associated with spam the following month (B = .043, p = .004).
Finally, the percentage of articles that are negative about our ability to combat spam as a proportion of all articles on spam is associated with higher spam volume (B = .013, p = .037), although the effect size is very small. A dejected attitude of authors in the news regarding spam may embolden spammers to continue engaging in their spam sending operations.
Time-series regression of spam rate per Internet user per month
The variables included in Table 2 are similar to those found in the prior model in Table 1. Internet users per capita is still significant and is in the same direction when measuring spam as a rate based on the number of Internet users in the United States (B = −.057, p = .014). More Internet users predict less spam at 1-month follow-up, whether measured as a rate or as an absolute count. Youth population size is also significant, predicting higher spam rates given an increase in younger members of the population (B = .058, p = .042). Absolute population size is no longer significant when predicting a relative spam rate (B = .097, p = .068), although it is marginally significant.
Some of the deterrent variables predicting spam rates are similar to that predicting spam counts. The number of ongoing trials of spammer prosecutions under the CAN SPAM Act is associated with a decrease in spam rates the following month (B = −.079, p = .049). Similar to the spam count model, the number of spammer conviction articles is not only marginally significant but is also associated with a reduction in spam (B = −.061, p = .089).
The count of articles mentioning spammers being detained under the CAN SPAM Act was selected for inclusion in the spam rate model as well. Unlike the spam count model, however, the sum of days spammers are detained per month was also included in the spam rate model. The sum of days detained variable sums all the days of incarceration mentioned in news articles per month. The count of articles mentioning detention has the same effect for spam rate as it did for spam counts, suggesting more detentions actually increases spam (B = .21, p < .001).
The new variable, the sum of days detained, is in the opposite direction, suggesting longer durations of detention per month actually decrease spam rates the following month (B −.109, p = .021). That is, holding the number of articles mentioning spammer detentions constant, longer detention durations may have a deterrent effect. It may be that articles mentioning spammer detentions describe spammers being detained for very short times, which might actually have an emboldening effect on spam.
To test this, an alternate spam rate model was conducted excluding the count of spammers detained variable (not shown in tables). The elimination resulted in the sum of days spammers detained variable becoming positive, but was no longer significant. Only once the number of articles is controlled for does the days detained indicate deterrence, as a high sum of days detained may simply mean many articles with short durations of detention. It should be noted that the same is not exactly true for the spam count model. Including the sum of days detained to the model does not result in significance (not shown in tables), although it is still negative. However, eliminating the count of detention articles results in a significant sum of days detained variable that is positive in its impact on spam counts.
Finally, the percentage of articles published per month that describe the CAN SPAM Act is associated with a reduction in spam rates the following month (B = −.048, p = .022). The result might suggest that such articles serve as a deterrent. While only marginally significant, the percentage of articles negative about our ability to fight spam is associated with more spam being sent the next month (B = .028, p = .054), although the effect size is very small.
Discussion
Implications and Policy Recommendations
Prior literature mostly concluded that spam has increased following the CAN SPAM Act (Arora, 2006; Lee, 2005; Zeller, 2005). The FTC concluded that spam leveled off following the passing of the CAN SPAM Act (Majoras et al., 2005). Neither of these conclusions are necessarily incorrect, as spam is definitely higher today than prior to 2004, during the nascent stages of email spam. However, after inspecting a time-series plot of spam over time, it may appear that spam has leveled off, albeit at a higher level, following the CAN SPAM Act. Despite this, causality cannot be drawn from these conclusions, as very few, if any, statistical measures were used to identify any kind of deterrent effect.
The only evaluation of the CAN SPAM Act utilizing adequate statistical methods concluded that a dichotomous measure of the Act had no significant impact on spam volume (Kigerl, 2009). The current study mirrored these findings, as a dichotomous measure of spam was also included in the analysis. A binary measure of CAN SPAM was included, but not selected for, during the backward stepwise selection process, suggesting that the measure did not decrease the AIC of the spam volume model. The conclusion would be the same as that in Kigerl (2009), suggesting no difference in spam volume following the passing of the act.
Yet, the present research went beyond a simple dichotomization of the CAN SPAM Act, and included multiple other measures of the Act as well as additional deterrence measures. Regarding these measures, the impact of the CAN SPAM Act on spam volume at first look appears to be mixed. The number of ongoing CAN SPAM trials and convictions, as well as the percentage of articles on the CAN SPAM Act, is associated with a reduction in the amount of spam sent, suggesting a deterrent effect. However, the number of articles mentioning spammers being detained increases spam volume for both spam counts and spam rates. Spammer detentions may capture something not theoretically foreseen prior to coding CAN SPAM news articles.
The reason for such a finding may be explained by the second model, that of spam rates. While the number of articles mentioning spammer detentions increases spam, the sum of days detained for all spammers per month decreases spam, so long as the count of these articles is controlled for. That is, as the number of detention articles remains the same, but the total days detained for the month increases, the average days detained per article is higher. More days of incarceration mentioned in articles is what produces a deterrent effect. However, the average number of days detained mentioned in each article tends to be a very short duration of time, which instead produces an emboldening effect. So, while the effects of the CAN SPAM Act appear to be mixed at the first glance in terms of deterrence, it may simply be that spammers are not detained long enough. Future research should investigate this relationship further. A new measure that ought to be considered would be the average number of days incarcerated per article per month, so that both the count of articles and the sum of days detained variables would not have to be both included in the models. Instead only one variable would be needed.
The finding might also have implications for policy changes. Most of the violations under the CAN SPAM Act are punished via fines and financial damages judged against the spammer. However, none of the measures of fines against spammers were found to be significant predictors of spam volume. Instead, it seems that the number of days detained produces a deterrent effect, so long as the length of incarceration is sufficiently long. It might be beneficial for the CAN SPAM Act to be revised such that penalties focus more on prison sentences rather than fines. The prison sentences should also be sufficiently long to avoid the possible emboldening effect witnessed in the data here. Many spammers have substantial incomes from their illegal spam businesses, hence why fines may do little to deter. However, lengthy prison sentences might be sufficient to make spammers think twice about sending spam illegally.
It is also important to note that the number of articles mention ongoing trials involving spammers under the CAN SPAM Act as well as more convictions of spammers appears to decrease the amount of spam sent the following month. This might suggest that more prosecutions of spammers would be beneficial. As it stands now, the CAN SPAM Act largely goes unenforced, as it is limited to a narrow number of possible authorities that would enforce it, as well as being conservative as to what is classified as spam. The Act might do well to have its definition of illegal spam broadened or simplified to facilitate easier enforcement. The definition of illegal spam could be simplified to only include unsolicited commercial mail, requiring opt-in. Such a revision might cut down on the annoyances legitimate marketers impose (at the very least), as they would likely comply with the law. As of now, unsolicited spam is not illegal, so long as the spammer abides by specified regulations.
The law might apply to more than just commercial electronic message marketers as well, such as requiring Internet service providers to authenticate all email sent from their networks. Also, the entities authorized to file lawsuits against spammers for damages might be expanded, such as permitting consumers to file suit in addition to the FTC, states, and Internet access providers. Enforcement under the CAN SPAM Act is not frequent, as the FTC is underfunded and ISPs have little profit motive to bother pursuing spammers on their networks which may not actually cost them that much in the way of direct damages (Rutenberg, 2011); where consumers and recipients themselves absorb most of the negative influences of spam. Opening the option of lawsuits to everyone who might be affected would likely increase enforcement. However, this would negate the possibility of jail sentences for spammers, but would at least increase trials and convictions, which also appear to have a deterrent effect.
Limitations
The spam sample and the procedures for extracting the time-series metrics from the sample have some limitations that ought to be mentioned. The sample itself was acquired from only a single web archive, collected by an uploader (http://untroubled.org/spam) which may not have been completely consistent in the process for baiting spam messages during the entire 16 years of data collection. That is, changes might have been made or slowly introduced, such as the number or frequency of bait email addresses or address posts on the Internet made over time. While the data ought to reflect genuine spamming activity, there may be fluctuations or systematic changes in time for different observations that would not be accounted for by the existing predictors in the model.
The spam sample used might also be skewed toward certain spammers who send more spam than usual to the same recipient’s inbox. For the individual email unit of analysis data set, each observation is not independent of other observations or spam that was received. That is, multiple emails can easily be sent by the same spammer, and many of those emails are likely identical to each other in the types of variables coded by the spam software. So the sample is biased toward spammers who send more spam to the same recipient. This is likely another example where the data are more representative of more serious, unlawful spammers.
Finally, all measures are intended to capture activity within the United States, as the CAN SPAM Act is U.S. legislation. All news articles are limited to those published in the United States, and all control variables are also specific to the United States. However, the dependent measures, that of spam volume, are not limited to the United States; but nor should they be. All spam messages were baited from English speaking websites using a mail server located in the United States. The spam rate measure was standardized based on U.S. Internet users to account for the recipients of spam, as that would be an appropriate measure to parallel the honeynet baiting process used to collect the spam data.
Conclusion
This research is the first of its kind on a number of dimensions. It is the first to use a series of continuous measures capturing the enforcement and awareness of the CAN SPAM Act in its possible influence on the spam volume. It is also the first to incorporate control variables while conducting impact assessments of the legislation on the amount of spam sent, taking into consideration the fact that there are likely many causes of spam. Finally, it uses a second, new measure of spam volume, that of spam rates based on the number of Internet users.
Many sources have attempted to comment on the efficacy of the CAN SPAM Act to decrease spam. There are some substantial limitations with all of them. This research has improved on these limitations to make an evaluation of the CAN SPAM Act that is more conclusive. Most research on the CAN SPAM Act to date has concluded that the Act has little impact on the amount of spam received. Contrary to prior literature, the CAN SPAM Act has been found to be associated with spam volume in a direction suggesting deterrence. The findings do not contradict that the Act might relate to illegal spam outcomes in the United States, and hence, abandonment of the Act is not warranted. Rather, further revisions to the Act that seek to improve enforcement are strongly recommended.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
