Abstract
Cybercriminals operate in obscurity to avoid detection of their illegal deeds. This fact makes studying them more difficult. Many cybercriminals meet in illicit online market places such as carding forums. The forums are often visible, but the actual transactions are carried out in private messages beyond view. However, there is no honor among thieves, and sometimes a carding forum server database will be hacked and leaked to the public. Existing research has been conducted on such leaked databases, but much of it is quantitative, rather than offering any qualitative interpretation into the nature of the forum user base. This research sought to analyze two such leaked carding forum databases by applying hierarchical clustering of 10,714 registered user accounts, grouping users based on 19 variables consisting of comment history style, site engagement activity, and explicit status markers. The results yielded 16 categories of users from four different domains composed of general consumers, location-based consumers, producers, and an “other” category. Following categorization, qualitative analyses were conducted to further shed insight into the nature of the two forums.
Introduction and Literature Review
The global economy has experienced continued growth, and with it, e-commerce has similarly grown and expanded, becoming more streamlined and accessible. But like everything when it comes to both society and technology, a shadow economy has grown in parallel, as the greater presence of consumers at the ready with their credit cards online has been met with ever-growing rates of identity fraud (Pascual et al., 2018). The cybercrime marketplace, like most e-commerce, has continued to expand.
This underground shadow economy can be divided into layers of visibility over the Internet. The topmost layer consists of publicly viewable forums and other communities that do not have restrictions on access. Next includes invitation-only online communities where a reference from an existing community member is required for entrance. The third layer consists of teams of cybercriminals who work together with a certain, fixed group of collaborators who communicate through chat or other means that do not have a central website or online presence. Finally, the lowest layer includes cybercriminal groups that have a physical location of operation where they work, sometimes in the guise of a legitimate business (Lusthaus, 2019).
Naturally, the online marketplaces where illicit trades are made are the most visible to researchers for study. In the East, these illicit online marketplaces take the form of social networking sites. In China, in particular, Baidu Tieba serves as a social networking service on China’s search engine, occupying 63% of the market share (Yip, 2010). The service offers user-generated message boards that are heavily utilized by identity thieves promoting popular boards such as “Visa,” “Master,” and “card verification value (CVV),” where stolen credit cards can be bought and sold.
In the Western world, illicit online marketplaces are more decentralized on disparate and loosely networked websites taking on one of four different formats. The formats consist of Internet Relay Chat (IRC), carding shops, darknet marketplaces (DNMs), and carding forums (Du et al., 2018).
IRC venues for trafficking stolen credential goods are the most impermanent. IRC is built on a separate Internet protocol and is hosted by an IRC server offering multiple channels dedicated to predefined topics (Allodi et al., 2015). IRC does not maintain an archive of past exchanged messages, instead, communication is live and can only be carried out when two parties are online simultaneously. If a user loses a connection, the prior messages cannot be salvaged. It is also virtually impossible to distinguish legitimate sellers from illegitimate ones.
Carding shops are dedicated websites that facilitate many underground activities and provide high-quality carding services. They serve as a supply chain for carders who wish to sell stolen cards (Benjamin et al., 2015). A large amount of stolen card data are traded and diffused through carding shops. Carding shops tend not to have as much of an open exchange of information as other online communities, as the website administrators are the gatekeepers of information on the site.
DNMs do not exist on publicly indexed search engines but instead inhabit the dark web, which can only be accessed via the specialized Tor protocol. Tor implements a tiered layer of encryption between client and server, enhancing the anonymity of user activity on the dark web (Lacey & Salmon, 2015). Purchases are made in Bitcoin or other cryptocurrencies. In addition to stolen credential goods, illegal drugs, counterfeit money, and stolen merchandise can also be ordered. Due to the nature of the dark web, and the less open format of the services therein, DNMs are more difficult to find and attract a large customer base.
The final and most popular scene for trading illicit digital goods are carding forums. Web forums consist of multiple boards pertaining to specialized topics (e.g., credit card sellers, malware distribution, general discussion), and within each board, users may start a new thread by making a post in which other users can read and reply, or replying to existing ongoing threads. Users must register a profile with the website in order to comment. Profiles can be used to track user trustworthiness ratings, and each board has one or more moderators who can act as a sort of law enforcement agent, banning disruptive members and setting guidelines.
Carding forums are the most widespread formats in the West for exchanging illicit goods. One of the largest efforts to collect and tabulate cybercrime community data identified 102 platforms totaling 43,981,647 records (Du et al., 2018). Among those, 12 were darknet markets, 13 were IRCs, 26 were carding shops, and 51 were carding forums. The carding forums were the most prolific in terms of generating records as well, contributing to 73% of all records stored across the 102 platforms.
One of the possible reasons for the greater success of carding forums might be the offering of greater formal control and coordination, with a hierarchical structure and greater specialization (Yip et al., 2012). Administrators are at the top of the pyramid and are responsible for the overall management of the forum and making decisions. Moderators are responsible for the management of the subforums based on their expertise or geographic location. Reviewers have the duty of testing and vetting goods and services prior to sale. Vendors can also achieve reviewed status and be accorded more market share based on the accumulation of trust.
By using carding forums, uncertainty can be mitigated by offering two mechanisms: a sophisticated review system and an exchange service known as escrow, where goods (funds, credit card information) are temporarily held by a trusted third party prior to approval of the transaction (Yip et al., 2013). Both of these services are enforced by a well-defined management hierarchy. Those seeking to purchase goods or services would seek out these trusted members to mitigate uncertainty. Such assurances likely draw in more visitors in a market environment plagued by what are termed “rippers” or users masquerading as vendors who fraudulently collect payments but never deliver (Holt, 2013). These are scammers who scam the scammers.
The myriad forums available differ from each other as well. Larger forums can have many thousands of members with flatter hierarchies, multiple tiers, and a more diverse portfolio of goods and services. Smaller communities are limited to a couple of hundred users, with a two-tiered hierarchy reminiscent of a gang, and focus on a smaller subset of cybercrime activities (Garg et al., 2015). Individual participants will often have accounts on more than one forum, and their behavior on each forum may differ from the rest (Park et al., 2018). Having multiple accounts across different forums is often necessary as forums tend to have a short life expectancy before they are taken down by law enforcement (Frank et al., 2018).
Most of the users of forums are generalists, not specializing in any single role or capacity. Specifically, most sellers provide more than one type of product or service (Haslebacher et al., 2017). Some of the goods available consist of different tiers of stolen identity information. These consist of “CVVs” or enough stolen credit card information to make unauthorized online purchases; “fullz” that includes more personally identifying information, such as credit card details, full name, and social security number, or enough information to open up a bank account in another person’s name; and “dumps” that is an entire clone of a victim’s credit card magnetic strip, enough to clone the card entirely and use in physical retail shops (Haslebacher et al., 2017).
Crimeware is also sold, as users can purchase malware, spyware, and exploits to vulnerabilities discovered in computer systems that have yet to be patched which can be used to hack those computer systems (Gaspareniene & Remeikiene, 2015). Zero-day exploits are in high demand, which are exploits for computer system vulnerabilities that not only have no known fix by system providers, but the vulnerabilities themselves are unknown to all except the attackers themselves (Allodi, 2017). Software companies will also pay for such exploits, as they have a vested interest in patching them. However, such exploits often command a higher price on the black market, so illegal sale of them is tempting (Allodi, 2017).
In addition to crimeware products, crimeware services are also available. Services consist of spamming services where spammers for hire can direct traffic to a buyer’s website; botnet rentals and distributed denial-of-service, where a botnet herder who owns thousands of infected hacked computers floods rival servers with an overwhelming deluge of traffic, taking the server off-line; document forgery services; and cash-out services to assist with laundering money misappropriated via identity theft (Kigerl, 2018; Mikhaylov & Frank, 2016).
Because carding forums consist of publicly accessible websites where offenders can exchange business-related communications, such websites are a wealth of information for conducting qualitative research. Anyone with an Internet connection can observe these exchanges in the wild and document their findings. Additionally, due to the digital nature of these websites and the high volume of posts made by carding forum members, these websites are also amenable to quantitative research as well.
In order to convert user activity on forum web pages into an analytic data set, researchers have developed web crawlers to automatically read through each page and extract relevant details required for each research question. Much of this research is applied, seeking to build tools to combat cybercrime. Among them include tools to predict user signaling of their offerings or illicit desires on such markets (Décary-Hétu & Leppänen, 2016) and also to identify top sellers of credential goods (Benjamin et al., 2015; Li & Chen, 2014).
Other research has sought to apply statistical clustering techniques to the data acquired from carding forums. Clustering approaches are designed to take attributes about a subject as inputs, be that subject a user on a forum or otherwise, and group those subjects together into multiple distinct categories based on shared similarities between the subjects on the inputted attributes. Such approaches have been used to identify expert hackers on a hacking forum and identify their specialties (Abbasi et al., 2014), group key actors (users actively trading illegal goods) into classes based on the type of activity on carding forums (Pastrana et al., 2018), group the malware itself shared on hacking forums into categories (Samtani et al., 2015), group online hacking tutorials based on content (Samtani et al., 2017), and group users together to identify top sellers and high-level offenders (Li, Chen, & Nunamaker, 2016; Li, Yin, & Chen, 2016).
Kigerl (2018).] combined clustering approaches with crawler results from three carding forums and applied an additional qualitative analysis of the cluster categories. The results mirrored contributions from prior qualitative analyses of carding forums but also isolated some additional groupings and further granular subcategorization of established categories, among them consisting of document forgery services as well as subdivisions of buyers into satisfied customers and unsatisfied customers.
However, information gleaned from carding forums using exclusively a crawler that scans only publicly viewable web pages does not capture the entire picture of these underground communities. Many, if not most, of the illicit exchanges are carried out through each forum’s private messaging (PM) system, exchanges that are not publicly visible. Yet there have been multiple incidents where a carding forum database is hacked or leaked to the public Internet, making all website contents, including the private messages (PMs), available to view.
Researchers have seized on these opportunities where they arise. Among some of the contributions to the literature using such data sources includes a qualitative analysis of Darkode, an invitation-only forum for both buyers and sellers of crimeware products, credential goods, proxy servers, and other products and services. However, distrust among community members was rife, and the site was eventually taken down by the Federal Bureau of Investigation in 2015 (Dupont et al., 2017).
Quantitative analyses on leaked forum data have also been conducted such as network analysis performed on connections between users via exchanges of PMs (both inbound and outbound) to isolate the topmost connected participants (Motoyama et al., 2011; Yip et al., 2012). Other researchers have examined the differences in subculture between larger and smaller forums—smaller forums were similar to a gang in leadership, whereas larger forums have much more complicated tiered hierarchies (Garg et al., 2015). Overdorf et al. (2018) created a predictive model to attempt to determine when a publicly available exchange between two participants moved behind the scenes to the website’s PM system.
However, clustering approaches combining a qualitative analysis have not yet been conducted on leaked carding forum databases, only publicly available and crawled websites. This research seeks to fill this gap by examining the results of two breached carding forum databases. In addition to utilizing leaked content for such purposes, this research also intends to apply two forms to clustering to further refine the results.
The first is a text-based clustering approach, termed topic modeling, to group users together based on the types of public and PMs they send to each other. Following the extraction of these categorical variables describing each user, the variables will be combined with other structured data, such as user location, age, activity, status, and others, into a second clustering model, termed hierarchical agglomerative clustering, to further refine the users into granular categories. Following clustering, qualitative analyses will be performed.
Existing research has analyzed carding forum measures such as those belonging to the social network analysis domain (number in inbound and outbound connections, network centrality, etc.), explicit status markers (vouched status, verified seller status, etc.), implicit status markers (number of posts and received PMs), and textual cluster categories via topic modeling (Décary-Hétu & Leppänen, 2016; Garg et al., 2015; Motoyama et al., 2011; Pastrana et al., 2018). However, to the best of this researcher’s knowledge, topic modeling categories have not been used as a method of feature engineering to subsequently include in a final, structured, clustering technique. Instead, topic model categories have been the end goal unto themselves. Also, this research seeks to include additional novel measures not previously clustered on before pertaining to the use of Internet protocol address (IPA) information such as the number of duplicate IPAs and the continent associated with an IPA. Additionally, IPAs are also coded for whether they appear to be behind an anonymizing source such as an open proxy, VPN, or Tor exit node.
Method
User post and activity data from two carding forum websites were acquired for this research. The data were extracted from two compromised and leaked webserver databases for each of the two carding forums. Tables from each database were pulled and synthesized into a single data set comprised of user post and message history information, user activity data, and fillable form fields pertaining to each user. Two phases of clustering the user base into discrete categories were conducted. The first phase clustered users based on the textual content of their forum posts and PMs. The second further divided users into additional subcategories based on the combined information of both comment cluster labels (latent variables) and the other existing variables (observed variables) pertaining to each user present in the database.
Data and Sample
The data were sourced from two leaked carding forum user databases. Both databases were downloaded from Raidforums, an Internet forum dedicated to sharing and hosting breached and leaked databases. This sample was derived from a post made on November 8, 2017, making two leaked carding forum databases available. The post was accessed from https://raidforums.com/Thread-Carding-Forum-Databases-LEAK on February 14, 2018, and a copy of both databases was subsequently downloaded for this research.
The uploader of the files claimed the two websites were run by the same administrator in the post made and that the two databases were copied on August 2016. The acquired files were two Gzip-compressed MySQL database dumps. The tables from each database were extracted and saved as comma-separated value files for further data cleaning and building.
The two databases belonged to two websites, elitecarders.name and fraud.ws. The extant research relying on leaked carding forum databases tends to reuse some of the same forums that have previously been studied (Garg et al., 2015; Motoyama et al., 2011; Overdorf et al., 2018; Yip et al., 2012). To this researcher’s knowledge, these two databases have never before been used in any academic or nonacademic studies. We do not claim that these two databases are representative of all carding forums, but they provide valuable insight into a yet untapped behind-the-scenes look into these two spaces.
Beginning and end dates for user registrations and user activity on Elitecarders ranged from December 19, 2010, to August 9, 2016. For fraud, the range was from December 18, 2014, to August 11, 2016. Prior to processing and sample selection, Elitecards had 167,713 individual user registrations, and fraud had 5,386. After eliminating users with zero forum posts or PMs, Elitecarders retained 10,375 users, and fraud retained 575. The loss in sample size indicates most users are “lurkers,” possibly reading posts and downloading shared content but never contributing a post or PM of their own. After further listwise deletion of missing fields, the final sample sizes were 10,255 and 459 for Elitecarders and fraud, respectively.
Measures
Twenty measures were built from the available database tables. The measures were intended to be the dimensions on which to cluster users into distinct categories. The measures consisted of 13 observed measures constructed from user activity and personal data associated with their account and seven latent variables built using topic modeling cluster categories using user comment histories.
Age at join
Registration involves completing a profile that contains a field for the user’s date of birth. There is nothing to stop users from lying about this information, but the field still captures user behavior. Age at join was calculated as the difference between the provided date of birth and the date of registration in years.
Continent
Each user had an assigned IPA. A database of lookup codes to geolocate each IPA was acquired from https://pygeoip.readthedocs.io/en/v0.3.2/index.html# on February 20, 2018. The database was used to assign a nation to each user based on their respective IPA. The country assigned to each user would reflect either the user’s location, the location of the user’s Internet service provider (ISP), or the location of the user’s proxy server or VPN service used to obscure their home location.
Six continents were then coded as the global location of each nation using a lookup file acquired from the following address: https://old.datahub.io/dataset/countries-continents/resource/aa08c34c-57e8-4e15-bd36-969bee26aba5. The continent lookup table was downloaded on October 1, 2015. The continents were dummy coded into the following categories: North America, South America, Europe, Asia, Africa, and Oceania.
Vouched status
Some users had a designated title attached to their profile. Titles such as “Admin,” “Donator,” “Verified,” “VIP,” and “Vouched” were considered explicit status symbols indicating elevated rank within the community. A binary variable representing the presence of any of these titles was used to indicate vouched status.
Banned
Additionally, some users had titles such as “ripper” or “banned.” If these titles were present, banned was coded “1;” otherwise, it was coded “0.”
Friend receipts
In addition to explicit titles, implicit indicators of status were also used. Friend receipts were the count of friend invites the user had received at the time of the database dump. Accepting a friend adds them to the user’s contacts on the forum.
Friend invites
Additionally, the outgoing friend invites a user sent to other users were captured as well. Friend invites represent the count of such invites sent.
Received PMs
PM is direct electronic chat communications users can send to one another, which are not publicly viewable on the forums. Received PMs is the count of PMs received by the user.
Sent PMs
The number of outgoing PMs were also accounted for. Sent PMs represent the count of such messages.
Profile visits
On each forum, users have a profile page that displays that user’s stats and other personal information. Visitors typically view a user’s profile by clicking on their screen name attached to each forum post made by the user. The web forums tracked the number of page views each user’s profile received. Profile visits represent the total number of profile views accumulated by the given user.
Forum posts
The number of forum posts was counted for each user. Forum posts can take the form of a newly created post that begins a separate thread pertaining to a topic that users may participate in but also responses to such original posts. Forum posts are publicly available online unlike PMs.
Active duration
Each forum post and PM has a time stamp associated with it. The number of days between the user’s initial forum registration and the most recent forum post or PM sent was calculated.
Duplicate accounts
Users can register an unlimited number of profiles. Duplicate accounts are the count of duplicate IPAs associated with the given user account. Specifically, it is the count of profiles across each website which have that exact same IPA. Users can change their IPA when registering a new account, so the measure is not without error but rather approximate.
Anonymous IPA
An attempt to account for using an anonymous IPA was made. The service, MaxMind (https://www.maxmind.com) was utilized, which offers a feature called GeoIP2 lookups to identify whether a user’s IPA is behind a VPN, open proxy, or Tor exit node. All IPAs in the data set were identified via the MaxMind service on March 19, 2020, which classified whether an IPA is suspected to be anonymous, operationalized as a binary indicator.
Topic Model Variables
In addition to the 12 observed measures discussed above, seven additional latent variables were created from the 369,320 user forum post and PM comments. The comment histories per each user were concatenated into one single string per user for analysis. Latent Dirichlet allocation was used to cluster user comments into latent categories. Latent Dirichlet allocation in this context is a type of topic modeling, and topic modeling consists of clustering documents together which contain narrative textual content (see Kigerl, 2018, for further reading on this method). Documents are grouped together based on the shared key words present among them. A soft clustering approach was used, meaning users could be assigned to more than one topic.
The form of topic modeling utilized, however, does not select the number of k topics on its own. Instead, fit testing must be performed and compared for each successive candidate k topic size. Two through 35 topic sizes were attempted, and four measures of fit were assessed against each individual model. Each method is an internal cluster model fit metric, assessing the similarity of documents assigned to the same topics and the separateness and distance of each topic from all the alternate topics (Arun et al., 2010; Cao et al., 2009; Deveaud et al., 2014; Griffiths & Steyvers, 2004). Two maximization fit metrics were used: Griffiths and Steyvers (2004) and Deveaud, SanJuan, and Bellot (2014). Maximization means that a higher numerical fit score implies a better fit. Two minimization fit metrics were also included: Cao, Xia, Li, Zhang, and Tang (2009) and Arun, Suresh, Madhavan, and Murthy (2010). For these metrics, lower scores indicate a better fit to the data. Based on a visual plot of each metric, the elbow method was employed (the cusp in a trend line where momentum slows) to determine the appropriate k size. A k size of seven was selected based on this method. Results can be found in the Appendix, in Table A1. Additionally, Table 1 contains the seven topics used for this research and their associated list of top 10 key words per topic.
Carding Forum Member Comment Topics.
Topic 1: Satisfied customers
Satisfied customers regularly purchase goods and services from the vendors available. Most of their posts are to exclusively say “thanks,” and others ask questions about drops (cash-out service providers who willingly receive illegally purchased goods to forward to the buyer for a fee) or about credit cards. There are not as many of these users, but they account for most of the posts, probably making up a majority of the customer base, which should be enough because there are actually very few vendors in these communities.
Topic 2: Free content consumers
The carding forums have free content sections where distributors can provide free samples of stolen credit cards, PayPal accounts, tutorials on hacking or cybertheft, or free malware apps and source code. These users frequently “vouch” and “rep” the providers of the free content, and the providers use this to build a reputation on the site. Most users are of this category, likely because of the lower risks involved.
Topic 3: Generalist market participants
These users dabble in several topics, buy credential goods (like satisfied customers), consume free content, but also sometimes provide free content of their own. Many of their posts are also just casual conversations, not necessarily exclusively pertaining to cybercrime.
Topic 4: Credential goods vendors
Credential goods vendors sell stolen credit card information. These vendors mostly rely on PMs, with few forum posts. Among the top key words identified are “247” and “100,” meaning 24-7 support and 100% verified. The 24-7 support indicates sellers are available around the clock to handle questions and concerns and to facilitate transactions. There is a lot of pressure to maintain their reputation, so good customer service is the key; 100% verified indicates each card is checked and working at the time of sale, as there is a risk some are canceled by the victim or bank before funds can be misappropriated. There are many banned users in this category as well, meaning some are not legitimate vendors, but are rippers instead, taking money from customers but never delivering. This means posts made by legitimate vendors and rippers are difficult to distinguish, hence why there is so much pressure to maintain a good reputation.
Topic 5: Compromised accounts dealers
These vendors sell proxy servers and many different types of hacked online Internet accounts. Among the compromised accounts include paid porn sites, cyberlockers to download pirated content, plus Facebook and Twitter accounts, probably to use for spamming purposes.
Topic 6: Free content distributors
Free content distributors provide free content to the free content consumers. Among their posts and offerings are free credit cards, free proxies, remote desktop protocol logins (compromised remote desktop credentials to access a hacked server), botnets, and tutorials. Many of the topic key words for these vendors pertain to location (California, USA), indicating the location where their free proxies reside.
Topic 7: Transient users
Transient users are participants whom the topic modeling approach failed to designate a label for. The most comment reason being that these users did not accrue a necessary number of comments in order to justify a category, they tend to register, make a single comment, then become inactive on the site.
Analytic Plan
The analytic method selected to examine the data was agglomerative hierarchical clustering. Hierarchical clustering, as the name suggests, builds a hierarchy of clustering categories into a pyramid-shaped tree-like structure, termed a dendrogram. There are therefore different hierarchical levels among the cluster categories assigned, meaning every category has a subcategory within it until the bottom of the hierarchy is reached, with every single user being within its own category. The peak of the dendrogram includes the entire sample in one root node category.
In the case of agglomerative hierarchical clustering, which is the specification used in this research, the algorithm begins with each user starting within their own category and then are merged into higher order categories successively during each iteration, as the dendrogram is built from the base upward. This is opposed to divisive hierarchical clustering that begins with every user assigned to one category (a top-down approach) and subdivides each category into a lower level during each iteration.
For the agglomerative approach, different linking methods can be utilized. The methods require a distance metric and a decision metric. This research used Euclidian distance as the distance metric, which measures how close in proximity two users are in a hyperdimensional space of their joint attributes, capturing how similar the values of their attributes are. The decision metric selected for choosing which cluster categories to join was the Ward method (Murtagh & Legendre, 2011). The Ward method computes the sum of squared distances within a possible cluster candidate. If the variance is low, the nodes are merged. The sum of squared distances will be low when users in a cluster are more tightly concentrated together in one group based on the distance metric.
After constructing the hierarchical cluster into a dendrogram, the optimum number of k cluster categories can be chosen. The number of potential cluster categories can be any number not exceeding the sample size. Size k can be selected by deciding at what height to cut the dendrogram, as there are fewer and fewer supercategories as we traverse higher on the hierarchy.
To select the desired k size, different sizes of k are iterated through, starting at two categories and ascending from there. Each cluster size was tested using three fit metrics: the Dunn index, the Davies–Bouldin index, and the Silhouette plot (Davies & Bouldin, 1979; Dunn, 1973; Rousseeuw, 1987). Each method attempts to quantify the cluster level of fit by measuring both the density of cases within a cluster category and the distance of each category from the rest. To select the optimum k size, visual plots of each metric over the series of k sizes were visually analyzed, and the elbow method was utilized to arrive at the desired selection. Namely, the point at which increasing the k cluster size yields diminished returns in fit based on a noticeable visual deceleration in successive improvement with additional k sizes.
Results
The combined samples of Elitecarders and fraud yielded 10,714 accounts between the two sites. Table 2 depicts the descriptives displaying both the observed and latent variables describing the sample. Three columns of descriptives are presented, one combined sample of both sites and a column per each of the two sites. The majority of the accounts in the sample were registered at Elitecarders, consisting of 95.7% of the total sample.
Sample Descriptive Statistics.
Note. IPA = Internet protocol address; PMs = private messages.
The average reported age across both sites is 28.81, with slightly older reported ages at fraud (33.39) relative to Elitecarders (27.94). Under continent, a pluralistic majority of IPAs were geolocated to North America, with Europe as a close second. Both websites were composed of a majority of English-speaking commenters, hence the predominance of North America and Europe.
Regarding explicit markers of status, fraud had many users who were vouched (33.12%), relative to Elitecarders (0.06%). Fraud also had slightly more banned users (8.93% vs. 6.64%). For the implicit markers of status, friend receipts and invites are low across the board (<1). Also, the number of sent PMs substantially exceeds the received PMs. The reason for this is because many PMs are sent out by the administrator of each site (which is reported to be the same individual), broadcasting regular message pertaining to updates and news about the site in general, which means many of the messages are sent to users who never sent a PM or made a post themselves and were hence not selected for inclusion in the sample. Otherwise, sent and received PMs should be similar. It is because of factors such as these that warrant a clustering approach to further refine users based on these attributes.
There are substantially more duplicate accounts at Elitecarders (30.18) than there are at fraud (1.36). Interestingly, there are more anonymous IPAs at fraud, with 2.83% of the users behind a proxy, VPN, or Tor exit node compared to only 0.66% at Elitecarders. Thus, more users reregister accounts at Elite, but fewer seek to obscure their IPAs from scrutiny. This may be because Elite is a larger website and hence has a larger base of less serious users that are not as criminally involved and therefore do not have to take extra precautions against being caught.
As per the latent topic categories, most users from both sites are predominantly free content consumers, followed by transient users who scarcely participate. The venders and dealers make up the minority but are exceeded by customers and generalists enough such that profit should be possible. The generalists consist of sellers as well as buyers.
Hierarchical Clustering
Hierarchical agglomerative clustering was applied to the user features of the sample of 10,714 registered accounts across the two sites. Once the cluster dendrogram was constructed, fit testing for each of 2 through 35 cluster size selections were conducted to isolate the optimum k cluster size. Four fit metrics were plotted for each iteration composed of the Calinski–Harabasz pseudo F-statistic, the Dunn index, the Davies–Bouldin index, and the Hubert and Levin C-index (Caliński & Harabasz, 1974; Davies & Bouldin, 1979; Dunn, 1973; Hubert & Levin, 1976). The Calinksi–Harabasz pseudo F-statistic is a maximization metric, with higher scores representing better model fit. The remaining three metrics are minimization problems, where lower scores indicate better fit.
Based on the 34 cluster size selections, a k size of 16 was converged based on the point at which fit metrics lose their momentum and diminished returns are realized with additional increases in k size. Users were thus assigned a category code indicating membership in one of the available 16 categories (category assignment was mutually exclusive where users could belong to only one category). The results of the fit testing can be found in the Appendix, Table A2. Separate descriptives were calculated for each cluster centroid, and an examination of each category was conducted.
Qualitative analyses were then required to label each category according to its contents based on the descriptives as well as the comment history of each user. The 16 classes were further qualitatively grouped into four different domains: general consumers, location-based consumers, producers, and other. A qualitative examination of each follows.
General Consumers
Table 3 depicts the three different archetypes of general carding forum consumers. Significance is reported for each cell, representing t tests for differences for each cluster category variable against all users of a different cluster category. The types of consumers in Table 3 consist of satisfied customers, vouched consumers, and banned consumers. Satisfied customers are labeled as such given 100% of users from this class are satisfied customers according to the latent topic they were assigned. These users likely compose a bulk of the purchases made to the venders and dealers of the site.
Hierarchical Cluster Group Descriptive Statistics: General Consumers.
Note. n = 1,016. IPA = Internet protocol address; PMs = private messages.
*p < .05. **p < .01. ***p < .001.
Vouched consumers are 100% vouched, and the majority are from fraud rather than Elitecarders. There appears to be a culture at fraud that requires new users to introduce themselves first and become vouched by another user before participating in the community. Some example introductions from users consist of the following: I am Sollaris from Bits. Bits was taken down by a group of people because the owner BIG BOSS scammed too much poor people. I am glad to find this new forum where we all can steal from rich people like the BANK and THE GOVERNMENT. what up all could use some work i am so broke it aint funny. i am a looon and i need to get paid Hello. I would like to come to this community. I’m a supplier of tools for spamming and scam page. I’m in some carding forums. vouche me thanks
Banned consumers are 100% banned. Their age, location, and distribution across the two sites are average. They are mostly free content consumers. After examining their user comment histories, some are banned for rule violations or asking for free content via PM, rather than asking publicly on the forum. Many were banned by the administrator after the admin defrauded them. Many users were banned shortly after making a deal or exchange on I seek you (ICQ) with the admin of Elite. The administrator was assigned to their own cluster category entitled “Administrator” for the producer’s cluster domain in Table 4. More on the administrators’ behavior will be discussed in turn.
Hierarchical Cluster Group Descriptive Statistics: Producers.
Note. n = 317. IPA = Internet protocol address; PMs = private messages.
*p < .05. **p < .01. ***p < .001.
Location-Based Consumers
Additional classes of consumers were assigned cluster categories based on the continent associated with their IPA. There are six, and they include North American consumers, South American consumers, European consumers, Asian consumers, African consumers, and Oceania consumers. They are depicted in Table 5. These users are mostly composed of free content consumers and transient users.
Hierarchical Cluster Group Descriptive Statistics: Location-Based Consumers.
Note. n = 6,044. IPA = Internet protocol address; PMs = private messages.
*p < .05. **p < .01. ***p < .001.
North American consumers make up the pluralistic majority of all users on the forums among the 16 classes of users with 26.28% of accounts being of this type. Ninety-five percent are free content consumers, with very few being transient users relative to the other location-based consumers. They also make an above-average number of forum posts.
South American consumers are almost evenly split between free content consumers and transient users, whereas European consumers are almost 100% free content consumers. Asian consumers tend to report being younger than the average user (26 years at the time of registration). Ten percent of African consumers are generalists, indicating they do more than consume free content. They also have a low number of duplicate accounts. Oceania consumers also have a low number of duplicate accounts, averaging 1.11 per each user. This means most only have one account on the site.
Producers
Table 4 contains the classes of four different types of carding forum producers discovered. The classes include the administrator (with a sample size of one), free content distributors, credential goods vendors, and account dealers. The administrator category is the most unusual, as only one account was assigned to this class on the Elitecarders forum. This user’s IPA is unknown, but it should be mentioned that based on the comment histories, the admin had multiple duplicate accounts to defraud users of their forum, it seems. The admin was in the habit of defrauding users and then banning them. The admin was, therefore, a ripper, which is an unusual finding given that the administrators of most carding forums desire to crack down on rippers as much as possible. The admin also has extremely high engagement numbers, sending out 291,246 PMs and making 2,815 forum posts.
The nature of how the admin went about scamming users can be gleaned from some of the messages users made after being scammed and registering a new account: Beware admin is a ripper! ripped 100eu from me when I bought one of his stuff from him and he put me on ignore list…every post from ppl in here saying they received something is fake! Either they are lick ass with no clue of what is going on or one of his mates helping him out. Hey man ADMIN IS A RIPPER! BEWARE AND SPREAD THIS! cuz that man ripped and ban me from the forum after i sent him 100 This son of bitch admin create fake accounts & do like he send $$$ but he RIPS ALL USERS HERE…90% are fake account that he creates…. U have ICQ?
Hierarchical Cluster Group Descriptive Statistics: Other Users.
Note. n = 3,337. IPA = Internet protocol address; PMs = private messages.
*p < .05. **p < .01. ***p < .001.
Sixty-five accounts were classified as free content distributors. These users are slightly more likely to belong to fraud rather than Elitecarders. They are also more likely to be associated with either North America or Europe. They make an average of 15.34 forum posts, but with a standard deviation of 78.45, meaning some of them make quite a few more posts. One hundred percent of these users belong to the free content distributors topic, but among these, 23% are also free content consumers, 14% are satisfied customers, and 22% are also generalists.
Fifty-eight users are credential goods vendors. About a third are associated with North America, another third with Europe, and about 17% with Asia. Twenty-nine percent of these users were banned. From the comment histories, banned users were deemed to be rippers, meaning the language used by both legitimate vendors and rippers is indistinguishable.
Finally, 193 users were account dealers, making up the most frequent category of producers on the forums. Few of these users were banned, and their geographic coordinates reflect the average of the whole site, with about half from North America and the other half being associated with Europe. About a third of these users are also free content consumers.
Other Users
Three of the 16 classes of users did not completely belong to the three domains of general consumers, location-based consumers, or producers. The class centroids are represented in Table 6. The users include generalists, high duplicate accounts users, and transient users. All three of these user categories report being slightly younger than the average.
Generalists have a large number of duplicate accounts with an average 43.88 associated with other profiles. Of the 635 generalists, 507 of the IPAs are unique such that duplication is skewed higher by a minority of users. A third of the generalists are also free content consumers. They send almost no PMs.
There are 917 accounts classified as high duplicate accounts users with an average of 249.61 duplicate IPAs. It should be noted that among the 917 accounts for this class, just five of these IPAs are unique, suggesting the same individual or set of individuals registering hundreds of accounts. Engagement with the site per each separate registration is low, sending almost no PMs and making a little over one post. Their active duration each time is the lowest of any class (5.68 days before becoming inactive). The majority of the single posts made are to request free samples/content before becoming inactive and registering again to repeat the same. None of them are banned. It may be a strategy to be as anonymous as possible.
It could also be the doing of the admin of Elite, as these are exclusively Elite Carder’s accounts, to create a fake community to attract new people and to provide a sense of authenticity. However, it does not appear these accounts are leaving positive reviews for or vouching any vendors, which would be a more viable strategy to scam users. So, it could be a different individual than the admin, and the strategy for registering so many duplicate accounts is still an open question. It should also be noted that none of these IPAs were detected as being anonymous and probably account for why Elite has so many duplicate accounts but so few anonymous ones.
Lastly, 1,785 accounts are classed as transient, meaning they make about one post and become inactive. Unlike the high duplicate accounts class, these users have an active duration of 48.59 days, indicating they take their time after registration before making a single post. These are likely cautious “lurkers,” users who register with the site and spend most of their time reading posts and downloading free content without making any type of post or comment themselves. Eventually, after lurking on the site for a sufficient duration of time, they may make one or two posts before exiting.
Discussion
This study sought to examine the contents of two hacked and leaked carding forum server databases containing user data for 10,714 separate user accounts. Two phases of clustering were applied to group user accounts into meaningful categories. The first phase classified users according to 369,320 forum posts and PMs, resulting in seven latent topics. The second phase further grouped users based on their topic categories as well as observed variables on user forum activity and status, yielding 16 cluster categories consisting of four domains of general consumers, location-based consumers, producers, and other users.
Many of the findings uncovered from the examination were consistent with prior research on carding forums and other cybercrime communities. Consumers make up the majority of accounts that are active on the sites, with the bulk of the activity composed of free content consumers seeking free samples of credential goods, free malware source codes, and tutorials on hacking and identity theft, among others. Actual customers who spend money on the available nonfree content are scarce relative to these consumers, but there are still enough relative to the even fewer vendors and dealers for actual business to be conducted.
However, the preceding analyses uncovered some findings that to this researcher’s knowledge has not been adequately touched on in prior findings. These discoveries would not have been possible without a full leaked server database. Among the general consumers, there were satisfied customers buying illegal goods and services, vouched consumers predominantly belonging to the fraud forum, and banned consumers. Without access to the private database files, the reason for many of these bans would not be visible.
Specifically, the PMs revealed that the administrator of Elitecarders was banning these users but not due to the typical scenarios among other carding communities where users are banned for rule violations or for defrauding other users in the case of rippers. Instead, the roles were reversed, and the admin was themself a ripper, banning users after defrauding them as a precaution against the victims reporting such offenses to the public forums. The administrator was assigned to the producer domain, consisting of the sole administrator, credential goods vendors, accounts dealers, and free content distributors. Yet, unlike producers discussed in other cybercrime communities, the administrator was also a ripper.
In addition to this new evidence made possible by the examination of PMs, the location-based consumers’ domain was made possible exclusively by the leaked forum data. The result was an additional sequence of consumers based exclusively on the geographic region associated with their IPA. While both websites were composed of predominantly English-speaking posts, isolated groups of users were found to frequent the communities behind IPAs from across the globe. The proportion of users whose IPA location agrees with their own continent of origin is unknown, but the distribution of such IPAs is not random.
Finally, the “other” category included generalists, high duplicate accounts users, and transient users. The high duplicate account category involved 917 separate account registrations, but only five unique IPAs, likely belonging to only one to a few individuals. About half of these users were transient users according to their topic model classified comment histories. However, transient users belong to a single category on their own apart from this. The distinction being that full transient users “lurk” on the site for a little under 2 months before making one or two posts. This is opposed to the duplicate accounts users, who quickly post before switching to a newly registered account, possibly as a precaution and an effort to remain anonymous.
Limitations
This research was not without limitations that should be mentioned. First, the two websites analyzed were not proportionate to one another based on the active registered accounts available, with the vast majority belonging to Elitecarders (10,255 accounts) compared to fraud (459 accounts). Thus, findings are more heavily skewed toward the activity at Elitecarders. However, both forums were reportedly hosted and run by the same administrator, which would likely result in the two sites being of similar composition and quality.
Also, it cannot be known how independent one registered account was from another, considering the ease with which users can register multiple accounts. An attempt was made to account for such dependency by recording the number of duplicate IPAs associated with each account. However, IPAs can be changed at any time. While analyses were conducted on 10,714 separate registered accounts, only 7,459 of the IPAs across each account were unique.
In the same vein, the associated continent assigned to each user should be met with similar caution. The geographic information belonging to an IPA could represent where the users’ ISP is located but could equally be where the user’s desired proxy or VPN service is located. Users who are diligent enough to use a proxy or VPN before visiting the site can choose any location on the globe they wish. Many of the proxy vendors discovered in this study and others (Kigerl, 2018) have many options based on the desired location of the buyer. There is thus the demand for proxies of a certain geographic location that users wish to choose. Despite this, the region of such users still reflects their own choices, and those regions, at least according to the clustering algorithm employed, appear to be meaningful.
Future Research and Conclusion
A rich data set has been constructed containing a minimum of 42 variables of over 10,000 user accounts among two carding forum websites. This analysis was a 2-fold process of building the analytic data set and conducting a qualitative interpretation of the assigned cluster categories. Additional analyses are still possible with the resulting data.
There are still some unanswered questions in the sample, such as how many transactions were legitimate and how many were fraudulent and what proportion of that fraud was perpetrated by the administrator. In addition to inferential analyses of the current data, additional preprocessing steps could be possible to convert the data into a time series format. Time series or panel data modeling could be conducted to examine both legitimate and illegitimate transactions. Thus, the point in time at which a transaction can occur can be estimated and analyzed and further disaggregated into legitimate and illegitimate transactions.
Time series data itself can also be analyzed via cluster analysis such as by group-based trajectory modeling or even a version of K-means clustering. Users can be grouped according to the trajectory of their behavior after registration, and their pathway over time can be visually plotted for comparison to other user trajectories. The results could further lend insight into the nature of carding forums.
Footnotes
Data Availability
An anonymized version of the data set used for this research is available upon request at
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Software Information
The Python programing language was used to extract the MySQL database tables for analysis. Statistical Package for the Social Sciences (SPSS) in conjunction with the Python for SPSS software extension was used to build the data set from the database tables. The R programming language was used to conduct the clustering and other analyses. All program code and syntax files used in building and analyzing the data source are available upon request.
Appendix
Topic Model Fit Testing and K Number Selection.
| Topics | Maximization | Minimization | ||
|---|---|---|---|---|
| Griffiths (2004) | Deveaud (2014) | Cao (2009) | Arun (2010) | |
| 2 | −9,571,280 | 3.2746 | .0286 | 17.4441 |
| 3 | −9,183,735 | 3.2571 | .1565 | 14.7267 |
| 4 | −8,887,965 | 3.454 | .0383 | 12.9298 |
| 5 | −8,747,085 | 3.2898 | .0682 | 10.872 |
| 6 | −8,587,016 | 3.1834 | .0404 | 10.3616 |
| 7 | −8,511,897 | 3.0149 | .0418 | 10.3178 |
| 8 | −8,505,357 | 2.894 | .0758 | 9.6454 |
| 9 | −8,392,902 | 2.8574 | .0682 | 8.6293 |
| 10 | −8,339,492 | 2.7505 | .0605 | 9.0991 |
| 11 | −8,308,987 | 2.6671 | .0574 | 9.6441 |
| 12 | −8,312,357 | 2.548 | .0637 | 10.434 |
| 13 | −8,267,015 | 2.5115 | .0565 | 8.9812 |
| 14 | −8,221,371 | 2.5116 | .0515 | 7.806 |
| 15 | −8,227,289 | 2.3548 | .0525 | 8.3884 |
| 16 | −8,190,032 | 2.3925 | .0512 | 8.4743 |
| 17 | −8,175,707 | 2.3505 | .0542 | 7.2781 |
| 18 | −8,190,324 | 2.2194 | .0517 | 9.3835 |
| 19 | −8,165,510 | 2.2166 | .053 | 7.193 |
| 20 | −8,168,577 | 2.0998 | .0549 | 7.5081 |
| 21 | −8,164,319 | 2.0336 | .0525 | 7.4589 |
| 22 | −8,142,845 | 2.021 | .0396 | 8.4712 |
| 23 | −8,163,510 | 1.9562 | .048 | 8.5916 |
| 24 | −8,154,009 | 1.8269 | .0379 | 8.2921 |
| 25 | −8,132,550 | 1.871 | .0456 | 8.4324 |
| 26 | −8,144,832 | 1.8277 | .0431 | 7.682 |
| 27 | −8,144,297 | 1.7368 | .044 | 8.5522 |
| 28 | −8,140,575 | 1.7066 | .0444 | 8.605 |
| 29 | −8,146,880 | 1.6333 | .0394 | 11.3246 |
| 30 | −8,138,920 | 1.6351 | .0349 | 9.4806 |
| 31 | −8,149,161 | 1.5599 | .0378 | 10.6495 |
| 32 | −8,138,972 | 1.5925 | .0426 | 7.9659 |
| 33 | −8,133,405 | 1.5506 | .0428 | 9.3435 |
| 34 | −8,125,649 | 1.5132 | .0386 | 10.231 |
| 35 | −8,130,955 | 1.4986 | .0397 | 10.2361 |
Hierarchical Cluster Fit Testing and K Number Selection.
| Maximization | Minimization | |||
|---|---|---|---|---|
| K Size | Calinski–Harabasz | Dunn Index | Davies–Bouldin | Hubert–Levin |
| 2 | 2,208.889 | 3.4116 | 0.0213 | 0 |
| 3 | 1,483.82 | 0.0135 | 2.7354 | .4106 |
| 4 | 1,250.549 | 0.0136 | 2.3331 | .3621 |
| 5 | 1,156.989 | 0.0136 | 2.233 | .3444 |
| 6 | 1,119.815 | 0.0136 | 2.0344 | .286 |
| 7 | 1,113.258 | 0.0136 | 1.8308 | .2745 |
| 8 | 1,129.783 | 0.0136 | 1.607 | .274 |
| 9 | 1,161.585 | 0.0136 | 1.5005 | .2515 |
| 10 | 1,209.357 | 0.0136 | 1.2796 | .2506 |
| 11 | 1,273.042 | 0.0121 | 1.2977 | .1966 |
| 12 | 1,349.369 | 0.0121 | 1.214 | .1659 |
| 13 | 1,438.783 | 0.0121 | 1.155 | .1581 |
| 14 | 1,536.792 | 0.0121 | 1.1124 | .1294 |
| 15 | 1,658.576 | 0.0121 | 1.0956 | .0977 |
| 16 | 1,779.729 | 0.0121 | 1.0823 | .0823 |
| 17 | 1,902.306 | 0.0075 | 1.4574 | .0318 |
| 18 | 2,036.731 | 0.0093 | 1.2125 | .0305 |
| 19 | 2,150.704 | 0.0093 | 0.9579 | .028 |
| 20 | 2,218.121 | 0.0093 | 0.9534 | .0263 |
| 21 | 2,241.383 | 0.0093 | 0.9444 | .0206 |
| 22 | 2,230.086 | 0.0093 | 0.9439 | .0192 |
| 23 | 2,222.622 | 0.0093 | 0.9607 | .0188 |
| 24 | 2,221.334 | 0.0093 | 0.9509 | .0181 |
| 25 | 2,210.014 | 0.0121 | 0.9507 | .0181 |
| 26 | 2,202.638 | 0.0121 | 0.943 | .0172 |
| 27 | 2,188.074 | 0.0121 | 0.9892 | .0156 |
| 28 | 2,170.66 | 0.0065 | 1.0135 | .0163 |
| 29 | 2,155.55 | 0.0065 | 0.9852 | .0158 |
| 30 | 2,138.93 | 0.0015 | 0.9996 | .0372 |
| 31 | 2,123.609 | 0.0015 | 1.0068 | .0366 |
| 32 | 2,111.247 | 0.0015 | 1.0407 | .034 |
| 33 | 2,098.914 | 0.0015 | 1.0533 | .0357 |
| 34 | 2,085.877 | 0.0015 | 1.0335 | .0348 |
| 35 | 2,072.978 | 0.0015 | 1.0485 | .0331 |
