Abstract
This article describes two class activities that introduce the concept of data mining and very basic data mining analyses. Assessment data suggest that students learned some of the conceptual basics of data mining, understood some of the ethical concerns related to the practice, and were able to perform correlations via the Statistical Package for the Social Sciences (SPSS, Version 20).
Data mining is a popular topic in statistics and has garnered attention beyond analytic and academic circles. Broadly defined, data mining refers to hypothesis-free analysis of archival data (typically “big data,” another statistical term enjoying a zeitgeist) with the hope of uncovering interesting patterns within the data. At its most basic, it involves running correlations on quantitative data. At its most advanced, it involves specialized software and data analysis techniques that are beyond the purview of a basic psychological statistics class. Data mining has been getting so much attention, in part, because of massive amounts of data collection occurring as a result of cheap storage and the ease with which data can be collected (such as when shopping; Wall, 2014), browsing the Internet (Brandeisky, 2015), or using our cell phones (Talbot, 2014).
Another reason for its popularity is that data mining holds promise for finding solutions to problems and uncovering hitherto undetected patterns of human behavior using applied data collected in naturalistic settings. For example, data mining is already being used to generate novel solutions to increase college student retention (Zhang, Oussena, Clark, & Kim, 2010) and uncover widespread cheating on standardized exams (Jacob & Levit, 2003).
Further evidence of the interest in data mining comes in the form of New York Times best-selling books such as Freakonomics (Levit & Dubner, 2005) and Big Data: A Revolution That Will Transform How We Live, Work, and Think (Mayer-Schönberger & Cukier, 2013). Media coverage of data mining has detailed how corporations collect data about consumers (Singletary, 2014) and how the government is using it to fight crime (Lichtblau, 2007). Data mining has also gained much attention because of fears about intrusive data collection, such as private health data (Pettypiece & Robertson, 2014), as well as what can happen when data mining does not result in accurate findings (Lazer, Kennedy, King, & Vespignani, 2014). It is such a wide spread practice that the government, driven largely by privacy concerns, is now considering legislation to more closely regulate consumer data collection (Federal Trade Commission, 2014).
Data Mining: Teaching Technique
Certainly, the academy has paid attention to the rising interest in this data mining. Several universities offer undergraduate courses and graduate degrees and certificates in data mining and closely related fields. As such, much of the pedagogical writing about this topic has been aimed at upper level courses for statistics majors (Dickey, 2005; King & Satyanarayana, 2013; Satyanarayana, 2013), upper level courses taught in statistics-related fields (Banks, Dong, Liu, & Mandvikar, 2004), or for the training of statistics professionals. However, little has been written about methods to introduce the very basics of data mining to introductory statistics students in order to teach both applied statistics and ethical dilemma raised by such data collection.
As such, this article seeks to fill the gap by describing a very basic activity to introduce the topic to novice statistics students. In addition to introducing data mining and using correlations to perform very simple data mining, this article demonstrates a way to integrate ethics into a statistics course.
Data Mining: Teaching Ethics
The third goal of the American Psychological Association’s (APA) Guidelines for the Undergraduate Psychology Major (2013) emphasizes “Ethical and Social Responsibility in a Diverse World.” As such, the insertion of an ethical debate regarding the large-scale collection (and use) of data when informed consent may not be very well informed allows an instructor to integrate this goal of the APA into a statistics course. Certainly, ethics and statistics classes have been integrated previously. Suggestions have already been made to integrate ethics into statistics courses by creating examples using data sets seeped in social justice issues (Lesser, 2007). Other efforts have integrated philosophical teaching on ethics into statistics classes (Lesser & Nordenhaug, 2004). The present article extends upon these ideas by suggesting that while learning about the widely used statistical practice, students should also reflect on ethical issues related to data mining.
This article presents a learning activity that introduces the basics of data mining and provides students with (a) the opportunity to learn about an increasingly ubiquitous data analysis technique; (b) the practical, ethical implications of data mining; and (c) a novel way to review correlation (the simple statistic at the core of very basic data mining). Study 1 describes a more involved version of the class activity (a homework assignment consisting of a reading and electronic discussion board followed by a class activity), and Study 2 presented students with a brief introduction to data mining via a radio news story and then the same class activity described in Study 1.
Study 1
Method
Participants
Students completed the activities as part of psychological statistics (an introductory course) at a small liberal arts university in the Northeastern United States during the fall 2013 semester. Participants were from two sections of the class taught by the same instructor (n = 49, although not all students completed all portions of the activity).
Procedure
This activity was used toward the end of the correlation/regression section of this course. Students completed a pretest of their knowledge and opinions about data mining. The test contained several yes/no questions (“In statistics class, you have been taught about hypothesis testing prior to running statics. Does data mining require a hypothesis?” “Are you in favor of government legislation regulating data collection on the Internet?” “Do you think that you understand the basics of conducting and reporting (in APA formatting) a data mining analysis?”). It also asked a multiple-choice question regarding the main form of data analysis used in data mining (the correct answer was correlation) as well as one short answer question (“In a few sentences, describe some of the ethical concerns regarding data mining”).
Next, students completed a discussion board assignment based on a reading about consumer data mining and privacy issues (Stein, 2011). This discussion board was one of eight completed as part of this course as homework assignments. Students responded to three questions (see Table 1) and responded to a peer’s post.
Study 1: Discussion Board Questions.
Approximately 1 week after the due date for the discussion board, students participated in a class period long (55 min) activity about data mining. The exercise started with an National Public Radio (NPR) interview conducted with one the authors of the book Big Data: A Revolution That Will Transform How We Live, Work, and Think (Inskeep, 2013), in which the author shares several examples of data use by companies such as Google and Target. This was followed by a brief lecture about data. This lecture introduced data mining by contrasting it with the hypothesis-driven data analysis that is traditionally taught in an introductory statistics course. To introduce this topic, the author asked the students to list all of the information that their university had gathered about them (standardized test scores, major, etc.) and imagine how these variables may correlate with Grade Point Average (GPA). An additional data mining example was provided, which showed relationships between political ideology, voting patterns, and alcoholic drink of choice (http://nmrpp.com/2014/01/02/politics-of-wine-liquor-brands/). The lecture was brief compared to the next portion of this activity, the data analyses. Students were given an archival data set to mine by conducting correlations on seemingly unrelated survey items. The archival data set was collected via Google Forms in 2011 from a previous section of the class (n = 18, see Table 2 for survey items and correlations). Students used Statistical Package for the Social Sciences (SPSS) at their own work stations but were encouraged to talk with their classmates and professor about expected and unusual findings in the data and to practice interpreting and creating APA-formatted results sections for the data. Students completed the posttest at the end of the class period.
Correlations for Activity Data Set.
Note. n = 17. FB = Facebook.
*Correlation is significant at the .05 level. **Correlation is significant at the .01 level.
Results and Discussion
Pre–Posttest Data
The pre–posttest consisted of several yes/no questions related to data mining. McNemar’s test was used to study response patterns, in particular, how responses to yes/no questions shift from Time 1 to Time 2 (see Table 3 for these tests). These findings suggest that, for the most part, students understood that data mining does not require a hypothesis (n = 36, 14 students who erroneously responded that data mining requires a hypothesis at Time 1 learned that it does not require a hypothesis by Time 2), that they could correctly generate a results section for a data mining correlation (n = 37, 25 students who indicated that they did not feel confident conducing and reporting data mining findings at Time 1 later indicated that they did feel confident performing these tasks at Time 2), and that they were more strongly in favor of government regulation of data gathering (n = 37, 15 students who indicated that they were not in favor of government regulation at Time 1 changed their minds about this topic by Time 2).
McNemar Results for Study 1.
Note. APA = American Psychological Association.
*p < .005. **p < .001.
The pre–post data also contained a multiple-choice question asking students to identify the main form of data analysis used in data mining. At Time 1, only 5 of 41 students who responded to the question could correctly identify correlation as the main method of data analysis for data mining, but at Time 2, all 41 students could correctly answered the question.
All students responded to this question and were able to articulate at least one concern about data mining, with the most popular response (49.06%) being privacy issues, followed by concerns that data mining can lead to inaccurate conclusions/inferences (20.75%). The discussion board grade data demonstrated student learning (n = 41, M = 97.24%, SD = 8.94%). This indicates that majority of the students provided thorough and correct responses to the three questions from Table 1.
Overall, the data suggest that completion of a discussion board as well as listening to a news story, brief lecture about data mining, and completing an in-class activity about the topic lead to gains in understanding and application of the topic. However, is it necessary to have the discussion board? Can student learning occur after one class day dedicated to data analysis? Study 2 sought to answer these questions.
Study 2
Method
Participants
Students completed the activities as part of psychological statistics at a small liberal arts university in the Northeastern United States. Data came from two sections of this course taught during the spring 2014 semester (n = 45, although not all students completed all parts of the assessment). Again, this activity was used toward the end of the correlation/regression unit in the course.
Procedure
The methods used are similar to Study 1 with a few important exceptions: Students neither read the magazine article (Stein, 2011) nor participated in a discussion board. Instead, students listened to the same brief lecture about data mining as well as the radio interview (Inskeep, 2013) described in Study 1. Students then mined the same data set from Study 1 (again, using correlations conducted within SPSS) and were asked to identify one relationship from the data set that they found counterintuitive, create an APA-formatted results section for that item, and submit their results discussion via Angel (a learning management system used by the university where this research was conducted).
At the end of the class period, students completed a brief learning assessment via Angel. The quiz consisted of four multiple-choice questions: (1) “What kind of data analysis is performed in the most basic of data mining?” (2) “Do you have reservations about data mining?” (3) “Data mining requires a research hypothesis.” (4) “Name one of the companies (mentioned in the radio story) that uses data mining.” The final question served as a manipulation check to see if the students could remember a basic detail from the interview. Two students failed the manipulation check. However, their responses to other items in the survey did demonstrate comprehension about the issue of data privacy, so their data were retained.
Results and Discussion
Postactivity assessment demonstrates that the students understand the basics of data mining. One hundred percent of all students could correctly identify correlation as the most basic analytic technique for data mining; 86.7% of students correctly indicated that data mining does not require a research hypothesis. In regard to student attitudes toward data mining, 42.2% of students reported not having reservations about data mining, while 35.6% did and 22.2% indicated having a more ambivalent attitude toward the mass data collection/analysis technique. Students performed well on this task, reflecting that they were paying attention during the lecture portion and appeared to gain knowledge about data mining, even without the more extensive preparation involving the magazine article and discussion board.
General Discussion
Data suggest that both versions of the present activity increase knowledge about data mining and provide a novel way to review correlation. Anecdotally, the students appeared to enjoy the activity. As the instructor walked around the computer lab, she overheard students talking about some of the more surprising findings (for instance, there is not a significant relationship between number of credits being taken and stress, which then allowed for a conversation with students about correlation and truncated data) and fielded questions from students who were busy working on the activity.
Limitations
Of course, this activity is not a comprehensive lesson in advanced data mining techniques. The sample size for the data set was small, the present exercises did not delve into correction techniques for analyzing multiple correlations at once, nor did they introduce more advanced software or analyses often used in data mining.
Regarding the small sample size, the author finds it helpful to work with this data set because she uses it every year and knows this data set inside and out. I inform my students that the data were collected from their peers withinthe last few years, and the questions are relevant to college students. Relevant, applicable, and engaging data have been presented as a way to student understanding of statistical procedures (Singer & Willett, 1990) and successfully applied to statistics classes (Thompson, 1994; Thompson & Fisher-Thompson, 2013). However, convenience, familiarity, and applicability are certainly not the only consideration when making pedagogical choices. A larger sample would be valuable to increase the ecological validity of this exercise as well as the reliability of findings. Some free sources for such large data set are available via the General Social Survey (n.d.), Pew Research (n.d.), and the United Nations Statistical Division (n.d.). These data sets have large n sizes and cover a wide variety of different topics. Instructors could also opt to have their students collect their own data sets: They could come up with this own list of survey items and collect the data for analysis.
In response to any criticism related to the depth of this activity: No, this exercise did not teach students any advanced analytic techniques for mining data. However, that was not the focus of these activities. Instead, they focused on (a) introducing data mining to students, (b) the ethical implications of data mining, and (c) providing a novel way to review correlation. Assessment data suggest that the activities succeeded in accomplishing these goals.
Activity Alternatives
In addition to using a different data set, there are other ways to modify these activities. For instance, when introducing data mining, different articles or news stories can be used. A professor could change the focus to data mining and, for example, public health (Greenfieldboyce, 2014; Maron, 2014). The mode of introduction can also vary according to the desires of an instructor. Currently, a magazine article was used as a prompt for an electronic discussion board, and the radio story was integrated into a lecture. However, either of these pieces of media could be used as an in-class activity, a homework assignment, or an in-class discussion piece to compliment the correlation activity.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
