Abstract
Abstract
The era of “big data” has radically altered the way scientific research is conducted and new knowledge is discovered. Indeed, the scientific method is rapidly being complemented and even replaced in some fields by data-driven approaches to knowledge discovery. This paradigm shift is sometimes referred to as the “fourth paradigm” of data-intensive and data-enabled scientific discovery. Interdisciplinary research with a hard emphasis on translational outcomes is becoming the norm in all large-scale scientific endeavors. Yet, graduate education remains largely focused on individual achievement within a single scientific domain, with little training in team-based, interdisciplinary data-oriented approaches designed to translate scientific data into new solutions to today's critical challenges. In this article, we propose a new pedagogy for graduate education: data-centered learning for the domain-data scientist. Our approach is based on four tenets: (1) Graduate training must incorporate interdisciplinary training that couples the domain sciences with data science. (2) Graduate training must prepare students for work in data-enabled research teams. (3) Graduate training must include education in teaming and leadership skills for the data scientist. (4) Graduate training must provide experiential training through academic/industry practicums and internships. We emphasize that this approach is distinct from today's graduate training, which offers training in either data science or a domain science (e.g., biology, sociology, political science, economics, and medicine), but does not integrate the two within a single curriculum designed to prepare the next generation of domain-data scientists. We are in the process of implementing the proposed pedagogy through the development of a new graduate curriculum based on the above four tenets, and we describe herein our strategy, progress, and lessons learned. While our pedagogy was developed in the context of graduate education, the general approach of data-centered learning can and should be applied to students and professionals at any stage of their education, including at the K-12, undergraduate, graduate, and professional levels. We believe that the time is right to embed data-centered learning within our educational system and, thus, generate the talent required to fully harness the potential of big data.
Introduction
The emergence of data collections and sources with large volume, a variety of data elements, tremendous velocity, and questionable veracity (i.e., “big data”) has ushered a paradigm shift in the way scientific research is conducted and new knowledge is discovered. 1 Knowledge discovery in science has traditionally been guided by the “scientific method,” whereby scientists invoke a series of orderly steps involving observation, review of the existing knowledge base, hypothesis generation, experimental design, and testing, analysis, and dissemination of results. This traditional scientific approach of observation-based hypothesis testing is rapidly being complemented and, in some areas, even replaced by data-driven approaches to knowledge discovery that rely less on hypothesis testing and more on hypothesis generation.
Nearly all scientific domains are moving toward this so-called “fourth paradigm” of data-intensive and data-enabled scientific discovery. 2 The shift involves traditional fields such as astronomy, which have always had access to rich data sources but are now facing unprecedented computational and analytical challenges, as well as emergent fields such as environmental science, which have only recently had access to the wealth of distributed data available from sensors and satellites. Social sciences such as political science also are recognizing the power of data derived from social media, mobile devices, and “smart cities.” Indeed, the wealth and diversity of data available for scientific research are yet to be fully appreciated. To point, the so-called “Internet of Things” is growing faster than the human population, with an estimated 50 billion Internet-connected devices, objects, and sensors by 2020 (i.e., >6 devices/person worldwide). 3 These data sources will allow scientists to approach knowledge discovery in entirely new ways.
Yet, graduate training remains siloed and largely focused on individual achievement within a single scientific domain, with little (if any) formalized training in team-based, interdisciplinary data-oriented approaches designed to translate scientific data into new solutions to today's critical challenges.
A New Pedagogy for Graduate Education: Data-Centered Learning for the Domain-Data Scientist
Advanced training in data analytics, while originally considered the solution to the challenges of big data, is now recognized as insufficient to meet the needs of today's domain scientist. As recently as 2011, a McKinsey Global Institute report predicted that by 2018, the United States would have a shortfall of 140,000 to 190,000 “deep” analysts and an additional shortfall of 1.5 million general analysts and managers with analytical expertise. 4 That same report predicted that investments in analytic talent and training would produce a >60% increase in operating margins across industry sectors.
However, follow-up reports have since found that analytic skills alone are increasingly insufficient to harness the power of big data, with industry leaders reporting minimal return on investments in analytics, largely because analytic findings are too often not actionable.5,6 In fact, even companies that specialize in the application of advanced analytics report a need for a different type of skilled worker. A 2016 Gartner report indicates that >40% of these organizations state a lack of skilled data scientists as their top challenge. 7
The failure of analytics to guide decision-making is driving a need for a new type of talent. McKinsey argues in favor of “translators,” 8 whereas General Assembly and Burning Glass propose “hybrid” employees. 9 Regardless of terminology, however, as envisioned by both groups, the skilled employee in demand today will “form the links that bind the chain of an effective advanced-analytics capability.” 8 Translators bridge business management and operations with analytic approaches and strategies to render analytic findings into actionable insights. For example, an “analytic consultant” may be someone trained in statistics, but with sufficient knowledge in business operations to improve managerial decision-making.
While businesses may indeed benefit from investments in the training of translators, we believe that there is an even greater opportunity to embark the career of a translator. The paradigm shift toward data-intensive and data-enabled scientific discovery, coupled with the failure of advanced training in analytics to fulfill current needs in industry, suggests that a new pedagogy is needed for graduate education: data-centered learning for the domain-data scientist.
Our pedagogical approach of data-centered learning for the domain-data scientist is targeted at the graduate school level and is based on four tenets (Fig. 1):
(1) Graduate training must incorporate interdisciplinary training that couples the domain sciences with data science. (2) Graduate training must prepare students for work in data-enabled research teams. (3) Graduate training must include education in teaming and leadership skills for the data scientist. (4) Graduate training must provide experiential training through academic/industry practicums and internships.

The proposed pedagogy for graduate education: data-centered learning for the domain-data scientist.
The first basic tenet of data-centered learning is rooted on the basic idea that domain scientists should have essential knowledge in data science and that data scientists should have insights into the theoretical and practical data needs of domain scientists, that is, the domain-data scientist. The second tenet is based on the observation that in today's data-driven world, science is increasingly being implemented by teams that are often distributed geographically and remotely. The third tenet of data-centered learning is based on the premise that data-intensive, interdisciplinary team-oriented science necessitates the need for teaching social skills such as effective communication to develop professional relationships and work effectively in teams. Finally, our fourth tenet acknowledges the need to provision real-world training and problem solving, for in the absence of training for real-world application of skills, the first three tenets will have little impact on graduate education and student success in the workplace. Below, we expound each tenet.
Today's domain scientists require training in the science of data—“data science.” While many definitions of data science have been put forth, including one on Wikipedia (https://en.wikipedia.org/wiki/Data_science), most definitions focus on the theories and practices of the disciplines from which data science draws, including computer science, information science, statistical science, and mathematics—not the application of those skills across scientific domains to translate scientific data into tangible solutions designed to improve the human condition. The National Consortium for Data Science (http://datascienceconsortium.org), * a public–private partnership established in August 2013 to address the data challenges of the 21st century, defines data science as the systematic study of the organization and use of digital data to accelerate discovery, improve critical decision-making processes, and enable a data-driven economy. 10
We believe that data science provides the foundation upon which we can develop the tools and approaches required to efficiently, effectively, and securely access, federate, process, and analyze data; however, the application of data science tools and approaches requires fundamental understanding of the scientific domain in which they are applied. Thus, training in data science alone is insufficient. Data scientists must also acquire skills and expertise in the data needs within a specific scientific domain or application area, for example, biology, political science, sociology, economics, business, medicine, or law; likewise, domain scientists must acquire skills and expertise in data science. Without such integrated training, data scientists and domain scientists are equally ill equipped to take advantage of today's opportunities and to accelerate discovery, improve decision-making, and enable a data-driven economy. We believe that a new breed of scientist is needed: the domain-data scientist. The coupling of data science training with domain science training and the creation of a fully integrated curriculum designed to train the next generation of domain-data scientists are foundational components of the proposed pedagogy for data-centered graduate education.
Success in the modern workforce and advancements in today's workplace require team-based work.5,6 Yet, graduate education remains focused on individual achievement rather than team-based accomplishments. While individualized training is important, we believe that interdisciplinary team-based training and research should serve as the medium of data-centered learning. Training that bridges data science with the domain sciences necessitates collaboration among faculty members in disciplines and academic units that have historically remained siloed. To realize our vision, the academic structure itself must be modified to encourage innovation, including new models of education and training and a tenure system that emboldens faculty to collaborate across departments and fields.11–13 Only through an interdisciplinary collaborative approach to training and research will we prepare scientists for the fourth paradigm, with skills to translate research data and analytic results into actionable discoveries across domains and industries.
Interdisciplinary team-based learning and the success of teams mandate that team members are equipped with skills such as communication, adaptability, conflict resolution, ethics, and empathy.14,15 Industries are recognizing this void and have begun adding teaming and leadership skills to the requirements for technical jobs. 16 Google, for example, bases its hiring not on GPAs or degrees, but on skills such as humility that the company believes are required to excel at interdisciplinary data-intensive research and development. 17 We believe that training in teaming and leadership skills is a crucial component of data-centered learning.
While the tenets discussed above are critical to the proposed pedagogy of data-centered learning for the domain-data scientist, we recognize that the exact skill set required to excel in today's data-driven workplace is not known, in part, because the opportunities presented by big data are only beginning to emerge. Thus, academic-based training alone is unlikely to yield the highly skilled workforce in demand today. In fact, high-quality training based solely on the first three tenets is nearly impossible to accomplish without training for real-world application of skills through real-world experiences.
We posit that real-world practicums and internships based on partnerships between academia and industry provide the ultimate mechanism through which today's students will be equipped with the skills they need to gain a competitive edge in the workplace and, in the process, drive businesses forward.18,19 Such programs will also help educators delineate the requisite skill set required across data-driven domains and industries. While the “apprenticeship” model may seem to be a thing of the past, we see a need to integrate hands-on learning into today's graduate education through industry-based practicums and internships. The challenge, however, will be to identify and provide sufficient incentives to drive academic/industry partnerships and establish practicums and internships.
From Pedagogy to Implementation
To ensure that our pedagogy and tenets are widely adopted, formal implementation is necessary. In this regard, we are working to implement our pedagogy through the adoption of the tenets into concrete portions of a new Professional Science Master's (PSM) in Data Science curriculum that we are developing.
Our efforts are being driven largely by a formal campus working group—Data@Carolina (http://data.web.unc.edu/)—formed in the summer of 2014 with faculty representatives from multiple departments, schools, and institutes. As part of the planning process, the working group conducted a comprehensive survey of data science and data science–related graduate and professional degree programs in place at our peer institutions. We identified 19 institutions with relevant programs: Arizona State University, Carnegie Mellon University, Columbia University, Georgia Institute of Technology, Illinois Institute of Technology, Indiana University–Bloomington, Michigan State University, New York University, Northwestern University, Rutgers University, Stanford University, Texas A&M University, University of California Berkeley, University of Connecticut, University of Maryland–University College, University of Tennessee at Knoxville, University of Virginia, and two North Carolina institutions, North Carolina State University and UNC Charlotte. The existing programs vary in their specific focus, with several focused more on data analytics or business intelligence than data science, but seven of the 19 programs are focused specifically on data science. The existing programs also vary in the number of required credit hours (range 27–45) and expected time to completion (range 9–31 months), as well as structure (full-time versus part-time, in-person versus remote versus blended) and tuition and fees. Moreover, six of the identified institutions offer Certificates in Data Science, Business Analytics, or related fields, and three (University of Chicago, University of Washington, and Georgia Institute of Technology) offer 10-week to 3-month paid internships in “Data Science for Social Good.”
The existence of these degree programs, certificate programs, and paid internships provides a clear sign of local and national student demand for training in data science and related fields. Our institution recognized this opportunity, and in January 2016, we received planning funds from our Provost and approval to move forward with the formal application process to establish the Data Science PSM. This is the first step toward a full commitment from our institution to instill education and training in data science across campus and across domains, at both the undergraduate and graduate levels.
The general framework for the new curriculum has been established and incorporates the four tenets into distinct areas of the overall curriculum. Effort was made to differentiate our program from the programs in place at our peer institutions. As currently proposed, the Data Science PSM will be a research-oriented and team-based, 12-month, full-time (30-credit) program. The program will be aimed at recent graduates and professionals who seek training in data science beyond their baccalaureate degree, although many of the courses that will form the Data Science PSM will also be available to undergraduates and other graduate students. The program will provide specialization in multiple domains in which the tools and approaches of data science are in high demand and instrumental to excel beyond the normal career path; these may include environmental science, social science, public health, pharmacy, public policy, and city and regional planning.
The proposed program will have multiple elements that collectively address the core tenets of our pedagogy:
1. Boot Camp: The Data Science PSM program will begin (during a summer session) with a 7-day Data Science Essentials Boot Camp & Retreat designed to provide remediation coursework and team building. Each year, students will be coalesced into teams to ensure close coordination across projects and group activities throughout the yearlong program. 2. Modular Courses: Modular core courses in data science (both required and elective) will be offered as 1/2- to 3-credit hour courses. Initially, these courses will be modeled using materials derived from existing courses and workshops; eventually, the modular courses will be streamlined and could be as short as 1 week or less. Examples of required courses are the following: Social and Ethical Implication of Big Data Analytics, Data-Driven Modeling and Scientific Computing, and Data Curation and Management. Examples of elective courses are the following: Text Mining, Web Databases, and Network Analysis. Elective courses will also be available for each of the specialty areas. 3. Courses in Teaming and Leadership: Courses in teaming and leadership (0.5- to 1.5-credit hours) will be offered through The Graduate School's newly established portfolio of professional skill courses. Examples are the following: Professional Communication–Writing, Professional Communication–Presenting, and Role of Leadership for Professional Scientists and Potential to Build Effective Teams. 4. Capstone Practicum: The Data Science PSM program will culminate with a term-long capstone practicum in which student teams will work on real-world data challenges in the specialty fields under the mentorship of multimentor teams drawn from the academic and industry (or nonprofit/government) sectors. To realize this, Data@Carolina has partnered with the National Consortium for Data Science. This partnership will enable us to form the industry, nonprofit, and government connections and commitments that are critical to ensure our success as we establish viable opportunities for rewarding academic-industry practicums. 5. Exit Camp: During the last portion of the program (a summer session), students will prepare an oral presentation and written report on the practicum experience.
Of the many lessons learned during the long process of converting our pedagogy into a Data Science PSM curriculum, several stand out. First, we've found that open and regular communication with our institution's upper echelons of leadership has been critical to our success, ensuring the necessary approvals (formal and informal) and funding to implement our pedagogy, while facilitating discussion about our efforts across campus and stimulating widespread enthusiasm and support. Second, although we have received planning funds and initial approval to implement the Data Science PSM, uncertainties about timing and funding necessarily remain; yet, the only way to overcome potential hurdles is to proceed with the belief that our efforts will succeed. Third, we have found that significant upfront costs and time commitments are required to implement our pedagogy. While this may not be surprising, we stress that we invested substantial personnel time for planning and paperwork (e.g., meetings, workshops, reports, applications, and so on) well before we were awarded planning funds. In fact, our efforts began at the grassroot level three years before we received planning funds, which brings us to our final lesson learned: promoting change in academic culture and tradition is slow and difficult, but possible with persistence.
While formal implementation of our pedagogy at the institution level has been slow, the mentoring and teaching practices of individual faculty members have been critical in moving our efforts forward. To be specific, we point out that faculty members have significant discretion in how they mentor graduate students. This autonomy has allowed many Data@Carolina faculty members to actively adopt the tenets we espouse in this article. Indeed, in our own mentoring of graduate students, we frequently draw from the core principles and tenets we present by pairing data science students with domain science coadvisors (or vice-versa), developing research projects involving teams of trainees at different levels, encouraging growth in teaming and leadership skills, and empowering students to obtain nonacademic internships.
At the same time, however, the quality of these experiences is highly variable across the individual student's graduate school training and the interest and availability of the mentor to deliberately emphasize such principles and tenets, let alone provide the support necessary to realize them. As such, we encourage all faculty members to apply the data-centered learning pedagogy described herein to their teaching practices, to the extent possible, at both the graduate and undergraduate levels. Efforts to do so will catalyze interest and enthusiasm for data-centered learning among students and propel the nation to lead the world in training of the next generation of domain-data scientists.
Closing Thoughts on Graduate Education for the 21st Century
We are not the first to suggest that graduate education in the United States faces many challenges that are threatening the nation's global competitiveness (see reference 20 for a discussion of specific challenges). However, the pedagogical approach we envision, data-centered learning for the domain-data scientist, aims to radically alter today's graduate education system through the incorporation of interdisciplinary training that couples data science with the domain sciences, emphasizes team-based learning and research, includes training in teaming and leadership skills, and promotes academic/industry practicums and internships.
Implementation of our approach will require significant financial investments, a shift in traditional academic culture and practice, and incentives for partnerships across stakeholders. However, we believe that such investments will reap huge rewards for all stakeholders and will reveal hidden opportunities. We further believe that without such investments, U.S. graduate education will face growing challenges, and U.S. industries will lose their competitive advantage in the use of data to drive innovation in the global marketplace.
We note that while we developed our pedagogy in the context of graduate education, the general approach of data-centered learning can and should be applied to students at any stage of their education, from K-12 to undergraduate education to graduate education and beyond. We believe that data-centered learning must begin in early childhood, as this will be critical to create a data-literate citizenship capable of navigating and succeeding in today's data-driven world. We therefore suggest that K-12 educators consider adapting the principles and concepts we have presented for use in their classrooms.
The time is right to embed data-centered learning, based on the four defining tenets, within our educational system and, thus, generate the domain-data science talent required to fully realize the potential of big data and, thereby, propel the U.S. economy and improve the health and well-being of our citizens and all people.
Footnotes
Acknowledgments
This work was supported by the authors' affiliated departments and institutes and by the Executive Vice Chancellor and Provost at the University of North Carolina at Chapel Hill.
Author Disclosure Statement
No competing financial interests exist.
