Abstract
Graduates of Master of Public Health programs may lack appropriate skills in data analysis, and would benefit from practice with research data. Datasets that contain health information relevant to student interests, and that are appropriate sizes for class use can be difficult to locate. The National Health and Nutrition Examination Survey is a study that collects both survey and health examination information from a national sample every 2 years. I present a sample of this dataset, with examples of how to use it for human health related questions. Instructions for how to access and create additional, customized datasets are also provided. Instructors may consider investigating this rich data source and providing students with subsets of these data for class assignments, projects and master’s theses.
Introduction
The ability to analyze data is a skill expected of Master of Public Health (MPH) Program graduates, and is described in the accreditation criteria set forth by the Council on Education for Public Health (CEPH; n.d.). However, some studies have found a lack of sufficient education and experience in data analysis and statistics in MPH graduates. In a 2015 survey of hiring managers at the New York City Health Department (n = 60), 25% of respondents reported their MPH level employees were lacking in the area of quantitative data analysis, an area identified as being the most important skill for their positions (Hemans-Henry et al., 2016). In another study conducted in 2014, of 567 public health professionals surveyed, 53% reported they needed more training in quantitative research methods (Wilcox et al., 2014). One of the challenges in adequately teaching skills in analytic techniques for students of epidemiology, biostatistics and research methods, is locating appropriate datasets for classroom use. Gaining experience in the analysis of health-based data is essential to preparing students for a career in public health. However, the availability of datasets that are the appropriate size and that contain relevant information interesting to each student is often difficult for instructors to locate. Exposure to real world study data has been shown to build students’ research, quantitative and critical thinking skills, as well as improve their reflective practices (Atenas et al., 2015; Atenas & Havemann, 2015a, 2015b; O’Reilly et al., 2022). Therefore, human health study data may help students gain a deeper understanding of how data collection, data management, and data analysis relate to the health of the public, and how their findings can be translated to changes in health care and policy.
The National Health and Nutrition Examination Survey (NHANES); [NHANES], 2023) is a national program of studies, that gathers both survey and examination data on a sample of residents of the United States every 2 years. The data contain health information on common health conditions, diseases, and exposures and are available free of charge from the NHANES website. Building assignments that require usage of the NHANES data allows students the opportunity to: (1) use real world health data; (2) select topics to study that are interesting to them; and (3) develop a manuscript that may potentially be submitted for publication. Additional benefits include experiencing the process of preparing a manuscript according to journal specifications, potentially responding to reviewer comments and concerns, and revising the manuscript.
Though the data can be downloaded directly from the NHANES website by anyone, several steps are needed to transform it into a usable format. While some students may be able to successfully manage the process, many do not possess the skills necessary to work with data in different formats. Giving students access to a subset of the NHANES data, in a format used in the classroom, will help instructors use class time more efficiently while providing students with a valuable and useful learning experience. The purpose of this article is to describe the utilization of the NHANES data in a graduate education setting, from the initial phase of data element selection to analyzing the relationship between the exposure and the outcome.
Background
The NHANES data were used in two courses included in the MPH program at a health sciences university. The first combined epidemiology and biostatistics and was designed to introduce data analysis skills early in the program. From this course, information was reviewed for 301 students across 12 semesters between 2018 and 2023. The second course focused on the development of students’ master’s theses in the final semester. Students who had reached the end of their MPH programs were expected to design their own research studies for their theses, using either primary or secondary data. Of the 92 students who were enrolled in this course over seven semesters between 2017 and 2023, the majority chose secondary data, including 52 students who used NHANES. The methods used for introducing and analyzing the NHANES data are described in this article and based on the students in these two courses.
The Data
Data Challenges
The NHANES data contain a rich assortment of variables and health-related topics. Selecting a combination of appropriate variables to include in a subset for class use can present a challenge to the instructor. Two approaches to solving this problem are to either create a subset after student input, or to pre-select a list of variables that can address common health issues. The approach taken may depend on the level, scope and goals of the course, as well as the skills and previous training of the students.
Data Subset Containing Information Chosen by Students
The decision to create a subset of the NHANES data based on student interests can drastically increase the time invested in the process for the instructor, but may be worth the effort if the students receive a more meaningful experience. While the NHANES data can be used for undergraduate educational purposes, this may be best reserved for graduate students, especially if they are interested in pursuing publication of their work. There are two methods for producing these subsets. One is to create a separate subset for each student, and the other is to create one master subset that contains information requested by all students. Creating a separate subset for each student is cumbersome and not recommended for classes with more than 10 students. Creating one master subset is more easily managed by the instructor, but can confuse some students. Instructors considering these methods will need to weigh the skills and needs of their students before settling on how to provide the data.
Data Subset Containing Information Chosen by the Instructor
Creating a subset of NHANES data containing a sample of health-related issues requires substantial initial time investment from the instructor, and it may be used in subsequent course offerings without revision. Since NHANES collects information about some of the most common human health conditions (i.e., heart disease, diabetes and cancer) and exposures (i.e., smoking, alcohol, and environmental contaminants), it is possible to select a sampling of variables that would present students with a wide variety of options. Examples of variables chosen by the instructor for classroom use and included in a master subset are shown in Table 1. It should be noted that these variables were chosen after the course had run several times. Depending on the background of the students attending the program, the combination of data elements will vary.
Example of Variables Included in the National Health and Nutrition Examination Survey Master Subset Chosen by the Instructor.
An additional challenge in both scenarios is communicating with the students adequately about their interests and identifying variables within the NHANES data that address their research questions. Each student will have their own, unique challenges, and each conversation will differ, depending on their interests, experience, and goals. An example of a typical conversation is included in Appendix A.
Variables and Codebooks
The majority of variables available in the NHANES data are categorical variables. However, there are a number of continuous variables, which may cause some confusion for some students who have not grasped the difference. When students begin to analyze the data using the statistical analysis application taught in their program or course, it is not uncommon for them to use inappropriate commands, which yield confusing output. Many of these errors could be avoided if they had a better understanding of the variables they were trying to analyze. For example, calculating a mean and standard deviation of type of health insurance, a categorical variable (i.e., codes 1–6), produces unusable output. The difference between categorical and continuous variables can be easily demonstrated using age, since it is commonly included in health data. For example, age is often documented as the number of years the person has lived, and generally ranges from 1 to 100. If there are 50 participants being analyzed, then calculating a mean would be useful since it is a continuous variable. A second way to measure age is to provide choices that indicate an age range, such as <18 years, 18 to 39 years or 40+ years. In this case, the analysis would include the number and percentage of people that fell into each of the three categories. Another effective method to help students understand the differences between categorical and continuous variables is to provide instructions on how to locate and use the NHANES online codebook, or to create a new codebook specifically for the data being used in the course.
Online Codebook
The NHANES website offers an online codebook of all study variables. There are several ways to locate variables of interest, such as by year, variable label or variable description. The codebook provides detailed information about each data element as well as descriptions of each code. Another benefit of using the online codebook, is that raw frequencies are shown for each variable. This is helpful for determining how many participants in the study have a specific characteristic, without first downloading the data. For example, it is easily found that in 2015-16, there were 856 participants that reported they had diabetes.
Course Specific Codebook
Some instructors may find it helpful to create their own codebook to accompany the subset of NHANES data. This may be particularly useful in explaining which variables are categorical and which are continuous. However, the variable descriptions and documentation can be cumbersome to read through and may still confuse some students. The instructor’s decision about which approach to take may again depend on the skill level of the students.
Because students in MPH programs arrive with different levels of skill and comfort with statistical analysis, providing brief, straightforward instructions for analyzing continuous and categorical variables are essential. This can serve both those that have had little experience with statistical software, and those that just need a reminder. Instructions that include screenshots of the output are generally well received. An example of an instruction sheet is provided in Appendix B. In addition, creating a video to accompany the written instructions is particularly appreciated by many students.
Pedagogical Uses
Students of public health or health sciences are often required to conduct a research project, beginning with the creation of a research question, selecting the appropriate variables to analyze and completing an analysis that addresses the research question. This section walks through the process of using NHANES data to complete these steps.
Variable Selection
In assisting students to select appropriate variables to analyze, the instructor may consider requiring a data analysis plan before beginning. Working through the steps of a data analysis plan would require the student to begin to visualize how the selected variables might be related. An example of instructions for creating a data analysis plan are included in Appendix C. Designing empty or dummy tables in which to present the results is a particularly useful exercise for students to gain a better understanding of what they hope to achieve from their analysis. Using the variables presented in Table 1 as an example, if the student proposed the idea of studying the relationship between heart disease and nutrition, they may choose to present this analysis as shown in Table 2. Visually, these blank tables can help clarify what the analysis means, how to assign the appropriate statistical test, and how to interpret the results.
Heart Attack Status by Quality of Overall Diet.
Of the 51 students reviewed for this paper who chose the NHANES data for their master’s thesis, among the most popular outcomes were eye health, mental health, diabetes and oral health. The most popular exposures included physical activity, smoking and diet. A list of selected thesis topics and the associated variables chosen by Master of Public Health students are presented in Table 3. Examples provided show bivariate analyses only, but some may consider additional variables that could affect the relationship. It is important to discuss the research question of interest and the supporting literature to guide the student through an analysis plan. Students should also be encouraged to explore theoretical and conceptual frameworks to reach a clear understanding of the association they are researching and which variables will address their research question.
Selected Student Master’s Thesis Titles.
Note. All appendices are provided with the online version of this article. NHANES = National Health and Nutrition Examination Survey.
Statistical Tests
Depending on the level and previous training of the students, instructors may choose a range of statistical tests and approaches for extracting meaning from the analysis. Since many full data analyses begin with simple tests to examine the overall relationship in the data, such as chi-square and t-tests, it is not unreasonable to include only the basics. For example, in Table 2 a chi-square test may be all that is necessary to determine if the distribution of diet quality is different between those who did and didn’t have a heart attack. If the student chose to compare number of fast food meals by heart attack status, then a t-test could be used. The point is, important information can be extracted with these analyses without introducing multivariate or mixed model analyses, keeping the student’s work more focused and hopefully, easier to understand and follow.
Getting the Data
To create a customized dataset, several steps will be needed to create one by first downloading data directly from the NHANES website (www.cdc.gov/nchs/nhanes/). The NHANES data are stored in dozens of small datasets, each containing information about a different category of health-related data. To bring the data into a usable format for student analyses, a sampling of these small datasets (for 1 year) should first be identified according to the variables desired to be included in the master subset. After they are merged and selected variables are dropped, the final dataset can be provided to students in a format appropriate for their chosen statistical analysis software. Instructions and an example of this process are demonstrated in Appendix D.
Discussion
Overall, student feedback for using the NHANES dataset was positive. Students appreciated the opportunity to analyze data that was relevant to their interests and experience, as explained in their course evaluations and class discussions.
As the program grew, offering students the option to choose any variables they were interested in became too difficult since many students required individual attention from the instructor to create their personalized dataset. Creating one large dataset for all students to use solved this issue. However, one single, large dataset should be updated each year to add new variables in anticipation of commonly requested topics. It has been important to listen to student feedback as well as making note of recent topics in public health (i.e., disparities, Covid 19 pandemic) to consider additional, useful variables.
Students just beginning to learn data analysis techniques earlier in their MPH programs had more difficulty settling on topics for their course research projects. A much smaller dataset than what is offered to students planning their thesis can be easier to digest compared to a large dataset with hundreds of possibilities. As some students still had difficulty choosing variables, a selection of pre-designed research topics were presented as choices for final projects. This option works well for larger classes, as the amount of feedback needed is greatly reduced. An example of a pre-designed research project is shown in Appendix E.
All students, regardless of their stage in the MPH program, benefited from instruction and regular guidance in statistical programming. Several types of support to assist different learning styles, such as written instructions, step-by-step videos recorded by the instructor, links to useful websites, readings including software specific guides, and live workshops were appreciated and commented on in course evaluations and email communication. For example, publicly available guides for using the Stata software are available online through several institutions of higher learning. These guides provide examples of Stata programming language and explanations of output that can be very helpful to students who are just learning analysis skills. There are also useful online guides made available by textbook publishers that are designed to supplement textbook materials. Another useful resource students found helpful were examples of published articles on research using the NHANES data. This allowed students to review examples of how the data were used in previous studies, and how the results were presented. Information about these helpful resources are listed in Appendix F. Students also expressed high satisfaction with engaging in peer review activities, including reading and commenting on drafts of assigned partners. However, providing structure for required feedback was essential in helping them give constructive criticism to their peers. This structure included comments in five areas: clarity, organization, strengths and weaknesses of the analysis, and recommendations for improvement.
Conclusion
The use of national datasets, like the NHANES, can offer benefits to the learning experience of public health students that other datasets cannot. One of the most important advantages is that the data represent real world study information that could be developed for publication. This aspect is particularly valuable for graduate level training. While some students may seek to publish their master’s thesis, others may choose to develop their study further in a doctoral program. However, all students will learn how secondary data can be used, and how an important, national study can help them in their careers as public health professionals.
Supplemental Material
sj-docx-1-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-1-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Supplemental Material
sj-docx-2-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-2-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Supplemental Material
sj-docx-3-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-3-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Supplemental Material
sj-docx-4-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-4-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Supplemental Material
sj-docx-5-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-5-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Supplemental Material
sj-docx-6-php-10.1177_23733799241234870 – Supplemental material for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students
Supplemental material, sj-docx-6-php-10.1177_23733799241234870 for The National Health and Nutrition Examination Survey as a Tool to Teach Data Analysis to Public Health Students by Virginia G. Briggs in Pedagogy in Health Promotion
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
