The Effect of Photoperiod on the Mood of Reddit Users

Abstract

Research into the seasonality of mood has long been stymied by a lack of data, in part due to the prohibitive cost of traditional data collection and the tendency for data to be highly localized. Recent work using social media data has evinced the utility of psycholinguistic features in tracking mood and mental illness, but Twitter data, which are nonanonymous and short-form by design, have almost exclusively been the subject of analysis. In this article, we present a novel corpus within this field of study, comments from the social network Reddit, which does not suffer from these potential limitations. We find that although there are no notable changes in mood in the entire population over the course of a year, a small cohort is acutely sensitive to changes in the relative day length (i.e., the relative photoperiod). Our findings corroborate the phase shift hypothesis, which is the prevailing theory for the seasonality of mood. We also demonstrate the viability of the Reddit comments corpus for studies in mood and, more broadly, mental health.

Introduction

Research on the seasonality of mood in the general population is inconclusive: some studies have found negligible seasonality,^1,2 some have found strong seasonality,^3,4 and yet others have found that moderate weather, regardless of season, yields improvements in mood.⁵ It is widely accepted, however, that a small subset of the population suffers from Seasonal Affective Disorder (SAD), a subtype of major depression with a seasonal pattern. In most cases, symptoms begin in the Fall and dissipate in the Spring (i.e., Winter SAD), although a minority experience symptoms during the Spring and Summer (i.e., Summer SAD). A less acute variant is known as subsyndromal SAD (S-SAD), referred to colloquially as “Winter blues.”⁶

The prevalence of SAD varies wildly, but is generally greater at more extreme latitudes,⁷ ostensibly because of more pronounced seasonal differences. In Canada, 2–6 percent of the population experience SAD and 15 percent experience S-SAD.⁶ In the United Kingdom, 2 percent experience SAD and 20 percent S-SAD.⁶ The prevailing theory in SAD research is the phase shift hypothesis, which contends that a disruption in the circadian rhythm, triggered by a change in the day length (i.e., the photoperiod), is responsible for the disorder.⁸ Because other climatic variables that change seasonally (e.g., temperature, precipitation) might confound the phase shift effect, we restrict our analysis of seasonality to photoperiod, the causal variable.

Research in this field has long been stymied by a lack of data. The existing data also tend to be highly localized. This is partly due to traditional methods of data collection, which can be costly, laborious, and reliant on subjects to report symptoms long after experiencing them. Social media, which readily and cheaply offers reams of data, is thus becoming an appealing source, especially since mental health conditions have been found to present significantly in language use.^9,10 Still, much of the work that has leveraged social media has been limited to Twitter,^11,12 which is neither anonymous nor long-form by design. Anonymity has been found to reduce inhibition when discussing mental health online,¹³ and Twitter's limit of 140 characters per tweet prevents users from discussing matters at length.

We circumvent these issues by collecting a novel corpus, specifically comments from the social network Reddit. Although mental health discourse on Reddit has been studied briefly,¹³ the moods of its users have not. Reddit is an anonymous platform by design, on which users can make long-form comments (up to 40,000 characters) in topic-specific forums called subreddits. The anonymity offered by Reddit has already been found effective at facilitating more open dialogue regarding mental health.¹³ In this article, we track the comments made by Reddit users over their entire comment history, across all subreddits, for variations in psycholinguistic features across a multiyear period.

Much of the work that has used social media for psychology research involves building models that can diagnose users with mental illness; Coppersmith et al.¹⁴ sought self-diagnoses of post-traumatic stress disorder on Twitter as a source of labeled data, and de Choudhury et al.¹¹ enlisted the help of Amazon Mechanical Turkers. Given the unreliability of self-diagnoses of SAD and SADs high comorbidity with other conditions,¹⁵ we do not attempt to build a model or make “armchair diagnoses” of SAD in this article. Instead, we aim to show the viability of Reddit as a corpus for studying mood and analyze how photoperiod affects mood in the general population at a global scale.

Following Coppersmith et al.¹⁶ and Golder and Macy,¹⁷ we use Linguistic Inquiry and Word Count (LIWC), a validated psychometric tool, for tracking psycholinguistic changes in users' comments after its typical preprocessing of the text. Although LIWC is in some ways a rudimentary tool, for instance, it does not distinguish between multiple word senses, it is widely used in this field of research^10,11,16,17 and has successfully been used by many to track various psychological disorders¹⁶ and seasonal changes in mood.¹⁷ It has even been used to create a reliable predictor of depression using social media data.¹¹ After tracking changes in the entire Reddit population, we use Gaussian mixture models (GMMs) to cluster users based on similar responses of mood to changes in daylight. We then use hypothesis testing to find the dimensions along which these clusters of users differ. The key contributions of this article are thus the novelty of the Reddit corpus within this domain of research and the breadth of its psycholinguistic analysis.

Data

Only a subset of the complete dataset of 1.7 billion Reddit comments spanning 8 years (October 2007–May 2015) can be used, as the majority of users cannot be matched to a specific geographic location. The location is needed to estimate the historical photoperiod to the second, using PyEphem,¹⁸ by subtracting the sunrise time from the sunset time for each day. Although no explicit geographic information is available, Reddit does have city-specific subreddits, such as/r/nyc, in which users may post and comment. Users may subscribe to subreddits, but these subscription lists are not public. Instead, we match users to their city of residence based on how exclusively and frequently they comment in a particular city's subreddit, while introducing criteria to mitigate the noise engendered by the activity of nonresidents (e.g., tourists).

Specifically, we limit our pool of users to those that (i) have only ever commented in one city-specific subreddit and (ii) who have made, in that subreddit, at least 11 comments. This threshold is chosen because it is at the 80th percentile in the distribution of the number of comments in city-specific subreddits made by users meeting criterion (i) above. Selecting too high a threshold (e.g., 25 comments, the 90th percentile) would select for users who are exceptionally active on Reddit. A threshold of 11 yields a large set of 100,045 users in 106 cities across the world. These users' comment histories, across all subreddits, constitute the final dataset.

The majority of users come from the United States (57.20 percent), Canada (20.37 percent), the United Kingdom (5.03 percent), and Australia (10.85 percent); the remaining 6.54 percent are distributed across the remaining 25 countries. Even in the United States, many are from smaller cities such as Anchorage, Alaska (0.16 percent). Although the age and sex of these users cannot be ascertained, on Reddit overall, there are only slightly more males (54 percent) than females (46 percent).¹⁹ Save age (the age of Redditors skews younger²⁰), our dataset appears representative of the broader population.

Methodology

We choose to track 49 LIWC features that capture aspects of mood or behavior, namely the LIWC categories of affect words, social words, cognitive processes, perpetual processes, biological processes, core drives and needs, time orientation, relativity, personal concerns, and summary variables. All except the four summary variables, analytical thinking, clout, authenticity, and emotional tone, are measured as percentages of words in a given text, after preprocessing.⁹ For instance, if a Reddit comment has a “posemo” (positive emotion) score of 4.2, 4.2 percent of the words in the comment have positive affect, according to LIWC. The summary variables result from more complex algorithms and are standardized.

We first plot the summary variables against language proficiency, the percentage of words in the comment that are in the built-in LIWC dictionary. We find that the two—analytical thinking (r = −0.301) and authenticity (r = 0.223)—are mildly Spearman-correlated with proficiency, ostensibly because language proficiency affects the manner in which users express themselves. Following Golder and Macy,¹⁷ we add a “proficiency threshold” (85 percent), which is the percentage of words in the comment that must be in English. This subsequently reduces these correlations to acceptable levels—analytical thinking (r = −0.155) and authenticity (r = 0.02). This is at the sentence level, not the user level, hence preventing multilingual users from being excluded for making non-English comments.

Since users have varying baseline values, before tracking population-level changes, it is necessary to mean center the data with respect to each user. This makes data from different users more comparable in scale, emphasizing, in Golder and Macy's parlance, within-individual changes over between-individual changes.¹⁷

Tracking population-level changes

To detect psycholinguistic changes in the entire population, we plot each mean-centered feature against the photoperiod (in seconds) on the day of the comment's writing. If the population, as a whole, experienced some seasonal mood shift, a moderately strong correlation with the photoperiod would be expected. Like Golder and Macy,¹⁷ we also plot each mean-centered feature against the relative amount of daylight (i.e., the marginal change in the photoperiod from the previous day), since, according to the phase shift hypothesis, it is the change in daylight that triggers changes in mood.⁸ For brevity, the relative amount of daylight is referred to as relative photoperiod hereafter. Bonferroni correction is applied to decrease the chance of there randomly being a significant relationship: because we study 98 relationships—49 LIWC features with two types of photoperiod—for a correlation to be significant at the 5 percent level, its p value must be less than 0.00051 ( = 0.05/98). Similar conditions apply for alpha values of 0.01 and 0.001.

Clustering individuals based on response similarity

We then cluster individuals based on similar changes in mood in response to changes in photoperiod. To do this, we first construct a 98-dimension “sensitivity vector” for each individual, where each feature is the average change, across all the user's comments, in one of the 49 mean-centered LIWC features due to a unit increase in either the absolute or relative photoperiod. The average change (i.e., the sensitivity) is the regression coefficient from the univariate regression of the mean-centered feature against absolute or relative photoperiod.

We construct a GMM with k components and a convergence threshold of 0.001, then fit the model to the newly constructed vectors, and partition the users depending on which component assigns a higher posterior probability to their sensitivity vector. Each component in the GMM is a 98-dimensional Gaussian distribution that corresponds to one of the k groups; given the feature vector for a user, the distribution that assigns the highest probability to the vector represents the group to which the user most likely belongs. We determine the value of k—the number of components—using the Dirichlet Process, testing up to k = 25. For the concentration parameter α, values 0.1, 1, 10, and 100 were tested. All permutations of α and k found that the optimal number of clusters is 2. Although k-means clustering has been used similarly in the past,²¹ we use GMM-based “soft” clustering since k-means tend toward equal-sized clusters, and it is not known, a priori, whether the clusters are of similar sizes.

Using hypothesis testing, we then find the dimensions of the sensitivity vector along which the two groups differ significantly. Welch's t test (two-tailed) is used so as not to assume that the two groups have an equal population variance. The Student's t test (two-tailed) is used to determine whether the mean of a feature across all users in the same group is nonzero. Bonferroni correction is applied in both instances.

Results and Discussion

Population-level changes

Overall, the Reddit population's mood is not tied to the photoperiod. As detailed in Table 1, none of the 49 LIWC features have any notable correlation with the photoperiod (absolute or relative) over the entire population; no correlation was greater than 0.0054 in magnitude. This is consistent with some previous findings,^1,2 including one with Twitter as a corpus.¹⁷

Table 1.

Correlations with Photoperiod

Mean-centered LIWC feature	r (Absolute)	r (Relative)	SD
Achieve	0.0014^***	0.001^***	4.0335
Affiliation	−0.0005^*	0.0014^***	4.8363
Analytic	0.0002	0.0006^***	33.4230
Anger	0.0033^***	0.0003	5.2768
Anx	0.0012^***	0.002^***	1.9671
Authentic	0.0002	−0.0004	38.0346
Bio	0.0038^***	0.0005^**	7.2610
Body	0.0034^***	0.003^***	4.1438
Cause	−0.0003	0.0015^***	4.7578
Certain	0.002^***	0.0007^***	5.4238
Clout	0.0007^***	−0.0004	33.5670
Cogproc	0.0006^**	0.0026^***	11.4387
Death	−0.0001	0.003^***	1.9978
Differ	0.0005^*	0.0018^***	5.6025
Discrep	−0.0005^*	0.0011^***	3.9612
Drives	−0.0016^***	0	9.1001
Family	0.0004	0.0037^***	2.4082
Feel	0.0024^***	0.0033^***	3.3344
Female	−0.0011^***	0.0054^***	2.9500
Focusfuture	0.0014^***	0.0008^***	3.4791
Focuspast	0.0011^***	0.0009^***	6.1617
Focuspresent	0.0006^**	−0.0001	10.4651
Friend	0.0034^***	0.0012^***	2.5689
Health	0.001^***	0.0028^***	2.9067
Hear	0	0.001^***	3.2353
Home	−0.0008^***	0.001^***	1.8156
Ingest	0.0021^***	0.0022^***	3.9323
Insight	−0.0005^*	0.0026^***	4.9278
Leisure	0.0017^***	0.0004	4.6869
Male	0.0018^***	−0.0005^*	4.2277
Money	0.0003	0.0002	3.4081
Motion	0.0043^***	0.001^***	4.0343
Negemo	0.0017^***	−0.0004	7.6075
Percept	0.0023^***	0.0014^***	6.4108
Posemo	0.0002	0.0008^***	12.0657
Power	0.0013^***	0.0006^**	5.0672
Relative	0.0018^***	−0.0006^**	10.7334
Relig	0.0016^***	0.0034^***	3.1996
Reward	0.0015^***	−0.001^***	5.0658
Risk	0.0016^***	−0.0003	2.7622
Sad	0.0023^***	0.0022^***	2.8411
See	0.0024^***	0.0013^***	3.9932
Sexual	0.0028^***	0.0035^***	3.8863
Social	0.0015^***	−0.0006^***	10.5790
Space	0.0027^***	0.0006^***	7.1388
Tentat	0.0011^***	0.0021^***	5.4430
Time	−0.0002	0.0002	6.7957
Tone	−0.0014^***	0.0016^***	38.0097
Work	−0.0004	0.0024^***	4.9450

The correlation of each mean-centered LIWC feature with the absolute and relative photoperiod across all comments by all users. The SD of the mean-centered feature is also provided. The mean of each mean-centered feature across all comments is 0, which is why it is not provided. ^*Denotes significance at the 5 percent level, ^**at the 1 percent level, and ^***at the 0.1 percent level.

LIWC, Linguistic Inquiry and Word Count; SD, standard deviation.

However, deeper analysis reveals several interesting patterns. Let the total sensitivity of a user be the magnitude of their sensitivity vector. The distribution of total sensitivity across all users has a high degree of positive skew. However, the distribution of the logged total sensitivity of all users, plotted in Figure 1, is approximately Gaussian, although it still retains some positive skew. There is also a small group of highly insensitive users, with a logged total sensitivity near −40. It should be noted that while users with a similar total sensitivity have a similar acuity of response to changes in photoperiod, the manner in which they respond is not necessarily similar.

FIG. 1.

The distribution of the logged total sensitivity for all users. A user's logged total sensitivity is the natural log of the magnitude of their sensitivity vector.

Differences between clusters

Clustering users based on similar changes in mood in response to changes in photoperiod yields two distinct clusters. The first cluster comprises 468 users (0.468 percent of the population) that are exceptionally sensitive to the photoperiod; the magnitudes of their sensitivity vectors are exceptionally high, as seen in Figure 2. The second cluster comprises the remaining users, who are insensitive to such changes. Henceforth, we refer to the clusters as sensitive and insensitive, respectively. Sensitive users have a total sensitivity that is on average 171.96 times larger compared with insensitive users, t(467) = 46.82, p < 0.001.

FIG. 2.

The logged total sensitivity of insensitive and sensitive users. A few outliers, with logged total sensitivity below −30, are excluded. A Gaussian mixture model with two components, one representing each group, determines the designation of a user as “sensitive” or “insensitive.”

Because each of the 98 features in the sensitivity vector is the average change in one of the original 49 LIWC categories with respect to either relative or absolute photoperiod, they are hereafter referred to as X-REL or X-ABS, where X is an LIWC category. X-REL_S, for example, is the mean of some relative photoperiod feature X across the entire sensitive cohort; X-ABS_I is the equivalent for the insensitive cohort.

Let the total relative sensitivity (TRS) be the magnitude of a sensitivity vector calculated only using the X-REL features; total absolute sensitivity (TAS) using only the X-ABS features. In Figures 3 and 4, distributions of the logged TRS and TAS are graphed for sensitive and insensitive users. Sensitive users are significantly more sensitive to changes in relative than absolute photoperiod—TRS is on average 21.53 times higher than TAS, t(574) = –41.77, p < 0.001. Insensitive users are also significantly more sensitive to relative photoperiod—for them, TRS is on average 31.97 times higher than TAS, t(103,670) = –41.77, p < 0.001. This is expected, given that the phase shift hypothesis ascribes seasonal changes in mood to changes in the relative—not the absolute—amount of daylight.

FIG. 3.

The logged TAS and logged TRS of all sensitive users. A user's TAS is the magnitude of their sensitivity vector, where the magnitude is calculated with the X-ABS elements of the vector (i.e., those with respect to the absolute photoperiod). A user's TRS is the magnitude of their sensitivity vector, where the magnitude is calculated with the X-REL elements of the vector (i.e., those with respect to the relative photoperiod). TAS, total absolute sensitivity; TRS, total relative sensitivity.

FIG. 4.

The logged TAS and logged TRS of all insensitive users. A user's TAS is the magnitude of their sensitivity vector, where the magnitude is calculated with the X-ABS elements of the vector (i.e., those with respect to the absolute photoperiod). A user's TRS is the magnitude of their sensitivity vector, where the magnitude is calculated with the X-REL elements of the vector (i.e., those with respect to the relative photoperiod). TAS, total absolute sensitivity; TRS, total relative sensitivity.

As seen in Table 2, along no dimension in the sensitivity vector are the two groups significantly different. However, as noted above, sensitive users have a significantly higher total sensitivity than insensitive users. In concert, these two results imply that sensitive users have little in common. They do not respond to changes in photoperiod in a similar manner; rather, they are distinct from insensitive users only because of the extreme acuity of their sensitivity. This is depicted in Figure 5, wherein sensitive and insensitive users are plotted along the axes of posemo-REL and negemo-REL. Although insensitive users are far greater in number, they are concentrated near the origin (0, 0), while the sensitive group is far more diffuse, tending to have more extreme values of posemo-REL and negemo-REL. Despite the difference in sparseness, the two groups are distributed similarly about the origin, which is why there exists no significant difference between posemo-REL_s and posemo-REL_I and between negemo-REL_s and negemo-REL_I. Because the total sensitivity captures only the magnitude of sensitivity and not its direction, it captures the difference in sparseness across the two groups.

FIG. 5.

Changes in positive (posemo) and negative (negemo) emotion associated with changes in relative photoperiod, for both sensitive and insensitive users. The change is the regression coefficient when the values of posemo and negemo for a user, across all their comments, are regressed against the relative photoperiod.

Table 2.

Differences Between Sensitive and Insensitive Users

X-REL/X-ABS	Sensitive (X-REL_S/X-ABS_S)	Insensitive (X-REL_I/X-ABS_I)
Analytic-ABS	0.2155	−0.0001
Analytic-REL	−0.1757	−0.0004
Clout-ABS	−0.0922	0.0003
Clout-REL	−1.0072	0.0002
Authentic-ABS	0.3931	−0.0003
Authentic-REL	0.1784	−0.0013
Tone-ABS	0.038	0.0003
Tone-REL	−1.2856	0.0006
posemo-ABS	−0.0038	0
posemo-REL	0.13	−0.0002
negemo-ABS	−0.0117	0
negemo-REL	−0.1463	−0.0002
anx-ABS	−0.0002	0
anx-REL	−0.0191	0
anger-ABS	−0.0093	0
anger-REL	−0.1231	0.0002
sad-ABS	−0.0017	0
sad-REL	0.0051	−0.0003
social-ABS	−0.0666	0.0002
social-REL	−0.4371	0.0002
family-ABS	−0.087	0
family-REL	−0.5049	−0.0001
friend-ABS	−0.0347	0
friend-REL	−0.0323	−0.0001
female-ABS	−0.032	0
female-REL	−0.2584	−0.0003
male-ABS	0.0125	0
male-REL	0.0385	0.0004
cogproc-ABS	0.0985	−0.0001
cogproc-REL	0.152	−0.0007
insight-ABS	0.0188	0
insight-REL	0.0591	−0.0001
cause-ABS	0.0194	0.0001
cause-REL	−0.0753	0
discrep-ABS	0.0029	0
discrep-REL	0.0438	0
tentat-ABS	0.0493	0
tentat-REL	−0.0051	−0.0003
certain-ABS	−0.0082	0
certain-REL	0.1665	−0.0002
differ-ABS	0.0542	−0.0001
differ-REL	−0.0328	−0.0004
percept-ABS	−0.0204	0
percept-REL	−0.2402	0.0005
see-ABS	0.0004	0
see-REL	−0.1398	0.0003
hear-ABS	−0.0334	0
hear-REL	0.0921	−0.0001
feel-ABS	0.0021	0
feel-REL	−0.2221	0.0001
bio-ABS	0.0018	−0.0001
bio-REL	−0.1647	0.0004
body-ABS	−0.0018	0
body-REL	0.1548	0.0002
health-ABS	0.0044	0
health-REL	−0.1294	0
sexual-ABS	−0.003	0
sexual-REL	−0.2609	−0.0001
ingest-ABS	0.0145	0
ingest-REL	−0.0348	0.0002
drives-ABS	0.032	0
drives-REL	−0.1282	−0.0007
affiliation-ABS	0.0202	0
affiliation-REL	−0.085	−0.0003
achieve-ABS	0.0143	0
achieve-REL	0.1013	−0.0002
power-ABS	−0.0261	0
power-REL	−0.056	0
reward-ABS	0.0267	0
reward-REL	0.0762	−0.0005
risk-ABS	−0.0037	0
risk-REL	−0.1647	0.0003
focuspast-ABS	0.031	0
focuspast-REL	0.222	−0.0007
focuspresent-ABS	0.0262	0
focuspresent-REL	0.0765	0.0009
focusfuture-ABS	−0.0126	0
focusfuture-REL	0.0023	−0.0001
relativ-ABS	0.058	0
relativ-REL	0.0133	0
motion-ABS	−0.0035	0
motion-REL	0.005	0.0001
space-ABS	0.0054	0
space-REL	−0.284	0.0001
time-ABS	0.038	0
time-REL	0.3342	−0.0003
work-ABS	−0.0011	0.0001
work-REL	0.165	−0.0002
leisure-ABS	0.0037	0
leisure-REL	0.0655	0
home-ABS	−0.0091	0
home-REL	0.1271	0
money-ABS	0.0074	0
money-REL	0.1079	−0.0001
relig-ABS	−0.0009	0
relig-REL	−0.0029	0.0001
death-ABS	0.0034	0
death-REL	−0.3487	−0.0002

The average change in every mean-centered LIWC feature X with respect to either absolute photoperiod (X-ABS) or relative photoperiod (X-REL). X-REL_S is the mean of X-REL across all sensitive users; X-REL_I is the mean of X-REL across all insensitive users. X-ABS_S and X-ABS_I are analogous means for X-ABS. None of the X-ABS or X-REL features are significantly different than 0, as determined by two-tailed Student's t tests. Neither are the differences across them significant, as determined by two-tailed Welch's t tests.

Also evident in Table 2 is the fact that none of the individual X-REL and X-ABS values are significantly different than 0 for either group. Although posemo-REL_s being 0.13 and negemo-REL_S being 0.1463 may give the impression that a marginal increase in relative photoperiod is associated with a marginally more positive mood among sensitive users, neither value, both of which are means across all sensitive users, is significantly different than 0, as determined by a two-tailed Student's t test. The presence of sensitive users whose mood degrades acutely with increases in relative photoperiod balances out the presence of sensitive users whose mood is lifted.

Although it may be possible to partition the sensitive users into smaller subgroups and make diagnoses of Winter SAD, Summer SAD, and S-SAD depending on the nature of their sensitivity to changes in photoperiod, for reasons stated earlier in the article, we make no such attempt to do so. The prevalence of SAD in the United States, United Kingdom, and Canada (2–6 percent),^6,7 from which most users in our dataset originate, is also higher than the proportion of users that are sensitive (0.468 percent). However, the data collection method outlined in Data section also selects against inactive users, and given that inactivity is a depressive symptom,⁶ seasonally depressed individuals may be underrepresented in our dataset.

It should also be noted that our analysis does not imply causality in itself, which is why we take care to note that increases in absolute and relative photoperiod may be associated with various mood changes, although not necessarily the cause of them. Causality between relative photoperiod and mood, however, is posited in the widely accepted phase shift hypothesis,⁸ and our findings appear to be in accord with that theory.

Conclusion

This work shows that Reddit can be a viable alternative to traditional data sources (and Twitter) for the study of mood and, more broadly, mental health, given appropriate data filtering heuristics.

Unsurprisingly, there is no relationship between photoperiod and the overall Reddit population's mood; however, there is a small cohort that is acutely sensitive to changes in the relative photoperiod. However, these sensitive users have little in common other than the acuity of their sensitivity; an increase in the relative photoperiod is significantly associated with a strong overall change in mood in all of them, but the exact dimensions along which mood changes vary greatly across sensitive users. Thus a marginal increase in the relative photoperiod does not significantly change the value of any particular LIWC feature more for sensitive users than it does for insensitive users. As expected from the phase shift hypothesis, changes in absolute photoperiod did not spur as much change.

Although these sensitive users displayed characteristics of the various types of SAD, we make no attempt to diagnose them, although with a proper validation set, constructed using more direct or traditional methods, it may be possible to do so. The actual exposure of Reddit users to sunlight is assumed to be typical; examining further confounds is the subject of future work. LIWC, although popular, is also a fairly simplistic tool with sometimes unclear implications for user behavior.

Previous works that have found certain language features to be a reliable predictor of depression^11,22 have proposed clinical applications such as an early warning system that monitors depressive patients' social media feeds. Even though we do not attempt to diagnose Reddit users in this article, our results still suggest that such clinical applications would not be nearly as effective for SAD, since the proportion of users that exhibit extreme change in language use is much smaller than the proportion of the population suffering from the disorder. In this way, we have identified a major limitation to studying mental health conditions using social media data, a field of study that has heretofore focused mostly on depression.

Studies done on Twitter regarding the prevalence of depression and other mental illnesses can be replicated using the Reddit corpus, but features unique to Reddit also open up new lines of research. One such possible line of research is the degree to which a user's language depends on the subreddit in which they are posting, for instance, do individuals use more emotional language in smaller subreddits, where users may be more familiar with one another? Although this article evinces the utility of the Reddit corpus, more complex approaches will be needed to maximize its potential to the scientific and healthcare communities.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

Murray

, Allen

, Trinder

. A longitudinal investigation of seasonal variation in mood. Chronobiology International, 2001; 18:875–891.

Denissen

, Butalid

, Penke

, et al. The effects of weather on daily mood: a multilevel approach. Emotion, 2008; 8:662.

Harmatz

, Well

, Overtree

, et al. Seasonal variation of depression and other moods: a longitudinal approach. Journal of Biological Rhythms, 2000; 15:344–350.

Okawa

, Shirakawa

, Uchiyama

, et al. Seasonal variation of mood and behaviour in a healthy middle-aged population in Japan. Acta Psychiatrica Scandinavica, 1996; 94:211–216.

Keller

, Fredrickson

, Ybarra

, et al. A warm heart and a clear head the contingent effects of weather on mood and cognition. Psychological Science, 2005; 16:724–731.

Melrose

. Seasonal affective disorder: an overview of assessment and treatment approaches. Depression Research and Treatment, 2015; 2015:1–6.

Rohan

, Roecklein

, Haaga

DAF

. Biological and psychological mechanisms of seasonal affective disorder: a review and integration. Current Psychiatry Reviews, 2009; 5:37–47.

Lewy

, Rough

, Songer

, et al. The phase shift hypothesis for the circadian component of winter depression. Dialogues in Clinical Neuroscience, 2007; 9:291.

Tausczik

, Pennebaker

. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 2010; 29:24–54.

10.

Ramirez-Esparza

, Zimmermann

, Brockmeyer

, et al. (2008) The psychology of word use in depression forums in English and in Spanish: texting two text analytic approaches. In Proceedings of the 2nd International AAAI Conference on Web and Social Media (ICWSM). Seattle: AAAI.

11.

De Choudhury

, Counts

, Horvitz

. (2013) Social media as a measurement tool of depression in populations. In 5th Annual ACM Web Science Conference. Paris: ACM, pp. 47–56.

12.

Coppersmith

, Dredze

, Harman

. (2015) From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. In NAACL HLT. Denver: NAACL.

13.

De Choudhury

, De

. (2014) Mental health discourse on Reddit: self-disclosure, social support, and anonymity. In Proceedings of the 8th International AAAI Conference on Web and Social Media (ICWSM). Ann Arbor: AAAI.

14.

Coppersmith

, Harman

, Dredze

. (2014) Measuring post traumatic stress disorder in Twitter. In Proceedings of the 8th International AAAI Conference on Web and Social Media (ICWSM). Ann Arbor: AAAI.

15.

Terman

, Levine

, Terman

, et al. Chronic fatigue syndrome and seasonal affective disorder: comorbidity, diagnostic overlap, and implications for treatment. The American Journal of Medicine, 1998; 105:115S–124S.

16.

Coppersmith

, Dredze

, Harman

. (2014) Quantifying mental health signals in Twitter. In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. Baltimore: ACL, pp. 51–60.

17.

Golder

, Macy

. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 2011; 333:1878–1881.

18.

Rhodes

. 2011. PyEphem: astronomical ephemeris for python. http://rhodesmill.org/pyephem (accessed Aug. 18, 2016)

19.

Reddit. (2015). Audience and demographics. https://reddit.zendesk.com/hc/en-us/articles/205183225-Audience-and-Demographics (accessed Aug. 18, 2016).

20.

Pew Research Center (2013). 6% of Online Adults are Reddit Users. www.pewinternet.org/files/old-media/Files/Reports/2013/PIP_reddit_usage_2013.pdf (accessed Aug. 18, 2016).

21.

Joshi

, Doshi

, Patel

. Diagnosis of breast cancer using clustering data mining approach. International Journal of Computer Applications, 2014; 101.

22.

De Choudhury

, Gamon

, Counts

, et al. (2013) Predicting depression via social media. In Proceedings of the 7th International AAAI Conference on Web and Social Media (ICWSM). Boston: AAAI.