Predicting Sociodemographic Attributes from Mobile Usage Patterns: Applications and Privacy Implications

Abstract

When users interact with their mobile devices, they leave behind unique digital footprints that can be viewed as predictive proxies that reveal an array of users' characteristics, including their demographics. Predicting users' demographics based on mobile usage can provide significant benefits for service providers and users, including improving customer targeting, service personalization, and market research efforts. This study uses machine learning algorithms and mobile usage data from 235 demographically diverse users to examine the accuracy of predicting their sociodemographic attributes (age, gender, income, and education) from mobile usage metadata, filling the gap in the current literature by quantifying the predictive power of each attribute and discussing the practical applications and privacy implications. According to the results, gender can be most accurately predicted (balanced accuracy = 0.862) from mobile usage footprints, whereas predicting users' education level is more challenging (balanced accuracy = 0.719). Moreover, the classification models were able to classify users based on whether their age or income was above or below a certain threshold with acceptable accuracy. The study also presents the practical applications of inferring demographic attributes from mobile usage data and discusses the implications of the findings, such as privacy and discrimination risks, from the perspectives of different stakeholders.

Introduction

Mobile devices have emerged as an indispensable component of contemporary life. A recent survey indicates that, on average, U.S. adults engage with their mobile devices for over two and a half hours daily.¹ A study conducted by the research firm Dscout reveals that an average user interacts with their mobile device 2617 times per day.² In addition, heavy users, representing the top 10%, initiate 132 sessions daily, interact with their devices 5427 times, and dedicate 225 minutes to their phones.² As a result of these interactions, users leave behind large and rich digital footprints in the form of metadata.

Considering how closely users are attached to their mobile devices (physically, cognitively, and emotionally), their interactions with their devices leave behind unique digital footprints that can reveal a wide array of their characteristics,^3,4 including their sociodemographic attributes. The potential to infer users' sociodemographic information from their mobile device usage holds valuable implications from multiple viewpoints. Companies and service providers can immediately apply this information to their marketing efforts and personalize services.⁵ For example, a company can use the sociodemographic attributes of its mobile application users to target advertising, identify underserved market segments, and offer customized services and content to its user base.⁶

However, while users may also benefit from receiving personalized services and product recommendations, this can simultaneously pose a privacy risk to them.^7,8 Many users may not be aware of the potential predictive power of different data attributes that are collected by mobile apps when they grant permissions to those apps during installation. For instance, in a recent study conducted by Pew Research, nearly half of the participants said that they were not comfortable after they were shown how Facebook had used their personal data to predict their interests for ad targeting.⁹ The situation for mobile apps could be even worse, considering that, due to the low granularity of certain app permissions,¹⁰ the descriptions of some app permissions are generic, vague, or misleading.¹¹

Understanding the degree to which various mobile usage metadata attributes can reveal users' demographic information is of great importance for policymakers and regulatory agencies to protect citizens from privacy and discrimination risks. Mobile digital footprints are currently being considered as alternative data sources for evaluating the creditworthiness of the unbanked population in the pursuit of global financial inclusion. However, if these data also have the ability to predict demographic attributes, the credit risk models may lead to indirect discrimination based on age or gender. On the contrary, if mobile usage data prove to be predictive, it can provide a cost-effective way of monitoring and understanding sociodemographic changes at a high spatial resolution and track progress toward goals such as poverty reduction¹² and local economic development, human mobility, and activities.¹³ This could be especially useful in developing countries where census data may be lacking or unreliable.¹⁴

Previous studies have mostly evaluated correlations between certain demographic attributes and select mobile phone usage metrics. However, a thorough evaluation of the predictive power of a diverse set of mobile usage attributes (both individually and when combined) in predicting various demographic characteristics is lacking in the literature. By aggregating a large number of individual data points corresponding to individual mobile usage sessions and analyzing the data collected from 235 Android users, this study investigates how users' age, gender, income, and education can be inferred from their mobile usage patterns.

More specifically, the exact mobile usage patterns of all 235 users over a 14-day period were used as input attributes to predict users' demographic characteristics. The data were gathered across four mobile usage categories: (1) text and talking, (2) social media, (3) web browsing, and (4) media streaming. Regression models were trained to predict a user's exact age and income, whereas binary classification models were utilized to predict users' age group, income group, gender, and education status. The classification groups are further explained in the Results section. The study seeks to answers the following research questions:

RQ1: How do different demographic segments use their mobile devices?

RQ2: How accurately can each of the four considered sociodemographic attributes be predicted from users' mobile usage footprints?

RQ3: Are there demographic segments with distinctive and homogeneous mobile usage patterns, making it easier for algorithms to detect them?

RQ4: What are the contributions of different mobile usage attributes in predicting each of the sociodemographic characteristics considered?

In addition to answering those questions, this article discusses the practical applications and privacy implications of the findings. The rest of this article is organized as follows. The Related Studies section provides an overview of related studies. The methodology of the study is then described in the Methods section. Results are presented and discussed in the Results section and in the Discussion and Implications section, respectively. Finally, the Conclusions section presents the conclusions of the study.

Related Studies

A number of past studies have investigated the differences in patterns of usage of mobile devices based on demographic attributes. However, there are inconsistencies in the obtained results and a lack of comprehensive and conclusive studies in this area. Moreover, most of the previous studies have only examined the differences in the use of mobile devices by users belonging to different demographic groups and did not examine the possibility of predicting those sociodemographic attributes from mobile usage information.

For example, considering the effect of age, by deploying a user segmentation methodology, the study by Petrovčič et al.¹⁵ showed that most old users used their mobile devices primarily for making and receiving calls. However, Busch et al.¹⁶ showed that older adults use their mobile devices for a much broader range of applications. In general, users in the age group 18–24 years were observed to send and receive a substantially higher number of text messages per day.¹⁷ Beyond call and text message patterns, Kim et al.¹⁸ reported statistically significant correlations between age and the use of entertainment applications (correlation = −0.30), news applications (correlation = −0.37), and social networking applications (correlation = −0.28).

The role of gender on the patterns of mobile device usage has also been studied by researchers.^17,19–27 For example, Sarraute et al.²⁰ and Witayangkurn et al.²⁵ found that, compared with women, men tend to make more frequent, but shorter, calls. Women, however, were found to use their mobile devices more frequently and for longer for social networking^21,18,28 and text messaging.^22,28 The study by Gillick et al.²³ suggests that men have a more extensive network of people whom they call but that women call their contacts more frequently (i.e., women have a smaller network with stronger ties).

Mobile usage patterns have also been shown to be correlated with income and wealth.^12,18,19,27 For example, the study by Eagle and Blumenstock¹⁹ identified notable differences between how low-income and high-income population segments use their mobile devices. The results show significantly higher usage in terms of the number of calls, duration of the calls, and extent of social network for users in the highest quantile of the income when compared with those in the lowest quantile. The use of mobile applications was also found to vary based on users' income. By using a customer segmentation approach, Kim and Park²⁹ showed that, for poor users, the primary use of mobile devices was to make and receive calls.

The study by Kim et al.¹⁸ reported positive correlations between income and the use of entertainment, news, and e-commerce applications as well as a negative correlation between income and the time spent on social networking applications. Moreover, the results presented in Blumenstock et al.¹² suggest that mobile usage patterns can be used to predict poverty both at the individual and at the aggregates. The latter is especially useful for computational social science and development studies in developing countries, where there is a lack of reliable and timely census data but instead a high penetration of mobile services.³⁰

The literature has also investigated education. Past studies have identified low literacy as a significant barrier to unlocking the potential of smart mobile devices.³¹ Moreover, Sundsoy³² showed that low-literate users could be detected using their mobile usage patterns. The study found that a user's number of text messages and calls were correlated with education. Similar results, at an aggregate level, were reported by Frias-Martinez and Virseda,¹⁴ where the total number of calls was shown to be statistically lower in census districts with lower education levels. In addition, users with higher levels of educational attainment were reported to use their mobile devices more frequently for news, entertainment, and e-commerce applications.¹⁸

While the correlations between some mobile device usage attributes and sociodemographic characteristics of users have been examined in the past literature, there is a lack of a conclusive study that shows how accurately each of those sociodemographic attributes can be predicted from mobile usage data and quantifies the predictive power of each mobile usage attribute in predicting those demographic characteristics. Furthermore, in light of the increasing importance of user's privacy protection requirements,³³ the practical implications of this study's results will be relevant for users, application designers, and policymakers.

Methods

Data set

Data collection was conducted using a survey approved by the institutional review board and through the Amazon Mechanical Turk (MTurk) portal. The advantages and limitations of using MTurk as a survey platform have been documented in previous research.³⁴ In the context of this article, MTurk was deemed an appropriate medium for data collection due to its ability to efficiently gather information from a demographically and geographically diverse group of participants, while preserving anonymity.

Furthermore, as discussed below, using MTurk allowed participants to upload log files containing metadata on their mobile usage attributes, contributing to the quality and reliability of the collected data. To further ensure the validity of the data, attention check questions were incorporated into the survey, and a comprehensive exploratory data analysis was performed to detect any inconsistencies. Survey eligibility criteria were presented at the beginning, which required individuals to be at least 18 years old, be a resident of the United States, own an Android smartphone, and be willing to install two specific Android applications to report their mobile usage attributes.

The first part of the survey captured respondents' sociodemographic information, including age, gender, income, and education. The second part captured information about participants' mobile devices and their usage. As for usage attributes, the average number of daily calls and text messages (inbound and outbound) and the average time spent on talking/texting, social media, web browsing, and media streaming were recorded. To ensure that the mobile usage attributes were accurately captured, participants were asked to install two mobile applications: “StayFree” (ver. 8.5.7) and “Callistics” (ver. 2.6.8) to report their usage statistics over the past 14 days before completion of the survey. Figure 1 shows the snapshots of these two mobile applications.

FIG. 1.

A snapshot of (a) Callistics and (b) My Data Manager mobile applications.

StayFree is a manager app that provides information about the time spent on each application and allows historical usage statistics (i.e., minutes spent on each application on each day) to be exported into a file. Participants were asked to extract and upload this file through the MTurk portal. While StayFree captures the total time a user spent on calls, it does not report the number of calls or text messages. Therefore, participants were asked to install Callistics to self-report historical statistics about the number (and duration) of calls and text messages. To ensure the reliability of these self-reported measures, the duration of calls reported by participants was compared against the value extracted from the StayFree log files. In addition, the age, make, and model of participants' mobile devices were captured. The estimated dollar value of mobile devices was subsequently determined using the Kimovil website.^†

The survey received responses from 278 participants. However, 34 participants were excluded as their self-reported call duration did not match the data from the log files. Additionally, nine participants who answered one of the attention check questions incorrectly were removed from the analysis. The final sample consisted of 235 demographically diverse participants, with 101 women and 134 men, which were further divided into 143 college-educated and 92 non-college-educated individuals. The distribution of participants' age and income is depicted in Figure 2.

FIG. 2.

The empirical distribution of participants' (a) age and (b) annual income.

Machine learning algorithms

To model the complex and nonlinear relationships between mobile usage attributes and sociodemographic characteristics of users, a range of machine learning algorithms were employed in addition to traditional linear models. Each algorithm has its own strengths and weaknesses.³⁵ The “No Free Lunch” theorem in machine learning states that no single algorithm is optimal for all classification or regression problems.³⁶ To account for this, multiple algorithms were used in this study. This helps ensure that the predictive power of the input attributes is not limited by the choice of a single algorithm.

Table 1 displays the machine learning algorithms employed in this study, including the hyperparameters (tuning parameters) for each algorithm and references to previous related studies that have used these algorithms. Furthermore, a brief description of the key features of each algorithm is also provided. One should note that, except for the Ordinary Least Squares Regression and Logistic Regression models that are specifically designed for regression and classification tasks, the other algorithms listed in Table 1 can be applied to both classification and regression modeling. Interested readers can consult the study by Kuhn³⁷ for more information about these algorithms.

Table 1.

List of machine learning algorithms and their tuning parameters

Algorithm	Acronym	Past studies	Tuning parameters	Key features
Classification and Regression Trees	CART	²⁵	Regression Complexity Parameter, cp	Easily handles both numerical and categorical data and captures nonlinear relationships.
Gradient Boosting Machines	GBM	^25,31	Number of trees, n.trees Depth, interaction.depth Shrinkage parameter, shrinkage	Robust to outliers, effective with high-dimensional data, and improves accuracy iteratively.
K-Nearest Neighbors	KNN	^25,38	Number of neighbors, k	Intuitive instance-based learning method that adapts easily to local patterns.
Logistic Regression	LR	^{17,19,22,25,37}	—	Provides interpretable results and serves as a valuable baseline for comparison with more complex algorithms.
Neural Networks (MLP)	NN	^31,38	Weight decay, decay Number of hidden layer units, size	Models complex nonlinear relationships and well handles large amounts of data.
Ordinary Least Squares Regression	OLS	^13,21	—	Provides interpretable results and serves as a valuable baseline for comparison with more complex algorithms.
Random Forests	RF	^25,31,37,38	Number of variables available for splitting, mtry Number of trees, n.trees	Handles high-dimensional data, resistant to overfitting, and well captures nonlinear relationships.
Support Vector Machines—Linear Kernel	SVM-L	^19,22,23,37	Cost parameter, c	Handles high-dimensional data and well capable of feature selection.
SVM—Radial Basis Function Kernel	SVM-RBF	^25,22,31,38	Cost parameter, c RBF kernel parameter, sigma	Flexible, captures complex relationships, effective with high-dimensional data.

MLP, Multi-Layer Perceptron.

To avoid overfitting, data were split into two mutually exclusive subsets: the train set (80% of records) and test set (20% of records).³⁵ Subsequently, using a 10-fold cross-validations and adaptive search algorithm proposed by Kuhn,³⁷ the optimal values of hyperparameters are determined using only the train set. The performance of this model is subsequently evaluated on the test subset. This process ensures that samples of the test set were not seen by the model during the training or parameter tuning, which minimizes the risk of overfitting.³⁵ To ensure that the reported prediction performance is not specific to a single random test subset, the above process was repeated 10 times, each time with a new random train and test partitions and performance results were averaged.³⁷

In addition, adjusted R² has been used to report the accuracy of regression models. For classification models, this study used area under the curve (AUC), balanced accuracy, sensitivity, specificity, and F1 score. AUC is a widely used metric for evaluating classification models, and it varies in the range [0 1], where an AUC value of 1 represents a perfect classification model. Other measures are defined as: $S e n s i t i v i t y = \frac{T P}{T P + F N}$ (1) $S p e c i f i c i t y = \frac{T N}{T N + F P}$ (2)

B a l a n c e d A c c u r a c y = \frac{1}{2} (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})

(3)

F 1 = \frac{2 T P}{2 T P + F P + F N}

(4)

where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. The F1 score is the harmonic mean of sensitivity and precision and considers both false positives and false negatives. Once a model is constructed, the importance of independent variables is quantified using the methodology proposed by Kuhn and Johnson,³⁵ where the variable importance values were scaled to have a maximum value of 100. Finally, to help improve the performance and stability of the machine learning models, numeric input attributes were standardized by subtracting the mean and dividing by the standard deviation for each feature.

Results

This section first presents the variations of mobile usage patterns based on demographic attributes (i.e., exploratory analysis). Subsequently, the results related to the prediction of demographic attributes from mobile usage (i.e., predictive modeling analysis) are illustrated and discussed.

Variations of mobile usage by demographics

This section presents results that address RQ1 by revealing differences in mobile phone usage among different sociodemographic groups. Table 2 displays the distribution of participants' sociodemographic characteristics and their mobile usage attributes. Two-sample t-tests were performed to assess the discrimination power of each mobile usage attribute with regard to the four demographic attributes (age, income, gender, and education).

Table 2.

Distribution of participants' sociodemographic information based on various mobile usage attributes

Attribute	Value	n (%)	Age		Income ($000s)		Gender		Education
Attribute	Value	n (%)	Mean	SD	Mean	SD	Female	Male	College	No college
All participants	—	235 (100)	39.8	11.2	44.3	27.9	42.9%	57.1%	60.9%	39.1%
Average number of calls received daily	0–3 Calls	104 (44.3)	42.8^***	11.9	37.7^***	25.2	46.2%	53.8%	62.5%	37.5%
	4–6 Calls	77 (32.8)	39.3	11.2	47.5	27.6	46.8%	53.2%	61.0%	39.0%
	7–10 Calls	30 (12.8)	34.7^***	9.4	51.7^*	24.4	36.7%	63.3%	53.3%	46.7%
	10+ Calls	24 (10.2)	33.4^***	10.8	52.8	34.5	25.0%^**	75.0%	62.5%	37.5%
Average duration of incoming calls	<1 Minute	35 (14.9)	43.8^*	14.0	40.0	25.7	34.3%	65.7%	57.1%	42.9%
	1–5 Minutes	85 (36.2)	39.3	11.1	43.1	25.0	38.8%	61.2%	68.2%	31.8%
	5–10 Minutes	75 (31.9)	40.2	10.4	50.0^**	32.7	44.0%	56.0%	54.7%	45.3%
	10–15 Minutes	24 (10.2)	37.5	9.9	43.5	22.2	58.3%^*	41.7%	54.2%	45.8%
	15+ Minutes	16 (6.8)	35.5	11.4	33.7^**	21.1	56.2%	43.8%	68.8%	31.2%
Average number of calls made daily	0–3 Calls	53 (65.1)	41.6^***	11.5	40.0^***	25.3	42.5%	57.5%	56.9%^*	43.1%
	4–6 Calls	58 (24.7)	37.3^*	11.4	51.1^**	27.5	44.8%	55.2%	60.3%	39.7%
	6+ Calls	24 (10.2)	32.6^***	9.5	55.5^*	35.2	41.7%	58.3%	87.5%^***	12.5%
Average duration of outgoing calls	<1 Minutes	10 (4.3)	47.8^*	13.7	53.1	26.9	20.0%^*	80.0%	50.0%	50.0%
	1–5 Minutes	80 (34.0)	39.9	11.4	44.1	26.6	40.0%	60.0%	73.8%^***	26.2%
	5–10 Minutes	97 (41.3)	39.4	10.9	47.1	27.9	38.1%	61.9%	51.5%^**	48.5%
	10–15 Minutes	28 (11.9)	41.5	11.2	40.7	32.2	67.9%^***	32.1%	50.0%	50.0%
	15+ Minutes	20 (8.5)	35.6^**	9.9	32.5^***	20.4	60.0%	40.0%	55.0%	45.0%
Average number of text messages received daily	0–1 Messages	20 (8.5)	50.0^***	14.7	41.0	30.7	40.0%	60.0%	65.0%	35.0%
	2–5 Messages	67 (28.5)	42.2^**	11.9	38.7^**	26.5	44.8%	55.2%	62.7%	37.3%
	6–10 Messages	44 (18.7)	40.4	10.4	44.4	30.1	43.2%	56.8%	61.4%	38.6%
	11–20 Messages	55 (23.4)	37.3^**	9.2	50.1^**	24.0	41.8%	58.2%	61.8%	38.2%
	20+ Messages	49 (20.9)	34.8^***	7.9	46.8	28.5	42.9%	57.1%	55.1%	44.9%
Average number of text messages sent daily	0–1 Messages	26 (11.1)	48.9^***	14.6	41.0	28.6	42.3%	57.7%	69.2%	30.8%
	2–5 Messages	61 (26.0)	41.6	11.4	38.1^**	26.3	45.9%	54.1%	57.4%	42.6%
	6–10 Messages	46 (19.6)	41.3	10.6	46.8	30.1	39.1%	60.9%	73.9%^**	26.1%
	11–20 Messages	55 (23.4)	36.8^**	9.4	47.2	23.5	41.8%	58.2%	56.4%	43.6%
	20+ Messages	47 (20.0)	34.7^***	7.7	48.3	30.1	44.7%	55.3%	53.2%	46.8%
Number of contacts on device	<30	98 (41.7)	42.9^***	12.5	36.4^***	24.0	51.0%^**	49.0%	52.0%^**	48.0%
	30–60	54 (23.0)	39.3	10.1	43.6	25.1	33.3%^*	66.7%	57.4%	42.6%
	60–100	47 (20.0)	37.3^**	8.1	51.0^*	27.1	42.6%	57.4%	72.3%^**	27.7%
	100–200	23 (9.8)	36.3	11.4	55.9^**	28.4	43.5%	56.5%	78.3%^**	21.7%
	200+	13 (5.5)	34.3^**	11.0	61.9	42.7	23.1%^*	76.9%	69.2%	30.8%
Device value	<$350	63 (26.8)	45.6^***	12.2	36.1^***	23.9	49.2%	50.8%	44.4%^***	55.6%
	$350–$450	55 (23.4)	39.3	8.9	40.1	25.4	34.5%	65.5%	61.8%	38.2%
	$450–$550	33 (14.0)	38.6	12.3	45.7	28.7	45.5%	54.5%	63.6%	36.4%
	$550–$650	37 (15.7)	36.9^**	9.4	52.8^**	26.6	35.1%	64.9%	67.6%	32.4%
	$650+	47 (20.0)	35.9^**	10.7	52.4^**	31.2	48.9%	51.1%	74.5%^**	25.5%
Device age	<6 Months	33 (14.0)	37.2	13.0	46.1	25.9	30.3%^*	69.7%	60.6%	39.4%
	6 Months–1 year	55 (23.4)	40.1	10.2	45.6	29.5	40.0%	60.0%	50.9%	49.1%
	1–2 Years	81 (34.5)	39.1	11.4	46.3	27.2	48.1%	51.9%	60.5%	39.5%
	2–3 Years	42 (17.9)	40.5	11.5	42.6	28.9	40.5%	59.5%	69.0%	31.0%
	3+ Years	24 (10.2)	45.1^**	10.8	35.2^*	24.4	54.2%	45.8%	70.8%	29.2%

The ^***, ^**, and ^* refer to the 1%, 5%, and 10% levels of statistical significance, respectively.

SD, standard deviation.

The results show that the number of incoming and outgoing calls decreases with age and that those receiving three or fewer calls daily had a notably lower annual income and were older on average (42.8 years) compared with those receiving more than 10 calls per day (33.4 years). The majority of those who received more than 10 calls per day were men (75.0%). College-educated users were found to make and receive more calls than non-college-educated users. The average duration of outgoing calls had significant discrimination power with regard to age, income, and education. Women engaged in longer outgoing calls, and excessively long calls were associated with lower income. The results also show positive correlations between income and the number of exchanged text messages and a negative correlation between age and the number of text messages. Participants who received more than 20 messages daily had an average age of 34.8 years, whereas those who received fewer than 2 messages had an average age of 50.0 years.

The number of contacts saved on users' mobile devices, which can serve as a measure of social network size,³⁸ showed statistically significant correlations with age (negative), income (positive), and education (positive). For example, the group of participants who had 100–200 saved contacts had an average income 70% higher than those with fewer than 30 saved contacts.

The specifications of mobile devices were also linked to users' demographics, supporting previous research.¹⁸ Specifically, device price was positively correlated with users' age, income, and education level. A positive correlation between device age and user age was also noted, indicating that younger users tend to upgrade their devices more frequently. The results also showed a higher desire among men to upgrade their devices, with 69.7% of users with devices <6 months old being men.

Beyond call and text message patterns, it is known that various demographic segments use their mobile devices differently.¹⁸ Table 3 presents the correlation coefficients (C) between age and income, and the daily usage of various mobile applications, talking and texting, social media platforms, web browsing, and media streaming. Along with the additional nine attributes presented in Table 2, these attributes serve as the input features utilized in this study. The results suggest strong positive correlations between different usage categories. Age is expectedly negatively correlated with time spent on different usage categories, with social media having the strongest negative correlation (C = −0.226). The correlations, however, are notably weaker for income. The observed positive correlation between income and average time spent on media streaming is also not surprising, considering that a sizable portion of streaming content is delivered via paid subscription channels.

Table 3.

Correlations between four key mobile usage attributes (in daily minutes) and age and income

	Mean	SD	1	2	3	4	5	6
1. Talking and texting	34.664	32.345	—
2. Social media	40.386	40.983	0.327	—
3. Web browsing	42.380	41.143	0.245	0.428	—
4. Media streaming	43.031	50.244	0.369	0.283	0.521	—
5. Age	39.821	11.223	−0.208	−0.226	0.207	−0.201	—
6. Income	44.160	27.943	0.044	−0.074	−0.053	0.157	−0.03	—

Figure 3a shows the distribution of daily mobile application usage for men and women, with the mean values indicated by vertical dashed lines and p-values from two-sample t-tests of mean differences. On average, women spend more time talking, texting, and using social media. Figure 3b illustrates the impact of education, where college-educated users are shown to spend more time browsing the web and using social media compared with non-college-educated users. A closer examination of the results reveals that excessive mobile usage for talking and texting (over 150 minutes per day) is dominated by non-college-educated users.

FIG. 3.

Distribution of the time spent on four different mobile applications by (a) users' gender and (b) education.

Predicting sociodemographic information from mobile usage data

To examine how accurately each of the four considered sociodemographic attributes be predicted from users' mobile usage (i.e., RQ2), a range of regression and classification models were trained. Staring with age and income as the first two demographic attributes, Figure 4 presents the average values of adjusted R² for various regression algorithms used to predict users' age (Fig. 4a) and income (Fig. 4b) from their mobile attributes, with the standard deviation shown as error bars around the averages. According to the results, mobile attributes could explain 43.8% of variations in users' age (Gradient Boosting Machines model) and 36.3% of variations in their income (Random Forests model). The analysis of standard deviations does not indicate any unusual patterns for the models considered.

FIG. 4.

Comparison of different regression models used to predict (a) users' age and (b) users' income from their mobile usage and device specification data.

The results in Figure 4 suggest that mobile usage attributes alone may not accurately predict exact users' exact age or income as a significant percentage of variations remains unexplained in the regression models. However, in many practical scenarios, determining if a user is above a certain age or has income above a set threshold might be sufficient (e.g., for personalized services or targeted ads). Accordingly, regression modeling can be transformed into a classification task. Figure 5 displays the performance of different classification models in predicting whether participants were above or below a certain age threshold, with the best performing algorithm for each threshold highlighted by the bar color. For example, if the age threshold is set at 30 years, the models will predict if the participant is older or younger than 30 years, with the older group being designated as the positive class. Please note that as the threshold varied, the classes being defined change, thus making it a different classification task.

FIG. 5.

Various performance measures of classification models to predict users' age class using their mobile usage attributes. The old group was considered the positive class.

The choice of age threshold depends on the specific application, such as a mobile retailer wanting to target users below 30 for a specific product. By changing the threshold, one can investigate if there is a particular demographic group with unique mobile usage patterns that make them easier to detect by the classification algorithm, which addresses RQ3. As it can be seen from Figure 5, the classification accuracy was generally high, especially for identifying users younger than 30 years or older than 60 years (accuracy >0.8). This is perhaps because, compared with middle-aged users, both young and old users had more distinctive mobile usage patterns, which made it easier for the classification algorithms to identify them. Moreover, by comparing specificity and F1 scores, the classification models appeared to provide symmetric performance with respect to false-positive and false-negative misclassification errors.

Similarly, classification models were trained to segment users into two groups of low income and high income. Figure 6 shows the classification performance for different income thresholds. According to the figure, identifying users with an annual income lower than $20k was most accurate (balanced accuracy = 0.76). The accuracy and other performance metrics, however, started to fall as the income threshold increased. This decreasing trend continued until the balanced accuracy toped out at 0.617, when income threshold was $60k.

FIG. 6.

Various performance measures of classification models to predict users' income class using their mobile usage attributes. The high-income group was considered as the positive class.

The classification performance then started to rise but then decreased again after the income threshold exceeded $80k. Considering RQ3 again, this trend suggests that low-income users had a more distinctive mobile usage pattern that was detected by the classification models. The same was true for income above $60k when the model identified distinctive mobile usage patterns of users with higher income. The performance decline after $80k, however, suggests that the mobile usage pattern did not change significantly after users' income exceeded $80k.

Coming back to RQ2, Figures 7 and 8 show various performance metrics for predicting users' gender and education as the other two sociodemographic attributes considered in this study. The results suggested a high accuracy when predicting gender (accuracy = 0.871) and a moderate accuracy when predicting education status (accuracy = 0.721).

FIG. 7.

Various performance measures of classification models used to predict users' gender. Females are considered as positive class.

FIG. 8.

Various performance measures of classification models used to predict users' education. College-educated users were considered as positive classes.

Finally, to quantify the contribution of each mobile usage attribute in predicting sociodemographic characteristics, Figure 9 shows the predictive power of various mobile attributes in predicting different users' sociodemographic information (i.e., RQ4). As shown, the average time spent on social media, the number of contacts on the device, the average time spent daily web browsing, and talking and texting were the most significant variables (i.e., relative variable importance [RVI] = 100) to predict user's age, income, education, and gender, respectively. The time spent on social media was also a strong predictor of gender (RVI = 83.49) and education (RVI = 83.15). Similarly, the number of contacts on the devices had a significant predictive power to infer users' education (RVI = 69.26) and gender (RVI = 53.48). Moreover, the result showed the strong discriminative power of the number of outgoing calls in establishing users' income (RVI = 95.17) and education level (RVI = 63.57).

FIG. 9.

The predictive power of various mobile attributes in predicting users' sociodemographic information.

In summary, the results of the study indicate that there were significant variations in mobile usage patterns across the four sociodemographic attributes considered, with age and income being the most notable factors (RQ1). In regard to RQ2, the results suggest that gender can be predicted with a high degree of accuracy (balanced accuracy = 0.862) using mobile usage data, whereas the prediction of users' education level proved to be more challenging (balanced accuracy = 0.719). In addition, while the regression models were not very accurate in determining a user's exact age and income, the classification models achieved acceptable accuracy in classifying users as having an age or income above or below a specific threshold.

As for RQ3, the result revealed that users younger than 30 years and those older than 60 years had more distinct mobile usage patterns, making it easier for the classification models to identify them, a pattern that was also observed for low-income users. Finally, the study revealed that the average time spent on social media, the number of contacts on the device, the average time spent daily web browsing, and the frequency of talking and texting were the most predictive attributes for inferring the user's age, income, education, and gender, respectively (RQ4).

Discussion and Implications

The practical implications of the study's results are examined in this section. It is essential to highlight that, irrespective of the number of mobile user records used for training the machine learning models, once these models are trained and demonstrate sufficient predictive performance, they can be applied to any number of mobile users for predicting their sociodemographic attributes. This renders the framework of this study applicable to an extensive mobile user base. However, this requires the seamless collection of their mobile usage attributes. Practically, there are two primary sources, namely mobile apps and call detail records (CDRs). Mobile apps often request access to various attributes of a user's device and usage patterns in exchange for personalized services or targeted advertising.^6,39 These data can include information such as call logs, device attributes, and settings, as well as calls and usage statistics.

CDRs, maintained by mobile service providers, also contain a wealth of information about a user's phone usage. CDRs typically contain information about both voice calls and SMS messages, as well as data sessions, such as internet usage. The specific information included in a CDR may vary depending on the mobile service provider, but most CDRs now contain detailed information about data usage, including the type of data used (e.g., browsing, streaming, downloading), the amount of data used, and the duration of the data session. Regardless of how mobile usage attributes are collected, the strong predictive power of mobile usage attributes to infer users' demographics carries several significant implications for firms, users, and policymakers and regulatory agencies.

For end users, the results of this study have implications in terms of the potential benefits of personalized services and product recommendations,⁶ but also, privacy concerns when using mobile applications. Receiving personalized service and product recommendations offers advantages in terms of optimized resource utilization and a customized user experience, potentially resulting in enhanced customer satisfaction.⁸ However, the ability to predict users' sociodemographic information based on their mobile usage attributes highlights the potential privacy risks associated with collecting this information through app permissions.^7,8,40

To mitigate these risks, users should educate themselves about the privacy policies of the apps they use and the permissions they grant, and actively monitor their privacy settings to ensure that their personal information is protected. Furthermore, users should consider using privacy-focused applications that provide more transparency and control over the collection, storage, and use of their mobile usage data. By being proactive about privacy, users can enjoy the benefits of personalized services and product recommendations, while also ensuring the protection of their sensitive information. Finally, with enhanced prediction accuracy, the inferred sociodemographic attributes of users can be regarded as an additional layer of security for user authentication in the future, especially when combined with other authentication methods.^41–44

Firms can improve their targeted advertising and provide personalized services to mobile users by inferring demographic information from their mobile usage data. However, this also raises concerns about privacy and the potential for users to abandon apps due to the number or sensitivity of requested permissions. According to the literature, the degree of a firm's information transparency, as perceived by users, plays a central role in users' decision-making process when they are faced with a personalization–privacy paradox.⁸

More specifically, more perceived information transparency has been shown to result in users being more likely to accept being profiled in return for receiving more personalized service, although the perceived value of the personalized service, the perceived risk resulting from the personal data collected, and users' demographic attributes were also shown to influence the decision.⁷ Some of the key components of consumer's privacy protection strategies include creating transparent procedures to inform users which personal data are collected from them, how they are collected, and for what applications they are used.⁴⁵

With respect to mobile applications, however, most current designs are unfortunately characterized by a certain level of ambiguity. For example, the Android permission system has a low granularity for some permissions;¹⁰ consequently, the phrases presented to users during the installation of an app are sometimes generic and ambiguous and do not precisely declare the purpose of the permissions.¹¹ As a result, providing a more transparent declaration of the purpose for requesting different app permissions should become a priority for firms. This is especially true due to mobile users becoming more privacy conscious. For instance, in a survey conducted by Pew Research, 90% of users indicated that having transparent information about how an app would collect and use their personal data was “very” or “somewhat” important to them when deciding to install the app.³³

Policymakers and regulatory agencies have the responsibility of establishing and enforcing policies to safeguard citizens' privacy and prevent discrimination based on demographic information. Considering the high predictive power of mobile usage metadata as shown in this study, these entities should ensure that such data are treated as sensitive personal information. Unfortunately, the regulations for treating CDR information are inconsistent across the world. For example, in the United States, CDR data are not protected under the Fourth Amendment, but there are extensive measures to protect CDR data under the EU General Data Protection Regulation (GDRP).

Moreover, using mobile usage data may lead to direct or indirect discrimination of users based on their sociodemographic attributes. This can be especially problematic when such data are used as inputs in algorithmic decision-making processes. For example, mobile usage metadata is now considered to be a primary alternative data source for credit risk assessment for the unbanked population who do not have a formal credit standing. Prior research has found that mobile usage metadata can be used to infer users' credit risk.³

However, if mobile usage data can be used to simultaneously predict sociodemographic attributes, lenders and regulatory agencies should take active measures to ensure that the integration of such data does not lead to discrimination against consumers based on their demographic attributes. This is especially true as the results of this study suggest that mobile usage metadata has a strong predictive power to identify elderly (Fig. 5) and low-income (Fig. 6) users, who are generally more vulnerable than other groups.

Although this research has produced some useful insights, its limitations should also be taken into account. Most importantly, it should be noted that in this research, the study participants were all residents of the United States. However, past research suggests that cross-cultural factors can influence how users use and interact with their mobile devices.⁴⁶

In addition, the study participants were all Android users, and iOS users were not included in this research. While Android users account for more than two thirds of smartphone users and are demographically much more diverse, there are notable demographic differences between the two groups, especially in terms of their income.⁴⁷ Therefore, a follow-up study is recommended to validate this study's results in other settings and cultures and by additionally considering iOS users. Finally, the strong predictive power of aggregated mobile attributes in establishing users' demographics, as shown in this article, calls for further investigations where additional attributes such as the location and mobility as well as the time and regularity of usage patterns are additionally considered.

Conclusions

This study demonstrated the strong predictive power of mobile usage attributes in determining users' sociodemographic characteristics, including age, gender, income, and education. The results showed that the accuracy of classifying users based on gender was 0.87, and for education was 0.72. Moreover, while the regression models could not accurately predict a user's exact age and income, the classification models were able to classify users based on whether their age or income was above or below a certain threshold with acceptable accuracy. The predictions were more accurate for younger users under 30 (accuracy = 0.81) or older users over 60 (accuracy = 0.80) for age and for users with an annual income below $20k (accuracy = 0.75) for income.

The most discriminative variables for gender were the average time spent talking and texting, followed by social media use. Female users were found to spend more time on these activities. For age, the most predictive attribute was the average time spent daily on social media, followed by media streaming and text messages. Results showed negative correlations between age and the time spent on these activities. For income, the number of contacts on the mobile device was the most predictive attribute, followed by daily calls made and the device's value. This suggests that a larger social network is linked to higher income. The most influential predictor of education was the average time spent on web browsing, followed by social media use and number of contacts. Results showed that non-college-educated users spend less time on these activities and have fewer contacts.

This study also discussed the practical applications and privacy implications of these findings. The ability to infer users' demographics from mobile attributes allows for targeted advertising and personalized services but can also pose privacy risks. The study highlights the sensitivity of CDRs and the need for protection. Recommendations were made for users, firms, policymakers, and regulatory agencies.

Footnotes

Authors' Contributions

R.R.: Conceptualization (lead), writing—original draft (equal), data collection and numerical analysis (lead). G.X.: Conceptualization (support), writing—original draft (equal), review and editing (equal). I.J.A.: Conceptualization (support), writing—review and editing (equal).

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Abbreviations Used

References

Deng

, Kanthawala

, Meng

, et al. Measuring smartphone usage and task switching with log tracking and self-reports. Mobile Media Commun, 2019; 7(1):3–23.

Winnick

Putting a finger on our phone obsession. Dscout 2016. Available from: https://dscout.com/people-nerds/mobile-touches [Last accessed: August 8, 2023].

, Zhao

, Zhou

, et al. A new aspect on p2p online lending default prediction using meta-level phone usage data in China. Decis Support Syst, 2018; 111(4):60–71.

Ellis

, Davidson

, Shaw

, et al. Do smartphone usage scales predict behavior?. Int J Hum Comput Stud, 2019; 130:86–92.

Galor

, Galor

, Penmetsa

. The role of user privacy concerns in shaping competition among platforms. Inform Syst Res, 29(3):698–722, 2018.

Banovic

, Krumm

. Warming up to cold start personalization. Proc ACM Interact Mob Wearable Ubiquitous Technol, 2018; 1(4):1–13.

Zhu

, Ou

CXJ

, van den Heuvel

WJAM

, et al. Privacy calculus and its utility for personalization services in e-commerce: An analysis of consumer decision-making. Inform Manag, 2017; 54(4):427–437.

Awad

, Krishnan

. The personalization privacy paradox: An empirical evaluation of information transparency and the willingness to be profiled online for personalization. MIS Q, 2006; 30(1): 13–28.

Hilton

, Rainie

Facebook Algorithms and Personal Data. Technical Report. Pew Research: Washington, DC; 2019.

10.

Matsumoto

, Kouichi

A proposal for the privacy leakage verification tool for android application developers. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC’13; 2013; 54; pp. 1–8, New York, USA.

11.

Kelley

, Cranor

, Sadeh

. Privacy as part of the app decision-making process. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2013; pp. 3393–3402, New York, NY, USA.

12.

Blumenstock

, Cadamuro

, On

. Predicting poverty and wealth from mobile phone metadata. Science, 2015; 350(6264):1073–1076.

13.

Bekkerman

, Zmirli

, Kirkpatrick

. Mining the thin air—For understanding of urban society. Big Data, 2019; 7(4):262–275.

14.

Frias-Martinez

, Virseda

On the relationship between socio-economic factors and cell phone usage. In: Proceedings of the Fifth International Conference on Information and Communication Technologies and Development; 2012; pp. 76–84, New York, NY, USA.

15.

Petrovčič

, Slavec

, Dolničar

. The ten shades of silver: Segmentation of older adults in the mobile phone market. Int J Hum Comput Interact, 2018; 34(9):845–860.

16.

Busch

, Hausvik

, Ropstad

, et al. Smartphone usage among older adults. Comput Hum Behav, 2021; 121:106783.

17.

Ceccucci

, Peslak

, Kruck

, et al. Does gender play a role in text messaging?. Issues Inform Syst, 2013; 14(2):186–194.

18.

Kim

, Briley

, Ocepek

. Differential innovation of smartphone and application use by sociodemographics and personality. Comput Hum Behav, 2015; 44:141–147.

19.

Eagle

, Blumenstock

Mobile divides: Gender, socioeconomic status, and mobile phone use in Rwanda. In: Proceedings of the 4th ACM/IEEE International Conference on Information and Communication Technologies and Development; 2010;6; pp. 1–10, New York, NY, USA.

20.

Sarraute

, Blanc

, Burroni

A study of age and gender seen through mobile phone usage patterns in Mexico. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; 2014; pp. 836–843.

21.

Haverila

MJ.

Behavioral aspects of cell phone usage among youth: An exploratory study. J Int Consum Market, 2012; 24(3):203–220.

22.

Ogletree

, Fancher

, Gill

. Gender and texting: Masculinity, femininity, and gender role ideology. Comput Hum Behav, 2014; 37:49–55.

23.

Gillick

, Blumenstock

, Eagle

Who's calling? Demographics of mobile phone use in Rwanda. In: AAAI Spring Symposium: Artificial Intelligence for Development; 2010.

24.

Seneviratne

, Seneviratne

, Mohapatra

, et al. Your installed apps reveal your gender and more! SIGMOBILE Mob Comput Commun Rev, 2015,18(3):55–61.

25.

Witayangkurn

, Arai

, Kanasugi

, et al. Understanding user attributes from calling behavior: Exploring call detail records through field observations. In: 12th International Conference on Advances in Mobile Computing and Multimedia; 2016; pp. 95–104, New York, USA.

26.

Al-Zuabi

, Jafar

, Aljoumaa

. Predicting customer's gender and age depending on mobile phone data. J Big Data, 2019; 6(1):18.

27.

Malmi

, Weber

You are what apps you use: Demographic prediction based on user's apps. In: Tenth International AAAI Conference on Web and Social Media; 2016.

28.

Kimbrough

, Guadagno

, Muscanell

, et al. Gender differences in mediated communication: Women connect more than do men. Comput Hum Behav, 2013; 29(3):896–900.

29.

Kim

, Park

. Mobile phone purchase and usage behaviours of early adopter groups in Korea. Behav Inform Technol, 2016; 33(7):693–703.

30.

Naboulsi

, Fiore

, Ribot

, et al. Large-scale mobile traffic analysis: A survey. IEEE Commun Surv Tutor, 2016; 18(1):124–161.

31.

Shah

, Sengupta

. Designing mobile based computational support for low-literate community health workers. Int J Hum Comput Stud, 2018; 115:1–8.

32.

Sundsoy

Can mobile usage predict illiteracy in a developing country?. Technical Report arXiv: 1607.01337, ArXiV, 2016.

33.

Atkinson

Apps Permissions in the Google Play Store. Technical Report. Pew Research: Washington, DC; 2015.

34.

Aguinis

, Villamor

, Ramani

. MTurk research: Review and recommendations. J Manag, 2021; 47(4):823–837.

35.

Kuhn

, Johnson

Applied Predictive Modeling, Vol. 26. Springer: New York; 2018.

36.

Wolpert

The lack of a priori distinctions between learning algorithms. Neural Comput, 1996; 8(7):1341–1390.

37.

Kuhn

Building predictive models in R using the caret package. J Stat Softw, 2008; 28(5):1–26.

38.

Chae

. Reexamining the relationship between social media and happiness: The effects of various social media platforms on reconceptualized happiness. Telemat Inform, 35(6):1656–1664, 2018.

39.

Colot

, Baecke

, Linden

. Toward decision support for telecom external data monetization: A study of the value of network- and personality-based metrics for third-party businesses. Big Data, 2022; 10(2):115–137.

40.

Hand

DJ.

Aspects of data ethics in a changing world: Where are we now?. Big Data, 2018; 6(3):176–190.

41.

Wang

, Wang

, He

, et al. Birthday, Name and Bifacial-security: Understanding Passwords of Chinese Web Users. In: USENIX Security Symposium; 2019; pp. 1537–1555.

42.

Wang

, Wang

The emperor's new password creation policies. In: European Symposium on Research in Computer Security. Springer, Cham; 2015; pp. 456–477.

43.

Wang

, Wang

, Cheng

, et al. Quantum2FA: Efficient quantum-resistant two-factor authentication scheme for mobile devices. IEEE Trans Depend Secure Comput, 2023; 20(1):193–208.

44.

, Wang

, Morais

. Quantum-safe round-optimal password authentication for mobile devices. IEEE Trans Depend Secure Comput, 2022; 19(3):1885–1899.

45.

Lim

, Woo

, Lee

, et al. Consumer valuation of personal information in the age of big data. J Assoc Inform Sci Tech, 2018; 69(1):60–71.

46.

Chong

AYL

, Chan

FTS

, Ooi

. Predicting consumer decisions to adopt mobile commerce: Cross country empirical examination between China and Malaysia. Decis Support Syst, 2012; 53(1):34–43.

47.

Shaw

, Ellis

, Kendrick

, et al. Predicting smartphone operating system from personality and individual differences. Cyberpsychol Behav Soc Network, 2016; 19(12):727–732.