Abstract
With the advent of social media, our online feeds increasingly consist of short, informal, and unstructured text. Instagram is one of the largest social media platforms, containing both text and images. However, most of the prior research on text processing in social media is focused on analyzing Twitter data, and little attention has been paid to text mining of Instagram data. Moreover, many text mining methods rely on training data annotated manually by humans, which in practice is both difficult and expensive to obtain. In this paper, we present methods for weakly supervised text classification of Instagram text. We analyze a corpora of Instagram posts from the fashion domain and train a deep clothing classifier with weak supervision to classify Instagram posts based on the associated text.
With our experiments, we demonstrate that in absence of annotated training data, using weak supervision to train models is a viable approach. With weak supervision we were able to label a large dataset in hours, something that would have taken months to do with human annotators. Using the dataset labeled with weak supervision in combination with generative modeling, an
Introduction
Text processing is present in our everyday life and empowers several important utilities, such as, machine translation, web search, personal assistants, and user recommendations. Today, social media is one of the largest sources of text, and while social media fosters the development of a new type of text processing applications, it also brings with it its own set of challenges due to the informal language.
Text in social media is unstructured and has a more informal and conversational tone than text from conventional media outlets [3]. For instance, text in social media is rich of abbreviations, hashtags, emojis, and misspellings.
Traditional Natural Language Processing (NLP)-tools are designed for formal text and are less effective when applied on informal text from social media [28]. This is why recent research efforts have tried to adapt NLP tools to the social media domain [13]. Moreover, methods within the intersection of NLP and machine learning applied to social media have been successful in information extraction [29], classification [19], and conversation modeling [27].
Results of the previous work are not enough for our purposes due to the following reasons: (1) many results rely on access to massive quantities of annotated data, something that is not available in our domain; (2) most of the work is focused on Twitter, with little attention to image sharing platforms like Instagram;2
and (3) to the best of our knowledge, no prior assessment of complex, multi-label, classification in social media has been made.Acquisition of annotated data that is accurate and can be used for training text classification models is expensive. Especially in a shifting data domain like social media. In this research, we explore the boundaries of text mining methods that can be effective without this type of strong supervision. In particular, we evaluate text classifiers trained with a programmatic type of supervision referred to as weak supervision.
Even if we assume that the main research results from Twitter will be useful in our research on Instagram, we still should take into account several important differences between the two domains. The most prevalent discrepancies are that Instagram is an image-sharing medium while Twitter is a micro-blogging medium, and that Twitter has a character-limit per tweet.

In this paper we investigate methods for training a text classifier without manually labeled data (strong supervision). Instead, we use algorithmic labeling of a large text corpora from the fashion domain on Instagram, and use that dataset to train a deep model to classify Instagram posts into fashion categories.
In this paper, we focus on the task of classifying Instagram posts into clothing categories based on the associated text (Fig. 1), it is an extension of a previous conference paper [15]. The work presented in this paper is part of a larger research project. The project aspires to improve the state-of-the-art in fashion recommendation by employing activities in social media and using data crossing multiple domains in the recommendations [17]. In future work, the text processing methods presented in this paper will be integrated with computer vision models in the project.
Just as other consumption-driven industries, the fashion industry has been influenced by the emergence of social media. Social media is progressively getting more attention by fashion brands and retailers as a source for detecting trends, adapting user recommendations, and for marketing purposes [4]. To give an example, the image-sharing platform Instagram has become a popular medium for fashion branding and community engagement [1]. This is why extraction and classification of fashion attributes on Instagram is an important task for several modern applications working with user recommendation and detection of fashion trends.
In addition to hosting images, Instagram contains large volumes of user generated text. Specifically, an Instagram post can be associated with an image caption written by the author of the post, by comments written by other users, and by “tags” in the image that refer to other users. Despite being a platform rich of text, little prior work has paid attention to the promising applications of text mining on Instagram. From our case study on Instagram posts in the fashion community, it was revealed that the text often indicates the clothing on the associated image, an example of this is given in Fig. 2. We believe that there is a value in the text on Instagram that currently is unutilized. For example, the text on Instagram can be mined and used for predictive modeling and analytics.

An Instagram post from the fashion community.
Our contribution in this paper includes:
An empirical study of Instagram text. An evaluation of word embeddings trained on Instagram text. A novel pipeline for multi-label clothing classification of the text associated with Instagram posts using weak supervision and the data programming paradigm [26].
Our empirical study provides one of the few available studies on Instagram text and shows that the text is noisy, that the text distribution exhibits the long-tail phenomenon, and that comment sections on Instagram often are multi-lingual. Moreover, experimental results demonstrate that the FastText algorithm for training word embeddings [5], that can capture the morphology of words, is also suited for the noisy type of text that can be found in social media. Finally, we train a deep text classifier using weak supervision and data programming. The classifier achieves an
The rest of this paper is structured as follows. In Section 2 we describe related work, and in Section 3 we present our approach to the problem. In Section 4 we summarize the experimental setup and Section 5 contains the results from our evaluations as well as our interpretation of the results. Lastly, Section 6 includes our conclusions and suggestions for future research directions.
Our research extends prior work on learning domain specific word embeddings (Section 2.1) and weakly supervised text classification (Section 2.2) working with informal text from social media.
Learning domain specific word embeddings
The practice of constructing word embeddings targeted to a specific domain is a relatively new field of research as most prior research have focused on constructing generic word embeddings, not optimized to a specific domain. Prior work that resembles our effort in learning word embeddings for the fashion domain are (1) [30] introduces word embeddings for the construction domain; (2) [23] compared embeddings specific to the biomedical domain with off-the-shelf embeddings;3
Off-the-shelf embeddings refers to pre-trained embeddings available online, trained on generic text rather than domain-specific text.
Similar to our experiments (1) used a domain-specific dataset for intrinsic evaluation for embeddings. However, they did not tune the hyperparameters, and their evaluation focused on a single set of off-the-shelf word embeddings. In summary, their results indicate that off-the-shelf embeddings performed comparably to domain-specific embeddings on several tasks.
In (2) an extrinsic evaluation of domain-specific embeddings was made, the evaluation compared embeddings trained on biomedical text with off-the-shelf embeddings on a classification task. Without detailed tuning of hyperparameters, the results indicated that the domain-specific embeddings only gave a modest improvement over the off-the-shelf ones.
In (3) it was found that performance of embeddings can be notably improved by tuning the hyperparameters, rather than sticking to the default values. Moreover, their results indicate that tuning of hyperparameters can be contradictory between intrinsic and extrinsic evaluations.
Our research differ from (1)–(3) by targeting noisy text from social media, rather than newswire text. The work in (4)–(6) is similar to ours in that they use word embeddings in the social domain but differ from our work in other aspects. In (4) word embeddings are trained using a corpora of tweets but the study lacks an comparison of the embeddings trained using tweets and generic embeddings trained on newswire text. (5) reports an improved accuracy when training embeddings directly on Unicode descriptions of emojis, instead of learning the embeddings on a large collection of tweets. Their motivation is, similar to ours, that off-the-shelf embeddings lack representations for many tokens that are commonplace in social media. Finally (6) use hashtags as a supervision signal when training word embeddings, this results in embeddings that are similar based on hashtags, which loses fine-grained word meanings required for the classification task studied in this paper.
For the task of classifying Instagram text, our research builds primarily on results from supervised machine learning. The success of this paradigm of machine learning has traditionally been coupled to annotated datasets. Notable results in supervised text classification are [18] and [9], both of which differ from our research in that they assume access to a large annotated text corpora for training the classifier.
More recently, research on exploiting unlabeled data for training has received attention. For certain tasks, completely unsupervised learning is enough, such as the task of learning word embeddings. For other tasks, a blend of supervised and unsupervised learning is appropriate. Semi-supervised and weakly-supervised learning are two approaches to learning with limited amount of supervision, while having access to an abundant amount of unlabeled data.
Semi-supervised learning
In semi-supervised learning, even though it is assumed that a smaller amount of labeled training data are available, the goal is to combine that data with a larger portion of unlabeled data. To train with unlabeled data, semi-supervised learning makes use of assumptions about the data, such as the data distribution. With the right assumptions, semi-supervised learning algorithms are able to relate the unlabeled data with the labeled data to drive the learning process.
Weakly-supervised learning
Weakly supervised learning methods rely on availability of weak-supervision signals and do not assume that any labeled data are available. A weak supervision signal can for instance be in the form of an external API, a crowdworker, or a domain heuristic. As opposed to strong supervision, weak supervision seldom has perfect accuracy or coverage.
Specifically, related to our research is the data programming paradigm presented in [26], the paradigm has achieved promising results on several text classification tasks. Data programming has been applied to binary and multinomial text extraction and classification tasks [25,26] and is currently being used within Google for training various classifiers [2]. To the best of our knowledge, it has neither been applied to multi-label classification tasks, nor to social media text.
Methodology
In Section 3.1 we outline how our analysis of the Instagram corpora was performed. Section 3.2 describes our second contribution, which is an evaluation of word embeddings trained on Instagram text. Finally, Section 3.3 presents the pipeline we used to train a deep text classifier using weak supervision. The code for the implementations and the trained embeddings are publicly available.4
Of special interest in our study was to elucidate how the Instagram text differs from newswire text, as it affects the choice of processing methods. We analyzed a corpora of Instagram posts by measuring the fraction of online-specific tokens, the number of Out-Of-Vocabulary (OOV) words, the number of languages in the corpora, and the text distribution.
Learning domain-specific word embeddings
Considering the peculiarity of Instagram text compared to newswire text, we have surveyed the benefit of training new word embeddings for the fashion domain on Instagram. We have performed an evaluation of embeddings trained on our corpora of Instagram posts using Word2vec, Glove, and FastText, with varying hyperparameters. Parameters that were not tuned in the evaluation, were kept to their default values, listed in Table 1.
Default parameters used when training word embeddings
Default parameters used when training word embeddings

A pipeline for weakly supervised text classification of Instagram posts.
To examine the difference between domain-specific word embeddings and generic word embeddings, the embeddings trained on the Instagram corpora were compared with the state-of-the-art off-the-shelf embeddings, provided by Google, Facebook, and Stanford’s NLP group. Specifically, the baselines were: (1)
This section presents a pipeline for weakly supervised text classification to predict clothing items in Instagram posts. The pipeline is visualized in Fig. 3 and includes steps devoted to labeling a dataset with weak supervision (Section 3.3.3), combining weak labels with data programming to produce probabilistic labels (Section 3.3.2), and training a discriminative model using the probabilistic labels (Section 3.3.5).
The classification task
Although multiple classifications are of interest in our research, such as brand classification, and fabric classification, we focus initially on the clothing item classification problem. This task is a multi-label multi-class classification problem with 13 classes. The classes are as follows: dresses, coats, blouses & tunics, bags, accessories, skirts, shoes, jumpers & cardigans, jeans, jackets, tights & socks, tops & t-shirts, and trouser & shorts.
Data programming
With the data programming paradigm [26], weak supervision is encoded with labeling functions. A labeling function is a black-box function
Formally, a labeling function
In Λ, an empirical probability
Finally, after estimating α and β, the parameterized generative model is used to engender probabilistic (confidence-weighted) training labels
Weak supervision for fashion attributes in Instagram posts
We used seven labeling functions to label a dataset of 30K Instagram posts with fashion attributes. The purpose of using multiple functions is that we expect that the combination of functions will improve the accuracy of the supervision compared to what each function in isolation would provide. The functions are as follows.
In the original data programming paper, a binary classification scenario is studied and it is assumed that labeling functions are binary [26]. We have extended the data programming paradigm from binary classification to the multi-label setting. To make use of the data programming paradigm for multi-label classification, we model the labeling process with one generative model for each class. With this approach, the combination of generative models is able to represent separate accuracy estimates of the labeling functions for each class.
Formally, the generative model
First, the labeling functions are applied to the unlabeled data
For the discriminative model, we have used a variant of the Convolutional Neural Network (CNN) model for text classification presented in [18]. This model was chosen as it is established as one of the best performing text classifiers for short texts. However, nearly any model could have been used, the only requirement is that the loss function can be modified to work with probabilistic labels.
The neural network architecture in [18] consists of an embedding input layer, a convolutional layer, and a fully-connected layer of softmax or sigmoid output units. Moreover, the architecture employs max-over-time pooling to detect keywords in the input. The architecture is illustrated in Fig. 4 and defined mathematically below.

CNN for multi-label text classification.
Embedding and convolutional layers Let
The first layer is the embedding layer, that serves as a lookup step, where each word
Next is the convolutional layer that performs two-dimensional convolutions over the sequence of embeddings. The convolutional layer consists of n filter windows
Let
Output and loss layer After the convolutional layer, max-over-time pooling [8] is applied to the feature maps. The max-over-time pooling yields new subsampled features
The original architecture in [18] is designed for the multi-class setting and uses a softmax output layer. We have extended the network to the multi-label setting working with probabilistic labels by switching out the loss function with a noise-aware loss function for multi-label classification. The loss is defined as the cross-entropy over sigmoid outputs with respect to probabilistic labels (Eq. (6)).
Model analysis There are a few key concepts that characterizes the CNN architecture for text classification. Most prevalent is the assumption that a smaller amount of tokens in the input are decisive for classification. This assumption is expressed both with the max-over-time pooling and by using ReLU activations, that have a sparsity effect on the network. Moreover, since all the neurons inside a single filter share weights, each filter can be seen as a feature-learner, that looks for a certain feature in the input. As weights are not shared across filters, increasing the number of filters can allow the network to learn to detect more distinct features in the input. The training procedure will cause the filters to learn different features to minimize the loss. How many filters to use depends on the task. If too many filters are used, some filters typically become so called “dead filters” that never activate and always output zeros.
This section outlines the experimental setup that was used to produce the results presented in the following section (Section 5). The experiments include data analysis of an Instagram corpora, comparing domain-specific embeddings to pre-trained embeddings, and training a deep text classifier for clothing classification of Instagram posts.
Data
Experiments were conducted using textual data from the Instagram platform, this section describes the datasets in more detail.
Instagram corpora
The empirical study of Instagram text was conducted on a provided dataset, consisting of Instagram posts from a community of users in the fashion domain. The data are in the form of a corpora consisting of image captions, user comments, and usertags associated with each post. In entirety, the corpora consists of 143 accounts, 200K posts, 9M comments, and 62M tokens, out of which 2M are unique. The numbers were computed before any pre-processing, except applying the NLTK [21] TweetTokenizer and removing user-handles.
Training dataset
When training classifiers, a subset of the Instagram corpora, consisting of 30K Instagram posts annotated with weak labels produced by the labeling functions described in Section 3.3.3 was used.
Evaluation dataset
For evaluation purposes, a smaller, manually annotated, dataset of 200 Instagram posts was used. The annotation was a collective work by four participants in our research group. Noteworthy is that the truth labels are based on the image associated with the text. In that sense, the evaluation is unfavorable for the text-based analysis. Since the labels are decided by the image, certain posts can have labels that cannot be inferred from the text alone, degrading the measured performance of the developed text classification models.
Data analysis
The data analysis was conducted on the entire Instagram corpora. To measure the fraction of emojis, hashtags, and user-handles, the NLTK [21] TweetTokenizer was used to tokenize the text, and regular expressions were applied to extract the desirable tokens. To quantify the amount of OOV words, two vocabularies were used, the Google-news vocabulary [14], and GNU aspell v0.60.7. Finally,
Word embeddings
To find out the best set of embeddings for the Instagram domain we trained a large set of embeddings on the Instagram corpora using a variety of hyperparameters and algorithms. The embeddings were evaluated using intrinsic evaluation that included a comparison with off-the-shelf embeddings.
Evaluation of word embeddings
Three datasets were used to evaluate trained word embeddings on the word similarity task, (1) WordSim353, introduced by [12], is a dataset consisting of 353 word pairs with accompanying relatedness scores; (2) SimLex-999, presented in [16], is a dataset of 999 word pairs and similarity labels; and (3) FashionSim, an open-source10
This section describes the setup used to train and evaluate text classifiers.
Evaluation
Classifiers were evaluated after training by freezing the weights of the models and comparing the models’ predictions to the annotated dataset.
CNN models and baselines
Two classifiers were evaluated. The CNN model described in Section 3.3.5 was trained using the training dataset annotated with weak labels described in Section 4, and where labels had been combined into probabilistic labels using the data programming framework prior to training (
The CNN models were compared against a human benchmark (
Hyperparameters
Limited hyperparameter tuning was done prior to the experiments. We used 128 filter windows of size 3, 4, and 5, and a mini-batch size of 256. Moreover we used a vector dimension in the embedding layer of 300 with randomly initialized embeddings updated as part of training. For regularization we used a dropout keep probability of 0.7 and a
Results and discussion
In this section, we present our experimental results.
What characterizes Instagram as a source of text?
This section presents results from exploratory data analysis of the Instagram corpora.
Lexical noise measurements
Table 2 contains statistics that capture the distinctive properties of the Instagram corpora compared with newswire text. Removing all online-specific tokens (hashtags, user-handles, emojis, URLs) results in an OOV fraction of 0.30 based on the aspell dictionary, that can be compared with 0.25 that was obtained by [3] on a Twitter corpora using the same pre-processing and dictionary.
Measurements of lexical noise in the corpora
Measurements of lexical noise in the corpora
Although all Instagram posts in the corpora are from English accounts, the comments sections are often multi-lingual. Applying
Text distributions
The number of comments associated with Instagram posts is varying. Data analysis indicate that the distribution of comments and amount of text associated with posts exhibit the long tail phenomenon. The frequencies of number of comments in Instagram posts roughly follows a power law relationship (Fig. 5). Some posts have no comments at all, while other posts have a few thousand comments. The mean length of captions and comments in the corpora is 29, and 6 tokens, respectively.

The text distribution in the corpora.
In comparison with measurements on Twitter corpora [3], text from Instagram is just as noisy based on our measurements (Table 2). Notable is also the high diversity of languages occurring in the comment sections on Instagram and the short length of comments (mean length measured to be 6 tokens).
The long-tail distribution of text on Instagram can be explained with the follower count of the post author and the preferential attachment theory [31]. As an Instagram post attracts a lot of comments, it will get a larger spread on the Instagram platform. This causes a snowball effect, where a post that already has many comments will be more likely to attract even more comments.
Comparison of word embeddings for the Instagram domain
In this section, word embeddings trained on the Instagram corpora are examined. The experiments include a comparison between Instagram embeddings and off-the-shelf embeddings, as well as hyperparameter tuning of embeddings trained on Instagram text.

Intrinsic evaluation on the word similarity task (p-value

Hyperparameter tuning on the FashionSim evaluation dataset (p-value
Off-the-shelf embeddings outperform the domain-specific embeddings on general evaluation metrics such as Simlex-999 [16], and Wordsim353 [12]. However, on the FashionSim evaluation dataset the reversed relationship occurs (Fig. 6). To exemplify, the embeddings
What are suitable hyperparameters?
It can be observed that FastText and Word2vec are highly dependent on the hyperparameter settings, while Glove is stable in comparison (Fig. 7). FastText demonstrated the best results on the given task. With FastText, the top accuracy was achieved with Skip-gram and context window size 2. A prevalent trend in the results is that CBOW performed better with larger window sizes, as opposed to Skip-gram that achieved the highest results with smaller context windows. Additionally, a substantial boost in accuracy was observed when increasing the vector dimension from 50 to 100, and then a less significant increase when further raising the dimension up to 300. When the dimension is increased above 300 there is a diminishing return of increased accuracy relative to the increased dimension.
Discussion
When comparing the state-of-the-art algorithms for training word embeddings, FastText embeddings yielded the most accurate semantics on the intrinsic evaluation. FastText explicitly models the morphology of words by incorporating information about subwords in the embeddings, this is useful for languages that are rich on morphology. According to our results, FastText is also suited for noisy text, as can be found in social media. This is not surprising, as social media language can be characterized as containing a large vocabulary, with many rare words, where the subword embeddings can enhance generalization between words.
Deep text classification with weak supervision
This section outlines the results from the experiments with deep text classification of Instagram text using weak supervision provided by the labeling functions from Section 3.3.3.
The data programming paradigm versus majority voting
Table 3 compares results from the CNN model trained with weak labels combined through majority voting with results from the same model trained with probabilistic labels obtained with data programming. The data programming approach achieves the best
The average performance from three training runs
The average performance from three training runs

Accuracy of labeling functions in generative models.
Fig. 8 visualizes the relative accuracy between labeling functions that was learned by the generative models in
Error analysis
A part of the error is attributable to the disparity between the labels in the test set and the text. As the ground truth is determined based on the image contents of the Instagram post, there is an inherent error when information is lacking in the text. This is also evident from the relatively low human benchmark on the task (0.60

Statistics on the dev and train set during training of the
After training the
Discussion
Considering that not all clothing items can be inferred from the text and that the human benchmark on the task is 0.60, the achieved
When combining the labels by using generative models, rather than majority voting, an increase of six

Heatmap of a sample Instagram text, where a higher heat indicates a larger logit in the trained
In [26] it is assumed that labeling functions are binary. We propose to extend the base model to the multi-label scenario by learning a separate generative model for each class. In our experiments, the relative accuracy of labeling functions differed between classes, strengthening our belief that learning separate generative models for each class is useful.
In this paper we presented the first empirical study of Instagram text that we are aware of. Moreover, we evaluated domain-specific embeddings trained on Instagram text and presented a novel pipeline that utilizes weak supervision to train a deep classifier to recognize fashion clothing based on text from Instagram posts.
With weak supervision, we were able to label a large dataset in hours, something that would have taken months to do with human annotators. The weak supervision signals were combined with the data programming paradigm, which makes for a proof-of-concept of the paradigm in a new domain. Moreover, the original model for binary classification was extended to the multi-label setting by learning a separate generative model for each class.
The results demonstrate that the text on Instagram is just as noisy as have been reported in studies on Twitter text, that the text distribution has a long tail, and that the comment sections on Instagram are multi-lingual. Our experiments also indicate that there is a mismatch between text in social media and off-the-shelf embeddings trained on newswire text. We also confirmed that weak supervision is a viable approach for training deep models with unlabeled data, achieving human-level performance on the classification task. In all measures, combining weak supervision signals with the proposed combination of generative models outperformed a baseline that uses majority voting.
In future work we plan to combine the text mining methods presented in this paper with a model that analyzes the image contents associated with the text in Instagram posts.
