The dynamics and potentials of big data for audience research

Abstract

This article considers the future of audience research in an era of big data. It does so by interrogating the dynamics and potentials of the big data paradigm in an era of user-generated content and commercial exploitation. In this context, it is proposed that the major dynamics of big data are a conjoint application of numerology and alchemy in the information age. On this basis, the potentials of new data techniques are addressed in light of the critical gap between audience data and the audiences themselves.

Keywords

alchemy audience research big data cultural studies data mining numerology predictive analysis social media

Electronics digested a mix of digital numerology and alchemy, collecting metadata as input to pattern recognition algorithms, breathing life into a machine capable of doing what men and women spent a century trying to do

Carvalko (2016: 121).

Audience research has entered the era of ‘big data’, a paradigm emerging from two decades of innovative and aggressive information management. In the context of this data arcadia, the need to reconsider our epistemological premise may be less apparent than the sudden expansion of the methodological toolkit, but it is no less pressing. With this in mind, my proposition is that we need to consider both the dynamics and the potentials of such techniques. Media dynamics are determined in this instance as the presumptions, imperatives and motives that shape the paradigm itself, along with the interaction of institutional forces at play in the utilisation of audience data. With the ground set by those dynamics, the media potentials of big data procedures emerge from both the applications and the possible outcomes of these techniques. Since audience research is social research, these outcomes have to be understood outside of the computational process, that is, in terms of likely implications for the operators, clients and subjects of big data. Indeed, it is fair to assume that audience researchers will themselves occupy any one, or perhaps all, of these classes in the course of their work. In establishing the dynamics of big data, I make the assertion here that the two primary motivations of this paradigm stem from two longstanding preoccupations of human science, namely numerology and alchemy. Having, I hope, established this claim I will seek to establish the potentials that are becoming apparent due to the increasing centrality of audience data in academic research.

User-generated content to self-replicating automata

YouTube (as with other instances of YouMedia) is an exemplar of the digital economy precisely because it is a medium without any content of its own. It relies upon its users to supply the value of the service, and to do so freely and without payment (Andrejevic, 2011). User-led systems have become predominant in an era where media production technologies are cheap and media distribution has been ‘liberated’ from both expert intermediaries and costs. With the digital future looking highly retrospective and/or mundane at the level of content, the primary logics of the digital media industries have centred upon capturing the commercial value of the World Wide Web in other ways. The answer to the free content conundrum appears to have been found in the unique properties of the Internet as a medium of record, where every action produces its own data point. At the turn of the millennium, the introduction of user tracking into web browsers and the individual addressability of devices both made it theoretically possible (and perhaps, more critically, permissible) to identify individual users. Subsequently, the ‘Web 2.0’ project was bankrolled by an Internet-advertising boom for two reasons: First, it created a vast body of detailed consumer profiles and click trails and second, individual users could be more effectively targeted with advertising on the basis of this information. These motivations were reflective of broader shift in commercial logic, where credit card companies, Internet service providers (ISPs) and online retailers all woke up to the secondary usage potentials of their transaction records.

The commercial value of a network system increases exponentially as the user base expands. In the 2000s, Google’s growing monopoly on basic search facilities left the company uniquely placed to aggregate profiles of users viewing habits (Halavais, 2009). This capability allowed Google to aggressively market Internet-advertising and rapidly become one of the world’s largest companies (Levy, 2011). At the other end of the scale, user tracking allowed for people to be picked out within that vast space, matching universal reach with individual addressability. The ‘personal’ look and feel of the concomitant digital culture nonetheless rest upon a faux individuation, given the vast bulk of actual usage remains centred upon mass-produced commodities and universal formats. This unprecedented standardisation of YouMedia is itself fundamental, since the underlying economics of Web 2.0 rest upon the real-time application of targeted marketing via automated algorithms. For this to work, it is essential that user-input variables operate in a recognisable series. That is why the prevailing form of ‘identity’ in digital culture has been imposed through the clumsy mash-up of the fan survey, dating ad and curriculum vitae (CV) formats. Facebook, like YouTube, effectively categorises human beings on the basis of the films, music and sport teams that they endorse, while also working hard to establish where we were born, where we work and who we know (Cheney-Lippold, 2011). Since this is a global system, everyone in the world must conform to this (let us face it) ridiculous template of human identity as best they can.

Despite the obvious shortcomings of commercially defined digital profiling, it remains the case that sufficient scale confers numerical value to even the most clumsy survey instrument. In that respect, the global scale at which contemporary social media platforms operate is inconceivably vast (two billion is an abstract only tangible, perhaps, to a seasoned quant). Nobody has ever had so much data, nor had it in an individuated and relational structure so purposefully designed for correlation. Unlike an earlier era, where market researchers would examine consumers in direct relation to a given product, the vast datasets of the social media era are purposefully intended to automatically correlate past and potential behaviours in relation to all or any products, activities or actions. This constitutes something of a tipping point in the informational economy. It became possible only because processing power increased exponentially (as anticipated by Moore’s law). It became possible only because data storage became inconceivably vast and cheap. It became possible only because a user-led medium of record removed the need to employ millions of data entry clerks to capture the data. With this architecture in place, the engineers of Web 2.0 consciously initialised a chain reaction in the generation of data. The ultimate output of this vast experiment is a mode of informatics where particular logics of correlation can be deployed to create new knowledge from the raw material (see Zafarani et al., 2014). To further draw out the analogy between applied nuclear physics and informational physics, this is the point at which more energy is coming out of the process than is going into it.

The magic system

For Technorati, whose desired crop is readily comparable data, social information must be collected somewhat generically. By firmly anchoring user activities to a set of profiling systems, this vast audience is systematically captured as an informational commodity. The potentials of this commodity are largely determined by what we understand data to be, and this understanding has evolved in correspondence with the evolution of information technology (see Puschmann and Burgess, 2014). In ancient times, data were taken as a priori, as something given (whether that was a place, a thing or a point in time). With the rise of natural sciences, data were reinterpreted as an indexical record of an established fact. As these facts began to proliferate, and were applied to human subjects by social scientists, these records became the primary resource for governance. Emerging from this legacy, the binary logics of the computer revolution have subsequently redefined data as simply a unit of information and, thus, essentially an integer entered into or derived from an algorithmic process. In this context, the collection, analysis or manipulation of data becomes a mathematical exercise, regardless of what we want to do with our YouTube dataset on popular culture. For the purposes of the computational process alone, it is fundamentally irrelevant whether this information being unitised is about people or brightly coloured rocks. Nonetheless, in a dynamic system that relies upon large-scale inputs from its user base, the human contribution to the generative capacity of numerical data becomes highly significant.

For some years now, a whole series of propositions have been made regarding the potentials of directing these large numbers towards the resolution of mathematical problems (Howe, 2009; Shirky, 2008; Surowiecki, 2004). Mass participation through a global interactive system apparently realises one of the major aspirations of computer science: the advent of infinitely regenerative data. Billions of users continuously inputting numerical sequences that can then be used to generate an effectively infinite series of calculations promise a mathematical manifestation of Von Neumann’s (1966) self-replicating automata. Proposals to harvest the value of such large-scale participation for this purpose often have fantastic motivations, but this is not what I would define as numerology (e.g. Kurzweil, 2005). The numerological dimension of big data arises instead from the analysis of numerical trends in those inputs in order to make inferences about the future. Meteorology and stock market trading are established practices of this kind, largely determined by mathematical predictions derived from a comprehensive, but tightly defined, dataset. With the advent of Web 2.0, the contemporary convergence of computer science with the aspirations of advertisers has plotted a rapid extension of predictive analysis across the social domain. Aside from being a ‘magic system’, advertising is also a behavioural science interested in determining future choices of consumption. The promises held out in this regard by the proponents of big data have been sufficiently compelling to draw the interest of not only commercial actors but also national security and paramilitary agencies seeking to gain advantages from divination (McCue, 2006).

In that sense, the political economy of user-generated content is situated firmly within the zeitgeist engendered by the crisis of undue complexity. The exponential increases in the scale and velocity of information in the digital age have been simultaneously configured as cause, evidence and solution to this central dilemma. As the proliferation of data and interconnectivity overwhelm our capacity to conceptualise causality and make judgements regarding action and effect, we are inclined to seek solutions through procedural remedies made possible by the selfsame technologies. Addressing this crisis of calculation at the heart of the global financial system, Arjun Appadurai has drawn our attention to a distinctive ‘spirit of calculation’ in our times. Following Weber, Appadurai (2012) identifies the return of ‘magicality’ as a form of ‘coercive proceduralism’ afflicting financial markets (p. 8). This claim is made precisely on the basis of an algorithmic ‘chartism’ that plots the future through favourable data points and inculcates unthinking faith in ‘mechanical techniques of prediction with no interest in casual or explanatory principles’ (p. 9). The wider consequence of this mode of speculation is that the very notion of calculation becomes ‘a hazy amalgam of optimization, maximization, choice, quantification, prediction and agonistic individualism’ (p. 11). This is, of course, a near perfect description of the structural and expressive logics of the social media. Here too, as we begin to make routine assumptions about human behaviours on the basis of numerical trends, we are firmly engaged in numerology.

Digital waste to virtual money

While quantitative trading may be the nexus of magical speculation in our digital society, not all of the numerological practices associated with predictive analysis are imbued with a clear commercial ethos. Even for those that are, it may be worth establishing a spectrum of motivations amid the widespread collection of user data. Commercial media producers may collect data with the primary intention of aligning audience tastes with future programming, thereby simply extending the obsolete point-sampling methods of TV ratings to a universal scale. Those with engineering experience for social media brands make the case that their primary motivation is to keep improving user features in a market notorious for its fickle customers. Google’s use of user data to drive the technical development of new products extends this engineering logic to a broader applicability. Had these various motivations and processes remained discrete, smaller in scope and fewer in number, it seems unlikely that commercial surveillance would have become the pressing concern that it now is. A wilful opacity regarding the design and purpose of these various exercises has certainly not helped. It is telling, perhaps, that we live in an age where government collection of user data typically receives a far more robust public response than the same actions by, often more capable, commercial operators. Yet, again, there are similar explanations on offer. Twitter monitoring in Australia provides a means for politicians to take the public temperature on policy issues. India’s unique identification scheme (Aadhar) is intended to reduce corruption and improve welfare systems. The Federal Bureau of Investigation’s (FBI) commissioning of a social media scraping infrastructure is intended to identify risks to public safety and national security.

Nonetheless, regardless of commercial interest or strategic intent, all data-mining schemes that rely on the aggregation and re-purposing of user data share a common intent to create new forms and sources of value. As a consequence, a market for data recycling has been firmly established through the on-sale of user data in many different forms. Those outraged by the privacy implications of what is essentially a world-spanning phone tap operation are commonly reminded of the old adage: ‘If it’s free, then you’re the product’. This was, of course, somewhat true of commercial television but selling slices of your attention span was perhaps more palatable than trawling through your emails (as both Facebook and Google are wont to do). When we are able to perceive the suite of YouMedia in aggregate, we quickly come to understand that it is not the audience, nor its time, but the data being produced about the audience that is often the actual commodity being realised in these transactions (Neef, 2014). Consequently, the users of online services are not simply a group of consumers but instead the intrinsic component of the integrated package of (largely fictitious) commodities that make the digital economy work (Polanyi, 1944; Tapscott and Williams, 2006). This intense mapping of the audience is rendered precisely in the differential gap between the modest book price (assets) and the eye-popping share valuations (capitalisation) of these companies. Their rapid rise makes it clear that the potentials for numerology emerging from the creation of such vast quantities of data inevitably inspire a profit motive (see Davenport, 2014).

In the era of social media, where extremely large volumes of personal data are being created, harvested and cross-referenced for the benefit of these secondary markets, it is immediately obvious that the waste products of everyday social interaction are being subjected to an alchemical transformation. I use the term alchemy because the primary motivation of the data-mining process is to create value through the aggregation of details which in their raw form are largely worthless. A base material is being transformed into gold, or at least mundane data are being transformed into a virtual currency unit. The unavoidable example is Mark Zuckerberg’s Facebook, where the content of the medium is the interpersonal communications of two billion people, as they exchange messages, photos, endorsements and contacts. This is the muck. The brass is provided upfront by institutional clients wanting to insert behavioural prompts into personal profiles or by a posteriori clients investing in the potentials of this vast data collection for numerological divination. For themselves, the users of this form of media certainly derive some social gains in terms of increasing the density of their social networks and their stores of social capital. For the angelic investors, however, the value of these services rests firmly on data gains. In both respects, the essential value of these online ‘products’ resides in the data that they collect, but data gains translate far more readily into financial gains, precisely because numbers tend to be attracted to each other.

In this context, we should never forget that the systematic capture of the Internet’s audience is a direct consequence of a universal media system. That is, the ambitions of Facebook and Google are only viable because we have already accepted the global domination of the relatively small number of IT giants who set and maintain protocols and operating standards for the World Wide Web. The ‘necessary convenience’ of these services consequently allows them to operate as gateways to a vast realm of human experience, thereby aggregating a majority of Internet users into the critical mass which allows for their secondary commercialisation. In this respect, they are closely aligned with the reconfiguration of the socio-technical interface via so-called ‘cloud’ computing, where concentrations of data are explicitly figured as resources to be monopolised. These vast stores of privately owned public data, softly branded as data farms, constitute the consolidation of estates within the digital economy (see Lanier, 2013). However, their commercial motivations go beyond the collection of rents, in that their business models are closely related to the numerological motivations associated with the accumulation of big data (see Schmidt and Cohen, 2014). Targeted advertising, however lucrative, is only a more sophisticated manifestation of America’s rejection of planned production in favour of guided consumption. As the data economy matures, however, it is the convergence with advances in numerology that is spurring the pursuit of new forms of digital alchemy, both of these revived ‘sciences’ being ambitious responses to a super abundance of user data.

Faith in big data

At one level of analysis, we could see the dynamics of big data being set by various forces, from IT companies, security agencies and venture capital down to the timeless purveyors of baldness cures and matchmaking for the lonely. Nonetheless, my argument here is that the dynamics of big data cannot be encapsulated solely by identifying or criticising the various forces of production. The overwhelming motives of user data algorithms are to predict the future and to make money, the very essence of capitalism, but they are not themselves ideological forces. More fundamentally, the primary motivations (the major dynamics if you will) of big data in this form are the conjoined pursuit of numerology and alchemy. In their daily conduct, these are vast technical processes with their own priesthoods and a daily ‘volunteer’ workforce that encompasses a substantial portion of the human species. Big data is unequivocally big, but what does need to be established at this point is the extent to which my claims over the dynamics of big data are supported by the proponents of its potentials. In their influential popular account, Mayer-Schonberger and Cukier (2013) illustrate their reading of the potentials of pervasive data collection through a series of distinctions, with the most central being between ‘digitisation’ and ‘datafication’ and between ‘causality’ and ‘correlation’.

For the first distinction, Mayer-Schonberger and Cukier offer the analogy of Amazon’s Kindle and Google Books. While Amazon has digitised books for the purposes of selling them as virtual books on their Kindle reader, Google has paired its massive book scanning programme with text recognition software to create a vast automated experiment in linguistic translation. In the latter case, the knowledge gained from parsing all the variants between different language editions of various works generates a new algorithmic capacity to provide translation services. Thus, the archive of print becomes adapted for new purposes and new forms of calculation. A new product, Google Translate, is derived from the historical efforts of human translators labouring with no comprehension of the subsequent aggregation, correlation and automation of their knowledge. This is not only a process of digitisation, as with Kindle, but also a process of datafication. Another good example would be Facebook, where the eagerness of users to share their digital photographs online, along with their singular capacity to recognise and list everyone they know, has provided a worldwide test bed for facial recognition software. Every time that we are halted by prompts to identify ‘friends’, we contribute to this systematic endeavour. This, again, is datafication, where the visual knowledge of billions of people becomes harvested for overarching purposes visible only to the architects of social media systems and their clients. In both cases, pre-existing data are being harvested and then re-purposed by an algorithm that creates new forms of value. By my reckoning, therefore, digitisation is a process of record, while datafication is an alchemical procedure.

It is the second distinction between causality and correlation, which marks the explicit rupture between the epistemologies of science and numerology. Causality is the establishment of a likely reason for an event via a controlled experiment where all variables are accounted for. Such controlled experiments are inherently difficult when it comes to social science because of the complexity of social variables and the knowing nature of the subjects themselves (Benton and Craib, 2011). Nonetheless, the essence of positivism is the desire to know why something happens, to define what is and to establish answers through experiments that can be repeated with consistent results. Essentially, this is how ‘social facts’ are manufactured (be that about audiences or anything else). Correlation, by clear contrast, is about identifying patterns in various forms of information and comparing them to establish trends. A useful example would be the harvesting of Facebook ‘likes’ in order to predict future behaviour in elections and in sexual relationships, which Kosinski et al. (2013) claim to be able to do with 85% accuracy. Without having the slightest idea why people make those choices, a purely numerological correlation predicts behaviours on the basis of data patterns without worrying a jot about causality. As Dale Neef (2014) argues, big data is engendering an ‘analyse everything and see what turns up approach’ which ‘essentially changes the nature of statistics as we know it, because age-old concerns about statistical significance, causation and correlation are all eliminated’ (p. 183). Chris Anderson, that seemingly eternal booster for Silicon Valley, puts it more expansively:

massive amounts of data and applied mathematics replace every tool that might be brought to bear. Out with every theory of human behaviour, from linguistics to sociology, Forget taxonomy, ontology and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. (in boyd and Crawford, 2012: 666)

It might be productive, nonetheless, to ask why Silicon Valley entrepreneurs do what they do. Their persistent focus upon explicating the benefits of big data techniques for businesses, managers and consumers invariably posits the creation of new value from existing and/or replenishing stores of data. Thus, the users of Web 2.0 have been recruited, witting or otherwise, to a series of universal lab tests exploring the alchemical possibilities of numerological procedure. This has nothing to do with the traditional concerns of academic audience research (self-realisation, cultural bonding or the uncovering of meaning) but everything to do with the quantification of social behaviours for the purposes of extraction. A handful of Internet companies now hold more data on media audiences than all the academic researchers in the world combined. As Eric Schmidt and Jared Cohen (2014) describe it, social media data is essentially a ‘gift’ to ‘governments and companies’ seeking to optimse the present and predict the future (p. 57). While the effective use of Hadoop is far from a simple matter, big data technicians can and do know more than we do about what the vast global audience on the Internet is doing in real-time. Their interest in the activity of digital audiences may be entirely different from ours but it is no less intense. This is necessarily the case, since the intrinsic value of their numerological techniques is realised through behavioural predictions. As such, without the continuing flow of user-generated content, the alchemical processes of Web 2.0 would break down entirely.

Skin in the game

For our part, there is no escaping the implications of big data for audience research as an academic practice. Given the scale of the various big data operations, we have to ask the question of whether we should bother to continue collecting data at all. Much of the discussion at audience research gatherings now seems to centre less upon primary research than it does upon frustrations with what is available through the automated programme interfaces (APIs) of the major web portals. In that respect, we have already been reduced to lobbying for greater access to the vast aggregations of big data being generated in digital storehouses across the globe. If we are turned away by the Silicon Valley landlords, and it is a fairly reasonable assumption that we often will be, we may feel compelled instead to find our own ways to collect data on a similarly global scale. It is not beyond us technically by any means, but we have yet to see free search and file-sharing services being offered by consortiums of universities whose primary interest lies in the secondary usage of user data. One obvious reason is that academics labour under ethical constraints that are antithetical to the big data era (boyd and Crawford, 2012). When we collect information on audiences, by whatever means, we have to tell people why we are doing so and for what purpose. In order to mitigate unseen risks to our research participants, we must go to extraordinary lengths to anonymise the data that we collect about individuals.

In practice, universities are still struggling to accommodate the step change in scale, which means that they are no longer the major concentrations of big data in a globalised world. Nonetheless, while the expanding domain of big data has something of an omnipotent air, it remains the case that most of the datafication projects currently in train necessarily suffer from the vast outsourcing operations that make them possible in the first place. Their reliance on a combination of volunteered inputs and the wholesale scraping of content from disparate sources has real implications for the integrity of the data being collected. A shopping history is not a very accurate rendition of an individual’s taste, since the majority of purchases may well be for others. A list of favourite bands or movies is not a very deep reading of cultural orientation. The standardised format of most ‘membership’ profiles limits the nature of information collected, and the gaps in these data tend to be very important. Big it may be, but with the advent of data mining, much of these are stolen data. As you might reasonably expect, it means these are dirty data. In recent years, the more widespread looting of webmail and cloud storage through text searching are indexing incomprehensible volumes of personal and professional communication. These are by its nature very complex and context-sensitive data, from which our capacity to visualise super-charged word clouds is somewhat underwhelming. This will improve but, more fundamentally, numerological practices cannot account for the knowing qualities of human subjects.

As users become aware of the pervasiveness of data trails, there is now an obvious ‘chilling’ effect, where everyone understands that making too many jihad jokes online will make it difficult to take holidays that require air travel. At a more general level, it is already evident that as our awareness of secondary data processing and the permanency of our user history have started to sink in, people are taking evasive action online (Andrejevic, 2013). This demand is reflected in the rise of counter-mining start-ups like Duck Duck Go, Protonmail and Tor. Even within the tent, the constant prompting for elaborate inputs of personal information initiates a tendency among users towards wilful misinformation. For those who remain compliant, it is actually difficult to be consistent when responding to innumerable surveys whose questions are preset and ultimate purpose is unknown, even to the people designing them. The number of people who faithfully and consistently enter personal information is probably as rare as the numbers who actually read the terms and conditions of agreements when they ‘join up’. Inevitably, the rise of blanket collection and retrospective query as a default approach to knowledge production has provoked legitimate concerns about the secondary uses of even the most uncontentious dataset. When users reasonably suspect that their browsing habits are scrutinised for both commercial and paramilitary purposes, they become naturally more suspicious of the questions put to them by academic researchers. As secondary usages quickly become the primary interests of big data, there can be nothing innocent about the data that we collect for the purposes of audience research.

The ubiquitous pressures of a global information society are clearly affecting the spirit in which data are solicited and offered. Consequently, this places limits upon the integrity of volunteered data. At the same time, by falling back upon the scraping of ‘natural’ information provided in less fraught exchanges, as in social media platforms and discussion boards, we become restricted by the cellular architecture of those systems. For example, there is an enormous difference between the viewpoint of a single user moving through the interlinked pages of a social media system and the tiny number of people with access to the bird’s eye view of that system. Collecting data from within the portals themselves is very difficult to do in any systematic sense, and the nature of the information is limited by the format parameters and the consistency of inputs. This is particularly frustrating since the scale of data collection entices us with the notion that the answers are in there somewhere. Notwithstanding this mirage of possibility, audience researchers can become easily lost in big data because the questions we ask still tend to be premised upon the idea that the information ‘out there’ is somehow descriptive of a social reality that we perceive as human beings. This predilection with the simulation gap is common to both the interpretative and positivist traditions. In the numerological era that we are now entering, however, the privilege of coding for reality has already been transferred to the designers of algorithms.

Designing culture

As a case study in ‘algorithmic culture’, Hallinan and Striphas (2016) describe the case of Netflix: ‘a cheery mediator of people and movies, one that produces delight (and, of course, profit) by fusing technology and subscriber information in a complex alchemy of audiovisual matchmaking’ (p. 117). This claim to alchemical mastery consciously ‘mystifies the semantic and socio-technical processes by which these connections are made’ (Hallinan and Striphas, 2016). In part, this procedural mystification stems from the fact that the algorithmic engines of the major collectors of user data are deeply proprietary and effectively ‘wired-shut’ from public scrutiny. Netflix famously broke this mould in 2009 by crowd-sourcing the re-design of its recommendations system on a prize basis to an army of 50,000 mechanical Turks. As a consequence, a large number of computer scientists became deeply engaged with problems of cultural classification that had largely escaped the attention of computer science. In the process, algorithmic engineers came to ‘speak with unprecedented authority on the subject, suffusing culture with assumptions, agendas and understandings consistent with their discipline’ (Hallinan and Striphas, 2016: 119). The search for a series of indicators for cultural preferences was intensive. It seems that a priori demographic categorisations of users were not useful in themselves, the ratings provided by users were not reliable on their own terms, and a complex set of classifications around the different elements within the movies was required to triangulate the other inputs.

The massive effort required to address a singular instance and channel of cultural consumption brings to light the tremendous challenges required in producing an orderly mathematical account of cultural practice (Striphas, 2015). Nonetheless, in order for the social world to become susceptible to computation, culture must be accounted for in terms that are applicable to the algorithmic process. There must be suitably defined variables, falsifiable outputs and internal logics capable of optimising for the ‘right answers’. In determining the context, components and import of this undertaking, Hallinan and Striphas (2016) define the notion of ‘algorithmic culture’ in terms of ‘the use of computational processes to sort, classify and hierarchize people, places, objects and ideas, and also the habits of thought, conduct and expression that arise in relationship to those processes’ (p. 119). This, of course, is an overwhelming semantic challenge that neither the interpretative nor positivist traditions have ever achieved in full measure. For the positivists, the dogged issue has been the inconsistency of schemes and interpretations when it comes to coding information (Bail, 2014). The interpretative emphasis on context tends to lead to the inconvenient conclusion that culture is manifested in spectrums of associative meanings rather than through the logics of seriality (see Anderson, 1998). In Striphas’ (2015) account, coding schemes are algorisms, that is, classifications that are more or less imperfect or incomplete ciphers for the world. By contrast, algorithms are ‘a set of mathematical procedures whose purpose is to expose some truth or tendency about the world’ (p. 404).

This distinction refers us back to Weber (1968) and his cautionary notes on ideal types (p. 6). More critically, it implies that the previously central issue of signifying cultural phenomena has been reduced to an input problem, one that does not mitigate the magical capacity of the numerological process. Within the procedural frame itself, it is the profusion of ‘outliers’, that is, inconsistent or eccentric values in the various inputs, that produces friction, and this has to be ironed out in order to plot consistent predictions. By the time we have reached that point, the designers of a successful algorithm have predetermined the possibilities for input variables, mathematically corrected disruptive inputs and arrived at an orderly conclusion. In an interactive system, where recommendations for consumption are taken to exert a strong influence on users, this corrective account subsequently instructs cultural practice. This, in turn, reduces the ‘noise’ and optimises machine learning within the algorithm itself. Thus, we have a bootstrapping process of which Douglas Englebart would have thoroughly approved, which prompts Striphas (2015) to posit that ‘culture is becoming the positive remainder resulting from specific information processing tasks’ (p. 406). I would certainly not go that far but it does usefully remind us that we are dealing here with social informatics and with a circuit of communication.

It is perhaps more striking that there is little or no mention of cultural diversity in the various explications of the cultural challenges for big data. There appears to be an implicit assumption that what rings true to a data analyst, programmer or executive in Silicon Valley constitutes the truth everywhere. Equally, the ‘universal’ platforms of the social media age, such as YouTube, make scant concession to the complexities of human cultures in the broader anthropological sense. In procedural terms, Google’s use of datafication to ‘solve’ and therefore ‘leapfrog’ linguistic barriers indicates that cultural difference can be approached as a functional rather than subjective obstacle. Nonetheless, every time big data applications solve such problems by applying one filter or another, they remove a large portion of the information that we have regarded as central to understanding the phenomena of culture. Variables can be constructed to account for different categories of product and for different patterns of consumption in particular market societies, but these algorisms fall well short of an enquiry into a whole way of life (Gomez-Uribe and Hunt, 2015). In that sense, it is not simply taste or language but considerable variations in environmental, ontological and cosmological consciousness that informs culture with a big C. Algorithmic engineers charged with manipulating these variables must first acquire ‘domain knowledge’ of such epistemic complexity that the Netflix challenge becomes akin to re-engineering Pong.

Big over distance

It is worth recognising that audience data is big not only in terms of mass but also when it comes to reach. There are equally important challenges to be faced when it comes to the geography of big data. We should not be inattentive to the complex interplay of imperial infrastructures, multi-national corporations and putatively sovereign states that determines the geopolitics of information (see Aouragh and Chakravartty, 2016). The contours of the digital divide may have changed a good deal over the past decade but the world is not flat when it comes to Internet access. The common modes of ‘collection’ for big audience research are not sufficiently attuned to populations who remain comparatively less enmeshed in the digital mall that has been packaged as Web 2.0. If we are to rely so heavily on the inputs of shopping carts and pop fandom to establish social identities, then it remains difficult to establish how ‘big audience research’ of this kind will capture the perspective of subsistence farmers in Cambodia, for example, even when they have access to social media. For this reason, uneven geographies and demographies are an inevitable consequence of the narrow interests who operate scraping technologies. When it comes to the everyday business of scraping, the Americans and Europeans are the most surveilled people on Earth, but for those who shop less there is less to scrape. As a consequence, the available data are an incredibly skewed sample for any behavioural universe. Therefore, we must always keep in mind that big data is not evenly big, something which becomes readily apparent when we match data inputs to the human geography of the world.

Even so, when we look at network platforms, it is evident that there are alchemical potentials in the spatial information embedded in many-to-many communication via the World Wide Web. The utility of this visual form of pattern recognition has been demonstrated by the steady rise of social network analysis as a methodology, and ultimately a theory, for understanding society. Beginning as a graphical method for recording kinship relations in anthropology, social network analysis was powerfully transformed by computerisation during the 1970s (Scott, 2012). The extended capacity to quickly plot even the most complex relationships between individuals in a visual form came to be combined with an individualist view of the social world in which social relationships were essentially transactions in the great game of life (Granovetter, 1974; Wellman and Berkowitz, 1988). With the advent of the Web, the distinction between recording networks, exploiting networks and creating networks could not be maintained. Networks beget networks, and their aggregation has been one of the driving logics of the digital revolution. As a medium of communication and of record, the Internet channels our communications in a network form, and it also captures those forms continuously in real time. For audience researchers, this means that online audiences are automatically recorded as a map of terminal locations and information flows. When we are able to access these data, we can see the unique spatial record of every audience configuration occurring online.

To the human eye, these maps are a massive jumble of dots and lines that bear little correspondence to a concept of audience that we can grasp. From a big data perspective, however, these graphical accounts can be correlated through data visualisations that can compare the almost innumerable records of sociability that now exist. Where patterns begin to emerge, a typology of networks begins and from this comes a characterisation of what audience formations on the Internet actually look like. Critically, such an exercise can be scaled to suit a determined enquiry. If we take the example of SoundCloud, here we have a large number of musicians associating from various points in the globe, with their attendant groups of followers, admirers and critics. Each of these clusters is captured in data transfer records and can therefore be mapped out visually, revealing in the process the complexity and density of communication in that cluster, along with the physical location of each member. In this form, then, we can now not only conceptualise interactive audiences but also actually witness their constellations. To take this exercise to the big data level, it becomes necessary to correlate social network records for a vast body of subscribers within this system. In doing so, we would hope to identify the underlying patterns of exchange and association manifested within a worldwide field of popular culture. Social network matching of this kind is a complicated exercise conceptually. It requires careful parameters, sophisticated algorithms and powerful processing (see Roy and Zeng, 2014). More than that, it requires access to high-level data.

Nonetheless, while you could not do this kind of work for a 10,000 dollar research grant over the summer recess, this kind of work is nothing like the ‘Manhattan project’ of a global ethnography. Approached in this way, geospatial studies of media audiences are not only conceivable but also eminently doable. We can expect, then, that this kind of work will be done more and more in the next few years. As a result, we will be able to learn much about the participatory bias of different cultural forms from linguistic idioms down to punk bands. In the more finely grained studies, we may be able to capture the kind of inter-group dynamics that inspired social network analysis in the first place. If we then proceed to combine the possibilities of mapping new audience formations with the capacity to scrape and sort digital content, then our knowledge of various networks of association can be augmented with the capture of everyday speech within those networks. The everyday integration of interpersonal communication with entertainment implies that we could also map the varied forms of distribution by which cultural works circulate within social networks (whether by purchase, broadcast, gifting or stealing). Does such usage of big data, then, effectively close the loop between the underlying anthropological premise of audience studies and the big data paradigm? Perhaps not quite. If this was to be our aim, we would have to ensure some capacity to move from the bird’s eye view of big data back to the subjective, environmental and embodied domain of culture as a human experience.

The potentials of audience data

The simultaneous evolution of big data applications and new audience formations via the Internet signifies the extent to which we may find ourselves increasingly studying data about audiences, instead of the audiences themselves. What can be more tempting than the capacity to sit at your laptop and drop into rich conversations taking place on the other side of the world? What can be more tempting than the capacity to access a database and get real-time visualisations of audiences for specific forms of content? The clear and obvious danger of such pleasures is that the essentially human concerns of audience studies could be forgotten in the convenient euphoria of big data. Thus, our longstanding concern with meaning might be ill served by entering this goldmine of data deposits without a map of our own. In many respects, it is starting to look like we are coming full circle to a world of ‘mass effects’ research driven by a priori assumptions and tested on sophisticated correlations of dirty data garnered from a reasonably unethical trawling operation. The prevailing emphasis on correlation over causality and on pattern recognition over data integrity may be suited to behavioural studies of shopping with limited aims, but these are shaky foundations for any understanding of human cultures. Indeed, such tendencies are precisely what the proponents of active audience theory have sought to counter for the past 50 years. In the digital aftermath of cultural studies, then, what we may be left with is a radically decontextualised numerology that merely begins with the online consumption of popular culture.

If we took such a pessimistic view, the newly automated programme of audience research would doubtless continue without us. If we proceed uncritically across a range of disciplines, then the various data formats now emerging will become self-referential enquiries that increasingly forget the independent existence of the phenomena that they were originally conceived to study. It is in this fashion, very precisely, that data overtake reality. Thankfully, I do not think we have to take this as given. We do not need to subscribe to the beguiling illusion of autonomous data in order to benefit from the new techniques developed to contend with the abundance of user inputs. Nor do we need to accept the numerological precept that seeks to retire materiality and causality to a pre-digital epoch. A more productive approach would be to direct our attention towards the potentials of digital research tools as we see them. In many respects, the real utility of big data lies not in its sheer bulk but in the fact that it is granular. This means that well-designed studies can operate at several scales and might therefore be able to furnish understandings in areas that remained practically out of reach just a few years back. It is not inconceivable that some alchemy of our own might lead us to more sophisticated understandings of key concepts like interactivity, community, crowd and public. Indeed, staking a claim to big data may well prove critical for understanding the dynamics and potentials of media in the twenty-first century. As long as we remember that audiences are not data, then everything should be fine.

Footnotes

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

References

Anderson

(1998) Nationalism, identity and the logic of seriality. In: Anderson

(ed.) The Spectre of Comparisons. London: Verso, pp. 29–45.

Andrejevic

(2011) Surveillance and alienation in the online economy. Surveillance & Society 8(3): 278–287.

Andrejevic

(2013) Infoglut: How Too Much Information is Changing the Way We Think and Know. London and New York: Routledge.

Aouragh

Chakravartty

(2016) Infrastructures of empire: towards a critical geopolitics of media and information studies. Media, Culture & Society 38(4): 559–575.

Appadurai

(2012) The spirit of calculation. The Cambridge Journal of Anthropology 30(1): 3–17.

Bail

(2014) The cultural environment: measuring culture with big data. Theory and Society 43(3): 465–482.

Benton

Craib

(2011) The Philosophy of Social Science: The Philosophical Foundations of Social Thought. Houndmills: Palgrave Macmillan.

boyd

Crawford

(2012) Critical questions for big data: provocations for a cultural, technological and scholarly phenomenon. Information, Communication & Society 15(5): 662–679.

Carvalko

(2016) Self absorption: where will technology lead us? IEEE Consumer Electronics Magazine 5: 120–122.

10.

Cheney-Lippold

(2011) A new algorithmic identity: soft biopolitcs and the modulation of control. Theory, Culture & Society 28(6): 164–181.

11.

Davenport

(2014) Big Data @ Work. Boston, MA: Harvard Business Review Press.

12.

Gomez-Uribe

Hunt

(2015) The Netflix recommender system: algorithms, business value, and innovation. ACM Transactions on Management Information Systems 6(4): Article 13, 1–19.

13.

Granovetter

(1974) Getting a Job: A Study of Contacts and Careers. Cambridge, MA: Harvard University Press.

14.

Halavais

(2009) Search Engine Society. Cambridge: Polity Press.

15.

Hallinan

Striphas

(2016) Recommended for you: the Netflix Prize and the production of algorithmic culture. New Media & Society 18(1): 117–137.

16.

Howe

(2009) Crowdsourcing: How the Power of the Crowd is Driving the Future of Business. New York: Penguin Random House.

17.

Kosinski

Stillwell

Graepel

(2013) Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110(15): 5802–5805.

18.

Kurzweil

(2005) The Singularity Is near: When Humans Transcend Biology. New York: Viking Penguin.

19.

Lanier

(2013) Who Owns the Future? New York: Simon & Schuster.

20.

Levy

(2011) In the Plex: How Google Thinks, Works and Shapes Our Lives. New York: Simon & Schuster.

21.

McCue

(2006) Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis. Burlington: Elsevier.

22.

Mayer-Schonberger

Cukier

(2013) Big Data: A Revolution That Will Transform How We Live, Work and Think. New York: John Murray Publishers.

23.

Neef

(2014) Digital Exhaust: What Everyone Should Know about Big Data, Digitization and Digitally Driven Innovation. Upper Saddle River, NJ: Pearson FT Press.

24.

Polanyi

(1944) The Great Transformation: Political and Economic Origins of Our Time. Boston: Beacon Press.

25.

Puschmann

Burgess

(2014) Metaphors of big data. International Journal of Communication 8: 1690–1709.

26.

Roy

Zeng

(2014) Social Multimedia Signals: A Signal Processing Approach to Social Network Phenomena. Heidelberg: Springer.

27.

Schmidt

Cohen

(2014) The New Digital Age: Reshaping the Future of People, Nations and Business. New York: John Murray Publishers.

28.

Scott

(2012) Social Network Analysis. London: Sage.

29.

Shirky

(2008) Cognitive Surplus: Creativity and Generosity in a Connected Age. New York: Penguin Press.

30.

Striphas

(2015) Algorithmic culture. European Journal of Cultural Studies 18(4–5): 395–412.

31.

Surowiecki

(2004) The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. New York: Doubleday.

32.

Tapscott

Williams

(2006) Wikinomics: How Mass Collaboration Changes Everything. New York: Penguin Books.

33.

Von Neumann

(1966) Theory of Self-Replicating Automata. Champaign, IL: University of Illinois Press.

34.

Weber

(1968) Economy and Society: An Outline of Interpretative Sociology. New York: Bedminster Press.

35.

Wellman

Berkowitz

(1988) Social Structures: A Network Approach. Cambridge: Cambridge University Press.

36.

Zafarani

Abbasi

Liu

(2014) Social Media Mining: An Introduction. New York: Cambridge University Press.