Abstract
This paper is a short introduction on (Big) Data Science and Intelligence for the RDA educational corner. Its purpose is to motivate a greater discussion of what is Big Data, how it is transforming the future of finance and what are the essential opportunities and concerns when using Big Data. “Intelligence” in Big Data is used to emphasize that mathematics is an essential part of the algorithmic and the statistical approaches we use when searching, estimating or seeking answers to our problems. When we use the power of IT, Mathematical and Statistical Intelligence embedded in numerous applications and studies seek to bridge theoretical constructs and their computational realizations. Their integration is a complete system of automatic and learning know how (we may call AI, Learning Machines or what not and by any other name). It is now expanded by systemic computing, data analytics and management to do much more with a lot less. However, in the long run, doing more without Intelligence, replacing intentionality by machine rationality, lead to an evolution where choices are no longer made but instead, are imposed by a data complexity and expert systems that may embed far greater risks than we can expect. In this case, without the power of a human intelligence and a mathematical (objective) rationality, our use of BIG data without science are similar to seeking to go from one place to another without a map.
Introduction
Big data is hammered by the media as a “new” model-less and intelligent alternative to complex mathematical modelling and statistical analysis. It is appealing because it presumes that “models are not needed” since data expresses “what really is”, rather than what we believe it is or search for. Depreciating the mathematical definitions of models’ may be imbued by far greater risks than presumed.
The new found power of “data”, big or small, is its use of hardware and software (information technologies) that render the treatment of information, from many sources and in huge quantities, far easier and instantaneous. The extraordinary expansion of Computer Systems, (in data assemblies, data storage, computational speed and the outgrowth of increasingly complex yet simpler to use languages) and their applications have contributed to new methods that allows us to “learn” from data (inferential software), and generate future scenarios (predictive software). These are based on far more complex and extensive data sets with packaged software which do not require the demanding efforts that went in their design and construction. For example, Machine Learning, in fact a branded name implying many and different approaches to how data alters our “memory” has also evolved, providing predictive software, based on mathematical and statistical tools such as scaled multivariate regression models, mathematical definitions of short and long run memory, adaptive learning, Bayesian learning, wavelets, neuro software, Bayesian networks, etc. These software may be called “First Generation” intelligent financial software combining computational and mathematical techniques (based on computational efficiency) with newly found computing and data management capabilities. Current Intelligence software however are fast mutating, into future generations that are hard to predict. These can lead to unexpected consequences when human intentionality and mathematical logic and efficiency are no longer needed. For example, based on the presumption that data, is always part of incomplete information, increased only by what it already implies. As a result, it can only guess what is. The ancient Greeks, already perceived that “the likely is unlikely”. This does not negate that models can be used, albeit they have to be qualified. In this case, data, its analysis, and efficiency are necessarily mathematically and statistically defined. By the same token, the expansion of data and systems complexity increases the demand for a “control” intelligence, to assure that increasingly complex Big Data Systems “do what they are intended to do”. In 1956, Ashby [3], a Cybernetics scholar, already pointed to “a law of requisite variety”. It claims that a control intelligence has to be greater than its machine (system) intelligence to allow its effective control. Failing such an intelligence, a systemic breakdown ensues.
These elements of Big Data usher a new perspective and new challenges to the world of finance, both expanded in a global and interconnected world competing technologically and with far more savvy financial agents. A competition based on a BIG Data Intelligence designed by a broader understanding and application of mathematics and data analytics, (with an ever greater access to scaled-up information and technology) provides the background for an evolving finance and its engineering future.
The variety of approaches and the algorithms that may be designed to declared purposes are many and varied. With Big Data and a computing IT backbone, we may define relationships based on what has happened or happening anywhere or everywhere. Inference mathematics then provides a better appreciation of what the data may teach, as well as expand and qualify the breadth of our choices. Integrating statistical learning and predictive algorithms, financial predictions may be improved, and future prices defined with greater certainty. Yet, such a process can be maintained to the extent that its intelligence would never outpace the financial intelligence.
Big Data and its associated statistical tools raise two fundamental questions we may confront, which we categorize as Statistical Rationality based on models development and Artificial Data Intelligence. The first question is based on underlying fundamentals of Decision Sciences and Statistics (for example, see Andersen et al. [2], Callebaut [5], the Economist (Feb. 25, 2010), Goodman and Wong [14], Hey et al. [20], Krohs and Callebaut [22], McKinsey [23]). Statistical Rationality for example presumes that “Hypotheses” express prior estimates or models relative to which data is used to reject or not reject underlying Hypotheses. In this sense, both have a prior rationality used to provide a coherent and a consistent approach that assesses the empirical evidence available that test these hypotheses imply. For example, given time series, models that replicate these series are merely hypotheses and their statistical treatment just provides estimates, trends and a statistical qualification for their fit. Such approaches are inferential, both for present and future states. Policies are then reached based on optimization problems, searching and replication techniques [1,6–9].
The second question, is based on the application and the meaning of “Artificial Intelligence” built on an algorithmic and analytical logic that seeks recurrent patterns, images, based on a large set of data. This intelligence is evolutionary in the sense that it may absorb and integrate information of various sources, update their memory and their intelligence by estimates of what they know (for example, their statistics) as well as reach conditional decisions based on data and complex models.
“Big Data Intelligence” is then an evolutionary learning that consists of a multiplicity of mathematical, statistical and analytical tools to perform functions such as: replicate; track; update, store, graph, project, decide etc. Examples to such systems in finance are attracting financial enterprises due to the its speed required in trading and investments, financial advice based on learning investors needs and wants. For example, High Frequency Trading algorithms.
Big Data is not new, although, its mathematical Intelligence is evolving from the simple to far more complex issues such as how to reconcile individual and collective behaviors, strategic (gaming) conflicts resolution, competitive investment strategies, etc. In the 1930’s, Physicists and Mathematicians already sought approaches to define theories for General Systems, applicable to large and complex interactive and communicating systems, Cybernetics and feedback design have led to automatic machines and their control. These were introduced mainly in manufacturing technologies upended by complex and integrated software as it was the case with Flexible Manufacturing Systems. These developments are now updated to fit service industries such as financial services, marketing etc. [12,13,16–18,21].
Data intelligence is already existing and varied. John Tukey years ago [29–31] predicted that the future would emphasize the primacy of data (see also David Donoho [11]) and the need to learn from data analytics rather than just “fundamental statistical models”. “For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt …. All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning and gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data” [11].
Louis Guttman in 1944 [18] suggested a scaling approach to large and multidimensional psychology tests that led to the Guttman scale. Sociology and Psychology are often studied using very large quantities of qualitative data which includes a large number of inter-actions and behavioral patterns that we may define as variables. Each variable defined by its attributes. Scale then can be used to compare one student knowledge of mathematics to another, a mental state compared to another. Guttman’s scaling methodology based on quantitative and qualitative data provides such ordering. Further development at the Bureau of Social Research in Jerusalem improved this approach by a dimensional reduction of multi variate data sets (see also Raveh and Tapiero [25,26]) provide apparently more information than standard correlation and factor analysis studies. Further, they were assimilated in the form of “scalogram” which gives a visual configuration of the qualitative data. Guttman’s approach has been used successfully in investigating morale and other problems in the United States Army by the Research Branch for the Morale Services Division of the Army Service Forces (during WWII). Subsequently it led to the development of numerous applications by the Bureau of Social Research (Jerusalem, Israel) on voting patterns and a broad range of questionnaires accumulated into large data sets.
The global production and storage of data have in the past years expanded manifold with our ability to store, to standardize, mine and be used (and therefore be valuable).
Drawing on work by Tukey, Cleveland, Chambers [6] and Breiman, presented a vision of data, better prepared and displayed. In other words, they suggest a friendlier-data highlighting its importance and information visually, providing an easier appreciation of what data means. For example, defining what a data user wants along any number of criteria and issues and articulate a personalized vision of data. They suggest that “Data Science” be defined in terms of 6 divisions (Donoho [11]):
Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Modeling
Data Visualization and Presentation
Science about Data Science
Further developments, led to graphic dynamic evolutions. From a statistical viewpoint, Bayesian approaches have been used to “update” probability estimates of recorded events – both based on Bayes theorem and extended to complex Bayesian Networks based on the algorithmic adaptation of the Bayesian framework. Additional technical tools such as wavelets, neuro nets, etc. and multivariate and experimental design, simulation’s program (producing data when needed) and so on are evolving, fueled by their applications. Further, search engines are applied to the discovery of any matter that would seem, once discovered, to be important. It may be applied to search for deviant traders, collusive trades across national boundaries, to detect credit default risks, trade opportunities, mispricing of assets etc. For regulators, it is used to better define factors contributing to systemic risks and detect suspected non-compliance. In all these models’ intelligence is necessarily mathematical, or at least its derivative. For example Behavioral science and finance are based on hypotheses and their statistical tests (all previous works by Amos Tversky and Daniel Kahneman were in fact based on experimental statistics with mathematical designs that detect and differentiate behavioral patterns).
An increasing number of software and start-ups are proposing “big-data-black-boxes” that seek financial trends, consumers’ moods, preferences, and intents using internet comments assembled globally on financial assets and economic variables. Algorithms and automatic learning machines are then created to seek and interpret images, to detect departures from expected trends, to trace sentiments integrated into predictive mechanisms for stock markets performance. A rising tide of data driven algorithms based on behavioral-rationalities is emerging and engulfing large firms and business interests in an avid and information dependent finance. Big data thus competes and at the same time complements the traditional statistical and mathematical approaches to data management and analysis. However, its blind use to define, say a decision, at the expense of depreciating models and computational rationality may imply risks as well as imbue decisions with a certainty that by definition exists nowhere.
The traditional statistical approach, unlike IT supported Big Data, is based on fundamental and structured hypotheses emanating from scientific statements or theories to be refuted or not, based on statistical (data) evidence (see Allen [1], Weinberg [32], Callebaut [5,28]). On the one hand, the statistical/scientific approach reveals uncertainty from a given and tested knowledge database. On the other hand, data driven algorithms are based on a robotized artificial intelligence that has the capacity to treat large data sets to reveal covariations, associations, patterns, based on their search (in some cases improved by built-in learning algorithms) rather than theory. Thus while the statistical/scientific approach is an evolutionary and evolving process, based on a cycle to hypothesize, measure, test and confirm-or-not, the data driven approach is a shortcut, seeking to know even though it may not know why. It is an interpretation of data at any one time with a predictive artificial intelligence that decision makers might not understand as data revelations can be as complex and as stealth as artificial intelligence may be to its users. It is therefore, a statement of a current fact – a presumed certainty, rather than a recognition that all knowledge is partial – embedded in a greater uncertainty.
A British astrophysicist Arthur S. Eddington once wrote “no experiment should be believed until it has been confirmed by theory!” (NY Times, March 26, 2012). In this sense, data driven measurements can be misleading and reveal about anything one may choose to discover. Researchers, coding the Bible, defined an algorithm to identify trends and patterns to predict future world events. To disprove the validity of such an approach and their prophecies, a digitalization of Tolstoy “War and Peace” and a search engine revealed as well secret intents that made as much sense as one would be willing to believe in. Similarly other books have predicted political stalemates in the Middle East and their resolution. In this sense, data driven measurement models can make sense out of nonsense even though it is in fact nonsense. Similarly, in biology, where big data has already found its footing, Callebaut [5, p. 69] raises important and philosophical issues: Can data ‘speak for themselves’? He also discusses Carl Woese’s concerns about “a society that permits biology to become an engineering discipline, that allows science to slip into a role changing the living world without trying to understand it. Big data has therefore its perils, with type II statistical errors (i.e. making a false discovery), enslaving decision makers to rationalities they might not be aware of, and providing a new twist to the misuse of data measurement by seeking and confirming of one’s certainty. The statistical approach in comparison to a data analytic approach (in Fig. 1) emphasizes the systemic causal approach of the scientific and statistical versus the algorithmic and data analytic approach (see also Nyamabuu and Tapiero [24]).

Statistical versus data analytic approaches.
Practically, Big data and Data Science” are associated to data banks, cloud computing, computer languages, some statistics, some cybersecurity, some data mining, etc. These are essential short term developments – an evolution of IT and its applications everywhere. When computers started in industry, there was also a rush. Anybody knowing Cobol was hired immediately. Fortran was the best that we could hope for. Artificial intelligence, was already created in 1958, with expert systems (see for example INRIA’s expert optimal control system, self-designing a software based on the partial differential equations that define it). An application is to be found in my book Tapiero [27]. Already it was supposed to change everything for medicine, for business etc. Cobol and Fortran although used for some time, were replaced by a new “must” Lisp language and Lisp machines. They too became obsolete with C and
Donoho [11] recalls, and confirms that Data Analysis is not a new concept based on the rationale that pure mathematical and probabilistic concepts were not adequate, and that we may be able to learn directly from data, with “no probabilistic bias”. Although there are numerous approaches seeking to combine data analytics with statistical analyses to treat ever larger data sets (for Example, Guttman, INRIA, my paper also with my Phd Student Uri Hanani [19] as well an extensive development of data analytic problems by Edwin Diday at INRIA and many colleagues such as L. Billard and E. Diday and Esposito [4,10] “From the Statistics of Data to the Statistics of Knowledge: Symbolic Data. (Diday et al. on the SODA software, Groenen, Winsberg, Rodriguez, Diday [15] and numerous other papers.)
The broad academic tendency to emphasize empirical analysis at the expense of mathematical integrity is contributing to such a development, yet ignoring “fake data”, “partial data”, etc. confronted by a rationality or mathematical theories.
Machine and Deep learning are now at the kernels of computer science and statistics, in financial engineering programs. It provides an opportunity to find even more trends. This kind of approach, in its infancy is appealing, although in the near future, financial managers and scientists may ask for “why this trend?”, at which time, it may seek a mathematical rationale to be integrated in its own system. Today it has motivated advances in technology, from personalized product recommendations to speech recognition in cell phones. From financial advisors to Financial ROBO-advisers, from theory to an emphasis of software designer intuition and empirical information, that are naturally based on an extensive (or simulated) personalized data base.
Finally, Financial IT and Big Data although they provide an immense opportunity in data management and data distribution, managing financial clients and their strategies. However they can, without a Big Data Intelligence, turn out to be an unwieldy process victimized by the belief that a larger hay stack may help find a needle in that hay stack. Yet, big data in banks, if tamed can complement the statistical/scientific approach to measurements by providing an opportunity to reveal new hypotheses and new directions that can set such approaches on a more certain footing.
Financial traders and financial institutions and agents in general, use financial data as a primary asset. The Economist reports the middle class grew by more than 1 billion people between 1990 and 2005, and data transferred globally over the internet has been rising exponentially. Companies like Amazon’s Web Services, AT&T’s Synaptic Hosting, AppNexus, GoGrid, Rackspace Cloud Hosting, the HP/Yahoo/Intel Cloud Computing Test bed, the IBM/ Google and Micro Strategy BI Cloud are providing various types of clouds services to ease data storage.
Big data and global finance may be a fit for one another as both imply and are based on complex and large volumes of data, both internally produced and externally supplied. While technologies from social media companies have shown a way to handle vast amounts of unstructured data, our abilities to translate these complex, diverse and dynamic data sources into a workable global and financial information system remain challenging. From a technical and statistical viewpoint, they require quantitative and data modeling expertise regarding: algorithmic search or data mining; data detection and selection; coding; filtering and reporting to construct a prior time path; data security and quality; data modeling and data dimensional scale reduction
Global financial systems increasingly digitalized allow automatic context-specific interpretations, aggregation and analysis (e.g., what information is relevant or not to a particular country, market or stock). For example, information and/or knowledge extracted from digital records can render financial bank jobs easier to accurately diagnose and detect risky clients in selected global areas. Similarly, digitized data may prevent cybercrimes more effectively and thus contributes to the management of increasingly complex financial networks, e-financial markets and financial retailing. Despite the potential for big data and data analytics for financial information, it is still immature. Michael de Crespigny, CEO at Information Security Information (ISF), points out regarding forensic finance that: Only half of organizations surveyed by the ISF are using some form of analytics for fraud prevention, forensics and network traffic analysis, while less than 20% are using it to identify information related to subject matters requests, predict hardware failures, ensure data integrity or check data classification. “Few organizations currently recognize the benefits for information security, yet many are already using data analytics to support their core business …” (ComputerWeekly.com, August 1, 2012). Is big data looking for a pin in a haystack by adding hay? Is big data in the future of global finance (or vice versa is finance in the future of Big Data)? Can big data, based on past facts, be reconciled with implied future volatility estimates? Is big data merely another IT data-driven tool to justify what we are do better or rather define what we ought to do? Is big data the end of privacy? Are big data algorithmic search models transparent to those who may want to use it? Is it a means to reveal outcomes a-priori sought or reveal the unknown and the unexpected? Is big data something new? Or the marketing of well-known data analytics tools up-ended with more information, expanded by new computational hardware, all of which are integrated? Is big data a means to increase or reduce complexity? Is the growth of data and our ability to deal with this data unsustainable?
Technology and financial data1
It is based on joint research with Michael Hayes from Goldman Sachs. Elements of this paper are also reproduced in the forthcoming book (Wiley, 2017), Globalization, Gating and Risk Finance, by U. Nyamabu and C.S. Tapiero.
It is based on joint research with Michael Hayes from Goldman Sachs. Elements of this paper are also reproduced in the forthcoming book (Wiley, 2017), Globalization, Gating and Risk Finance, by U. Nyamabu and C.S. Tapiero.
The evolution of Information Technology (IT) combined with globalization have led to a transformed Data management. Distributed databases and IT networking under a global and data-open environment are gradually changing finance. It does so from a technological, and a data intelligence viewpoints. The latter is underlying algorithmic finance, feeding financial industries (banks, investment funds, hedge funds and advisory firms, etc.). Financial technologies and intelligence are not expectedly and uniformly distributed and therefore, they have introduced a risk asymmetry for ill equipped agents and a fierce competition between financial firms, able to acquire the technology but struggling to make it pay (since its financial intelligence is not distributed equally). For these reasons, the future of global finance is both complex and in a continuous search of what it may become.
Prior to 1995 (before distributed computing was introduced), data management was relatively simple. Generally, data was centrally maintained in mainframe computers; financial products were less complex; the velocity of systems change was much slower; centralized system processing meant data standards were easier to manage and maintain. Since 1995, and amplified by globalization data management has become increasingly complex due to the financial freedom to trade in a far greater types of assets, and incentives that has extended the outreach of banking and their competitive postures. Specific reasons include: Glass-Steagall Act gets repealed and therefore banks were able to trade rather than just lend and manage clients’ moneys; product innovation such as future contracts and a broad variety of optional products; distributed computing and databases that took hold and transformed data management; accessible from many locations; data processing standards are diluted and business silos are empowered.
The Year 2000 (the Y2K syndrome) has further challenged computer aided data management and ushered changes both in markets and in financial products. Derivatives markets began to explode creating a notional size of over 400 trillion dollars. Global capital markets became more interdependent as products were sold globally across jurisdictions. Further, following the repeal of Glass-Steagall Act, it removed separation between investment banks and depository banks enabling risk to flow freely between the bank and its entities augmenting the financial outreach of banking institutions to areas where they were previously unable to function in.
These resulted in an unprecedented number of mergers and acquisitions; the line between risk and data control became blurred; comfort, instead of objective analysis became acceptable; engineering standards are compromised. A “new complexity” invaded finance, augmenting both the profits of those who could master such complexity and increasing the risks for those who could not. From a data management viewpoint, distributed computing was set in with new opportunities and risks. Some of its consequences included: data management went from being centralized to decentralized: hundreds if not thousands of database servers were created; businesses rapidly transformed themselves creating distinct processing silos; federated ownership of data ruled the day meaning that everyone owned the data, so no one effectively owned it. In other words, owning data was not an advantage; businesses argued “data freedom” was better than “data anarchy”; product creation/experimentation increased at an exponential rate and governance and data processing became compromised.
These consequences, faced with an ever expanding financial system, created a sense of sovereign uncontrollability and sovereign risks that have prepared the groundwork to institute legally a far greater and far more powerful financial regulation. The 2007–2008 financial crisis has increased the political commitment to reign in the financial system. If not completely, at least wherever it can. Current attempts to reduce its costs due to its consequences, is also the case for a data-only finance.
According to the National Research Council (2010, p. 1), the Post Lehman Crisis in 2008 and its contagion, has also pointed that “rapid change in the financial system driven by innovation and deregulation have altered the mechanisms and pace of financial intermediation to such an extent that regulatory tools, processes and data have fallen behind.” These have provided a regulatory revolution which is in a continuous transformation. Its challenges are expanding multi-fold in global finance, where all countries are confronted with their own problems, their regulation and other countries’ regulations. These elements introduce far greater costs to financial institutions that comply to financial (and often contradictory) regulations.
The complexity of standardizing data and its quality in a global environment has led an IT approach summarized by a 4-letter objective: “A–C–T–S” or Accuracy, Completeness, Timeliness, and Adherence to Standards. There are basically two general approaches:
The first is Enterprise Information Integration (EII), which is a process of information integration that uses data abstraction to provide a unified interface (known as uniform data access) for viewing all the data within an organization and a single set of structures and naming conventions (known as uniform information representation) to represent data.
The second is Data Virtualization (DV), which uses technologies that offer data users a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.
The main benefits of these approaches are: users get a unified view of the data; data integration happens at middleware level (data quality improves because standards are promoted back into the database); engineering principles and practices are uniformly applied across the enterprise; encapsulation hides the technical details of the data so users can interact with data in a more natural manner; business intelligence tools can readily be put to use against the data; users can once again trust the data to conform to operating standard; and governance of information becomes effective.
Footnotes
Acknowledgements
This introduction has benefited from discussions with Alain Bensoussan and George Papanicolaou as well as with my students and the NYU Tandon School of Engineering with whom we have exchanged views on Big Data and the future of finance.
