Abstract
Abstract
This publication covers the highlights and interviews at the Strata Data Conference held in New York on September 11–13, 2018.
Section 1: General Observations
New York's Strata Data Conference “Make Data Work” * was presented by O'Reilly † and Cloudera. ‡ (Congrats to Cloudera on its acquisition of Hortonworks!) The event's goals are to “help you put big data, cutting-edge data science (DS), and new business fundamentals to work.”
This premier data conference occupied the premier real estate of the Javits Center. § Contrary to one long forgotten night of November 8, 2016, for three memorable, beautiful September days the center was bursting at the seams with excited and exciting people full of fresh DS, artificial intelligence (AI), machine learning (ML), analytics, and internet of things (IoT) ideas, tools, services, and products. Day 1 (September 11) was all about training and tutorials, whereas days 2 and 3 (September 12 and 13) were for keynotes and sessions. Clearly, it is impossible to cover the absolute majority of what was happening at the conference. So, what are my personal highlights of this mega-event?
Let us start with five general observations. First, it is very telling and appropriate that this year the conference changed its name from “Strata Big Data and Hadoop Conference” to just “Strata Data Conference.” Second, the efforts for gender parity were much greater this year. That was evident on several fronts. First, and for the first time in my experience as a regular at the conference, there were long lines for both men's and women's restrooms. Also, and seriously, the total numbers of the keynotes, plenaries, and companies presented by women were the highest among these conferences!
Third, the conference diversity was not limited only to the gender parity but was on full display by the breadth and depth of the business applications. They included, for example, Infoworks' agile engineering software, Timescale's time-series scalable database, RLR Company's Hadoop desktop server, Integris Software's GDPR (a new European Union regulation called “General data protection regulation”) data privacy solution, and SmartCover's IoT diagnostics and prevention technology for seaward systems (all described in the below Section 2). This year a lion's share of the offerings and products were both thoughtful, to the point, and actionable. Altogether, this clearly indicates a diversity and maturity of the DS community at large, and not just of the so-called “data-industrial complex” (Tim Cook, CEO of Apple). 1 My hat goes off to both O'Reilly and Cloudera for being true Friends of Data & Analytics (FDAs)2–4 and for their many-years support of our community!
Fourth, there is another, less obvious, and somewhat subjective remark on the community's maturity. This one comes from my personal, over a decade-long observations. When I moved from academia to industry to become one of the first Chief Data Officers (CDOs) world-wide, our community was very small, and a number of the utilized abbreviations was very small too. Nowadays, including during the Data Strata conference, this number continues to go through the roof. And I'm not only talking about well-known abbreviations, such as AI, DS, GDPR, IoT, and ML, but also about other ones, like, for example, AGI, DL, FDR, GOBS, NLP, and 5Vs. Finally, for all the abbreviations' lovers (guilty as charged), let me reckon two most recent additions: CSP and DIC (a youngest relative of over half-century old MIC. The latter abbreviations will be defined at the end of this article.).
Fifth, multiple presentations exemplified the conference's goal of translating data into superior business outcomes. One of the best keynotes was presented by Cassie Kozyrkov, Chief Decision Scientist (have you heard of this role before?) from Google Cloud. ** Cassie brilliantly covered important topics from actionable insights and decision science to the future of data science and type III errors (finding the right answers for the wrong questions). 5 Type III errors are indispensable in our relentless focus on business outcomes as well as preventing us from generation of BS (GOBS). 6 My interview with Cassie can be found in the below Section 3 (and her bio can be found in the Supplementary Data (Supplementary Data are available online at www.liebertpub.com/big). Still, her bottom line was quite simple: “all those complex mathematical endeavors need to be directed well, otherwise the results may be at best useless and at worst harmful… The world is collecting so much data – our goal should be to make it as useful as possible.” Amen!
Section 2: Conference Highlights
Here is a very narrowed down, subjective, and short write-up of only five individual presentations and/or company products I saw at the conference.
Agile data engineering is an extremely important activity for many of us in trenches. Amar Arsikere, CEO and Founder of Infoworks, †† introduced the company's new agile data engineering software. This software automates and accelerates big data analytics projects through the company's Autonomous Data Engine, which has been adopted by some of the largest enterprises in the world. Using a code-free environment, the engine allows organizations to quickly create and manage data pipeline and workflow processes from source to consumption. Customers deploy big data projects to production within days, dramatically increasing analytics agility and time-to-value. And if you want to learn more about simplifying data operations, please go to Infoworks. ‡‡
At the conference (and 1.5 years after its launch), CEO and Co-Founder Ajay Kulkarni of Timescale §§ announced TimescaleDB 1.0. It signifies the maturity and enterprise readiness of this open-source time-series database, built on top of PostgreSQL. This database offers the reliability and tooling of a 20+ years old database with a powerful extension framework. TimescaleDB ingests millions of data points per second; scales tables to 100s billions of rows and 10s of terabytes; and returns quick responses to complex queries. *** It is architected to manage time-series data and includes many key specific functions, including automatic space–time partitioning, a hypertable abstraction layer, adaptive chunk sizing, time-series analytics in SQL, geospatial analysis, JSON support, and easy schema management. To my knowledge (please, let me know if I am wrong!), it is the only time-series database to scale these workloads, while still supporting full SQL.
At the conference, the Ricker Lyman Robotic Company ††† debuted its first product, Hivecell One, enabling developers to have a cluster on their desktop for working with Hadoop (it is still alive and kicking!). Hivecell is a small stackable server with six core 64-bit ARMv8 processor, 256 GPU CUDA cores, and 8 GB RAM, which brings true linear scalability. You can place another Hivecell on the stack to scale your compute power. The patent-pending Baranovsky connectors pass power and Ethernet through the stack of Hivecells, eliminating the clutter of wires from the developer's desktop. Hivecell has a built-in patent-pending provision system that enables developers to install Hadoop on a cluster with a single click of a button. It also supports Mesos, Kubernetes, and Kafka. My interview with the CEO and Co-Founder Jeff Ricker is in Section 4 hereunder, and the joint bios of Jeff and his multiyear friend, President, and Co-Founder Paul Lyman is in the Supplementary Data.
Data privacy issues, including GDPR, were discussed in several sessions. Integris Software ‡‡‡ was founded to meet the requirements of GDPR, the new California privacy law, and other privacy obligations. First, companies have to realize that privacy is fundamentally a data issue and has to be an outcome of a comprehensive data protection strategy. Few technology executives actually know what data are sitting on their systems, and relying on the old manual survey-based collection methods are not realistic in the age of big data. What is needed is data privacy automation, a new field that uses ML to help organizations discover, map, and set polices for their data so they do not have to lock it all down. Integris Software, led by Kristina Bergman, CEO and Co-founder, is a pioneer in data privacy automation. §§§ Integris helps companies to build privacy into the design of their modern data architecture, enabling to protect customer privacy, while keeping their data unlocked.
Lastly, Dr. Greg Quist, CEO and Co-Founder of SmartCover Systems, **** told a fascinating story on how to predict and prevent sewer spills. Unfortunately, for many of the sewer systems, getting data is a challenge. This is particularly true for the utility leaders, who are challenged to maintain an aging infrastructure in the midst of increasingly frequent and severe storm events and, hence, are up many nights. SmartCover helps these utility leaders step out of the dark, and, through a robust IoT solution (sensors, satellite communications, analytics, real-time data, and event notifications), let the sewers update them on how things were going. This results in reduced costs, optimized operations, and the elimination of overflows and spills. Best of all, the clients are now sleeping, and letting their sewers do the talking! My interview with Greg is in the following Section 5, and his bio is in the Supplementary Data.
Overall, this year Strata Data Conference in New York's Javits Center was well organized, amazing, and empowering! Hopefully these observations, highlights, and interviews will relate some of that to you, our dearest Big Data journal's readers.
Finally, the promised definitions of the additional abbreviations are: AGI (artificial general intelligence), 7 DL (deep learning), 8 FDR (false discovery rate), 9 GOBS (generation of BS), NLP (natural language processing), 10 and 5Vs (five challenges of big data: value, variety, velocity, veracity, and volume). 11 To my knowledge, this article is the first one to define two new abbreviations: CSP as “consulting, services, products” and DIC as “data-industrial complex.” Interestingly, in computer sciences, CSP stands for “constraint satisfaction problem.” 8 Clearly, DIC reminds us of the infamous MIC (military-industrial complex), introduced by President Eisenhower in 1961 12 and since then acquired huge popularity. Do you expect that DIC will gain the same uber-popularity?
This article would not be possible without help, assistance, and encouragement of the following individuals: Zoran Obradovic, Sophie Mohin, Maureen Jennings, Amar Arsikere, Andrey and Michael Baranovsky, Kristina Bergman, Tricia Bush, Marjorie Cannon, Mary Eggert, Justin Hahn, Jacob Javits, Benjamin, Evelyne, and Natali Kolker, Cassie Kozyrkov, Ajay Kulkarni, Paul Lyman, Joe Manguno, Lucas Mayer, Cassie McAllister, Jacinda Mein, Vural Ozdemir, Jeff Ricker, Greg Quist, Graham Symmonds, and Jenny Wang.
I am looking forward to the next year New York's Data Strata Conference, again in the Javits Center and again in September! †††† Please, let me know if you have any questions, suggestions, or ideas at (ekolker@nyu.edu).
Section 3: Interview with Cassie Kozyrkov, Google
Type III error means you should not have been pursuing the problem you are solving in the first place, you should have been doing something else that is more useful. When you go down the wrong rabbit hole with data, at best you are wasting everybody's time and at worst you are doing something harmful.
Forgetting Type III error is a bad mistake for society. We have got so much data. Now we need to really talk about making it useful. Meticulously answering the wrong question is painful for everybody. Let us have a discipline oriented around bringing down Type III error. And that is what is at the beating heart of Decision Intelligence: doing the right thing properly.
Business community at large: data science has a lot of promise and there is a reason businesses are investing in it. Unfortunately that investment goes nowhere businesses are not able to use data science effectively. Decision Intelligence is a way to multiply the impact of data science and make sure that investment pays off.
General public: the world is collecting data and there is a lot of human benefit that is locked in those data. We can unlock that benefit if we have the skills and abilities to make those data useful. Building those bridges is the key to success.
Section 4: Interview with Jeff Ricker, RLR Company
Deliver the revolution in hardware that matches the revolution that has occurred in software. Create the standard building block for edge computing: the personal data center. Build a fog computing ecosystem for adopting new distributed software and sharing compute power securely peer to peer.
There is an immediate pressing need in the market. There is a significant barrier to learning and developing on these distributed software frameworks. Furthermore, installing and configuring (provisioning) distributed frameworks are extremely difficult. There are hundreds of parameters to be set. Most of the professional services provided by companies such as Cloudera and Hortonworks are just for helping clients with provisioning. The barrier to learning is preventing the supply of developers from meeting the exploding demand for big data, AI and machine learning expertise.
A personal data center works in the office and the home, just as the personal computer did 30 years ago. Blockchain, fog computing, and personal data centers all work together. Blockchain enables individuals to own their own data and share it peer to peer. Fog computing enables vendors to provision software to the edge, removing the complexity from the user. Personal data centers enable individuals to store the data and run the software that they own and control but can still use online.
Amazon's case is acute, but most production systems follow a similar pattern of having a peak usage that exceeds normal usage. For instance, most of the trading on Wall Street occurs at the open of the market and at the close of the market. Two hours of peak usage, 6 hours of mild usage, and 16 hours of idleness. The pattern is found everywhere.
With the growth of AI and machine learning, there is a growing need for the ability to buy and sell spare computing power at the edge. It has to be at the edge in a growing number of cases as using the cloud is too slow to be practical. However, sharing compute power at the edge peer to peer is a significant challenge. Trust can be achieved if the hardware is secure to the metal and both parties know that the other is using the same hardware.
Section 5: Interview with Greg Quist, Smart Cover
The answers we received from our friends were both shocking and uniform—“What can you guys do about sewer spills? Our sewers overflow, we have strict liability, and we pay fines and get bad press. Can you guys help us out?” So we jumped on the Internet and found no solutions and there were no patents on this at the USPTO. So on February 5, 2005, Hadronex (the official corporate name for SmartCover Systems) was formed.
Starting with a blank sheet of paper, we turned back to our water friends and asked: “What do you want it to do? How do you want it to function? How much would you pay?.” The answers we got were—“keep it simple,” “easy to install and service,” “give me the answers I want,” “make it affordable for large numbers,” “no confined space entry,” “keep the sensors out of the water,” “tell us if the manhole has been opened.” Armed with a specification directly from the industry, David and I set out to build a solution with no preconceived notions. With our own meager funds, working elbow to elbow in David's home workshop, we built a start-up IoT solution before IoT became a buzzword.
So what the customers said translated into, for us, “plug and play solution,” “reliable and dependable sensors,” “dependable two way wireless communications,” and “built-in power.” Fortunately, in 2005, reasonable answers existed for the technology required to make this happen. SmartCover could not have started 10 years earlier. The technology was not there yet. Timing is everything.
Working with our customers and starting literally from scratch, we had our first prototype in the field by May, fixed our problems, had our second prototype in the field by July, fixed those problems, and by November, had our full solution available for the market. David and Greg did engineering, quality control, R&D, finance, customer service, sales, and everything else. So we had designed, tested, fielded, and sold a complete end-to-end IoT solution from a blank sheet of paper in 9 months.
We started sales locally in San Diego County, to be sure our product was reliable, then expanded to Southern California, then to all of California, then nationwide and internationally. At each step, we made sure our solution was getting robust enough to succeed without close babysitting. The customer was always at the center of our focus. We listened and made modifications.
We have taken in two rounds of financing to get gasoline in the company engine. First in late 2007—early 2008, we took in private equity to get our sales, customer service, and engineering beefed up. We tripled sales the next year. And in 2016, we took in funding from XPV Water Partners to help us reach the next level. XPV have been excellent partners and only invest in water companies.
Besides best in class customer service, examples of our current technical differentiation are (1) we utilize the Iridium® low earth orbit satellite system as our two-way wireless communications. There are multiple reasons why we do this but a few are ubiquity for the Iridium—it works anywhere in the world; availability under the most demanding conditions—our system worked in the New York City region flawlessly during Hurricane Sandy when all other wireless systems were down; (2) we do not require confined space entry for installation or service; and (3) our system provides a real-time intrusion alarm telling our customers their sewer has been breached.
Our sensors are purpose built for the sewer environment—able to withstand corrosive atmosphere, high humidity, dirt, and shock. We typically measure water level, and from that flow, and we are adding new sensors carefully to be sure our customers get the performance expected from SmartCover. For example, next year, we will be bringing an H2S sensor to the market. We can measure other parameters as well, including pressure, temperature, and pH.
Communication is accomplished through the Iridium satellite constellation. Iridium is best used for our application because when the big storms come, such as Hurricane Florence this year, and Hurricane Irma last year, and New York of course remembers Hurricane Sandy, you want to know what is going on in your sewer. Most terrestrial systems such as cell phones fail under high stress. That is why the U.S. military is a big Iridium customer, too.
We perform a great deal of automated analysis on our data, both at the measurement site to ensure quality measurements and on our cloud servers to improve the value of the data. This includes data fusion with other data sources such as NOAA and USGS, giving our customers great visibility to the response of their sewer systems to storms, snow, floods, and tides.
Ultimately, the information we provide to our customers answers the question: “So what do I do now?” Of course, the answer most of the time is “nothing–all's good.” But simply having that assurance that the system is operating well when before the advent of SmartCover, operators were effectively blind to the real-time conditions of their sewers. As Peter Drucker says, “what's measured improves.” We help our customers act both proactively using predictive methods, and reactively to rapidly changing conditions, often driven by external events such as rain.
It is our goal—and I think we are succeeding—to help our customers save money, reduce operational risk, extend the lifetime of their assets with no increased risk, minimize or avoid spills, and simply do their job better with less hassle and cost.
I mean, the iPhone came out in 2007, 2 years after we started. So SmartCover is 2 years older than the iPhone, and I would say the iPhone has met with broader acceptance than IoT for sewers. But we are seeing the peak of the baby boomers like me start to retire. We call it the “Silver Tsunami”—and the younger generation coming in is less resistant to digital technology, partly due to the iPhone. And rate payers—customers of our customers—are becoming much more picky due to the smart phone in their hands and are expecting higher and better performance from their utilities to whom they pay their bills. I think we are going to see much deeper and broader adoption by the industry within the next 5 to 10 years, and those who are not onboard will slowly disappear.
What we are doing now is making our digital data and predictive analysis as easy to integrate into a variety of standard utility platforms as possible. And we are doing it one step at a time.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
