This is a transcription of an interview of Abhishek Mehta, managing director for big data and analytics for Bank of America on SiliconAngle.tv in which he discusses the disruptive nature of big data, which is 95% unstructured, the primacy of data in all businsses and how we are entering the second great industrial revolution with the rise of data factories, and how IT developers need to refocus away from developing the next new algorithm and instead on what problems to solve with those algorithms.
The interview, at Hadoop World earlier in October 2010, was conducted by Wikibon co-founder David Vellante and SiliconAngle.tv founder John Furrier. Speakers are identified by their initials. This transcription is not a polished article but is intended to be used as research material.
JF: In Bank of America obviously you run people’s money and a lot of confidential information. You deal with big data all the time. So Hadoop is Open Source, the Hadoop movement has arrived. It’s got a lot of momentum. In year 2 it already has seen a collision between the Open Source community & commercialization. So thanks for coming on. And tell us why. What’s your perspective? Share with us your view and BoA’s view of Hadoop.
AM: First of all, I’m a big believer in Open Source. I see Hadoop today as the way Linux was 20 years ago. It is in the exact same place. We have seen how disruptive Linux has been. I think Hadoop will be equally disruptive, not just to existing systems, but it will enable you to do things you couldn’t do before. It’s good to be occupying the front seat with it and being the leader in thinking about it. Because I don’t think it is a question of if Hadoop changes the way we do business today but of when it happens. So again a big supporter of Open Source; I think Hadoop will be a massive disruptor. Because be it Bank of America, be it Wal*Mart, Be it Verizon, they are all data companies. You don’t push cash around, it’s moving bits & bytes. And we realize that, we want to be good custodians of it, & increase transparency that we have in the bank & in the larger system to drive positive change.
JF: As it is with all kinds of issues, you guys are in the data business because you have all kinds of data. Most people think of online banking as sort of a reality. When online banking first came out people said, “that’s awesome.” Now it’s like you take it for granted, it’s like what new features can you add. So you are in the business of using data. Is the role of data changing your business model, and share with folks out there the business models, because you brought that up. You’re getting more mobility with Web access; you’re learning more about your customers in real time. How does that effect the business model & product you offer?
AM: It’s a great question. I talk about the emerging business model in the context of what I call the “data factory”. I think we are witnessing the second industrial revolution. And it is fueled by data. And it will be bigger than the first industrial revolution, because finally technology has democratized not just the access of data to a plethora of new companies but also the ability to store, mine, clean, analyze, and produce data products that can solve problems that you could not have solved before.
So these data factories are going to emerge as the new drivers of innovation of a massive revolution that will change fundamentally how business models extract value, because data is going to be, is the core asset in a multitude of industries. And the ability to automate the data pipeline and then rapidly find information in it to make decisions that benefit our end-customers, will add value.
JF: How does that change software development? Think about old-school AI for example, which was an academic thing. But what you are really talking about is leveraging quantitative principles with programming that requires reasoning. New kinds of approaches. Are you seeing any new software environments or mindsets out there? Obviously you guys are in that analytic business, you’ve got to think like quant jocks and also think about being a developer.
AM: Absolutely. I think the interesting part is that wherever technology is adopting the core principles of pushing the code to the data rather than the other way around because ???…, you see some massive game changing.
JF:’ Explain that notion of pushing the code to the data.
AM: The whole concept of big data is big by definition. It’s massive amounts of data. If I have petabytes of data sitting on an infrastructure platform, rather than move the data to where I analyze it, I need the ability that Hadoop does – the ability to push the code to the data — because the pipes are limited, and that creates a bottleneck. Moving a petabyte of data through a 10G network takes time. Moving a few Megabytes of code to a petabyte of data is a massive change. And wherever you see technology being employed across what I call the “data factory stack”, with similar principles – commodity hardware, massively parallel, moving small bits across the network — you see emerging technology that’s truly interesting.
So BI there’s a company in BI called Tableau that is probably the only company in BI that you can use for big data. Similarly massively parallel ….
DV: What is the name of that company again?
AM: Tableau, like a French table. Tableau is a startup in Seattle, started by Pat Hanrahan and a bunch of smart Stamford guys. It seems like every smart company has Stamford behind it. You can blame Tim O’Reilly for that one.
JF: Or MIT, Or Northeastern University. We’re on the East Coast.
AM: But Pat Hanrahan, the founder of the company, completely turned the concept of BI on its head and said, “BI needs to be used by the business user, not the technologist.” So Tableau does that with their own custom language called VizQL, which pushes the code out. Vertica and Astra have similar concepts on the database side. And then you’ve got Hadoop, which is the foundation of any data factory.
I call Hadoop a day-old baby. It’s got a conical head & it’s kind of cute, but it has tremendous potential. But it’s still a day-old baby.
DV: So you talk about storing the data, and then you bring the code to the data so you can process it. Then you’ve got to make it available to the business user, right? And you’re talking about Tableau doing that. That’s a big white space in the industry right now.
AM: Absolutely huge. I think there are two big white spaces in the industry. One is actually building the data factory. So we have the ambitious goal to build the first data factory in financial services. And I think the emergence of the data factory will truly drive the second industrial revolution. Massively change the way we drive value.
Google and Facebook are data factories. They are also new properties. So the first areas where Hadoop has developed is in new properties. Outside of new properties are people like me, who are sitting on large legacy infrastructures, who need a massive incentive to change because no incentive to change is greater than the status quo. Status quo is always the enemy of change.
There is white space around the ability to build a data factory that automates the data pipeline. It doesn’t exist today. You have to build them.
Then secondly to your point, at some point your data analysis is only as good as the story you tell. The BI space around it is very large. And having tried all the BI tools in the current space around it, the only tool that works – and you have 4 billion rows a month, and I want to put the on an actual US map, and I have four years of data, and Tableau can do it in four seconds?!
JF: Where do I sign?
DV: Your traditional analytics is like a snake swallowing a basketball. You’re chasing chips and bringing in more storage and the architecture’s just not working. Now you’ve got this new architecture, this Hadoop thing, that comes in, and it changes the way you think about the problem.
AM: It’s a massive game changer. To look at what is happening today, Google processes a Tbyte an hour. Processes it. I can’t even store it, right? In the human genome project you have to analyze 3 billion base pairs. The first time it was done, it took 10 years. The second time it took three. Today you can do it in a week. So the analogy of the snake, it just doesn’t work any more because the ability not just to take, store, and clean massive amounts of data but also to build models on the population is a massive game changer. Because now as a bank I can think of eliminating fraud. So now I can build a model looking at every incidence of fraud going back five years for every single person, rather than sampling it now, building a model, realizing there is an outlier that breaks the model, and then rebuilding the model. Those days are over.
DV: Wow, that is a game changer.
JF: There is no automated pipeline for the data, so people are building their own. So okay, Hadoop’s great, this is cool, people get excited and they build their own solution. How is Hadoop going to change the game on the existing players – EMC, NetApp, HP, Oracle? Oracle in particular is being called out here on stage and also by us. It is the old game of extracting rents out of the marketplace and screwing customers with exorbitant licensing fees if you are running VMware, and these kinds of things are going on.
AM: That’s a $100 billion question.
JF: There’s a lot of money in that. People get killed for that. I’d better watch my car when I start it up if I’m going to continue this Oracle bashing.
AM: The database market is a $100 billion market. That is a great question. I don’t have an answer to it honestly. I think it will be interesting to see. My belief is the current players in the market – I’m not an Oracle basher. Oracle has done some very cool things, just like IBM does and just like the other players in the industry. I think they all see that the future is in a smarter ?? That’s like IBM says, building a smarter planet. Oracle and X-data is trying to go down that path. Over time you will see that the democratization of data management, led by the Open Source revolution, will make them change the way that their business and products are done.
JF: Only if there are alternatives.
AM: Yes but Linux did not put Oracle or IBM out of business.
JF: It changed their business model a little bit.
DV: They adopted it aggressively and changed their whole business model.
AM: I think we will see the large software vendors embracing Open Source phenomenon like Hadoop and building software products in the ecosystem around it. The Open Source community is very powerful, and it can solve a lot of problems. At a certain point an enterprise like ours needs certain critical things to be built in that framework that the Open Source community may or may not do. Information security, pipeline management.
JF: What kinds of products are you guys coming out with that can dwarf online banking, because what online banking did to people as “Wow. This is cool”. Users can touch it. What’s next? What new innovation will enable you guys. What big product do you see coming. Mobility? Security?
AM: In banking in general, and this is just my personal perspective – there are 6 billion mobile phones in the world; in a world of 6.8 billion people. The lowest common denominator in the world today is my mobile phone. So I think a mobile device or as I call it a mobile computer, because it’s really a computer, which truly is accessing this cloud information. For me a phone is really an access point to massive amounts of data in a virtual cloud called the Internet. Over time you’ll see these clouds emerge of multiple data sources. Google is the keeper of all Internet for online data. Similarly you can look at a large financial services company as being a large financial cloud. Facebook is a large social cloud. Telecom companies being a large communications cloud. At some point they are all going to merge in this massive space and the mobile access point, i.e., the voice the customer has, will change a lot of things.
Does mobile banking change the way people do payments today? The answer is yes. What model emerges, what that will be remains to be seen. It’s not as simple as saying that a phone can be an access point for payments. But a lot of different players….
JF: You talk about the legacy. Just as you have legacy infrastructure internally for data you talk about a legacy cloud environment.
DV: We are talking about innovation around how information and data are used, right? It’s a data renaissance.
AM: And hopefully a protocol where people interoperate. I think the last thing we should try to drive in the industry 10-15 years out is closed systems where data sets don’t merge and come together, which brings a whole host of other questions on. What is the protocol for exchanging data? Or the new laws of data – who should own it? So there is a host of things to be done. So hopefully I can lead some of that innovation myself.
JF: So we are seeing some embracing of Hadoop. Is there any advice you would like to share with people out there, with entrepreneurs? A lot of smart people, entrepreneurs, are coming in and digging around, getting dirty and developing applications. Any advice you might want to share with people who might come to you with the next Tableau or opportunity?
AM: If any entrepreneur has already caught onto Hadoop then they are smarter than I am; I don’t think I could offer any advice.
JF: You’d be a buyer.
AM: Hadoop – I think people have already bought into it. This is the next big thing to come and will fundamentally change economic models and disrupt existing businesses like you have never seen before. It is going to happen. Buy into it. Think big. Throw out the assumption that you had before, problems that you thought could not be solved before — eliminating fraud, looking at the spread of disease, trying to fix the traffic system, optimizing the energy grid — can be solved now.
DV: It’s intoxicating.
AM: And think beyond the vet (???) properties because there is a lot of adoption and a lot of help because companies outside the west coast, on the east coast, need to make Hadoop real or to make technology and democratized data real.
JF: And Hadoop has really shown the community aspect of the collaboration. We live in a global economy, a global world where the workforce can be anywhere. And that’s a big game changer with open software.
AM: Absolutely. And I do believe that data quant is the job of the future, that it will be sexier than being a model for Aberchrombie.
JF: Big data is sexy. I love it.
AM: And I think that at some point we have to change our education system, retool our work force, America has nothing to worry about. The next generation of the factories sits here.
JF: Data factories.
AM: Data factories. They are going to come here.
JF: The new industrial revolution. Talk about that idea, I really like that, that we are going to see a revolution that will be bigger than anything we’ve seen.
AM: I think the broader industry is slowly realizing that the core asset for their franchises & their business models is their data. Wal*Mart for me is a data company. I am a data company. And we are technology companies. We may not talk that way, but we are technology companies.
I think that as you think about data factories emerging, there are three core concepts around them: 1. Concept #1: You have to believe that your core asset is data. 2. Concept #2: You have to have the ability to automate the data pipeline. 3. Concept #3: You need to know how to monetize it. If you can monetize a data asset, you are a data factory.
And we see them today. Google, Facebook – they are very well known data factories. Data is their core asset. There are some that are not that well known: Zinga, the biggest gaming company on Facebook. They’re 250 million users. I have 80 million consumers. So growth is viral, and data growth is viral. That will change massively. So those are the core principles.
Certain things are lacking today: Data protocols, the rules around data – what I call the new common laws around data – are lacking. How will the larger economy embrace data factories & push them forward? And some major challenges around things like privacy and security have yet to be met yet. And some very smart thinkers are thinking around those areas. So the issues and the common economic and legal landscape has to change. The factories will come, the principles already exist, the tools already exist. So we will see a massive evolution where the factories will come to America and will drive the next massive wave of innovation.
DV: The future is bright, John.
JF: The Cube is all about acquiring knowledge, and you’re doing great.
AM: The West Coast has been kind to me. I spent six months here earlier this year & the West Coast embraced me very openly because we saw that when we asked the question: If I have to process massive amounts of data, who does it today? They all said in the Valley. So Yahoo, Facebook, Twitter, Zinga have all been very kind to me. My education comes from them.
JF: There’s a camaraderie. That’s a people-centric kind of thing that no one’s talking about. This is a very early emerging trend. It’s still small – you know the principals by name. It’s like Linux back in the day. Is there a camaraderie amongst the big data shops and geeks and developers? Because it is a mixed bag, and they are all playing together in this sandbox.
AM: It’s a very good question. It’s a lesson I’ve had to learn the hard way. I’ll share it: One, the science is tricky, and it’s on steroids. It’s an art. Here’s why: As of today, this year, we have 1.2 Terabytes of information in existence. That’s 1,000 exobytes, 1 million petabytes. Cisco estimates that by 2013 there will be 70 exobytes of information flowing through the Internet, 5% of that will be structured. All the rest will be unstructured. Now look at the enterprise databases. It’s all relational, all structured. So most of the information being produced is unstructured. It’s like the Google robotic car. You look inside and there’s no driver. That’s scary. And that is the future.
We have a large call center. And we can improve our service to our customers if we had the ability to understand exactly what our customers were saying to us. So that is one thing. A lot of data, most of it unstructured.
The second thing is sampling is dead. You can now model at the population level, with the whole data set. Which is massively interesting.
And the third thing, the lesson we learned, algorithms are no longer proprietary. How you apply them is. So the competitive advantage is not going to come from writing the next graph algorithm. Because the people you may know on LinkedIn. using GPS data for example, as well as risk concentrations can all be done with a graph algorithm which as already been written. So forget about rewriting it. How you apply it and which problem you apply the algorithm to is now the important thing. Which is a massive game changer because people for the longest time have spent their time thinking about writing an algorithm that nobody has and protecting it.
So to answer your question in a long-winded way, because of those three factors there is a lot of camaraderie emerging in the space because people are realizing that the competitive advantage that historically came from these very smart quants writing algorithms is no longer there. The advantage will be you can use a graph algorithm for one thing, and I can use it for something else, and it’s fine for us to talk about it, because there are these nice pockets of value that I’m not impeding on them.
DV: I wanted your take on Open Source. You are seeing a lot of big companies like Oracle now owning Open Source; you’re seeing them use their power to go after Google, for example, with Java Mobile Edition. How do you deal with that or you just say the innovation is just so great we’re just going to say the innovation is just so great, damn the torpedoes, we’re going to go forward. Or do you have some processes, & what kind of advice would you give to people who want to go forward but are concerned about the potential legal aspects or the wild, wild west of Open Source.
AM: It’s a great question. I don’t have all the answers and don’t claim to have them. I think the industry needs to take a leadership role, come together, and establish a set of laws, just like it exists for physical properties, for intangible property like data. It doesn’t exist today. I think that’s where – I was talking to the O’Reilly guys about it a few weeks ago. The industry needs to come in and do its own proclamation of digital rights. Rather than wait for someone to come in & do it to us, let’s take the lead. We know the issues, we know the power of the platform and how it could potentially be misused. Let’s take the lead ourselves and be the white knight, the purveyors of good, & establish a rule book that can enable people to leverage technology for solving problems & not in a bad way. I don’t think I know the way to go forward and do it, because there is a lot of creative tension in and around people trying to figure out what space they should employ. It’s like going to Japan & going into the subway system. You’re always wiggling for elbowroom. People are doing that right now, they’re trying to find elbowroom. They will find elbowroom but they will realize it’s okay to be in close quarters with each other. They’re still the same people waiting for the same train to the same destination in the same place. It’s okay to be close together.
DV: That’s great. And I’m hearing, John, that there are tremendous opportunities not just for technology but in legal frameworks as well and in media and other types of frameworks as well such as social networks.
JF: One of the things we were talking about before you came on is we are throwing off video now at such a high scale for us – well, for anyone. No one’s doing this kind of tech programming. We’re throwing off some significant big data. So we we have a significant storage back-end problem. So we’re recruiting Cloudera executives to help us fund, or help us find some developers.
DV: And it’s mobile, so it’s not each for us to just bring in big hunking ….
JF: But it doesn’t exist. Automating the inbound data flow transfer. Just for us to transfer from our machines to….
AM: Plus the ability to search it. There was this fascinating prof at MIT that spoke about ability to tag videos at a semantic fashion. So if I go to Utube today I can only search by the title. I can’t type “fast red car” and have every video with a fast red car come up, unless the video is titled “Fast, Red Car’ or “Red Car”. How’s that going to change.
JF: Yes, how do you search video?
AM: The scariest thing I’ve heard is the Large Hydron Collider in Switzerland, guess how much data it throws out every second. Forty terabytes a second.
JF: That’s big data!
AM: Holy shit, how do you store that? So there’s so much innovation to come. That’s the fun part. I think we’re witnessing the birth of a business revolution. There’s a lot to be done. Someone taught me this lesson: Go find a business problem to solve. Success will follow, money will follow. That’s the big thing for us at BoA.