Q&A: VP discusses deployment-flexible Vertica v10 and delivering transparent, replicable ML
It’s a wild ride keeping up in the surf of technological change. Some companies miss the wave, while others catch it only to wipe out as it crests. In the world of big data, Vertica has made a habit of catching wave after wave, positioning itself, paddling hard, then hanging ten as it rides the curl.
An early player in big data services, Vertica has smoothly transitioned from one trend to the next, be it model-view-presenter architecture, big data with Hadoop Distributed File System, or HDFS, through data science and data analytics, into cloud and machine learning. Vertica is currently the only platform that offers disaggregated compute both on-premise and in the cloud, and with the release of Vertica version 10, the platform is taking on the next level of deployment flexibility.
“Vertica is at its core a true engineering culture,”” said Joy King (pictured), vice president of Vertica product management and marketing at Micro Focus International PLC. “That means we don’t pretend to know everything that’s coming. But we are committed to embracing the technology trends, the innovations. … We don’t pretend to know it all; we just do it all.”
King spoke with Dave Vellante, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the virtual Vertica Big Data Conference. They discussed trends in data and how Vertica is making machine-learning models transparent and replicable. (* Disclosure below.)
[Editor’s note: The following content has been condensed for clarity.]
I’ve said to a number of our guests that Vertica’s always been good at riding the wave. What are the current trends that you see? The big waves that you’re riding right now.
King: Data growth and data silos is trend one. Hadoop is a very reasonably capable elephant, but she can’t be an entire zoo. So, there’s a lot of disappointment in the market but a lot of data in HDFS. You combine that with the explosion of cloud object storage, you’re talking about even more data, but even more data silos.
Trend two is the cloud reality. Cloud brings so many advantages; there are so many opportunities that public cloud computing delivers. But I think we’ve learned enough now to know that there’s also some reality. It’s a little more pricey than we expected, there are some security and privacy concerns, there’s some workloads that can’t go to the cloud, so hybrid and also multicloud deployments are the next trend that are mandatory.
The trend that maybe the one that is most exciting in terms of changing the world — and we could use a little change right now — is operationalizing machine learning. There’s so much potential in the technology, but it somehow has been stuck, for the most part, in science projects and data science labs, and the time is now to operationalize it.
I think we all know that data analytics, machine learning, none of that delivers real value unless the volume of data is there to be able to truly predict and influence the future. The last seven to 10 years have been, correctly, about collecting the data, getting the data into a common location. And HDFS was well-designed for that. Now the key is, how do we take advantage of all of that data? And now that’s exactly what Vertica is focusing on.
Vertica 10.0 just released. What are the highlights?
King: Vertica in Eon Mode allows workload isolation, meaning allocating the compute resources that different use cases need without allowing them to interfere with other use cases and allowing everybody to access the data. So, it’s a great way to bring the corporate world together but still protect them from each other.
With Vertica 10.0, we are introducing Vertica in Eon Mode for HDFS and Vertica in Eon Mode on Google Cloud. Eon mode for HDFS is a way to apply an ANSI SQL database management platform to HDFS infrastructure and data in HDFS file storage. And that is a great way to leverage the investment that so many companies have made in HDFS. And I think it’s fair to the elephant to treat her well.
You beat a number of the cloud players with the capability for separate compute and storage on-premises and in the cloud. That is a differentiator for Vertica, assuming that you’re giving me that cloud experience, and the licensing, and the pricing capability. Can you explain how Vertica handles licensing and costs?
King: There is no question that the public clouds introduced the separation of compute and storage and these advantages. But they do not have the ability, or the interest, to replicate that on-premise. For Vertica, we were born to be software-only. We don’t charge as a package for the hardware underneath, so we are totally motivated to be independent of that and also to continuously optimize the software to be as efficient as possible.
Vertica offers per node and per terabyte for our customers, depending on their use case. We also offer perpetual licenses for customers who want CAPEX. But we also offer subscription for companies that say, ‘Nope. I have to have OPEX.’ This can certainly cause some complexity for our field organization; we know that it’s all about choice, that everybody in today’s world wants it personalized just for me, and that’s exactly what we’re doing with our pricing and licensing.
So, my takeaway here is optionality and pricing your way. That’s great. Now let’s talk about storage optionality. You’ve got Amazon Web Services Inc., I’m presuming Google LLC now, Pure Storage Inc. is a partner …
King: We support Google object store, Amazon S3 object store, HDFS, Pure Storage FlashBlade, which is an object store on-prem, and we are continuing on this path. Because, ultimately, we know that our customers need the option of having next-generation data center architecture, which is sort of shared or communal storage, so all the data is in one place, but workloads can be managed independently on that data, and that’s exactly what we’re doing.
Let’s talk about applying machine intelligence to the data, the machine learning piece of it. What’s your story there?
King: Quite a few years ago, we began building some in-database, native in-database machine-learning algorithms into Vertica. And the reason we did that was we knew that the architecture of MPP columnar execution would dramatically improve performance. We also knew that a lot of people speak SQL. So, what if we could give access to machine learning in the database via SQL and deliver that kind of performance? That’s the journey we started on.
Then we realized that actually machine learning is a lot more, as everybody knows, than just algorithms. So, we then built in the full end-to-end machine-learning function, from data preparation to model training, model scoring and evaluation, all the way through to full deployment. And all of this SQL-accessible. You speak SQL; you speak to the data. And the other advantage of this approach was, we realized that accuracy was compromised if you down sample.
If you moved a portion of the data from a database to a specialty machine learning platform, you were challenged by accuracy and also what the industry is calling replicability. And that means, if a model makes a decision, like let’s say credit scoring, and that decision is in any way challenged, well, you have to be able to replicate it, to prove that you made the decision correctly.
There was a bit of a blowup in the media not too long ago about a credit-scoring decision that appeared to be gender-biased, but unfortunately, because the model could not be replicated, there was no way to disprove that, and that was not a good thing.
So, all of this is built into Vertica, and with Vertica 10, we’ve taken the next step. Just like with Hadoop, we know that innovation happens within Vertica but also outside of Vertica. We saw that data scientists really love their preferred language, like Python; they love their tools and platforms, like TensorFlow. With Vertica 10, we now integrate even more with Python, which we have for a while, but we also integrate with TensorFlow integration and PMML.
What does that mean? It means that if you build and train a model, external to Vertica, using the machine-learning platform that you like, you can import that model into Vertica and run it on the full end-to-end process but run it on all the data. No more accuracy challenges, MPP columnar execution, so it’s blazing fast. And, if somebody wants to know why a model made a decision, you can replicate that model and you can explain why.
It also brings cultural unification. It unifies the business analyst community, who speak SQL, with the data scientist community, who love their tools, like TensorFlow and Python.
In so much of machine intelligence and artificial intelligence, there’s a black box problem that you can’t replicate the model; then you do run into potential gender bias. Being able to replicate that and open up and make the machine intelligence transparent is very, very important.
King: It really is, and that replicability, as well as accuracy, is critical, because if you’re down sampling and you’re running models on different sets of data, things can get confusing. Doing it in the database or training the model and then importing it into the database for production, that’s what Vertica allows. This is the next step in blazing the ML trail.
What are your customers pushing you for, and what are you delivering?
King: The number one thing that our customers are demanding right now is deployment flexibility. What I tell them is, it is impossible to know what you’re going to be commanded to do or what options you might have in the future; the key is not having to choose. And they are very, very committed to that.
I would say that the interest in operationalizing machine learning, but not necessarily forcing the analytics team to hammer the data science team about which tools are the best tools, that’s probably number two.
And then I’d say number three is performance at scale. Look at companies like Uber Technologies Inc. or The Trade Desk Inc. or AT&T Corp. When they say milliseconds, they think that’s slow. When they say petabytes, they’re like, ‘Yeah, that was yesterday.’ So, performance at scale good enough for Vertica is never good enough. And it’s why we’re constantly building at the core the next-generation execution engine, database designer, optimization engine, all that stuff.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the virtual Vertica Big Data Conference. (* Disclosure: TheCUBE is a paid media partner for the Vertica Big Data Conference. Neither Vertica, the sponsor for theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.