UPDATED 09:00 EST / SEPTEMBER 07 2023

AI

Data and analytics interrupt their programs for generative AI

What a difference a few months makes. If at the beginning of the year you asked anyone outside the artificial intelligence practitioner community about large language models and generative AI, they probably wouldn’t have known what you were talking about.

When we had some first-quarter advance strategy briefings with various data and analytics vendors, the topic never came up. Yet, as ChatGPT went viral, the conversation suddenly changed. Just as with smartphones, YouTube videos and TikTok, enterprises found themselves following the consumer lead in evaluating and adopting new technology. Yep, chalk up another case in the technology world of the tail wagging the dog.

Over the spring conference  season, we had the chance to spend time with Databricks, DataStax, IBM, MongoDB, Oracle, SAP, SAS, Snowflake and Teradata, and not surprisingly, we found generative AI, large language models, or LLMs and foundation models, or FMs, taking over the conversation. Now that the summer has given us some downtime to reflect, we’ve distilled some of the common themes, and as we put them together, we also noticed some major gaps.

English becomes the default API

If there is a single headline, it is that generative is turning English (and down the pike, other common speaking languages) into the most popular application programming interface or programming language. It’s getting possible to type, and not too far in the future, likely talk to your computer like Captain Kirk did in “Star Trek.” That’s key to why generative suddenly took over the conversation.

Previous waves of AI, which were mostly centered around machine learning, were something that happened under the hood. Maybe that e-commerce site you shopped started giving you better recommendations for what to buy, while your streaming service seemed more spot-on suggesting what next to watch. However, to the consumer, the difference in the experience that AI provided was too subtle to notice.

By contrast, the experience typing a conversational question rather than a keyword search, and getting a conversational answer, was truly transformative, even if the results looked like something that a high school senior would have typed up. No wonder people started taking notice.

And now enterprises are looking to replicate that experience, with business intelligence queries a natural starting point. Generative has the potential to make what we were calling “natural language queries” far less robotic; in place of the keywords or pointers to specific columns that drove the machine learning models to turn query and response into prose, generative AI has the potential to make the process truly conversational. Emerging services, such as ThoughtSpot Sage, Snowflake Document AI or Databricks LakehouseIQ could pick up from where Tableau Ask Data left off.

Then there is the coding side, where generative AI can pick up where traditional autocomplete leaves off. We’ve seen a flurry of announcements from Amazon Web Services (CodeWhisperer); IBM (Watson Code Assistant); Microsoft (GitHub Copilot); Databricks (English language Spark SDK); and others bringing out services that can do everything from filling out the missing piece of code to generating all the code from a declarative conversational request. The generative model sorts through the corpus of code and spits out a program, or large parts of it. And many of these services will also scan for bugs, security gaps, bias, and privacy protection.

The bright side is that generative code assistants eliminate enormous legwork (and potentially, coding backlogs) and provide ready answers for the perennial skills shortage among developers. But the flip side is that there are serious intellectual property protection questions for code creators who believe that generative will rip off their IP. Don’t underestimate the pent-up demand for this service.

How the generative sausage is made

Vendors also addressed how generative is generated. It starts with strategies around infrastructure (where virtually everybody is lining up to become BFFs with Nvidia), and also how and where enterprises seeking their own generative solutions should start.

The next question is what data to use. Most enterprises will not want to entrust models that deliver answers based on the unwashed and highly generalized internet. The models and the corpuses of data must be trustworthy and relevant to the domain. A rejoinder that we constantly heard all spring was about having “your models with your data.” We got a hint of this being a powerful theme when we dashed off a quick LinkedIn post on the topic last winter that drew 60,000-plus hits over the course of a weekend.

Then, of course, there is the need to choose the model. The action will be around prebuilt FMs, because few organizations have the resources (or want) to become the next OpenAI and build their models from scratch. Just as the data must be relevant to the domain, so must the model. No single FM will solve all use cases. And so, a common refrain was providers offering portfolios of FMs that will also support third-party models or bring your own.

Given the scale of LLMs, marshaling the data is no trivial task. Content must be chunked into tokens, which for an LLM could be a word or part of a word. But then processing raw tokens in an LLM can get massive. That’s where vector embeddings come in: They crunch tokens together to generate a meaning, with the meaning physically stored as mathematical representations.

Storage and indexing of vectors was another common theme, and so the question has become whether vectors merit their own database, or whether they are simply a feature or add-on to existing databases. So yes, we saw upstarts like Pinecone and Milvus emerge with their own platforms. But we also saw a raft of announcements from AWS, DataStax, Microsoft, MongoDB, Snowflake and others that were mostly about adding vector storage and indexing to their existing databases. In the long run, we view vector support as a database feature, not a product, although we expect that some providers such as AWS are likely to straddle the issue with vector support as both a feature in existing databases and as a separate data platform.

What’s missing?

Amidst all the pronouncements from data and analytics providers on generative over the spring, we found a huge omission. Like all AI, generative will require governance, and most providers are for now paying lip service to that.

But generative adds its own unique twists, such as figuring how to keep a process that looks sentient but is not, from heading off the rails. Answers provided by general-purpose services that scrape the internet, such as ChatGPT, all too often look like generically worded essays lacking footnotes.

Making “classical” machine learning explainable has been enough of a challenge. Now, try explaining a generative model that outputs results through a long chain of mathematical next-most-likely-word probability computations. The industry is still figuring out how to monitor, mitigate and document issues unique to generative AI that only start with hallucinations and inconsistent answers.

Inevitably, LLMs used by enterprises will have to document their sources to generate audit trails, but the models themselves will remain black boxes. Certifying those models will likely be an empirical process, and that will require some time for best practices to crystallize. We’re still at the beginning of this journey.

Tony Baer is principal at dbInsight LLC, which provides an independent view on the database and analytics technology ecosystem. Baer is an industry expert in extending data management practices, governance and advanced analytics to address the desire of enterprises to generate meaningful value from data-driven transformation. This post is excerpted for SiliconANGLE from an in-depth study on the impact of generative AI on the data and analytics landscape. The complete report can be downloaded free of charge here.

Image: Bing Image Creator

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU