SAS Institute Adapts to the Big Data Era


The original SAS software package, which debuted over 35 years ago, was designed to run on IBM mainframes. A lot has changed in the world of IT since then, and SAS has evolved to keep up.

The latest stage in SAS’s evolution is a re-architecting of its software to run optimally in distributed computing environments. Between Hadoop and next-generation data warehouses, business analytics increasingly takes place against the backdrop of Big Data architectures, and SAS knows that’s where it has to be.

For SAS, the latest journey began around two years ago, according to Paul Kent, Vice President of Platform Research and Development at the Cary, N.C.-based firm. That’s when SAS teamed up with Teradata to provide SAS analytics inside the massively parallel enterprise data warehouse. Since then, it has forged similar partnerships with IBM Netezza, EMC Greenplum and Aster Data (since acquired by Teradata.)

The shift to parallel computing was heralded by Google, which pioneered the practice of stringing together lots of commodity blades to form a single super-computer, Kent points out. That, in turn, required SAS to rewrite its software and algorithms to run on multiple nodes simultaneously, Kent said, an effort that is still ongoing. But the impact on users is significant.

Namely, in-database analytics obviates the need to move data between the data warehouse and a separate analytic engine or application, such as SAS. This means users spend less time moving data around and more time analyzing it.

One SAS customer, a large national retail company, for example, reduced the amount of time it was spending running marketing optimization analytics from one week (170 hours, to be precise) to three minutes or less, Kent said. The retailer can now take an iterative approach to analytics, rather than running just one time-intensive job to support a week’s worth of marketing objectives.

Paul Kent, Vice President of Platform Research and Development, SAS Institute

In-database analytics also makes it possible run analytics on full data sets, rather than samples. Moving large data sets to and from systems is impractical, so admins often end up transferring smaller, more manageable sample data sets for analysis. They sometimes then run analytics on the sample data to ensure it is representative of the complete data set, before running the actual analytics, Kent said, taking up even more valuable time.

With “the math inside the machine,” as Kent puts it, those steps are no longer necessary.

The ability to run analytics on complete data sets is particularly important when it comes to predicting future events based on historical trends. Take a mortgage lender evaluating risk, for example. If it relies on just sample data from a two-year period during a recession to score applicants, it could miscalculate likely default rates and deny loans to otherwise qualified people during more prosperous economic times.

But the benefits of in-database analytics for users mean new challenges for SAS. It requires SAS engineers to understand how to best partition data across clusters of commodity storage for optimum performance, Kent said. And they must make the transition for customers as seamless as possible, he added. Both jobs are works in progress, he said.

Then there’s Hadoop. SAS has yet to bring its analytic prowess to the open source Big Data framework, but the company is poised to release three Hadoop connectors – one each for Cloudera, Hortonworks and MapR – in the near future, Kent said. SAS also adds its capabilities to other data warehouses, such as ParAccel and HP Vertica, depending on the level of customer interest.

It’s all just part of the latest evolution of SAS, this time for the Big Data Era.