UPDATED 17:06 EDT / FEBRUARY 10 2019

Trust but verify: Machine learning’s magic masks hidden frailties

AI SPECIAL REPORT by Paul Gillin

The idea sounded good in theory: Rather than giving away full-boat scholarships, colleges could optimize their use of scholarship money to attract students willing to pay most of the tuition costs.

So instead of offering a $20,000 scholarship to one needy student, they could divide the same amount into four scholarships of $5,000 each and dangle them in front to wealthier students who might otherwise choose a different school. Luring four paying students instead of one nonpayer would create $240,000 in additional tuition revenue over four years.

The widely used practice, called “financial aid leveraging,” is a perfect application of machine learning, the form of predictive analytics that has taken the business world by storm. But it turned out that the long-term unintended consequence of this leveraging is an imbalance in the student population between economic classes, with wealthier applicants gaining admission at the expense of poorer but equally qualified peers.

Machine learning, a branch of artificial intelligence, applies specialized algorithms to large data sets to discover factors that influence outcomes that might be invisible to humans because of the sheer quantity of data involved. Researchers are using machine learning to tackle a wide variety of tasks of unimaginable complexity, such as determining harmful drug interactions by correlating millions of patient medication records or identifying new factors that contribute to equipment failure in factories.

Web-scale giants such as Facebook Inc., Google LLC and Microsoft Corp. have stoked the frenzy by releasing robust machine learning frameworks under open-source licenses. Enrollment in machine learning courses at top universities has tripled since 2010. The number of Google searches on the term “machine learning” has surged nearly sevenfold since 2012.

Companies now hawk machine learning as an ingredient in everything from enterprise supply chain management software to kids’ dolls. Yet for all the hype, many people still have only a rudimentary understanding of what machine learning can do – and most important, how it can go wrong.

Questionable outcomes

Cornell’s Passi: “If the system completely matches gut feeling it’s as useless as if it delivers the opposite of gut feeling.” Photo: Cornell University

Cornell’s Passi: “If the system completely matches gut feeling it’s as useless as if it delivers the complete opposite of gut feeling.” (Photo: Cornell University)

Financial aid leveraging is one of several examples of questionable machine-learning outcomes cited by Samir Passi and Solon Barocas of Cornell University in a recent paper about fairness in problem formulation. Misplaced assumptions, failure to agree on desired outcomes and unintentional biases introduced by incomplete training data are just some of the factors that can cause machine learning programs to go off the rails, yielding data that’s useless at best and misleading at worst.

“People often think that bad machine learning systems are equated with bad actors, but I think the more common problem is unintended, undesirable side effects,” Passi said in an interview with SiliconANGLE.

Although there’s no evidence that misguided machine learning algorithms have killed anybody, there is plenty of evidence that they have caused harm. A 2016 Pro Publica analysis of the risk assessment algorithms widely used by U.S. law enforcement agencies to predict repeat offenses found that most exhibit a strong bias against African-American defendants, despite the fact that race is not technically a factor in the equation.

That doesn’t surprise Passi. He noted that law enforcement agencies often treat arrests as a proxy for crime. “So they look for areas where most arrests occur and allocate more police resources there,” he said. “Deploying more officers leads to more arrests, which increases the crime rate statistics.”

In an example that hits closer to home for business executives, Amazon.com Inc. abandoned a machine learning-based recruitment application in 2017 after three years of development when the software showed continual bias toward male candidates despite Amazon’s efforts to compensate. The source of the problem was the data Amazon used to train the application: It was largely composed of resumes from candidates in the male-dominated computer industry. Even after being instructed to ignore candidates’ gender, the algorithm had learned to favor certain terms men commonly use to describe themselves.

A different kind of result

Those examples highlight a dynamic unique to machine learning and other applications of AI: Whereas conventional programs define a strict process for arriving at a reproducible outcome, machine learning algorithms can identify factors in an equation that aren’t explicitly stated. As a result, organizations that want to use these powerful new tools need to pay special attention to data quality, testing and transparent processes.

MapR’s Dunning: “Mix up training data with stuff you know and stuff you don’t know.” (Photo: Open World Forum)

“When you’re learning rather than designing from a spec, you don’t know as much about what the system should be doing so it’s harder to predict the outcome,” said Ted Dunning, chief application architect at MapR Technologies Inc. and author of the 2014 book “Practical Machine Learning Anomaly Detection.”

Those examples aren’t meant to imply that machine learning is inherently untrustworthy or to belittle its enormous value. However, they are a cautionary tale about the risks of taking the recommendations of an artificial intelligence engine at face value without understanding the factors that influence its decisions.

Like most branches of artificial intelligence, machine learning has acquired a kind of black-box mystique that can easily mask some of its inherent frailties. Despite the impressive advances computers have made in tasks like playing chess and piloting driverless automobiles, their algorithms are only as good as the people who built them and the data they’re given.

The upshot: Work on machine learning in coming years is likely to focus on cracking open that black box and devising more robust methods to make sure those algorithms do what they’re supposed to do and avoid collateral damage.

Any organization that’s getting started with machine learning should be aware of the technology’s limitations as well as its power. Experts contacted by SiliconANGLE cited five areas to watch:

1. Define terms

Success means different things to different people. Getting them to agree can be a maddeningly difficult task.

In their problem formulation paper, Passi and Barocas tell the true story of a company that specializes in collecting financial data on people who want car loans but have poor credit ratings. The company sells those people’s names to auto dealers, who have the option of trying to sell them a car. The company wanted to use data science to improve the quality of leads, a goal that seemed simple enough. The hope was that data science would yield diamonds in the rough: buyers with mixed credit histories who were nevertheless good credit risks.

But the project foundered because of disagreement on everything from what constituted a good lead to the definition of a high credit score. The data science team was unable to secure the data needed to match credit ratings to individuals, and because of inconsistent scoring mechanisms it was forced to divide prospective buyers into just two categories.

The result was the dealers would be limited to two sets of candidates: one deemed a good credit risk and the other not. Candidates in the lower tier would never be considered for financing, effectively destroying the project’s original goal.

The story is emblematic of a problem that can easily frustrate machine learning projects: People in the same company on the same team often have different definitions of success. Often they don’t even know it.

Indico’s Wilde: “If you can’t define the desired state, don’t expect AI to do it for you.” Photo via LinkedIn

Indico’s Wilde: “If you can’t define the desired state, don’t expect AI to do it for you.” (Photo: LinkedIn)

Tom Wilde, chief executive of Indico Data Solutions Inc., a company that uses machine learning to improve process flows by interpreting unstructured data, recalls working on one project with a financial institution that wanted to automate the analysis of requests for proposals. The task involved evaluating about 40 attributes, which team members assumed were well-understood.

“We found there was about 20 percent consistency in those definitions,” Wilde said. “There’s no way the model could be successful.”

At Couchbase Inc., a customer that wanted to optimize promotions to the biggest potential spenders on pleasure cruises was frustrated by lack of agreement among its suppliers on the most basic data elements.

“We discovered during the definition process that they had seven different definitions of gender – male, female, undecided and several variations,” said Sachin Smotra, Couchbase’s director of product management. “They were working with five different partners that each had their own definitions.”

MapR’s Dunning recalls one project he worked on that was intended to recommend videos. The team chose to build the model based upon the titles users clicked upon the most, but initial results were disappointing. It turned out that “people put terrible titles on videos,” Dunning said. “We changed the data from clicks to 30-second views and the result was a 400 to 500 percent improvement in the value of the recommendations.” A small change in the input variable thus created a huge magnifying effect on result quality.

The lesson in all these examples: Gaining consensus of what will be measured and what data is meaningful is an essential first step, experts said. Otherwise, assumptions will be misguided from the start.

2. Choose the right problem to solve

As a form of predictive analytics, machine learning works best when data from the past can predict future outcomes. That makes it well-suited for applications like anomaly detection in machine log files and predictive maintenance, but a poor candidate for stock market prediction or open-ended questions such as “What is the meaning of life?”

“There are two reasons to use machine learning: Either the amount of data is too much or there are too many input vectors to figure out,” said Daniel Riek, senior director of the artificial intelligence center of excellence at Red Hat Inc. “Choose use cases that lend themselves well to machine learning.”

Experts advise focusing on problems with a limited domain of inputs and possible outcomes. “We see a lot of AI initiatives that begin as discovery projects without any real business outcome in mind,” said Indico’s Wilde. “Almost all of them stall.”

Even when the variables and outcomes are well-defined, predictive models are rarely certain. “It’s probabilistic, not deterministic,” said Seth Dobrin, vice president and chief data officer of analytics at IBM Corp. “You don’t get a defined answer but rather a likelihood.”

A prominent recent example was the 2016 U.S. presidential election. Based upon well-documented demographics and historical voting patterns, most machine learning models predicted Hillary Clinton to win. But the models couldn’t factor in surprise events such as the reopening of an FBI investigation or fake news.

And even without those elements, the best models predicted a Clinton victory with only about 70 percent probability, leaving plenty of margin for an upset. Voters and the news media may have been surprised at the result, but statisticians weren’t.

“The fact that Trump narrowly won when polls had him narrowly trailing was an utterly routine and unremarkable occurrence,” wrote Nate Silver, founder and editor-in-chief of the political and sports analytics site FiveThirtyEight. “The outcome was well within the ‘cone of uncertainty,’ so to speak.”

At best, the outcome of a machine learning process should be taken with a grain of salt. “It’s basic probability derived from your training data that a certain input yields a certain output,” said Red Hat’s Riek. “Then you run through production to see if that result is an acceptable outcome.”

The ability to repeat results is important. When presented with similar but not identical data sets, the machine learning model should return similar results each time it’s run. Continuous validation testing and repetition improves confidence. “If you run the same model 30 times, you should get the same ranking each time,” Dobrin said. But even then, real-world results may differ.

3. Use comprehensive, relevant data

Developers working with transactional systems know the definition of bad data: four digits in a ZIP code field is a problem. With machine learning, the distinctions aren’t as obvious.

In fact, machine learning algorithms have a higher tolerance for “dirty” data because they can learn to identify and discard it over time. “The quality of the data improves with the amount of learning that you do,” said Pradeep Bhanot, director of product marketing at Actian Corp.

And in contrast to traditional data cleansing, which emphasizes scaling down and summarizing data, machine learning algorithms work best with large quantities of raw information and an iterative approach to improvement. “Bigger sample sizes and more iterations give you more accuracy,” Bhanot said.

Because machine learning is probabilistic, outputs are more like judgments than absolute answers. The more data the model has, the better the result should be, and data doesn’t have to be scrubbed and normalized to the degree it is for transaction processing.

“The traditional assumption is that data quality has to be perfect, and that’s not really true if you have a learning system,” said MapR’s Dunning. “Very often learning systems can learn to compensate.” In fact, Dunning recommends injecting some noise into the data to see if the algorithm successfully filters it out. “You will make the system run less well in the short run but much better in the long run,” he said.

The bigger potential problem lies in data that doesn’t represent the full domain of the problem. Most data sets are biased, so finding training data that is comprehensive is a critical success factor.

Fortunately, the domain of public data sets is growing. Google has contributed more than 60 to the public domain and many others are available from government and private sources. IBM’s Watson OpenScale and MapR’s Data Science Refinery are examples of an emerging class of tools for ensuring data quality in machine learning deployments

4. Know likely outcomes

The output of machine learning processes should make sense, even if they’re unexpected. If the problem is defined clearly enough and people with domain expertise are involved in evaluating results, no one be should be surprised by the outcome.

Red Hat’s Riek: “You need a set of bounded outputs and failure signs.” (Photo: Red Hat)

That doesn’t mean the model should tell you what you already know. Surprises are good if they unearth new insights. The trick is to find the balance between getting results that are obvious and those that are wildly improbable.

“When you get a result that matches your gut feeling, does that make it right?” asked Passi. “At the same time, if the results are counterintuitive, does that make them inherently wrong?”

Experts say it’s essential to have experts involved in the testing process to set expectations about plausible outcomes. “Envision what success looks like at the end of this process and then work backwards as opposed to picking through a pile of data to look for anything interesting,” said Indico’s Wilde.

It’s also advisable to focus on small problems with a limited solution sets, keeping in mind that machine learning is better suited to finding ways to improve existing processes than invent new ones. “If you can’t define the desired state, don’t expect AI to do it for you,” Wilde said.

But companies also should choose problems that have potential for improvement so that the model doesn’t simply reinforce existing knowledge. “If the system completely matches gut feeling it’s as useless as if it delivers the complete opposite of gut feeling,” Passi said.

The data used to train the algorithm should be relevant to the desired outcome but not so tightly constrained that outside-the-box solutions don’t emerge. “You need to mix up training data with stuff you know and stuff you don’t know,” Dunning said. “Exploring the gray regions of data makes models better.”

Humans need to remain in the loop as well to avoid confusion between correlation and causation. The fact that two variables appear to correlate on an outcome doesn’t mean they influence it. The volume of ice cream sales and the frequency of drowning deaths correlate, but that doesn’t mean ice cream causes drowning. A more likely causal factor is summer.

Algorithms won’t always be able to tell the difference, so human oversight is needed to spot assumptions that don’t make sense. “A model trained to detect correlations should not be used to make causal inferences,” advises Google’s Responsible AI Practices code.

5. Watch for hidden bias

The failures of the Amazon candidate screening and law enforcement risk assessment applications lay in biases that humans didn’t anticipate. Pinpointing where those biases lie can be devilishly difficult since few data sets are truly representative of the real world and sources of bias can be subtle.

IBM’s Dobrin related one example of a financial services company whose application for evaluating home mortgage candidates inadvertently factored race into the equation because ZIP codes were included in the training data. Although race wasn’t noted in the source data, the algorithm learned that candidates from certain ZIP codes were higher mortgage risks and so began denying their applications more frequently. “Because the company didn’t understand the hidden bias, they couldn’t predict that would happen,” he said.

All humans have biases but also the mechanisms to control them. Computers don’t have such guard rails, at least not yet, meaning that the outcome of biased data can be amplified. “The same mechanisms that enable you to function in society can lead to horribly bigoted behavior,” Dunning said.

Repeated testing and validation are a core defense. Bias tends to creep into models over time, which means algorithms must be monitored on an ongoing basis against a set of realistic outputs. “You need a set of bounded outputs and failure signs,” Riek said. “You can’t have nuanced output.”

But bias is sometimes necessary, and that’s where the interests of data scientists and their business-side colleagues can come into conflict. Machine learning algorithms excel at reaching optimal solutions, but whether for purposes of compliance, legal defense or altruism, optimal isn’t always best. When considering job candidates or mortgage applicants, for example, the business may want to favor candidates of certain genders or ethnicities.

Carnegie-Mellon’s Danks: “You can have a statistically unbiased model that is not morally unbiased.” (Photo: Wright-Patterson Air Force Base)

In those cases, business objectives need to trump algorithms. “You have a lot of people who understand ethical and social impacts but not AI, and you have a lot of people in AI who don’t understand ethical and social impacts,” said David Danks, a professor of philosophy and psychology at Carnegie Mellon University. “People who write code don’t have to be ethicists, but they need to be able to talk to ethicists.”

Danks believes the task of creating machine learning models is too often left to data scientists without the up-front involvement of business stakeholders who will have to live with the results of their models. Data scientists gravitate toward statistical perfection, but that isn’t always desirable. “You can have a statistically unbiased model that is not morally unbiased,” he said.

Collaboration needs to start from the beginning. “Too many AI projects get too far down the road before the business people get pulled in,” said Wilde. “When this happens, it can be very difficult to get a project back on track.”

A question of trust

Our relationship with computers is defined by trust. Years of experience have taught us that, given the same set of inputs, programs will always produce the same results. Machine learning challenges those assumptions.

Outputs may vary based upon permutations in the inference model. Results are qualified by probability. Omissions in source data can create unintentional bias. Correlation can be misinterpreted as causation.

And that’s all OK if those limitations are understood. In the age of smart machines, transparency is more critical than ever, experts advise. “Explainable AI is how to get trustworthy AI,” said CMU’s Danks.

Another wrinkle is that trust is situational. A machine model that teaches an autonomous vehicle to avoid hitting a pedestrian needs to be right 100 percent of the time. A recommendation engine on an e-commerce site has more latitude for error.

The key is to understand how the decision is reached and the likelihood that it’s the right one. For now, people are a necessary factor in that equation. For all the talk over the last couple of years about intelligent machines making humans obsolete, today’s technology is only as good as the parameters humans define.

“The idea that intelligent, thinking, conniving beings exist in AI is total, total, total science fiction,” former MIT AI lab chief and iRobot Corp. co-founder Rodney Brooks said in a recent interview published on Medium. “We don’t have anything with the intent of a slug at this point.”

But for many of the problems being addressed by machine learning these days, that’s good enough.

Image: Mike MacKenzie/Flickr

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.