

This is a 2-part series focusing on the use case and possible solution for supporting support engineers. The series explores one key aspect of this month’s theme of Virtualization – troubleshooting in tech-support. As more tasks become automated and support teams deal with an increasing variety of software and customers, the act of virtualizing these efforts can lead to a few troubles of their own.
Written by Puneet Pandit, Founder and Chief Executive Officer of Glassbeam
Having done tech-support in the past and now talking to clients or prospects on supporting their support team, what sticks out always is how different the support process is from one company to another. The products being supported are different, the log for each device is different, what you look for in the logs are different and in general everything differs from one product’s support team to another. Moreover, how one troubleshoots a problem is also very different even within the same product from one person to another. This brings an interesting question in the Machine data based support space – Is support automation even a possibility.
Support automation as one pre-defined workflow tool, which works for all support groups, can be complex to leaning towards impractical. While some aspects of the troubleshooting can be standardized, most part will be product specific or individualistic. What support needs are tools that can help make their troubleshooting tasks simpler and faster, tools that can be customized to every individual’s needs and tools that can be programmed to bubble up all known issues and automate related support processes.
If one takes a step back and looks at troubleshooting as a process across product lines, many common things stand out, irrespective of the product being supported. For example,
One of the initial steps of troubleshooting is to look at those log files which have the error message
The support engineer might then want to check what happened before and after a particular error message in the log file that has the error message
The support engineer might want to see what happened in other files (which represent the other systems/processes of the product) during the time of error. The events surrounding an event of interest might throw light into what went wrong
The support engineer might also want to look at output of specific commands represented by different sections in the log file. These sections could represent the configuration of the device or the state of the system as a whole or specific parts of the system
More often than not, a problem is due to a change in the system’s configuration. The support engineer would want to know what changed and when?
Depending on the type of the problem, the support engineer might want to dig into performance or other statistical trends which are being tracked for that system
Before digging deeper into the logs to solve the issue, the support engineer might also want to check if this is an isolated event or is prevalent across multiple systems in the field.
A product never works in isolation and is always interconnected with other devices in a stack like environment. Support issues are many times not isolated incidents, but dependent on other systems in the stack too. The support engineer in this case needs to analyze across stack.
The support engineer might want to check if this is a previously solved problem, so that he doesn’t reinvent the wheel. He could do this by going through previous support cases and/or knowledge base article that has a solution for this problem
If this is a performance related problem, the support engineer, would typically collect performance statistics, plot them and analyze trends. The engineer would also like to know what was going on in the system when the performance went down, what other events occurred, which configuration changed etc
What if this is a known bug? The support engineer would then have to check in the Bug database and make sure it is not a known bug; otherwise, the engineer would waste time ascertaining a known problem.
What if this is a known issue, but not formally documented anywhere. In most organizations, there is a wealth of information being discussed on the internal E-mail distribution lists, which don’t get documented anywhere. So, the support engineer might search his Inbox to see if he finds anything there
What about those cheat sheets that each support engineer has built? Those are details only known to a specific engineer, who has not found the time to document it anyplace still. The support engineer might check if the issue at hand matches anything he remembers to have seen/solved before.
There could be more steps than listed above, but you get the picture. While not all issues require detailed troubleshooting, even simple issues or known problems require time spent by support on the process side. For ex: Even if it is a well-defined, known issue, the support engineer still has to spend time to open a case and update all the details in the case, including the solution. If it is an RMA, a linked case has to be opened to dispatch a replacement part. If the support team receives hundreds of such cases, then a significant time is spent even on known issues.
Support needs tools that help perform the above steps faster and tools that can help automate where possible.
We’ve just looked at some of the steps commonly followed during troubleshooting and also how even though the specifics are different the overall approach is similar.
While thinking through finding solutions to enable support, we need to remember that all support problems cannot be treated the same way. If we look at support issues, they are typically broken down into different levels, depending on the complexity of the problem. Most organizations have 3 to 4 levels – L1 – L3 or L4.
1. Issues that get resolved at L1/L2 level – These form the majority of the support cases in almost all support organizations. While these issues have lower MTTR (Mean time to resolution), the volume of issues makes this the majority of the time that support team spends on
2. Issues that get resolved at L2/L3 or L3/L4 levels (depending on how organizations have structured their team) – While these are fewer in number, the MTTR for such problems are a lot higher.
Both the issues above need a different way of handling. But before we discuss ways of solving this, one of the most important things to remember is that the support organization is always under tremendous pressure to resolve issues in the shortest time possible. A support engineer will always choose a method that helps him get the results in the shortest time and the fewest possible steps and hence a solution that does not meet those goals, will not succeed.
With the above premise, let’s look at possible solutions to the two types of issues listed above. In this blog we can look at L1/L2 issues and continue the rest in the next blog.
For L1/L2 type of issues, what if there was:
A System that can process logs within seconds
A log vault that allows the support engineer to get access to the log in the shortest possible time
A search interface to quickly and easily search the historical logs.
An integrated file viewer and a file diff tool that can allow a support engineer to view log file of interest and see the changes in the file as compared to the previous files.
A rules and alert engine that can be configured to look for the most common problems (Known issue repository) and alert the support engineer when there are issues found
An interface that highlights all the known issue in a given log as soon as it is loaded into the system – using the Known issue repository
An interface to show related cases or known bugs – This would require integration into case and bug management system
An interface that shows that the current configuration of the system and changes in configuration of interest – Requires an interface to define what constitutes a configuration and which of those need to be tracked for change
Integration with case management system, RMA systems to be able to automatically open cases with all the relevant details when the system finds known issues.
While the above list is neither comprehensive nor detailed, it does provide an overview of interfaces, systems, processes that can be put in place to make things easier for the L1/L2 support.
In the next blog, we can continue this topic and look at systems and processes that can help L3/L4 type of issues.
About the Author
Puneet Pandit: Founder and Chief Executive Officer of Glassbeam
Puneet has served as chief executive officer of Glassbeam since its inception in 2009. He is focused on Glassbeam’s mission to disrupt the market for Big Data business applications and machine data analytics.
Glassbeam was incubated inside Orchesys, a professional services firm focused on enterprise storage. Prior to founding Orchesys, Puneet was a senior director at Network Appliance, where he led the database and business applications solutions group focused on driving the $300 million “Oracle on NetApp” market with cross functional initiatives across joint R&D, sales, services and marketing programs. Prior to NetApp, Puneet was a strategic advisor at Ernst & Young and a management consultant at Tata Unisys.
He holds a Bachelor of Science in electrical engineering from Punjab University and MBA from The University of Chicago.
THANK YOU