How poor DevOps culture led to a $465M trading loss for Knight Capital
On August 1, 2012, Knight Capital Americas LLC experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over four million executions in 154 stocks for more than 397 million shares.
By the time that Knight realized the error, and stopped sending the orders, the company had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. During that one hour period, Knight lost over $460 million from these unwanted positions.
The SEC, in its filling report, said that in the absence of appropriate controls, the speed with which automated trading systems enter orders into the marketplace can turn an otherwise manageable error into an extreme event with potentially wide-spread impact.
Poor DevOps strategy
Now the question is how that event happened? Is this due to the human error made by the trader or a software glitch in the system? After performing a detailed analysis of the SEC filing, it came down to a poor execution of deployment strategy.
Automated trading is an increasingly important component of the national market system. To ensure the participation of its clients in the program ratio (AR) on the New York Stock Exchange, whose launch was planned for August 1, 2012, Knight made a number of changes to their systems and software code associated with processing orders. These changes included change in the development and deployment of SMARS. SMARS is an automated, high-speed, algorithmic router that sends orders to the market. Its main function to get orders from other parts of the trading platform based on available liquidity and then send one or more representative external services orders for execution.
When you deploy a new submarine code in SMARS, it had to replace the unused code in the appropriate part of the router. This previously unused code was needed to function “Power Peg” program, which the company has not applied for many years. Despite this, the company continued its work and called the program during deployment. The new subroutine used a flag code, which had previously been attached to the Power Peg. Knight did removed the code in 2005 and moved that to SMARS code sequence, but it failed to re-test the code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.
Since July 27, 2012, Knight launched a new subroutine code in SMARS, placing it on a limited number of servers. During deployment of the new code, one of the technicians did not copy the new code for one of the eight SMARS servers. There was no backup person to re-check the deployment and none realized that Power Reg was removed from one of the eighth servers and RLP code was not added to new SMARS platform.
On 1st of August, Knight received orders from broker-dealers whose clients can participate in the RLP. Seven servers, which was deployed correctly, process orders without any issues. But, the orders that was sent to the buggy server triggered sending child orders to certain trading centers for execution due to defective Power Peg code. Due to the fact that the function verification of parental order has been moved to a different stage of the process, the server continued to place orders without stopping child – regardless of the fact that the parent order is executed. Although some of the order processing system determines that the parent order is processed in SMARS, this information was never leaked to the market.
The same day, Knight received orders that related to the RLP, but were intended for trading before the market opened. The system sent 97 of these e-mail messages to a group of Knight Personnel before the 9:30 a.m. market open. Messages of this type was not regarded as important as the Knight staff did not read them.
On August 1, after the incident happened, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. It was relied on technology people and invested time solving SMARS problem in a live trading environment. The system continued to send millions of child orders during the same time. In the worst scenario, the technology people removed the new RLP code from the seven servers where it had been deployed correctly – resulting additional incoming parent orders to activate the Power Peg code. The end results – a whopping loss of over $465 million in a span of 45 minutes.
Knight is not the only one
The Knight episode was not only because of human error, but also due to most likely horrible deployment scripts and woeful production monitoring. Many comments on Reddit at least reasonably pointed out problems with the DevOps and deployment methods used by companies.
A comment on Reddit mentioned that one such example involved a server at an airline that started misbehaving heavily, a server which was used for ticket sales. The Airline didn’t have any warning log system, nor was there any documentation of the issues. Once the server was rebooted, the entire booking engine stopped working. It was found that the server was the only one with a telnet connection to Amadeus, a central airline booking service. This was critical information, but not listed anywhere. It took 90 minutes to get down to the server room and switch it back on manually.
Another user said they worked for a company with no process surrounding code changes. The developer made the change, tested it, then implemented it themselves whenever they felt like it. Most of these processes ran between midnight and 4am, when no one was in office.
Yet another developer said their business pushes development to release code weekly/bi-weekly to keep up or try to edge out competitors by adding new features. And when bugs are found, the business tells developers to patch it for today and “fix it later.” And that later fix will likely never happen, potentially leading to a disaterous error.
The SEC filling importantly recommends new human processes to avoid a similar tragedy. The software development lifecycle should processes for all of company’s business critical systems and applications, including trading systems, finance, risk, and compliance. The risk management controls and supervisory procedures, including those pertaining to deployment of new software and code should have proper QA process formalized and documented.
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU