The 6th anniversary of the trading system meltdown at Knight Capital is an opportunity to reflect upon computer system defects, human error, organization flaws, and the best principles and practices for solution delivery in the information technology industry. In this blog and my upcoming book, Bugs: A Short History of Computer System Failure, I will chronicle some important system failures in the past and discuss ideas for improving the future of system quality. As IT becomes increasingly woven into Life, the quality of hardware and software impacts our commerce, health, infrastructure, military, politics, science, security, and transportation. The Big Idea is that we have no choice but to get better at delivering technology solutions because our lives depend on it.
On August 1, 2012, Knight Capital Group LLC (?Knight?), a leading financial market maker, experienced a major failure in the operation of its automated routing system for US equity orders. Knight originally received 212 small orders from retail customers and then mistakenly streamed thousands of orders per second into the NYSE market over a 45 minute period; it executed over 4 million trades in 154 stocks totaling more than 397 million shares and assumed a net long position in 80 stocks of approximately $3.5 billion as well as a net short position in 74 stocks of approximately $3.15 billion. Knight lost over $460 million from these unwanted positions, and by the next day, its own stock price had dropped by 75%, as employees, customers, and competitors stumbled to figure out what to do next. A week later, Knight received a $400 million cash infusion from a group investors, and by the next summer, it was acquired by a rival, Getco LLC. This essay will discuss the rise and fall of Knight and explain the IT matters that contributed to the system failure.
Founded in 1995 by Kenneth Pasternak and Walter Raquet, Knight Capital Group was a market maker and trade execution provider headquartered in Jersey City, New Jersey, across the Hudson river from Wall Street. The company?s bold insight was that the human-centered model of exchange trading was going to be fundamentally transformed by computers. Its primary customers were large broker-dealers, electronic discount brokers, hedge funds, and other institutional investors. Born in the crucible of the IT advances of the 1990?s and uplifted by the related growth of the technology-weighted NASDAQ stock market, Knight grew rapidly to become the single largest market maker of stocks listed on the NASDAQ (17%) and NYSE (16%). In July 1998, Knight raised $145 million in capital through its own Initial Public Offering (IPO) with a share price of $14.50 and market capitalization of $725 million. By the end of 1999, Knight?s share price had soared above $150, and its market cap had surged to $8 billion. A number of factors contributed to the increase in trading volumes on both the NASDAQ and NYSE markets including the flood of cash flows into equity-based mutual funds, historic high returns in US equity markets, the increasing number of companies going public, the emergence and market acceptance of electronic discount brokers, and multiple technological innovations such as the Internet, World Wide Web, and Personal Computer reducing transaction costs.
But there were downs along with the ups as well. Knight was hit hard by the burst of the dot-com bubble, with NASDAQ trading volumes depressed for months. On April 9, 2001, the SEC then announced Regulation National Marketing System (RNMS) and mandated that the stock market move to decimal pricing. Academic studies and industry forecasts suggested that investors would save money from narrower spreads at the expense of the market makers. Knight was among the hardest hit by the regulation change, and it struggled for the next year. On January 8, 2002, Knight agreed to pay $1.5 million to settle multiple NASD regulatory violation claims including failure to honor posted quotes, the improper display of limit orders, and slow, sometimes inaccurate reporting of thousands of trades to the NASD. The regulatory fine was $700,000, and its clients were paid $800,000. The NASD investigation also highlighted the existence of and executive knowledge of front-running within Knight, a Wall Street practice in which firms traded for their own accounts based on previewing customer order flow and executing their own trades before a customer?s order.
Knight needed to make changes and replaced Pasternak with Thomas Joyce, an industry veteran, in May 2002; Joyce soon shifted the firm?s business to high volume market making in other asset classes through acquisitions and organic growth. As it adjusted to the new regulatory environment, Knight recovered its footing and financial success. By 2011, the company was worth $1.5 billion, earned net income of $115 million, employed approximately 1400 people including over 100 software developers, and had opened other offices in the USA as well as the UK, Switzerland, China, and Singapore. Knight now made markets in US options and European equities and it also traded currencies and fixed income for its proprietary accounts. It was still dominant in US equity markets and managed an average daily US equity volume of more than 3.3 billion trades worth around $21 billion. As part of the business expansion and renewal strategy, Knight retired older IT systems and built new trading technology such as the Smart Market Access Routing System (SMARS). SMARS was able to execute thousands of orders per second and could compare prices between dozens of different trading venues within fractions of a second.
Some of Knight?s biggest customers were the discount brokers and online brokerages such as TD Ameritrade, E*Trade, Scottrade, and Vanguard. Knight also competed for business with financial services giants like Citigroup, UBS, and Citadel. However, these larger competitors could internalize increasingly larger amounts of trading away from the public eye in their own exclusive markets or shared private markets, so-called dark pools. Since 2008, the portion of all stock trades in the US taking place away from public markets has risen from 15% to more than 40%. As of 2018, there are about 40 dark pools and as many as 200 internalizers competing with a dozen public exchanges in the US alone.
In October 2011, the NYSE proposed a dark pool of its own, called the Retail Liquidity Program (RLP). The RLP would create a private market of traders within the NYSE that could anonymously transact shares for fractions of pennies more or less than the displayed bid and offer prices, respectively. The RLP was controversial even within NYSE Euronext, the parent company of the NYSE; its CEO, Duncan Niederauer, had written a public letter in the Financial Times criticizing dark pools for shifting ?more and more information? outside the public view and excluded from the price discovery process?. The SEC decision benefited large institutional investors who could now buy or sell large blocks of shares with relative anonymity and without moving the public markets, however it came again at the expense of market makers. During the months of debate, Joyce had not given the RLP much chance for approval, saying in one interview, ?Frankly, I don?t see how the SEC can be possibly OK it?. In early June 2012, the NYSE received SEC approval of its RLP, and it quickly announced the RLP would go live on August 1, 2012, giving market makers just over 30 days to prepare. Joyce insisted on participating in the RLP because giving up the order flow without a fight would have further dented profits in its best line of business.
With only a month between the RLP?s approval and it?s go-live, Knight?s software development team worked feverishly to make the necessary changes to its trade execution systems including SMARS, its algorithmic, high speed order router. A core feature of SMARS receives orders from other upstream components in Knight?s trading platform (?parent? orders) and then, as needed based on the available liquidity and price, sends one or more representative (?child?) orders to downstream, external venues for execution. The new RLP code in SMARS replaced some unused code in the relevant portion of the order router; the old code previously had been used for an order algorithm called ?Power Peg?, which Knight had stopped using since 2003. Power Peg was a test program that bought high and sold low; it was specifically designed to move stock prices higher and lower in order to verify behavior of its other proprietary trading algorithms in a controlled environment. It was not to be used in the live, production environment. There were grave problems with Power Peg in the current context. First, the Power Peg code remained present and executable at the time of the RLP deployment despite its lack of use. Such ?dead code? is a bad practice, but common in large software systems maintained for years. Second, the new RLP code had repurposed a flag that was formerly used to activate the Power Peg code; the intent was that when the flag was set to ?yes?, the new RLP component ? not Power Peg ? would be activated. Such repurposing often creates confusion, had no substantial benefit and was a major mistake as we shall see shortly. Third, there had been substantial code refactorings in SMARS over the years without thorough regression testing; in 2005, Knight changed the cumulative quantity function that counted the number of shares of the parent order that had been executed and filled to decide whether to route another child order. The cumulative quantity function was now invoked earlier in the SMARS workflow which in theory was a good idea to prevent excess system activity; in practice, it was now disconnected from Power Peg which used to call it directly, could no longer throttle the algorithm when orders were filled, and Knight never retested Power Peg after this change.
In the week before go-live, a Knight engineer manually deployed the new RLP code in SMARS to its eight servers. However, the engineer made a mistake and did not copy the new code to one of the servers. Knight did not have a second engineer review the deployment, and neither was there an automated system to alert anyone to the discrepancy. Knight also had no written procedures requiring a supervisory review, all facts we shall return to later. On August 1, 8:01 AM EST, an internal system called BNET generated 97 email messages that referenced SMARS and identified an error described as ?Power Peg disabled?. These obscure, internal messages were sent to Knight personnel, but their channel was not designated for high priority alerts and the staff generally did not review them in real-time; however, they were the proverbial smoke of the smoldering code and deployment bits about to burn, and it was a missed opportunity to identify and fix the DevOps issue prior to market open. At 9:30 AM EST, Knight began receiving RLP orders from broker-dealers, and SMARS distributed the incoming work to its servers. The seven servers that had the new RLP code processed the orders correctly. However, orders sent to the eighth server with the defective Power Peg code activated by the repurposed flag soon triggered the fault line of a financial tectonic plate. This server began to continuously send child orders for each incoming parent order without regard to the number of confirmed executions Knight had already received from other trading venues. The results were immediately catastrophic. For the 212 incoming parent orders processed by the defective Power Peg code, SMARS sent thousands of child orders per second that would buy high and sell low, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. For 75 of these stocks, Knight?s executions jostled prices more than 5% and comprised more than 20% of trading volume; for 37 stocks, prices lurched more than 10% and Knight?s executions constituted more than 50% of trading volume.
Nanex, LLC Market Data on US Equity Volumes from August 1, 2012
Following the Flash Crash of May 6, 2010 in which the DJIA lost over 1000 points in minutes, the SEC announced several new rules to regulate securities trading. First, circuit breakers were required to stop trading if the market experienced what was labeled as ?significant price fluctuations? of more than 10% during a 5-minute period. Second, the SEC required more specific conditions governing the cancellation of trades. For events involving between five and twenty stocks, trades could be cancelled if they were at least 10% away from the ?reference price?, the last sale before pricing was disrupted; for events involving more than twenty stocks, trades could be cancelled if they deviated more than 30% from the reference price. Third, Securities Exchange Act Rule C.F.R 240.15c3?5 (?Rule?) went into effect, requiring the exchanges and broker-dealers to implement risk management controls to ensure integrity of their systems as well as executive review and certification of the controls. Since the Flash Crash rules were designed for price swings not trading volume, they did not kick in as intended and stop trading because few of the stocks traded by Knight on that fateful day exceeded the 10% price change threshold. By 9:34 am, NYSE computer analysts noticed that market volumes were double the normal level and traced the volume spike back to Knight. Niederauer tried calling Joyce, but Joyce was still at home recovering from knee surgery. The NYSE then alerted Knight?s chief information officer who gathered the firm?s top IT people; most trading shops would have flipped a kill switch in their algorithms or would have simply shut down systems. However, Knight had no documented procedures for incident response, again, another fact we shall return to later. So, it continued to fumble in the dark for another 20 minutes, deciding next that the problem was the new code. Because the ?old? version allegedly worked, Knight reverted back to the old code still running on the eighth server and reinstalled it on the others. As it turned out, this was the worst possible decision because all eight servers now had the defective Power Peg code activated by the misappropriated RLP flag and executing without a throttle. It was not until 9:58 AM that Knight engineers identified the root cause and shut down SMARS on all the servers, however the damage had been done. Knight had executed over 4 million trades in 154 stocks totaling more than 397 million shares; it assumed a net long position in 80 stocks of approximately $3.5 billion as well as a net short position in 74 stocks of approximately $3.15 billion. Under the post-flash crash rules enforced by the NYSE, most of the trades were within the 10% price band, thus they would stand and could not be cancelled. Joyce called then SEC chairwoman, Mary Schapiro, for help reversing the trades, but to no avail; she demurred and deflected the matter back to the NYSE. Knight?s stock plunged by 33% that day, and the mark to market loss for its trades amounted to more than $460 million. News on Wall Street travels fast; other market participants could smell the blood in the water.
Announcements from TD Ameritrade and other customers in the ensuring days that they would continue to do business with Knight did calm matters somewhat, but the company simply did not have enough cash to cover and settle its position liability. Over the weekend, on August 5, Knight raised around $400 million from several investors led by Jefferies investment bank. The financing terms were 267 million convertible, preferred shares priced at $1.50 with a 2% dividend yield; if converted, these shares could give the new investors control of 70% of the company. Knight also agreed to three new board members, Martin Brand from Blackstone, Matthew Nimetz of General Atlantic, and Fred Tomczyk of TD Ameritrade. The deal was a severe blow to Knight?s shareholders, but better than the alternative of bankruptcy. The board met during the winter months of 2012 to assess takeover offers and in December, it agreed to be acquired by rival, Getco LLC, for $3.70 per share, a sizable premium to what the earlier investors had paid to keep the company float. Once the merger with Getco LLC was completed in the summer of 2013, the merged company was renamed KCG Holdings, and Joyce resigned.
The SEC published a detailed report of its investigation into Knight?s system failure, and it contains several lessons useful for IT professionals and business leaders.
- The People, Leadership, and Values of an organization are its foundation and foremost success factors. The report highlighted the ?defective? CEO certification of Knight?s risk management and technology controls in March 2012. SEC Rule 15c3?5 requires broker dealers to ?appropriately control the risks associated with market access, so as not to jeopardize their own financial condition, that of other market participants, the integrity of trading on the securities markets, and the stability of the financial system.? Subsection (b) of the Rule then specifies that these companies must ?establish, document, and maintain a system of risk management controls and supervisory procedures reasonably designed to manage the financial, regulatory, and other risks? of having market access. Subsection of the Rule also requires brokers or dealers to have systematic financial risk management controls and supervisory procedures that are reasonably designed to prevent the entry of erroneous orders that exceed pre-set credit and capital thresholds in the aggregate for each customer and the broker or dealer. Finally, subsection (e) of the Rule requires that the CEO review and certify that the controls and procedures comply with subsections b and c of the rule. The SEC cited multiple violations of the Rule including no controls to prevent erroneous entry of orders, no pre-set capital thresholds for the firm in the aggregate, no technology controls and supervisory procedures for software deployment, no written incident response procedures, and an inadequate written description of its enterprise risk management controls. Interestingly, the wording of the company certification itself was rather odd, stating only that Knight had ?processes? in place to comply with the rule, but neither that the controls nor procedures themselves actually complied with the rule. In the end, the CEO certification was proverbial lipstick on the pig, and it was as worthless as Knight?s risk and IT governance controls. Another side effect of the August 1 event was that many of the millions of orders that SMARS sent were naked short sale orders in which Knight neither marked the order as a short sale nor borrowed the underlying security, violating Rules 200(g) and 203 (b), respectively of Regulation SHO. Knight was also fined $12 million payable to the US Treasury for its short sale violations. Ultimately, executives must demand compliance with the law and expect excellence in business operations. Furthermore, the Board must do its own due diligence and verify that the company is appropriately managing its enterprise risks.
- Operator Error during complex software deployments are all too common. Knight could have prevented the failure and minimized the damage with a variety of DevOps controls including a simple peer review of code and deployment, clear written procedures for software deployment and incident response, a visual dashboard to verify the version of software deployment units, automated scripts in Team City, Octopus, bash, Python, or Powershell (pick your favorite language or tool) to consistently deploy the software across the servers, or a re-architecture of the SMARS system to use Docker containers and an orchestration system like Kubernetes or Swarm to automatically guarantee the deployed version of live software with the added benefit of scalability.
- Time and project management was another reason Knight failed to deliver the RLP solution. Knight?s IT project managers and CIO should have pushed back on the hyper-aggressive delivery schedule and countered its business leaders with an alternative phased schedule instead of the Big Bang ? pun intended. Thirty days to implement, test, and deploy major changes to an algorithmic trading system that is used to make markets daily worth billions of dollars is impulsive, naive, and reckless.
- Risk management is a vital capability for a modern organization, especially for financial services companies. The SEC?s report concluded: ?Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace? Good technology risk management practices include quality assurance, continuous improvement, controlled user acceptance testing, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations, an independent audit process, technology governance that prevents software malfunctions, system errors and failures, service outages, and when such issues arise, a prompt, effective, and risk-mitigating response.? While Knight had order controls in other systems, it did not compare orders exiting SMARS with those that entered it. Knight?s primary risk monitoring tool, known as ?PMON?, is a post-execution position monitoring system. At the opening of the market, senior Knight personnel observed a large volume of positions in a special account called 33 that temporarily held multiple types of positions, including positions resulting from executions that Knight received back from from markets that its systems could not match to the unfilled quantity of a parent order. There was a $2 million gross limit to the 33 account, but it was not linked to any automated controls concerning Knight?s overall financial exposure. Furthermore, PMON relied entirely on human monitoring, did not generate automated alerts, and did not highlight the display of account exposures based on whether a limit had been exceeded. Moreover, Knight also had no circuit breakers which is a standard pattern and practice for financial services companies.
- Design, implementation and devops details of your system components matter. Thou shalt not run dead code; prune dead code and use version control systems to track the changes. Thou shalt not re-purpose configuration flags; activate new features with new flags. Thou shalt automate thy deployments. Peer review of code and deployment artifacts should raise questions that prevent and resolve such faults before they reach a live environment.
KCG Holdings was eventually acquired by another market making rival, Virtu LLC in July 2017 for $1.4 billion. The silver lining to the story was that Knight was not too big to fail, and the market handled the failure with a relatively organized rescue without the help of taxpayers. However, a dark cloud remains; market data suggests that 70% of US equity trading is now executed by high frequency trading firms, and one can only wonder when, not if, the next Flash Crash will occur.
Enjoy the article? Follow me on Medium and Twitter for more updates.