High Performance Trading/Algo Speed with Wombat Design and Implementation Guide January 24, 2008 Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.
ALL DESIGNS, SPECIFICATIONS, STATEMENTS, INFORMATION, AND RECOMMENDATIONS (COLLECTIVELY, "DESIGNS") IN THIS MANUAL ARE PRESENTED "AS IS," WITH ALL FAULTS. CISCO AND ITS SUPPLIERS DISCLAIM ALL WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF DEALING, USAGE, OR TRADE PRACTICE.
C O N T E N T S Introduction 1-1 Target Audience 1-3 Target Market 1-4 Automated Trading Benefits 1-4 Automated Trading Architecture 1-6 Concept Features 1-10 Tested Components 1-11 Servers 1-11 Networking 1-12 Operating System 1-12 Test Implementation Framework Testing Topology 1-13 1-13 Testing 1-14 Methodology 1-14 Test Setup 1-14 Procedures 1-15 Data Observations 1-16 Time Synchronization 1-16 Limitations 1-16 Testing Results 1-17 Mean Latency 1-17 Latency Dispersion 1-22 Appendix A—Device Config
Contents Configure Ethernet AtTributes of Leaf Switches 1-49 Configure Ethernet AtTributes of Core Switches 1-49 Validate the Ethernet Management NEtwork 1-50 Set Up SE Tools on a Ethernet-attached Host 1-50 Perform a Switch Chassis Inspection 1-50 Perform a Physical Inspection 1-50 (Optional) Record Leaf Switches and Hosts 1-50 Disable Uplinks on Leaf Switches 1-51 Install Host-Side Drivers and Configure IP Addresses to InfinIBand Ports on Hosts Troubleshoot “Bring Up” Pod 1-53 Run Step Troubleshoot “Brin
High Performance Trading/Algo Speed with Wombat Design and Implementation Guide Introduction Automated trading and new regulatory demands are the two main forces behind the changes in Financial Markets. Firms are trying to maintain their competitive edge by constantly changing their trading strategies and increasing the speed of trading. New financial products, business models, and trading tools require a super fast response.
Introduction memory read/writes, as well as kernel bypass. Applications that specifically support Message Passing Interface (MPI) or Open Fabrics messaging transports can effectively achieve latencies of less than 10 microseconds. The InfiniBand fabric can also be seamlessly integrated with existing Ethernet networks by using SFS 3000 Series switches-this eliminates any interoperability concerns.
Introduction Figure 1 shows the general automated trading solution environment.
Introduction Target Market Financial services firms require vast amounts of computing power to run their business. They use homegrown applications or increasingly ISV applications to do these computations. However, these applications were originally built for SMP machines or hard-coded clusters of a certain size. The net result is that computations takes hours and days, while the business needs it to happen in seconds and minutes.
Introduction Regulatory changes such as Reg NMS generate more quote, order, and cancel/replace messages as equity firms adapt to more electronic business processes. The subpenny pricing rule also increases demands on the supporting infrastructure. MiFID, which goes into effect in Europe next year, is expected to lead to higher data volumes as well, since investment banks that internalize trades will be required to publish their pre-trade quotes electronically.
Automated Trading Architecture Automated Trading Architecture Figure 2 shows the general automated trading network architecture.
Automated Trading Architecture Figure 3 shows a high-level view of the automated trading architecture. Figure 3 Automated Trading—High Level Architecture Financial Information Providers (Reuters, Bloomberg, Thomson) Exchanges, Markets, Liquidity Venues (Nasdaq, Arca, INET.
Automated Trading Architecture Figure 4 shows the “buy” side of the transactional system architecture.
Automated Trading Architecture Figure 5 shows the “sell” side of the transactional system architecture.
Concept Features Concept Features We are proposing a services-oriented framework for building the next-generation trading architecture. This approach provides a conceptual framework and an implementation path based on modularization and minimization of inter-dependencies.
Concept Features Table 1 Service Descriptions and Technologies (continued) Service Description Technology Storage services Virtualization of storage hardware (VSANs), data replication, remote backup, and file virtualization Trading resilience and mobility Local and site load balancing and high availability campus networks Wide area application services Acceleration of applications over a WAN connection for traders residing off-campus Thin client service De-coupling of the computing resources fro
Concept Features Networking Table 3 Networking Components Ethernet Switch Cisco Catalyst 6509 1 GbE, 720 supervisor, 6748 line cards Ethernet NIC Embedded Broadcom 5708 GbE NIC Network Interface Configuration None InfiniBand Switch Cisco InfiniBand SFS7000 (single data rate) InfiniBand HCA Cisco Cheetah DDR HCA PCIe 8x slot (running at SDR) Operating System Table 4 Operating System Version RHEL 4.4 64bit Kernel 2.6.9-42.
Test Implementation Framework Table 5 Application Software (continued) Client affinities and process Not set priorities Playback Data OPRA data recorded on 2 April 07 Test Implementation Framework Testing Topology Figure 7 Testing Topology GigE IB 1.2.11.80 GigE Filer (central control) 172.29.213.80) 1.2.10.80 1.2.10.52 Mcpub 1.2.10.3 OPRA Feedhandler 1.2.20.3 1.2.11.3 1.2.11.5 1.2.20.5 Mamaperf #1 1.2.20.6 1.2.11.80 Mamaperf #2 1.2.20.7 1.2.11.7 Mamaperf #3 Mamaperf #4 223256 1.2.20.
Testing Testing Methodology Test Setup Six servers are arranged as in Figure 8. The components consist of recorded OPRA (options price reporting authority) data (from April 2, 2007), the Wombat playback mechanism mcpub using the papareplay library, the Wombat OPRA feed handler (OPRA FH), and Wombat’s preferred performance measurement client, mamaperf. Mcpub replays data from previously captured OPRA files. The goal of the papareplay library is to mimic the original timing of the market data.
Testing Figure 8 Test Setup Canned OPRA Data mcpu mcpu mcpu Switched Ethernet LAN Feed Handler OPR AFH OPR AFH OPR AFH Cisco InfiniBand or Switched Ethernet LAN 29 West LBM (IP Multicast) mamaper mamaper mamaper mamaper mamaper mamaper mamaper mamaper mamaper mamaper mamaper 223257 mamaper Procedures A remote shell script was used to ensure that the timing of each run was as consistent with other runs as possible. The script first started the OPRA FH instances.
Testing the mcpub instances were instructed to begin playback at 1x recorded rate, and increase to 4x recorded rate at around 20 seconds after market-open. This resulted in four times as much data being played back over the remainder of the run. Data Observations The Wombat software provides for in-line time-stamping at several points through the data path.
Testing Testing Results Mean Latency The sections below tabulate the mean latencies for test runs at varying rates of playback data. Multiple runs were performed to ensure the consistency of results. Playback at 1x Recorded Rate When mcpub/papareply was programmed to play back at the same rate at which the data had been recorded, the average update rate across all runs was 10.5 Kups in aggregate for the three channels with a 10-second peak of approximately 154 Kups.
Testing Figure 9 shows a histogram of the mean latency observations for both Cisco IB and Ethernet. Figure 9 Transport Mean Latency at 1x Recorded Rate Because the update rate varied over time, it is important to check whether there is a correlation between update rate and latency. Live market data does not flow at a steady state. It has peaks and troughs and bursts of updates over small and large intervals. Yet many latency-sensitive applications value predictability of latency.
Testing Figure 10 Latency vs Update Rate at 1x Recorded Rate Consistent latency in the face of differences in 10-second update rates is one illustration of the ability of the DAL/IB solution to improve predictability.
Testing 4x Recorded Rates When mcpub/papareply was programmed to play back four times faster than the recording, the average update rate across all runs was 41 Kups in aggregate for the three channels with a 10-second peak of approximately 75 Kups, which corresponds to a 10-second peak of 603 Kups for a full OPRA feed. The Table 7 show the latency results. The mean latency across all runs was 50 µsec for DAL/IB and 240 µsec for UDP/Ethernet. Ethernet latency was lower.
Testing Figure 11 plots a histogram of the mean latency observations for both Cisco IB and Ethernet.
Testing Figure 12 plots the updates per second (per OPRA line/client) against mean latency. Unlike at 1x recorded rate, the Ethernet latencies do not fall appreciably with update rate. This is consistent with the previous hypothesis (that the inverse relationship was due to batching and flushing). At the rates experienced in the 4x scenario, the buffering and flushing may have been maxed out.
Testing Standard Deviation Table 8 and Table 9 show the results for standard deviation for each client during the runs, for the 1x playback rate and the 4x playback rate, respectively. Table 8 Standard Deviation for 1X Playback Rate Standard Deviation latency (milliseconds) InfiniBand 1x Recorded Rate Line 13 Line 15 Line 17 Ethernet Run 1 Run 2 Run 3 Run 1 Run 2 Run 3 1 0.085 0.114 0.088 0.647 0.651 0.653 2 0.081 0.057 0.087 0.652 0.651 0.647 3 0.045 0.066 0.081 0.648 0.
Testing Table 9 4x Playback Recorded Rate Mean latency (milliseconds) InfiniBand 4x Recorded Rate Line 13 Line 15 Line 17 Ethernet 1 Run 1 0.123 Run 2 0.124 Run 3 0.140 Run 1 0.428 Run 2 0.429 Run 3 0.426 2 0.131 0.097 0.109 0.416 0.413 0.415 3 0.075 0.116 0.088 0.415 0.414 0.424 4 0.075 0.088 0.082 0.406 0.412 0.411 1 0.116 0.124 0.140 0.467 0.457 0.458 2 0.151 0.112 0.109 0.453 0.441 0.440 3 0.083 0.096 0.088 0.448 0.451 0.447 4 0.067 0.086 0.
Testing Figure 13 and Figure 14 show a histogram of standard deviations for the 1x playback rate and at the 4x playback rate. In both charts, the InfiniBand data shows a very narrow distribution of standard deviations, or low latency dispersion. The very low standard deviations indicate that the InfiniBand solution exhibits good predictability and low jitter.
Testing Figure 14 Standard Deviation of Latency Histogram at 4x Recorded Rate High Performance Trading/Algo Speed with Wombat Design and Implementation Guide 26 OL-15617-01
Testing Max Latencies Figure 15 and Figure 16 show the distribution of latencies at 1x recorded rate and at 4x recorded rate. At 1x recorded rate the Infiniband solution the mean of the maximum latencies for DAL/Cisco InfiniBand is 1.4 milliseconds and with UDP/Ethernet it is 5.7 milliseconds. At 4x recorded rate, the means of the two networks move closer. At the higher rate, the mean of the maximum latencies is 4.9 milliseconds on DAL/Cisco InfiniBand and 6.9 milliseconds with UDP/Ethernet.
Appendix A—Device Configuration Figure 16 Max Latency Histogram at 4x Recorded Rate Appendix A—Device Configuration This section provides sample configurations for the two devices used in the solution; the Cisco 6500 Catalyst Switch and the SFS 7000. Catalyst Switch Configuration en-6509-1#show version Cisco Internetwork Operating System Software IOS (tm) s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-VM), Version 12.2(18) SXF10, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.
Appendix A—Device Configuration third-party authority to import, export, distribute or use encryption. Importers, exporters, distributors and users are responsible for compliance with U.S. and local country laws. By using this product you agree to comply with applicable laws and regulations. If you are unable to comply with U.S. and local laws, return this product immediately. A summary of U.S. laws governing Cisco cryptographic products may be found at: http://www.cisco.com/wwl/export/crypto/tool/stqrg.
Appendix A—Device Configuration ! ! fabric buffer-reserve queue port-channel load-balance src-dst-mac diagnostic cns publish cisco.cns.device.diag_results diagnostic cns subscribe cisco.cns.device.
Appendix A—Device Configuration no ip address shutdown ! interface GigabitEthernet1/9 no ip address shutdown ! interface GigabitEthernet1/10 no ip address shutdown ! interface GigabitEthernet1/11 no ip address shutdown ! interface GigabitEthernet1/12 no ip address shutdown ! interface GigabitEthernet1/13 no ip address shutdown ! interface GigabitEthernet1/14 no ip address shutdown ! interface GigabitEthernet1/15 no ip address shutdown ! interface GigabitEthernet1/16 no ip address shutdown ! interface Gigab
Appendix A—Device Configuration no ip address shutdown ! interface GigabitEthernet1/25 switchport switchport access vlan 10 switchport mode access no ip address ! interface GigabitEthernet1/26 switchport switchport access vlan 10 switchport mode access no ip address ! interface GigabitEthernet1/27 switchport switchport access vlan 10 switchport mode access no ip address ! interface GigabitEthernet1/28 switchport switchport access vlan 10 switchport mode access no ip address ! interface GigabitEthernet1/29
Appendix A—Device Configuration switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet1/36 switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet1/37 switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet1/38 switchport switchport access vlan 14 switchport mode access no ip address interface GigabitEthernet1/39 switchport switchport access vlan 302 switchport mode acc
Appendix A—Device Configuration switchport mode access no ip address ! interface GigabitEthernet1/46 switchport switchport access vlan 16 switchport mode access no ip address ! interface GigabitEthernet1/47 switchport switchport access vlan 16 switchport mode access no ip address ! interface GigabitEthernet1/48 switchport switchport access vlan 16 switchport mode access no ip address ! interface GigabitEthernet2/1 no ip address shutdown ! interface GigabitEthernet2/2 no ip address shutdown ! interface Giga
Appendix A—Device Configuration ! interface GigabitEthernet2/12 no ip address shutdown ! interface GigabitEthernet2/13 no ip address shutdown ! interface GigabitEthernet2/14 no ip address shutdown ! interface GigabitEthernet2/15 no ip address shutdown ! interface GigabitEthernet2/16 no ip address shutdown ! interface GigabitEthernet2/17 no ip address shutdown ! interface GigabitEthernet2/18 no ip address shutdown ! interface GigabitEthernet2/19 no ip address shutdown ! interface GigabitEthernet2/20 no ip a
Appendix A—Device Configuration ! interface GigabitEthernet2/27 switchport switchport access vlan 11 switchport mode access no ip address ! interface GigabitEthernet2/28 switchport switchport access vlan 11 switchport mode access no ip address ! interface GigabitEthernet2/29 switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet2/30 switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet2/31 switchport switchport a
Appendix A—Device Configuration ! interface GigabitEthernet2/38 switchport switchport access vlan 10 switchport mode access no ip address ! interface GigabitEthernet2/39 switchport switchport access vlan 302 switchport trunk encapsulation dot1q switchport trunk native vlan 302 switchport mode access no ip address channel-group 1 mode on ! interface GigabitEthernet2/40 switchport switchport access vlan 302 switchport trunk encapsulation dot1q switchport trunk native vlan 302 switchport mode access no ip add
Appendix A—Device Configuration switchport access vlan 11 switchport mode access no ip address ! interface GigabitEthernet2/48 switchport switchport access vlan 14 switchport mode access no ip address ! interface GigabitEthernet5/1 no ip address shutdown ! interface GigabitEthernet5/2 no ip address shutdown ! interface TenGigabitEthernet7/1 switchport switchport access vlan 50 switchport mode access no ip address ! interface TenGigabitEthernet7/2 switchport switchport access vlan 50 switchport mode access
Appendix A—Device Configuration no ip address ! interface Vlan1 no ip address ! interface Vlan10 ip address 1.2.10.1 255.255.255.0 ! interface Vlan11 ip address 1.2.11.1 255.255.255.0 ! interface Vlan12 no ip address shutdown ! interface Vlan14 ip address 1.2.12.1 255.255.255.0 ! interface Vlan15 no ip address shutdown ! interface Vlan16 ip address 1.2.16.1 255.255.255.0 ! interface Vlan20 ip address 1.2.20.1 255.255.255.0 ! interface Vlan30 ip address 1.2.30.1 255.255.255.
Appendix A—Device Configuration transport input lat pad udptn telnet rlogin mop ssh nasi acercon ! exception core-file ! no cns aaa enable end SFS 7000 Configuration (Core) svbu-hs-ts120-8> show version ================================================================================ System Version Information ================================================================================ system-version : SFS-7000D TopspinOS 2.10.0-ALPHA releng #323 04/16/2007 23:28:29 contact : tac@cisco.
Appendix A—Device Configuration speed 4x-sdr ! interface ib 8 speed 4x-sdr ! interface ib 9 speed 4x-sdr ! interface ib 10 speed 4x-sdr ! interface ib 13 speed 4x-sdr ! interface ib 14 speed 4x-sdr ! interface ib 15 speed 4x-sdr ! interface ib 16 speed 4x-sdr ! interface ib 17 speed 4x-sdr ! interface ib 18 speed 4x-sdr ! speed 4x-sdr ! interface ib 20 speed 4x-sdr ! interface ib 21 speed 4x-sdr ! interface ib 22 speed 4x-sdr ! interface ib 23 speed 4x-sdr ! interface ib 24 speed 4x-sdr ! ! hostname "svbu-
Appendix B—Building and Configuring Switches Appendix B—Building and Configuring Switches Definitions Table 10 Definition of Key Terms Term Description Blocking Blocking topologies do not provide a 1:1 ratio of paths in and paths out. In a blocking topology, traffic may potentially contend for paths. Non-blocking Non-blocking topologies provide, for each path into a switch or network, an equal path out. Non-blocking topologies avoid oversubscription.
Appendix B—Building and Configuring Switches • Note Focus on your out-of-band (Ethernet) network first. Verify that all of your hosts and switches are available on the out-of-band network before you bring up the InfiniBand network. Do not try to bring up the cluster using the in-band IPoIB management interfaces. • Break any given cluster into segments or “pods.” Bringing up a “pod” means bringing up all hosts connected to a leaf switch that is not logically connected to any core switches.
Appendix B—Building and Configuring Switches Table 11 Where do I put my core switches? Planning Requirements (continued) Core switches should reside in racks that contain no hosts and at most one additional core switch. Racks for core switches should have side panels removed. Rack space immediately to the left and immediately to the right of all core switches should be vacant because cables will feed out from the core switches into this space.
Appendix B—Building and Configuring Switches Table 11 Where do I put my leaf switches? Planning Requirements (continued) Leaf switches typically reside in the same racks as hosts. The number of leaf switches per rack and the placement of each switch in the rack depends on the blocking factor of the fabric. Refer to the Definitions, page 42 for details on blocking/subscription. Cisco provides two common rack configuration models.
Appendix B—Building and Configuring Switches Table 11 Planning Requirements (continued) Note Dashed borders in Figure 18 delineate pods. Remember, do not rack anything yet! Just keep in mind where it is all going. This section is about planning, not executing. How many HCAs do I need? You need at least one HCA per host. You can install one-port HCAs or two-port HCAs. Where do I put my hosts? Ideally, 32 to 36 hosts reside in each non-core rack, along with 2 or 3 leaf switches.
Appendix B—Building and Configuring Switches Table 11 Planning Requirements (continued) How will I identify my switches and hosts? You should create naming conventions that address the following components: • Rack names • Host names • Switch names • Rack in which a given host resides • Rack in which a given switch resides In the event that the organization for which you are installing the cluster already has established naming conventions, defer to the existing rules.
Appendix B—Building and Configuring Switches Install Interface Cards in the Hosts Install your HCA(s) in your hosts. For detailed instructions, refer to the installation guides that arrive with your HCA. Install your NIC(s) in your hosts (if necessary). Rack and Cable All Hardware Rack and Cable all hardware as follows: Step 1 Mount your switches, hosts, and any other chassis in your racks according to the plan that you developed in The Very First Thing That You Do: Plan, page 43.
Appendix B—Building and Configuring Switches Configure Ethernet AtTributes of Leaf Switches Enter the following series of commands on each leaf switch: Step 1 Login: super Log in to the switch. Password: xxxxx Step 2 switch> enable Enter Privileged Exec mode. Step 3 switch# configure terminal Enter Global Configuration mode. Step 4 switch(config)# hostname R2S101 Configure a device name from your naming conventions. (The CLI prompt will not immediately reflect the name change.
Appendix B—Building and Configuring Switches Validate the Ethernet Management NEtwork Bring up the Ethernet management network according to the plan that you developed in The Very First Thing That You Do: Plan, page 43. • Set up Ethernet IP addresses on all switches and hosts. • Verify logical connectivity to all switches and hosts. Set Up SE Tools on a Ethernet-attached Host • Install expect software. • Perl, TCL, Python. • Collect tools from SVBU.
Appendix B—Building and Configuring Switches The text that follows is an example file named R2S101: R2H01 R2H02 R2H03 R2H04 R2H05 R2H06 R2H07 R2H08 R2H09 R2H10 R2H11 R2H12 R2H13 R2H14 R2H15 R2H16 Disable Uplinks on Leaf Switches Access each leaf switch through the Ethernet management network and disable the uplinks to the core switches. Step 1 Login: super Log in to the switch. Password: xxxxx Step 2 R2S101> enable Enter Privileged Exec mode.
Appendix B—Building and Configuring Switches Install Drivers from an ISO on NFS Step 1 host login: user-id Log in to your host. Password: password Step 2 Host~ #cd path/image Navigate to the ISO on your file system. Step 3 Host:/path/image # mount -o loop cisco.iso /mnt Mount the ISO. Step 4 Host:/path/image #cd /mnt Navigate to the ISO. Step 5 Host: /mnt #./tsinstall Enter the tsinstall command. Step 6 host:~ # reboot Reboot your host.
Appendix B—Building and Configuring Switches Troubleshoot “Bring Up” Pod Embedded SM Step 1 Login: super Log in to the switch. Password: xxxxx Step 2 R2S101> enable Enter Privileged Exec mode. Step 3 R2S101# terminal length 0 Configure unlimited output from show commands. Step 4 R2S101# configure terminal Enter Global Configuration mode. Step 5 R2S101(config)# trace app 26 mod 10 level terse flow Configure trace level for log tracking.
Appendix B—Building and Configuring Switches Error Course of Action Configuration caused by some ports in INIT state 1. Look for Failed discover node test, node 00:05:ad:00:00:02:22:d0, port_num= 14, error code 1 in the log. Note that the message provides you with the device GUID (in this case, 00:05:ad:00:00:02:22:d0) and port number (14). 2. Note 3. SM OUT_OF_SERVICE trap for GID=0xfe800000000000000005ad00000348e1 Note Match the GUID to its SFS chassis and identify the chassis type.
Appendix B—Building and Configuring Switches SM The SM can monitor thresholds on the error counters and the SM can notify you. Refer to the relevant SM documentation for details (Cisco SFS CLI guide or Cisco High-Performance SM User Guide). Manual We have get_counters script and reset_counters script. Reset counters clears the port counters throughout the network. Get counters gets the counters throughout the network.
Appendix B—Building and Configuring Switches • Fabric module • Return material authorization (RMA) the offending device This means that there are bad links and the port is turned off. Q. In the case of a 120: is it the port on the switch, the port on the HCA, or the cable? Q. How do I tell what’s bad? A. Swap the switch port. Swap the HCA port. A. Review the problem symptoms (from Step 15). A. Re-enable the port that you shut down in step 15. A. Verify that the problem recurs. A.
Appendix B—Building and Configuring Switches 5. Internal link problem (Cisco 7008; similar steps apply to the Cisco 7012 and 7024) Try to identify the bad fru: • node card • core card • backplane (almost never) Figure out failed ports. (Use the show diag fru error command at the switch CLI.) Identify the cards that create the failed connection. Begin by swapping out the relevant node card with another node card (do not introduce an outside card).
Appendix B—Building and Configuring Switches High Performance Trading/Algo Speed with Wombat Design and Implementation Guide 58 OL-15617-01