Saturday, October 11, 2014

Basic statistics of NANEX data

In this blog I will show basic statistics over the NANEX trades. This is really just warmup for the more interesting analysis to follow.  The trades data is in the relation nanex_trades on both grokit and fgrokit. The schema of the relation is:


AttributeType
TypeCategory
SymbolFactor
MsOfDayInteger
DateDate
ListedExgCategory
ReportingExgCategory
PriceFloat
TradeConditionCategory
ConditionFlagsCategory
BATECodeCategory
SizeInteger

As a reminder from the previous post, I am running experiments on grokit and fgrokit, two servers with 64 cores, 512GB of main memory and 24 disks. The difference is that fgrokit has 16 SDD disks (out of the 24 total). SDD disks have no seek time, thus jumping over columns does not significantly affect read speed.

Number of trades by type

The following program computes the number of trades by type. Line 4 accesses the data, 5-9 specifies a grouping on Type with a Count() aggregation and line 11 pushes the result to the front end. The result of the query is:

Typecnt
Equity56,831,509,648
Bond75,955

Basically, the data contains 56.8 billion equities and 76 thousand bonds. All other types were filtered out at data loading time. The performance is the following:

ServerTimeTuple SpeedRead SpeedCpu Load
grokit122.1 s520 MT/s520 MB/s20.5/64
fgrokit43.7 s1600 MT/s1600 MB/s63.5/64

Interestingly, grokit is almost 3 times slower than fgrokit. In fact, as you will see later, fgrokit is reading data far below the max rate. fgrokit seems to be on the border of being CPU bound. grokit is far below (20 out of 64 cores to process). Notice that fgrokit processes data and an insane 1.6 billion tuples/second. The 520 million tuples/second achieved by grokit is impressive as well. This is of course possible only with skinny columns like Type (1 byte)

It is worth mentioning that GrokIt framework has a significant overhead to run the queries.  The actual tuple processing time for fgrokit is 57GB/1.6GB = 35.6s out of 43.7s total execution time (7s overhead).

Number of trades by date

To compute number of trades by date we can use: This is similar to the previous query but the grouping is by date. The performance results we obtain are:

ServerTimeTuple SpeedRead SpeedCpu Load
grokit239.5 s250 MT/s1000 MB/s11/64
fgrokit67.9 s950 MT/s3800 MB/s43.5/64

This time fgrokit reads data at 3.8GB/s that allows it to pump and process 950 million tuples/second. grokit only achieves 1/4 of that speed but used 1/4 th of the CPUs. One column queries like this are a problem for grokit server since the disk heads have to jump a lot (hence the smallish 1.0GB/s speed).
To see how the query behaves with more columns, we extend the query like this: The performance results are:

ServerTimeTuple SpeedRead SpeedCpu Load
grokit762 s80 MT/s1200 MB/s11/64
fgrokit227.6 s250 MT/s4000 MB/s37.8/64

The read speed of both grokit and fgrokit is slightly higher. Even though 3 columns are accessed (Date, Size, Price), fgrokit still maintains a 250 million tuples/second processing rate and it is not CPU bound. The top-10 results sorted by number of trades is:

DatecountavgPriceavgSizeavgCosttotalVoltotal
2008-Oct-1082.91 M$30.30 273.32$6,329.04 22.66 B$524.76 B
2011-Aug-0874.73 M$37.85 297$7,735.54 22.19 B$578.04 B
2008-Sep-1873.79 M$37.69 302.69$8,742.44 22.34 B$645.08 B
2011-Aug-0970.11 M$37.49 291.92$8,019.87 20.47 B$562.25 B
2011-Aug-0569.22 M$38.62 285.11$7,890.37 19.74 B$546.21 B
2011-Aug-1067.08 M$36.92 289.68$7,408.78 19.43 B$496.97 B
2010-May-0766.29 M$35.20 398.98$7,787.90 26.45 B$516.24 B
2008-Oct-0866.03 M$33.74 270.21$7,105.08 17.84 B$469.15 B
2010-May-0665.95 M$35.25 450.53$8,190.20 29.71 B$540.13 B
2008-Oct-1664.72 M$31.25 266.13$6,566.62 17.22 B$425.00 B

and sorted by the total volume is:

DatecountavgPriceavgSizeavgCosttotalVoltotal
2008-Sep-1873.79 M$37.69 302.69$8,742.44 22.34 B$645.08 B
2011-Aug-0874.73 M$37.85 297$7,735.54 22.19 B$578.04 B
2011-Aug-0970.11 M$37.49 291.92$8,019.87 20.47 B$562.25 B
2008-Sep-1951.57 M$40.18 360.29$10,760.68 18.58 B$554.90 B
2011-Aug-0569.22 M$38.62 285.11$7,890.37 19.74 B$546.21 B
2010-May-0665.95 M$35.25 450.53$8,190.20 29.71 B$540.13 B
2008-Sep-1761.55 M$38.49 319.88$8,735.88 19.69 B$537.71 B
2008-Jan-2348.51 M$42.19 340.27$11,024.05 16.51 B$534.74 B
2008-Oct-1082.91 M$30.30 273.32$6,329.04 22.66 B$524.76 B
2008-Sep-1662.94 M$37.94 323.74$8,217.81 20.38 B$517.21 B

Interestingly September 18, 2008 is the highest total traded day but the 3rd highest in terms of number of transactions. In terms of total volume, it is 97th, way down the list. This clearly requires further investigation.

Number of trades by Symbol

This time we need to group by symbol instead of date. Symbol is represented as  factor with integer support and uses the same space as Date.  The query is:

The running times are fgrokit: 238.6s and grokit: 711.8s. These times are about the same as the Date experiments. The jump from 2300 to 60,000 groups does not seem to have any influence (everything is still CPU bound but the CPU usage is about the same).


The result consists of 60758 rows, one for each of the symbols that appear in the data. Notice that in the query we still used View() since GrokIt Visualizer has no trouble dealing with results of this size (I'll have a separate blob post on this issue). The top 10 results by number of transactions are:

SymbolcountavgPriceavgSizeavgCosttotalVoltotal
SPY839.53 M$123.17 480.62$59,947.39 403.49 B$50.33 T
BAC420.32 M$16.31 732.49$10,080.15 307.88 B$4.24 T
C347.03 M$20.83 1.14 K$10,628.76 396.42 B$3.69 T
MSFT316.31 M$27.60 463.16$12,705.05 146.50 B$4.02 T
IWM310.02 M$73.01 446.18$33,008.40 138.33 B$10.23 T
AAPL294.54 M$236.23 193.74$41,417.26 57.06 B$12.20 T
INTC288.38 M$21.35 484.98$10,425.26 139.86 B$3.01 T
JPM279.98 M$39.23 266.06$10,491.78 74.49 B$2.94 T
WFC268.17 M$28.41 288.39$8,280.18 77.34 B$2.22 T
CSCO266.36 M$21.22 474.71$10,079.06 126.44 B$2.68 T

The set of symbols that make it to top 10 is not surprising. The number of individual transactions is. Most of this information is not usually reported by financial data analysts and aggregates are hard to come by. We only needed 4 minutes on fgrokit to tabulate this information.

No comments:

Post a Comment