Adventures in Large Finance Data Processing: Basic statistics of NANEX data

In this blog I will show basic statistics over the NANEX trades. This is really just warmup for the more interesting analysis to follow. The trades data is in the relation nanex_trades on both grokit and fgrokit. The schema of the relation is:

Attribute	Type
Type	Category
Symbol	Factor
MsOfDay	Integer
Date	Date
ListedExg	Category
ReportingExg	Category
Price	Float
TradeCondition	Category
ConditionFlags	Category
BATECode	Category
Size	Integer

As a reminder from the previous post, I am running experiments on grokit and fgrokit, two servers with 64 cores, 512GB of main memory and 24 disks. The difference is that fgrokit has 16 SDD disks (out of the 24 total). SDD disks have no seek time, thus jumping over columns does not significantly affect read speed.

Number of trades by type

The following program computes the number of trades by type. Line 4 accesses the data, 5-9 specifies a grouping on Type with a Count() aggregation and line 11 pushes the result to the front end. The result of the query is:

Type	cnt
Equity	56,831,509,648
Bond	75,955

Basically, the data contains 56.8 billion equities and 76 thousand bonds. All other types were filtered out at data loading time. The performance is the following:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	122.1 s	520 MT/s	520 MB/s	20.5/64
fgrokit	43.7 s	1600 MT/s	1600 MB/s	63.5/64

Interestingly, grokit is almost 3 times slower than fgrokit. In fact, as you will see later, fgrokit is reading data far below the max rate. fgrokit seems to be on the border of being CPU bound. grokit is far below (20 out of 64 cores to process). Notice that fgrokit processes data and an insane 1.6 billion tuples/second. The 520 million tuples/second achieved by grokit is impressive as well. This is of course possible only with skinny columns like Type (1 byte)

It is worth mentioning that GrokIt framework has a significant overhead to run the queries. The actual tuple processing time for fgrokit is 57GB/1.6GB = 35.6s out of 43.7s total execution time (7s overhead).

Number of trades by date

To compute number of trades by date we can use: This is similar to the previous query but the grouping is by date. The performance results we obtain are:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	239.5 s	250 MT/s	1000 MB/s	11/64
fgrokit	67.9 s	950 MT/s	3800 MB/s	43.5/64

This time fgrokit reads data at 3.8GB/s that allows it to pump and process 950 million tuples/second. grokit only achieves 1/4 of that speed but used 1/4 th of the CPUs. One column queries like this are a problem for grokit server since the disk heads have to jump a lot (hence the smallish 1.0GB/s speed).
To see how the query behaves with more columns, we extend the query like this: The performance results are:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	762 s	80 MT/s	1200 MB/s	11/64
fgrokit	227.6 s	250 MT/s	4000 MB/s	37.8/64

The read speed of both grokit and fgrokit is slightly higher. Even though 3 columns are accessed (Date, Size, Price), fgrokit still maintains a 250 million tuples/second processing rate and it is not CPU bound. The top-10 results sorted by number of trades is:

Date	count	avgPrice	avgSize	avgCost	totalVol	total
2008-Oct-10	82.91 M	$30.30	273.32	$6,329.04	22.66 B	$524.76 B
2011-Aug-08	74.73 M	$37.85	297	$7,735.54	22.19 B	$578.04 B
2008-Sep-18	73.79 M	$37.69	302.69	$8,742.44	22.34 B	$645.08 B
2011-Aug-09	70.11 M	$37.49	291.92	$8,019.87	20.47 B	$562.25 B
2011-Aug-05	69.22 M	$38.62	285.11	$7,890.37	19.74 B	$546.21 B
2011-Aug-10	67.08 M	$36.92	289.68	$7,408.78	19.43 B	$496.97 B
2010-May-07	66.29 M	$35.20	398.98	$7,787.90	26.45 B	$516.24 B
2008-Oct-08	66.03 M	$33.74	270.21	$7,105.08	17.84 B	$469.15 B
2010-May-06	65.95 M	$35.25	450.53	$8,190.20	29.71 B	$540.13 B
2008-Oct-16	64.72 M	$31.25	266.13	$6,566.62	17.22 B	$425.00 B

and sorted by the total volume is:

Date	count	avgPrice	avgSize	avgCost	totalVol	total
2008-Sep-18	73.79 M	$37.69	302.69	$8,742.44	22.34 B	$645.08 B
2011-Aug-08	74.73 M	$37.85	297	$7,735.54	22.19 B	$578.04 B
2011-Aug-09	70.11 M	$37.49	291.92	$8,019.87	20.47 B	$562.25 B
2008-Sep-19	51.57 M	$40.18	360.29	$10,760.68	18.58 B	$554.90 B
2011-Aug-05	69.22 M	$38.62	285.11	$7,890.37	19.74 B	$546.21 B
2010-May-06	65.95 M	$35.25	450.53	$8,190.20	29.71 B	$540.13 B
2008-Sep-17	61.55 M	$38.49	319.88	$8,735.88	19.69 B	$537.71 B
2008-Jan-23	48.51 M	$42.19	340.27	$11,024.05	16.51 B	$534.74 B
2008-Oct-10	82.91 M	$30.30	273.32	$6,329.04	22.66 B	$524.76 B
2008-Sep-16	62.94 M	$37.94	323.74	$8,217.81	20.38 B	$517.21 B

Interestingly September 18, 2008 is the highest total traded day but the 3rd highest in terms of number of transactions. In terms of total volume, it is 97th, way down the list. This clearly requires further investigation.

Number of trades by Symbol

This time we need to group by symbol instead of date. Symbol is represented as factor with integer support and uses the same space as Date. The query is:

The running times are fgrokit: 238.6s and grokit: 711.8s. These times are about the same as the Date experiments. The jump from 2300 to 60,000 groups does not seem to have any influence (everything is still CPU bound but the CPU usage is about the same).

The result consists of 60758 rows, one for each of the symbols that appear in the data. Notice that in the query we still used View() since GrokIt Visualizer has no trouble dealing with results of this size (I'll have a separate blob post on this issue). The top 10 results by number of transactions are:

Symbol	count	avgPrice	avgSize	avgCost	totalVol	total
SPY	839.53 M	$123.17	480.62	$59,947.39	403.49 B	$50.33 T
BAC	420.32 M	$16.31	732.49	$10,080.15	307.88 B	$4.24 T
C	347.03 M	$20.83	1.14 K	$10,628.76	396.42 B	$3.69 T
MSFT	316.31 M	$27.60	463.16	$12,705.05	146.50 B	$4.02 T
IWM	310.02 M	$73.01	446.18	$33,008.40	138.33 B	$10.23 T
AAPL	294.54 M	$236.23	193.74	$41,417.26	57.06 B	$12.20 T
INTC	288.38 M	$21.35	484.98	$10,425.26	139.86 B	$3.01 T
JPM	279.98 M	$39.23	266.06	$10,491.78	74.49 B	$2.94 T
WFC	268.17 M	$28.41	288.39	$8,280.18	77.34 B	$2.22 T
CSCO	266.36 M	$21.22	474.71	$10,079.06	126.44 B	$2.68 T

The set of symbols that make it to top 10 is not surprising. The number of individual transactions is. Most of this information is not usually reported by financial data analysts and aggregates are hard to come by. We only needed 4 minutes on fgrokit to tabulate this information.

Adventures in Large Finance Data Processing

Saturday, October 11, 2014

Basic statistics of NANEX data

Number of trades by type

Number of trades by date

Number of trades by Symbol

No comments:

Post a Comment