Adventures in Large Finance Data Processing

Thursday, October 23, 2014

Fine-grained analysis of stocks

In this blog post I will investigate the sub-second behavior of stocks. The NANEX data has a 25 millisecond clock resolution -- this is the lowest granularity level we can look at in the data.

The code is virtually identical to the trades-by-second query in the previous post. What was unexplained previously is the presence of the mysterious Segmenter function. The Segmenter deserves a blog post of its own but, in principle, it allows the GroupBy to scale to a large number of groups. In this case, there are 1,881,910,846 groups that are further processed by the OrderBy clause and reduced to 200,000 tuples. The reduction is needed to be able to visualize the result but is GrokIt has no trouble dealing with results of this size.
In this case, I am showing below the actual performance graphs GrokIt is displaying instead on average performance numbers. The execution on fgrokit produces the performance graphs:

A number of interesting things can be noticed during the execution:

The execution is chaotic from the CPU and number of tuples point of view. This really is a harsh query that requires a large amount of memory allocation and traffic. This results in large fluctuations throughout the execution.
GrokIt cannot keep up with the IO (i.e. the disk is too fast) in this query. There are a large number of chunk drops, as can be seen from the "Chunks Dropped" graph. Chunks are the unit of storage and processing in GrokIt and contain 2 million tuples.
GrokIt does not read data at full speed (4 GB/s on fgrokit). This is due to the large interference between the chaotic memory access of the query processing and the IO.
Despite the obvious struggling, fgrokit finishes the query in 233s. In this time, 56.8 billion tuples were grouped into 1.8 billion groups and then the top 200,000 best tuples extracted.
There is probably room for improvement for GrokIt since the execution looks wasteful (too many chunk drops).

In terms of the actual result, the outcome is fascinating. The top 10 tuples are:

Date	Millisecond	cnt
2009-Nov-23	35,998,000	14,740
2010-Jan-26	35,998,950	14,715
2010-May-07	44,417,850	12,377
2009-Dec-01	36,000,750	12,066
2010-May-05	34,319,350	11,698
2010-Feb-08	56,486,475	11,682
2010-Jan-29	56,661,350	11,656
2009-Nov-04	53,561,975	10,735
2010-Jan-29	53,907,275	10,591
2010-May-07	45,911,500	9,226

The top entry has 14,740 trades for the 25ms interval. If sustained for a full second, this would correspond to 589,600 trades/second, almost 6 times larger than the largest per second rate we have seen in the previous blog post. By consulting again the per second results, for November 23, 2009 at second 35,998 (this is the second in which the top millisecond falls into), the total number of trades is 38,512 and the rank among the top seconds is 539. This means that half the trades in this second happened in a 25ms interval -- this reaffirms the idea of flash trading possible only with algorithmic trading. Looking at the second highest 25ms, the total trades in the corresponding second are 19,195 ranked 9573 among the top seconds. In this situation, 75% of all trading in the second happens in this 25ms interval.

Thursday, October 16, 2014

Most active trading seconds

In this blog I am investigating the hottest (i.e. highest number of trades) trading seconds in the 9.5 year history covered by the NANEX data in the hope I can discover interesting facts.

Let's start with the basic query that finds the 200,000 most active (i.e. with most executed trades) seconds. In GrokIt, the query is simply:

In the above query, MsOfDay %/% 1000 converts milliseconds into seconds (%/% is the integer division operator). The final division by 3600 converts the second into a hour-as-a-double so we can read more easily the result. The query is similar to queries in the previous blog post but the number of groups is much larger: 87,872,904 (this was determined by running a query that ends with a Count instead of OrderBy).

The running time for the query on the two machines is:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	387.6 s	150 MT/s	1200 MB/s	14/64
fgrokit	127.8 s	500 MT/s	4000 MB/s	50/64

This is in line with the results we have seen in previous blog posts. Notice that on fgrokit, the query takes just 2 minutes.

The top 10 rows in the result are:

Date	Hour (dec)	cnt
2014-Mar-21	15.75	106,063
2011-Sep-01	10	99,914
2014-Mar-21	15.83	95,796
2014-Mar-21	15.92	90,809
2014-May-30	15.75	89,582
2013-Dec-20	15.92	87,102
2013-Dec-20	15.75	85,679
2011-Nov-29	10	82,537
2012-Jun-22	15.92	82,002
2012-Jun-22	15.75	81,267

We immediately see within the first 10 rows two interesting facts in this data. First, the most active second contains more than 100,000 trades. This gives a sense of what the peak load on financial trading systems might be. Arguably, it is hard to guess this number and this query readily provides the answer. Second, it seems that the hottest seconds are clustered around 10 AM and 3:45-4 PM.

To shed more light into this issue, we plot the distribution of the hottest 1000 seconds versus the hour of the day they appear:

We can immediately see the peak at 10 AM and the very large peak at 4 PM. This strongly suggests algorithmic trading behavior: obtain positions at the beginning of the day and clear the position before the day end.

Just for comparison, here is how the graph looks like when all 200,000 hottest seconds are included:

In fact, 80,000 out of 200,000 hottest seconds happen between 3:30 and 4 PM. That seems to be by far the busiest part of the trading day.

Just as a point of comparison, here is the distribution of the number of trades as a function of the time of day:

This immediately begs the question what happens at the millisecond level. We'll see that in a future blog post.

Saturday, October 11, 2014

Basic statistics of NANEX data

In this blog I will show basic statistics over the NANEX trades. This is really just warmup for the more interesting analysis to follow. The trades data is in the relation nanex_trades on both grokit and fgrokit. The schema of the relation is:

Attribute	Type
Type	Category
Symbol	Factor
MsOfDay	Integer
Date	Date
ListedExg	Category
ReportingExg	Category
Price	Float
TradeCondition	Category
ConditionFlags	Category
BATECode	Category
Size	Integer

As a reminder from the previous post, I am running experiments on grokit and fgrokit, two servers with 64 cores, 512GB of main memory and 24 disks. The difference is that fgrokit has 16 SDD disks (out of the 24 total). SDD disks have no seek time, thus jumping over columns does not significantly affect read speed.

Number of trades by type

The following program computes the number of trades by type. Line 4 accesses the data, 5-9 specifies a grouping on Type with a Count() aggregation and line 11 pushes the result to the front end. The result of the query is:

Type	cnt
Equity	56,831,509,648
Bond	75,955

Basically, the data contains 56.8 billion equities and 76 thousand bonds. All other types were filtered out at data loading time. The performance is the following:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	122.1 s	520 MT/s	520 MB/s	20.5/64
fgrokit	43.7 s	1600 MT/s	1600 MB/s	63.5/64

Interestingly, grokit is almost 3 times slower than fgrokit. In fact, as you will see later, fgrokit is reading data far below the max rate. fgrokit seems to be on the border of being CPU bound. grokit is far below (20 out of 64 cores to process). Notice that fgrokit processes data and an insane 1.6 billion tuples/second. The 520 million tuples/second achieved by grokit is impressive as well. This is of course possible only with skinny columns like Type (1 byte)

It is worth mentioning that GrokIt framework has a significant overhead to run the queries. The actual tuple processing time for fgrokit is 57GB/1.6GB = 35.6s out of 43.7s total execution time (7s overhead).

Number of trades by date

To compute number of trades by date we can use: This is similar to the previous query but the grouping is by date. The performance results we obtain are:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	239.5 s	250 MT/s	1000 MB/s	11/64
fgrokit	67.9 s	950 MT/s	3800 MB/s	43.5/64

This time fgrokit reads data at 3.8GB/s that allows it to pump and process 950 million tuples/second. grokit only achieves 1/4 of that speed but used 1/4 th of the CPUs. One column queries like this are a problem for grokit server since the disk heads have to jump a lot (hence the smallish 1.0GB/s speed).
To see how the query behaves with more columns, we extend the query like this: The performance results are:

Server	Time	Tuple Speed	Read Speed	Cpu Load
grokit	762 s	80 MT/s	1200 MB/s	11/64
fgrokit	227.6 s	250 MT/s	4000 MB/s	37.8/64

The read speed of both grokit and fgrokit is slightly higher. Even though 3 columns are accessed (Date, Size, Price), fgrokit still maintains a 250 million tuples/second processing rate and it is not CPU bound. The top-10 results sorted by number of trades is:

Date	count	avgPrice	avgSize	avgCost	totalVol	total
2008-Oct-10	82.91 M	$30.30	273.32	$6,329.04	22.66 B	$524.76 B
2011-Aug-08	74.73 M	$37.85	297	$7,735.54	22.19 B	$578.04 B
2008-Sep-18	73.79 M	$37.69	302.69	$8,742.44	22.34 B	$645.08 B
2011-Aug-09	70.11 M	$37.49	291.92	$8,019.87	20.47 B	$562.25 B
2011-Aug-05	69.22 M	$38.62	285.11	$7,890.37	19.74 B	$546.21 B
2011-Aug-10	67.08 M	$36.92	289.68	$7,408.78	19.43 B	$496.97 B
2010-May-07	66.29 M	$35.20	398.98	$7,787.90	26.45 B	$516.24 B
2008-Oct-08	66.03 M	$33.74	270.21	$7,105.08	17.84 B	$469.15 B
2010-May-06	65.95 M	$35.25	450.53	$8,190.20	29.71 B	$540.13 B
2008-Oct-16	64.72 M	$31.25	266.13	$6,566.62	17.22 B	$425.00 B

and sorted by the total volume is:

Date	count	avgPrice	avgSize	avgCost	totalVol	total
2008-Sep-18	73.79 M	$37.69	302.69	$8,742.44	22.34 B	$645.08 B
2011-Aug-08	74.73 M	$37.85	297	$7,735.54	22.19 B	$578.04 B
2011-Aug-09	70.11 M	$37.49	291.92	$8,019.87	20.47 B	$562.25 B
2008-Sep-19	51.57 M	$40.18	360.29	$10,760.68	18.58 B	$554.90 B
2011-Aug-05	69.22 M	$38.62	285.11	$7,890.37	19.74 B	$546.21 B
2010-May-06	65.95 M	$35.25	450.53	$8,190.20	29.71 B	$540.13 B
2008-Sep-17	61.55 M	$38.49	319.88	$8,735.88	19.69 B	$537.71 B
2008-Jan-23	48.51 M	$42.19	340.27	$11,024.05	16.51 B	$534.74 B
2008-Oct-10	82.91 M	$30.30	273.32	$6,329.04	22.66 B	$524.76 B
2008-Sep-16	62.94 M	$37.94	323.74	$8,217.81	20.38 B	$517.21 B

Interestingly September 18, 2008 is the highest total traded day but the 3rd highest in terms of number of transactions. In terms of total volume, it is 97th, way down the list. This clearly requires further investigation.

Number of trades by Symbol

This time we need to group by symbol instead of date. Symbol is represented as factor with integer support and uses the same space as Date. The query is:

The running times are fgrokit: 238.6s and grokit: 711.8s. These times are about the same as the Date experiments. The jump from 2300 to 60,000 groups does not seem to have any influence (everything is still CPU bound but the CPU usage is about the same).

The result consists of 60758 rows, one for each of the symbols that appear in the data. Notice that in the query we still used View() since GrokIt Visualizer has no trouble dealing with results of this size (I'll have a separate blob post on this issue). The top 10 results by number of transactions are:

Symbol	count	avgPrice	avgSize	avgCost	totalVol	total
SPY	839.53 M	$123.17	480.62	$59,947.39	403.49 B	$50.33 T
BAC	420.32 M	$16.31	732.49	$10,080.15	307.88 B	$4.24 T
C	347.03 M	$20.83	1.14 K	$10,628.76	396.42 B	$3.69 T
MSFT	316.31 M	$27.60	463.16	$12,705.05	146.50 B	$4.02 T
IWM	310.02 M	$73.01	446.18	$33,008.40	138.33 B	$10.23 T
AAPL	294.54 M	$236.23	193.74	$41,417.26	57.06 B	$12.20 T
INTC	288.38 M	$21.35	484.98	$10,425.26	139.86 B	$3.01 T
JPM	279.98 M	$39.23	266.06	$10,491.78	74.49 B	$2.94 T
WFC	268.17 M	$28.41	288.39	$8,280.18	77.34 B	$2.22 T
CSCO	266.36 M	$21.22	474.71	$10,079.06	126.44 B	$2.68 T

The set of symbols that make it to top 10 is not surprising. The number of individual transactions is. Most of this information is not usually reported by financial data analysts and aggregates are hard to come by. We only needed 4 minutes on fgrokit to tabulate this information.

Friday, October 10, 2014

Blog Desiderata and Outlook

First, welcome to my first blog post. I am Alin Dobra, an Associate Professor at University of Florida, Computer Information Science and Engineering Department department. I have many interests (approximate query processing, probabilistic network analysis, data-mining, etc) but this blog will be focused on large data processing, specifically analysis of NANEX financial data.

The data

NANEX is a financial data feed integrator based in Chicago that provides both live and historical feeds from all major financial exchanges. My department bought historical data on the 3 major US exchanges (NYSE, NASDAQ and AMEX) covering 9.5 years (Jan 1, 2005 - May 31, 2014) at individual trade and quote granularity. The data comes in a NANEX proprietary format and it is about 4TB in size (at an estimated 90% compression ratio). This is some of the largest financial data available, with 56.8 billion trades and an (estimated) 1 trillion quotes. Since both trades and quotes have a 25ms time stamp, the data provides a unique insight into the inner working of the stock market, especially in its potential to shed light into millisecond trading.

The systems

Both NANEX data and systems I will use in this blog post were purchased as part of the performance funding provided by State of Florida. The systems are available for students in the CISE department to get hands on experience with large data processing (as part of large data classes and extra work). Two machines are dedicated to processing this data: grokit.cise.ufl.edu and fgrokit.cise.ufl.edu. Each of the machines have:

4 AMD Opteron 6376 processors with 16 cores each for a total of 64 cores
512 GB main memory
24 disks
Infiniband and Gigabit networking

fgrokit machine has 16 1TB SAMSUNG EVO drives and 8 4TB WD spindle drives. grokit machine has 24 4TB WD drives. The SSD drives on fgrokit provides up to 3X the performance as we will see in subsequent posts.

The software

Starting in 2009, together with Chris Jermaine (then at UF now at Rice), I developed DataPath, a high performance database system. Since 2013, DataPath has been further developed by Tera Insights, a startup that I co-founded, and it is known as GrokIt. GrokIt has a R-language bindings and a web interface that allows editing, monitoring and result data visualization. Since GrokIt is so much more convenient to use, it has been made available for CISE students and faculty for large data processing.

The purpose

This blog will have to main purposes. The main goal is to find interesting things about the financial markets, with a particular focus on aggregate knowledge rather that point knowledge. I hope that the findings will encourage researchers in general and students in CISE department that have access to the data to look for more things. A secondary goal is to provide a large number of examples on how analysis can be performed on the data using Tera Insights's GrokIt, examples that can be used as a starting point for further exploration. The examples will be accompanied by performance numbers and notes on how to efficiently code.

One of the points I hope I will repeatedly prove is that, with the right tools, large data is not in the exclusive realm of large clusters with thousands of cores and drives. All the queries and data analytics I will report will run on a single machine (either grokit or fgrokit).

For the CISE students

If you have access to grokit or fgrokit, feel free to use validate and use as the starting point the code in these blogposts. Please posts comments with your experience and questions. Tera Insighs is hosting a forum at forum.terainsights.net where you can ask questions on how to use GrokIt if you run in trouble.