The data
NANEX is a financial data feed integrator based in Chicago that provides both live and historical feeds from all major financial exchanges. My department bought historical data on the 3 major US exchanges (NYSE, NASDAQ and AMEX) covering 9.5 years (Jan 1, 2005 - May 31, 2014) at individual trade and quote granularity. The data comes in a NANEX proprietary format and it is about 4TB in size (at an estimated 90% compression ratio). This is some of the largest financial data available, with 56.8 billion trades and an (estimated) 1 trillion quotes. Since both trades and quotes have a 25ms time stamp, the data provides a unique insight into the inner working of the stock market, especially in its potential to shed light into millisecond trading.
The systems
Both NANEX data and systems I will use in this blog post were purchased as part of the performance funding provided by State of Florida. The systems are available for students in the CISE department to get hands on experience with large data processing (as part of large data classes and extra work). Two machines are dedicated to processing this data: grokit.cise.ufl.edu and fgrokit.cise.ufl.edu. Each of the machines have:
- 4 AMD Opteron 6376 processors with 16 cores each for a total of 64 cores
- 512 GB main memory
- 24 disks
- Infiniband and Gigabit networking
The software
Starting in 2009, together with Chris Jermaine (then at UF now at Rice), I developed DataPath, a high performance database system. Since 2013, DataPath has been further developed by Tera Insights, a startup that I co-founded, and it is known as GrokIt. GrokIt has a R-language bindings and a web interface that allows editing, monitoring and result data visualization. Since GrokIt is so much more convenient to use, it has been made available for CISE students and faculty for large data processing.
The purpose
This blog will have to main purposes. The main goal is to find interesting things about the financial markets, with a particular focus on aggregate knowledge rather that point knowledge. I hope that the findings will encourage researchers in general and students in CISE department that have access to the data to look for more things. A secondary goal is to provide a large number of examples on how analysis can be performed on the data using Tera Insights's GrokIt, examples that can be used as a starting point for further exploration. The examples will be accompanied by performance numbers and notes on how to efficiently code.
One of the points I hope I will repeatedly prove is that, with the right tools, large data is not in the exclusive realm of large clusters with thousands of cores and drives. All the queries and data analytics I will report will run on a single machine (either grokit or fgrokit).
For the CISE students
If you have access to grokit or fgrokit, feel free to use validate and use as the starting point the code in these blogposts. Please posts comments with your experience and questions. Tera Insighs is hosting a forum at forum.terainsights.net where you can ask questions on how to use GrokIt if you run in trouble.
No comments:
Post a Comment