Friday, October 10, 2014

Blog Desiderata and Outlook

First, welcome to my first blog post. I am Alin Dobra, an Associate Professor at University of Florida, Computer Information Science and Engineering Department department. I have many interests (approximate query processing, probabilistic network analysis, data-mining, etc) but this blog will be focused on large data processing, specifically analysis of NANEX  financial data.


The data

NANEX is a financial data feed integrator based in Chicago that provides both live and historical feeds from all major financial exchanges. My department bought historical data on the 3 major US exchanges (NYSE, NASDAQ and AMEX) covering 9.5 years (Jan 1, 2005 - May 31, 2014) at individual trade and quote granularity. The data comes in a NANEX proprietary format and it is about 4TB in size (at an estimated 90% compression ratio). This is some of the largest financial data available, with 56.8 billion trades and an (estimated) 1 trillion quotes. Since both trades and quotes have a 25ms time stamp, the data provides a unique insight into the inner working of the stock market, especially in its potential to shed light into millisecond trading. 


The systems

Both NANEX data and systems I will use in this blog post were purchased as part of the performance funding provided by State of Florida. The systems are available for students in the CISE department to get hands on experience with large data processing (as part of large data classes and extra work). Two machines are dedicated to processing this data: grokit.cise.ufl.edu and fgrokit.cise.ufl.edu. Each of the machines have:
  1. 4 AMD Opteron 6376 processors with 16 cores each for a total of 64 cores
  2. 512 GB main memory
  3. 24 disks
  4. Infiniband and Gigabit networking
fgrokit machine has 16 1TB SAMSUNG EVO drives and 8 4TB WD spindle drives. grokit machine has 24 4TB WD drives. The SSD drives on fgrokit provides up to 3X the performance as we will see in subsequent posts.


The software

Starting in 2009, together with Chris Jermaine (then at UF now at Rice), I developed DataPath, a high performance database system. Since 2013, DataPath has been further developed by Tera Insights, a startup that I co-founded, and it is known as GrokIt. GrokIt has a R-language bindings and a web interface that allows editing, monitoring and result data visualization. Since GrokIt is so much more convenient to use, it has been made available for CISE students and faculty for large data processing. 


The purpose

This blog will have to main purposes. The main goal is to find interesting things about the financial markets, with a particular focus on aggregate knowledge rather that point knowledge. I hope that the findings will encourage researchers in general and students in CISE department that have access to the data to look for more things. A secondary goal is to provide a large number of examples on how analysis can be performed on the data using Tera Insights's GrokIt, examples that can be used as a starting point for further exploration. The examples will be accompanied by performance numbers and notes on how to efficiently code. 

One of the points I hope I will repeatedly prove is that, with the right tools, large data is not in the exclusive realm of large clusters with thousands of cores and drives. All the queries and data analytics I will report will run on a single machine (either grokit or fgrokit). 


For the CISE students

If you have access to grokit or fgrokit, feel free to use validate and use as the starting point the code in these blogposts. Please posts comments with your experience and questions. Tera Insighs is hosting a forum at forum.terainsights.net where you can ask questions on how to use GrokIt if you run in trouble.


No comments:

Post a Comment