| Attribute | Type |
|---|---|
| Type | Category |
| Symbol | Factor |
| MsOfDay | Integer |
| Date | Date |
| ListedExg | Category |
| ReportingExg | Category |
| Price | Float |
| TradeCondition | Category |
| ConditionFlags | Category |
| BATECode | Category |
| Size | Integer |
As a reminder from the previous post, I am running experiments on grokit and fgrokit, two servers with 64 cores, 512GB of main memory and 24 disks. The difference is that fgrokit has 16 SDD disks (out of the 24 total). SDD disks have no seek time, thus jumping over columns does not significantly affect read speed.
Number of trades by type
The following program computes the number of trades by type.
Line 4 accesses the data, 5-9 specifies a grouping on Type with a Count() aggregation and line 11 pushes the result to the front end. The result of the query is:
Basically, the data contains 56.8 billion equities and 76 thousand bonds. All other types were filtered out at data loading time. The performance is the following:
| Type | cnt |
|---|---|
| Equity | 56,831,509,648 |
| Bond | 75,955 |
Basically, the data contains 56.8 billion equities and 76 thousand bonds. All other types were filtered out at data loading time. The performance is the following:
| Server | Time | Tuple Speed | Read Speed | Cpu Load |
|---|---|---|---|---|
| grokit | 122.1 s | 520 MT/s | 520 MB/s | 20.5/64 |
| fgrokit | 43.7 s | 1600 MT/s | 1600 MB/s | 63.5/64 |
Interestingly, grokit is almost 3 times slower than fgrokit. In fact, as you will see later, fgrokit is reading data far below the max rate. fgrokit seems to be on the border of being CPU bound. grokit is far below (20 out of 64 cores to process). Notice that fgrokit processes data and an insane 1.6 billion tuples/second. The 520 million tuples/second achieved by grokit is impressive as well. This is of course possible only with skinny columns like Type (1 byte)
It is worth mentioning that GrokIt framework has a significant overhead to run the queries. The actual tuple processing time for fgrokit is 57GB/1.6GB = 35.6s out of 43.7s total execution time (7s overhead).
Number of trades by date
To compute number of trades by date we can use:
This is similar to the previous query but the grouping is by date. The performance results we obtain are:
This time fgrokit reads data at 3.8GB/s that allows it to pump and process 950 million tuples/second. grokit only achieves 1/4 of that speed but used 1/4 th of the CPUs. One column queries like this are a problem for grokit server since the disk heads have to jump a lot (hence the smallish 1.0GB/s speed).
To see how the query behaves with more columns, we extend the query like this: The performance results are:
The read speed of both grokit and fgrokit is slightly higher. Even though 3 columns are accessed (Date, Size, Price), fgrokit still maintains a 250 million tuples/second processing rate and it is not CPU bound. The top-10 results sorted by number of trades is:
and sorted by the total volume is:
Interestingly September 18, 2008 is the highest total traded day but the 3rd highest in terms of number of transactions. In terms of total volume, it is 97th, way down the list. This clearly requires further investigation.
| Server | Time | Tuple Speed | Read Speed | Cpu Load |
|---|---|---|---|---|
| grokit | 239.5 s | 250 MT/s | 1000 MB/s | 11/64 |
| fgrokit | 67.9 s | 950 MT/s | 3800 MB/s | 43.5/64 |
This time fgrokit reads data at 3.8GB/s that allows it to pump and process 950 million tuples/second. grokit only achieves 1/4 of that speed but used 1/4 th of the CPUs. One column queries like this are a problem for grokit server since the disk heads have to jump a lot (hence the smallish 1.0GB/s speed).
To see how the query behaves with more columns, we extend the query like this: The performance results are:
| Server | Time | Tuple Speed | Read Speed | Cpu Load |
|---|---|---|---|---|
| grokit | 762 s | 80 MT/s | 1200 MB/s | 11/64 |
| fgrokit | 227.6 s | 250 MT/s | 4000 MB/s | 37.8/64 |
The read speed of both grokit and fgrokit is slightly higher. Even though 3 columns are accessed (Date, Size, Price), fgrokit still maintains a 250 million tuples/second processing rate and it is not CPU bound. The top-10 results sorted by number of trades is:
| Date | count | avgPrice | avgSize | avgCost | totalVol | total |
|---|---|---|---|---|---|---|
| 2008-Oct-10 | 82.91 M | $30.30 | 273.32 | $6,329.04 | 22.66 B | $524.76 B |
| 2011-Aug-08 | 74.73 M | $37.85 | 297 | $7,735.54 | 22.19 B | $578.04 B |
| 2008-Sep-18 | 73.79 M | $37.69 | 302.69 | $8,742.44 | 22.34 B | $645.08 B |
| 2011-Aug-09 | 70.11 M | $37.49 | 291.92 | $8,019.87 | 20.47 B | $562.25 B |
| 2011-Aug-05 | 69.22 M | $38.62 | 285.11 | $7,890.37 | 19.74 B | $546.21 B |
| 2011-Aug-10 | 67.08 M | $36.92 | 289.68 | $7,408.78 | 19.43 B | $496.97 B |
| 2010-May-07 | 66.29 M | $35.20 | 398.98 | $7,787.90 | 26.45 B | $516.24 B |
| 2008-Oct-08 | 66.03 M | $33.74 | 270.21 | $7,105.08 | 17.84 B | $469.15 B |
| 2010-May-06 | 65.95 M | $35.25 | 450.53 | $8,190.20 | 29.71 B | $540.13 B |
| 2008-Oct-16 | 64.72 M | $31.25 | 266.13 | $6,566.62 | 17.22 B | $425.00 B |
and sorted by the total volume is:
| Date | count | avgPrice | avgSize | avgCost | totalVol | total |
|---|---|---|---|---|---|---|
| 2008-Sep-18 | 73.79 M | $37.69 | 302.69 | $8,742.44 | 22.34 B | $645.08 B |
| 2011-Aug-08 | 74.73 M | $37.85 | 297 | $7,735.54 | 22.19 B | $578.04 B |
| 2011-Aug-09 | 70.11 M | $37.49 | 291.92 | $8,019.87 | 20.47 B | $562.25 B |
| 2008-Sep-19 | 51.57 M | $40.18 | 360.29 | $10,760.68 | 18.58 B | $554.90 B |
| 2011-Aug-05 | 69.22 M | $38.62 | 285.11 | $7,890.37 | 19.74 B | $546.21 B |
| 2010-May-06 | 65.95 M | $35.25 | 450.53 | $8,190.20 | 29.71 B | $540.13 B |
| 2008-Sep-17 | 61.55 M | $38.49 | 319.88 | $8,735.88 | 19.69 B | $537.71 B |
| 2008-Jan-23 | 48.51 M | $42.19 | 340.27 | $11,024.05 | 16.51 B | $534.74 B |
| 2008-Oct-10 | 82.91 M | $30.30 | 273.32 | $6,329.04 | 22.66 B | $524.76 B |
| 2008-Sep-16 | 62.94 M | $37.94 | 323.74 | $8,217.81 | 20.38 B | $517.21 B |
Interestingly September 18, 2008 is the highest total traded day but the 3rd highest in terms of number of transactions. In terms of total volume, it is 97th, way down the list. This clearly requires further investigation.
Number of trades by Symbol
This time we need to group by symbol instead of date. Symbol is represented as factor with integer support and uses the same space as Date. The query is:
The running times are fgrokit: 238.6s and grokit: 711.8s. These times are about the same as the Date experiments. The jump from 2300 to 60,000 groups does not seem to have any influence (everything is still CPU bound but the CPU usage is about the same).
The running times are fgrokit: 238.6s and grokit: 711.8s. These times are about the same as the Date experiments. The jump from 2300 to 60,000 groups does not seem to have any influence (everything is still CPU bound but the CPU usage is about the same).
The result consists of 60758 rows, one for each of the symbols that appear in the data. Notice that in the query we still used View() since GrokIt Visualizer has no trouble dealing with results of this size (I'll have a separate blob post on this issue). The top 10 results by number of transactions are:
| Symbol | count | avgPrice | avgSize | avgCost | totalVol | total |
|---|---|---|---|---|---|---|
| SPY | 839.53 M | $123.17 | 480.62 | $59,947.39 | 403.49 B | $50.33 T |
| BAC | 420.32 M | $16.31 | 732.49 | $10,080.15 | 307.88 B | $4.24 T |
| C | 347.03 M | $20.83 | 1.14 K | $10,628.76 | 396.42 B | $3.69 T |
| MSFT | 316.31 M | $27.60 | 463.16 | $12,705.05 | 146.50 B | $4.02 T |
| IWM | 310.02 M | $73.01 | 446.18 | $33,008.40 | 138.33 B | $10.23 T |
| AAPL | 294.54 M | $236.23 | 193.74 | $41,417.26 | 57.06 B | $12.20 T |
| INTC | 288.38 M | $21.35 | 484.98 | $10,425.26 | 139.86 B | $3.01 T |
| JPM | 279.98 M | $39.23 | 266.06 | $10,491.78 | 74.49 B | $2.94 T |
| WFC | 268.17 M | $28.41 | 288.39 | $8,280.18 | 77.34 B | $2.22 T |
| CSCO | 266.36 M | $21.22 | 474.71 | $10,079.06 | 126.44 B | $2.68 T |
The set of symbols that make it to top 10 is not surprising. The number of individual transactions is. Most of this information is not usually reported by financial data analysts and aggregates are hard to come by. We only needed 4 minutes on fgrokit to tabulate this information.
No comments:
Post a Comment