When I started researching about Oracle OLAP , I looked into number of websites and blogs to see if anything is available on benchmark load performance of Oracle OLAP in 10g and I ended up without any luck. To be honest , we were little bit worried about the performance , specifically the load performance before we did production deployment as our performance environment was not big enough to do full scale performance testing with live like data volume.
I would not claim that the example I am showing here is best possible implementation, there are no of areas which can be improved and we are still working on that. However I believe the stats presented here will be able to provide at least some idea to OLAP enthusiasts and newbies..
Here are some details on hardware and environment..
Hardware Model : x86_64
Operating system version : Red Hat Linux 5.3
Oracle Database Version : 10.2.0.4
Oracle Parameter settings related to OLAP:
----------------------------------------------------
SGA_TARGET = 4 GB
SGA_MAX_SIZE = 4 GB
PGA_AGGREGATE_TARGET = 1 GB
sessions = 280
paralell_max_servers = 235
PGA memory management is set to auto mode , so OLAP_page_pool_size is automatically managed by Oracle.
Tablespace Information
Temp tablespace: I created a seperate 10 GB temporary tablespace specifically for OLAP use.
Undo tablespace : 30 GB
Total number of dimensions in the data model is 15 , out of that 5 is common in two subject areas. None of measures are defined on entire set of 15 dimension , so we decided to build two separate cubes one having 10 dimensions and other with 8 dimensions for specific subject areas. Here I am just providing details on the bigger cube, one with 10 dimensions.
The distinct member counts on each dimension is given below...( just a select distinct on each of these dimensions)
ACTIVITY CATEGORY -- 8
ACTIVITY PRIORITY -- 6
ACTIVITY STATUS --- 9
ACTIVITY TYPE -- 115
DATE DIMENSION - 1462
HOUR DIMENSION -- 25
MFU DIMENSION - 4
PRODUCT DIMENSION - 96596
QUEUE DIMENSION - 8
USER DIMENSION - 20000
I have not yet added up level in each of these , even then the total no of cells comes around 112,255,618,037,760,000,000 ~= 112,255K Trillion .... That's a huge number.
One point can be noted here ,that date and hour are basically part of same time dimension , and probably some of the activity ones could have been clubbed in single dimensions ( like activity category , priority , status)...When I suggested that, the period of development and design was already over and I did not have any other option but to go with that as the deployment windows was very near... Not a very good situation from an architect's point of view, but there I was , we all face problems due to various other factors,outside of our control.
Coming back to our sizing calculation, we used number datatype , allost all measure values are in the range of 1 ...10,000,00 and mostly positive... so assuming 3 bytes on avarage , total size of the cube is ~306287669 TB ( !if all cells were popolated .. we will discuss later )... we can not even think of having that much storage for a single cube...also I have not summed up individual levels yet..
All cells are never populated in reality...in our case it was a tiny fraction of total cells count.. Lets now have a look at our fact table to see for which dimension member we actually have data. I used simple queries like one demonstrated below for each of the dimension members...
select count(distinct a.product_key) from dm_activity_fact a , dm_product_dimension_v b
where a.product_key = b.prod_key
And the counts are pretty less as expected....
ACTIVITY CATEGORY -- 8
ACTIVITY PRIORITY -- 6
ACTIVITY STATUS --- 9
ACTIVITY TYPE -- 115
DATE DIMENSION - 277
HOUR DIMENSION -- 21
MFU DIMENSION - 4
PRODUCT DIMENSION - 75
QUEUE DIMENSION - 7
USER DIMENSION - 377
Count of these cells is now ...228,792,242,952,000 .. and the cube size using the same process is ~624 TB
Good reduction ( 0.00024% of the earlier predicted size ) ...but still its a long way to go....624 TB is also big for a single cube....we do not even have half of that storage in our data center.
Lets have a detailed look into the data now. A single activity ( a record in fact table ) can have only one product associated with it , similarly it can only have one particular category or priority or status, this itself does not reduce the cell count the cube , but consider the fact that there might be number products where there is very few activities , and chances are even less for that activity to be related to all other dimensions,effectively most of measure when defined over all the dimensions ,will have very few cells popolated... This is one type sparsity in the data and is handled very effectively in Oracle.
When the dimensions are marked as sprase in the AWM, oracle creates a internal structure known an "composite". It basically maintains a list of all sparse dimension value combination where there is data. The empty combination are not stored in physical storage , greatly reducing the storage need. Whenever new non-empty combination comes , the composite gets updated with new combination.. This is my interpretation of how it works, not clearly written in oracle documents ... conceptually this has got similarly in the way an Oracle B-Tree index in maintained...
We defined all our dimensions sparse , including time...we also used compression. Sometimes this is refered as compressed composites. However compression in this case is not similar to binary ( RMAN type ) or symbolic compresssion ( the popular winzip type compression) , this is explained in Oracle OLAP DML Reference quite nicely.
"In some cases, when you aggregate data in a variable dimensioned by a composite defined with one or more hierarchical dimension, one parent node may have only one descendant node — and so on all the way up to the top level. When a variable has a good deal of this type of sparsity, use a compressed composite as the dimension of the variable. Dimensioning this type of variable with a compressed composite creates the smallest possible variable, composite, and composite index—much smaller than if you dimension a variable with a b-tree or hash composite."
So now you can realise the power of Oracle OLAP's sparsity handling menchanism ....the total on-disk size of the cube is only 1 GB , yes that's only 1 GB.
Ok , what about the load performance then ? We did both full refresh and after that incremental refresh of the cube. The full refresh took around 40 minutes to finish and incremental ( for one week's data only ) took 5 min to finish. Almost all the report queries finished within sub seconds, as expected from a standard olap implementation.
Coming back to our sizing calculation, we used number datatype , allost all measure values are in the range of 1 ...10,000,00 and mostly positive... so assuming 3 bytes on avarage , total size of the cube is ~306287669 TB ( !if all cells were popolated .. we will discuss later )... we can not even think of having that much storage for a single cube...also I have not summed up individual levels yet..
All cells are never populated in reality...in our case it was a tiny fraction of total cells count.. Lets now have a look at our fact table to see for which dimension member we actually have data. I used simple queries like one demonstrated below for each of the dimension members...
select count(distinct a.product_key) from dm_activity_fact a , dm_product_dimension_v b
where a.product_key = b.prod_key
And the counts are pretty less as expected....
ACTIVITY CATEGORY -- 8
ACTIVITY PRIORITY -- 6
ACTIVITY STATUS --- 9
ACTIVITY TYPE -- 115
DATE DIMENSION - 277
HOUR DIMENSION -- 21
MFU DIMENSION - 4
PRODUCT DIMENSION - 75
QUEUE DIMENSION - 7
USER DIMENSION - 377
Count of these cells is now ...228,792,242,952,000 .. and the cube size using the same process is ~624 TB
Good reduction ( 0.00024% of the earlier predicted size ) ...but still its a long way to go....624 TB is also big for a single cube....we do not even have half of that storage in our data center.
Lets have a detailed look into the data now. A single activity ( a record in fact table ) can have only one product associated with it , similarly it can only have one particular category or priority or status, this itself does not reduce the cell count the cube , but consider the fact that there might be number products where there is very few activities , and chances are even less for that activity to be related to all other dimensions,effectively most of measure when defined over all the dimensions ,will have very few cells popolated... This is one type sparsity in the data and is handled very effectively in Oracle.
When the dimensions are marked as sprase in the AWM, oracle creates a internal structure known an "composite". It basically maintains a list of all sparse dimension value combination where there is data. The empty combination are not stored in physical storage , greatly reducing the storage need. Whenever new non-empty combination comes , the composite gets updated with new combination.. This is my interpretation of how it works, not clearly written in oracle documents ... conceptually this has got similarly in the way an Oracle B-Tree index in maintained...
We defined all our dimensions sparse , including time...we also used compression. Sometimes this is refered as compressed composites. However compression in this case is not similar to binary ( RMAN type ) or symbolic compresssion ( the popular winzip type compression) , this is explained in Oracle OLAP DML Reference quite nicely.
"In some cases, when you aggregate data in a variable dimensioned by a composite defined with one or more hierarchical dimension, one parent node may have only one descendant node — and so on all the way up to the top level. When a variable has a good deal of this type of sparsity, use a compressed composite as the dimension of the variable. Dimensioning this type of variable with a compressed composite creates the smallest possible variable, composite, and composite index—much smaller than if you dimension a variable with a b-tree or hash composite."
So now you can realise the power of Oracle OLAP's sparsity handling menchanism ....the total on-disk size of the cube is only 1 GB , yes that's only 1 GB.
Ok , what about the load performance then ? We did both full refresh and after that incremental refresh of the cube. The full refresh took around 40 minutes to finish and incremental ( for one week's data only ) took 5 min to finish. Almost all the report queries finished within sub seconds, as expected from a standard olap implementation.