Monday, March 25, 2013

On the way to Multiple Stream/Relation Join

Few months back I was given the responsibility to Join multiple relations based on the equality of certain fields.
My research shows that database engines like MySQL and Oracles perform the join of multiple relations by joining two relations at a time. This method did not seem appropriate to me. I needed to join all the relations simultaneously.

I have been following Hash Join Algorithm. Most the papers show that with the has join algorithm, I need to have all the tuples from all the relations complete. However, my data source is stream type. I will not have complete data set in the beginning. I will be having the data tuples in continuous manner and also at different time I will get tuples of different relation. This was a bit challenge to me.

With some tweaks to Hash Join Algorithm, I managed to tackle with the incomplete data set at beginning.

Working for more than 5 months, I experimented with lots of alternatives. Now I stand with a working application that can theoretically join any number of relations. However as lots of articles say, Hash Join is both CPU and memory intensive. And therefore, practically it is difficult to join any number of relations.

Then in past few weeks I have working to fit the problem for 3 streams/relations.

The performance yet depends on the query type and join conditions.

At one point, I could join 400K X 400K X 400K tuples in around 35 seconds with result of around 12000.
Of course if there a large join results the performance could yet degrade.

I am still on my way to improve the performance.