added a comment -
26/Sep/08 12:58 PM The issues stems from the fact that the code assumes N tables will always be connected through N-1 connections. However, intermediate tables may really be only to connect tables that the user is actually interested in which changes the problem to needing > N-1 connections.
I started with a crazy model that incorporated 15 tables. Several of them were outer joins to the same table. I ran into the getShortestPath bug and I created a new solution. It is enclosed entirely withing the org.pentaho.pms.mql.graph package. It uses a basic AC3-like arc consistency algorithm to allow find the best path quickly and efficiently. I then continued to add support to allow the user to choose between one of five different techniques for generating a path:
1) All - use all tables and joins
2) Any Relevant - use all joins that lead to any used table regardless if there is more than one path
3) First - returns the first path that only contains that contains no redundant joins between tables
4) Shortest - returns a path connecting every required table with the smallest number of joins
5) Best Score - returns a path that chooses all the joins that will create the smallest score based on the estimated size of a table
After this I went on and included some code from the trunk to use outer joins and discovered some bugs in the DefaultSQLDialect.java class when making assumptions about how to build inner and outer joins.
Then I added another property to the BusinessTable that is a where clause filter which gets added any time the table is used. This was necessary for me because I needed to import the same table multiple times with a different selection of data in each table.
I also enhanced to the PMSFormula.java class to make the resolving of columns in formulas more consistent. I was finding that the class used one method during validation and another during actual code generation so I changed it to resolve as follows:
* 1) <BUSINESS TABLE ID>.<BUSINESS COLUMN ID>
* 2) <BUSINESS CATEGORY ID>.<BUSINESS COLUMN ID>
* 3) <PHYSICAL TABLE>.<PHYSICAL COLUMN>
* 4) <PHYSICAL TABLE ID>.<PHYSICAL COLUMN ID>
* 5) <BUSINESS COLUMN ID>
* 6) <PHYSICAL COLUMN>
* 7) <PHYSICAL COLUMN ID>
I would hope that Pentaho will use all of these enhancements, but at the very least it should look at the getShortestPath changes and the DefaultSQLDialect changes. The get shortest path changes will work for the latest code in the 1.7 branch and the trunk, but the other files are more appropriate for the trunk given the use of outer joins.
-Nick