Given 2 arrays which represent list of HTML elements in old HTML and new HTML respectively, I need to find the diff between the 2 arrays.
What I am doing currently:
- Create a 2D min cost matrix using dynamic programming using Levenshtein distance
- Traverse the matrix from bottom right to top left and find the path
But this does not always results in a optimal path (mostly because I am taking locally optimal decisions that prove to be sub optimal globally). So, I want to know what is the way to approach this?
I tried dijkstra’s algorithm but then the path it produces is the shortest path (sum of all the costs) but then it may or may not be true.
Here is how my code looks like:
creation of matrix
row = old.length+1;
column = current.length+1;
diff = []
function createMatrix() {
matrix = new Array(row);
for (var i = 0; i < row; i++) {
matrix[i] = new Array(column);
}
//fill the 0th row and column
for (var i = 0; i < row; i++) {
matrix[i][0] = i;
}
for (var j = 0; j < column; j += 1) {
matrix[0][j] = j;
}
//calculate min operations
for (var i = 1; i < row; i++) {
for (var j = 1; j < column; j++) {
var a = old[i - offset];
var b = current[j - offset];
var subDistance = (a.hashVal == b.hashVal) ? 0 : 10e6;
var costReplace = matrix[i - 1][j - 1] + subDistance * a.elementCount + subDistance * b.elementCount;
var costRemoved = matrix[i - 1][j] + a.elementCount;
var costInserted = matrix[i][j - 1] + b.elementCount;
matrix[i][j] = Math.min(costReplace, costRemoved, costInserted);
}
}
}
function that finds the path to create the diff:
function traceBack(){
var i,j;
for(i=row-1,j = column-1;i>0&&j>0;){
var top = matrix[i-1][j];
var topLeft = matrix[i-1][j-1];
var left = matrix[i][j-1];
if(topLeft== matrix[i][j] && topLeft <= top && topLeft <=left ){
current[j-offset].status = 'unchanged';
diff.unshift(current[j-offset]);
i--;
j--;
}
else if(top < matrix[i][j] ){
(old[i-offset]).status = 'removed';
diff.unshift(old[i-offset]);
i--;
}
else if (left < matrix[i][j] ){
(current[j-offset]).status ='inserted';
diff.unshift(current[j-offset]);
j--;
}
}
while(i>0){
(old[i-offset]).status ='removed';
diff.unshift(old[i-offset]);
i--;
}
while(j>0){
(current[j-offset]).status = 'inserted';
diff.unshift(current[j-offset]);
j--;
}
}
Few points:
- Input HTML strings are transformed to DOM object using htmlparser library
- Some basic pre-processing is done on the DOM objects
- Objects are flattened into a single list (children chain, flattened out)
- createMatrix is called ..
