Parse Irregular table

enter image description here

Notice this complex table. For many of the rows on the left like “finite forms”, “past”, and “future”, they have “rowSpan” equal to something greater than 1.

If I wanted to parse a simple table that has no row (or col-)Span attributes, then I would simply get table.rows and get the index of each row as the column, and determine what categories the cells belong to based on that.

The rowSpan and colSpan attributes of this table make things more complex, because if I get table.rows then the row at the beginning of “past” really has “past” then “first” then the next word (in this case דַּשְׁתִּי‎).

If that pattern repeated for the rows after, then it would be fine because I would be able to determine what category and sub cateogry of the row each cell belonged to (in this case it would be that the word דַּשְׁתִּי‎ (and דַּשְׁנוּ‎ on that same row) would be under the row categories of “past” and “first” (as well as each in it’s own column cateogry, for דַּשְׁנוּ‎ plural [m,f] and for דַּשְׁתִּי‎ singular [m, f])).

However, with rowSpan set to 3 in the “past” row area, the next row under “first” (“second” in this case) is the first element of the new array, even though the “past” cell really preceeds it in the HTML (and in the way I want to parse the table in JavaScript).

To simply the question:

I need to parse a complex table, in JavaScript, which may contain rowSpan and colSpan attributes.

My goal is to get the inner cells of the table and it’s corresponding row and columns categories (in this case, there are two categories for both row and column (skipping the “non-finite forms” row at the top), I want a result as an array of objects where each object represents an inner cell, containing meta information about the row categories and column categories that cell belongs to.

In our example of the word דַּשְׁתִּי‎, the result should be some kind of JavaScript object (in an array) like this

content: דַּשְׁתִּי‎

row: 1 (assuming we start counting from the first inner word after categories start)

column: 1

rowCategories: [past, first]

columnCategories: [singular, [m,f]]

For the next word, the result would be almost the same except the columnCategories would be [plural, [m, f]].

And so too for all of the other inner cells.

Ideally I am looking for a function that takes in arguments specifying how many row and column categories there are (in this case, 2 each), dynamically being able to parse a table with any amount of them.

I have been trying to figure this out, but not able to just think how to properly parse the table to represent the row categories in front as well as the column categories properly to be able to cross reference them and organize them properly, given the rowSpan and colSpan attributes.