Fixes to backend to support matmul
Compare changes
Includes enough fixes to backend to support matmul
Interestingly, this did not require adding a pass that coalesces forks / sequentializes outer forks. Technically what happens is matmul will generate invalid schedule IR (because the reduce loop will turn into a reduction variable not dominating its use in the initialization of a reduction variable), but when this gets lowered to LLVM IR, the reduction variable gets pulled into a phi at the top of the loop, so the LLVM is actually valid. This is a nice coincidence, but won't work when we try to lower to parallel fork-joins (in general). So, the fork coalescing + sequentialization at the Hercules IR level will still be necessary in the long term.