This looks like a very interesting paper that takes the rare approach of actually trying to understand what all the cool new language models are doing at a fundamental level.
Does anyone with more knowledge of the relevant mathematics (group theory and so on) care to chime in?
This paper is a very good advertisement for Krohn-Rhodes theory, which shows how automata decompose into simpler automata. I think it's a somewhat obscure topic within math (among people who aren't semigroup theorists), so I was happy to be exposed to it.
It's a bit shocking that they got Transformers to actually learn the theoretical low depth algorithms for simulating automata, but looking closer at their results we can see that the parts that I would intuitively think are hard to learn (i.e. learning parity) are fairly brittle.
Does anyone with more knowledge of the relevant mathematics (group theory and so on) care to chime in?