LLM Architecture Scales

A commonly held belief is that reasoning is locked behind larger models. The idea is that larger models have more layers and wider layers which help them to better reason about relationships in text. This feels like a hand-wavy solution. For example, in a 2-hop reasoning problem, do you need 10 layers, but then for 4 hop, you need 20 layers? If your reasoning problem is more complex, do you need wider layers? As far as I know, none of these questions have good answers yet.

Read More