Why AI language models choke on too much text
This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then: Processing a 100-token prompt will require 45.6 million attention operations. Processing a 1,000-token prompt will require 4.6 billion attention operations. Processing a 10,000-token prompt will require 460 billion […]