Calculating KV Cache Memory Size Cost for a LLM¶

Let’s calculate the KV cache size,

For each token, the model stores one key and one value vector per attention head used for caching in each transformer layer. The size of each vector is determined by the head dimension, which is the hidden size divided by the number of attention heads.

Take Mistral 7B as an example:

Mistral has 32 layers, 32 attention heads(per layer), and a hidden size of 4096. This results in a head dimension of 128 (4096 / 32). The total KV cache memory usage for a token can be calculated as:

   Size of one vector = no. head dimension = 128 elements

   No of key vectors per token = No of value vectors per token = Layers * attention heads
                                                               = 32 * 32
                                                               = 1024 vectors

   Total number of vectors per token = key vectors + value vectors 
                                     = 1024 + 1024
                                     = 2048

   Say for FP16 precision model(i.e. 2 bytes per element), 
   the key-value vectors in KV cache per token will cost:
            = Size of Vector * Total number of Vectors * Data Element Size
            = 128  *  2048  *  2 bytes
            = 524,288 bytes  =  512 KiB

So, storing key and value vectors for one token in KV cache for Mistral 7B will cost 512 KiB at float point 16 precision.

For a sequence length of 1K tokens, which is very common, the size of KV cache will be:

= 1000 * 512 KiB = 500 MiB.

While Mistral 7B is a small model. The KV cache would require memory of 500 MiB for context length of 1K token.

References: - how much gpu memory is required to run a large language model