Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To find the absolute minimum you just multiply the number of parameters by the bits per parameter, divide by 8 if you want bytes. In case 8 billion parameters of 4 bits each means "at least 4 billion bytes". For back of the napkin add ~20% overhead to that (it really depends on your context setup and a few other things but that's a good swag to start with) and then add whatever memory the base operating system is going to be using in the background.

Extra tidbits to keep in mind:

- A bits-per-parameter higher than the model was trained adds nothing (other than compatibility on certain accelerators) but a bits-per-parameter lower than the model was trained degrades the quality.

- Different models may be trained at different bits-per-parameter. E.g. 671 billion parameter Deepseek R1 (full) was trained at fp8 while llama 3.1 405 billion parameter was trained and released at a higher parameter width so "full quality" benchmark results for Deepseek R1 require less memory than Llama 3.1 even though R1 has more total parameters.

- Lower quantinizations will tend to run proportionally faster if you were memory bandwidth bound and that can be a reason to lower the quality even if you can fit the larger version of a model into memory (such as in this demonstration).



Thank you. So F16 would be 16 bits per weight, and F32 would be 32? Next question, if you don't mind, what are the tradeoffs in choosing between a model with more parameters quantized to smaller values vs fewer parameters full-precision models? My current understanding is to prefer smaller quantized models over larger full-precision.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: