The test loss is an estimate based on the Cerebras-GPT scaling law. It allows for a relative comparison of the models' loss if they are trained from scratch using the same hyperparameters and dataset. The estimate may not align with the figures published in the original papers.
The test loss serves as an estimation of the pre-training performance. Downstream performance may vary, even when the models achieve the same test loss.
Loss is a very sensitive number. A 5% change in loss is a huge difference, typical of doubling the model size.
The pre-populated models only provide information about the parameter size and dataset size. They do not incorporate other hyperparameters.
The loss curve presented is a simulation of an ideal learning rate at specific FLOP levels. In real training runs, the curve may be steeper but should ultimately reach the same loss value.