From what I understand, the key contribution lies in its cost efficiency during training. However, is this referring to the whole training (including pre-training) phase, or just the reinforcement learning stage?
Additionally, it seems that the cost savings primarily come from improvements in the reward model. The paper mentions two examples: for math problems, they use fixed answers to generate reward scores, and for LeetCode problems, they rely on a compiler.
However, these examples cover only a narrow set of problem types. Not all logical challenges fall under math or coding. Can a model trained mainly on math and coding problems generalize well to other types of logical reasoning tasks?