Do LLM's Learn Transferable Skills?

September, 2024

When an LLM is trained to be good at one task, will it also get better at other tasks? For example, if an LLM is trained to be good at formal logic, will it also be good at making informal logical arguments?

The most obvious way to investigate these questions would be to train several models on different training sets and test whether models trained for tasks in one domain can transfer skills to another. However, this would involve training many models, and would realistically mean training relatively small ones. Instead, I investigated these questions observationally by studying the population of large models that have been released to the public on platforms like HuggingFace.

Eval Correlations

The core question here is how performance on one task relates to performance on other tasks. For its LLM leaderboard, HuggingFace runs about 50 evaluations ("evals") on their library of over 1000 models, which makes it possible to study how performance on one eval relates to performance on another.

Comparing Pairs of Evals

The simplest way to visualize the relations between evals is with scatter plots, where each point represents a model, and the axes represent a pair of evals. Below are scatter plots showing the relationships between a few randomly selected evals.

Scatter plots showing the relationships between randomly selected pairs of evals. Each point represents an LLM, and the axes represent performance on two randomly chosen evals.

Several of these show a generally positive correlation between evals (such as subfigures E, G, and F). Some show more complicated relationships. Subfigure H seems to show that being very good at "Salient Translation Error Detection" is helpful for "Temporal Sequences", but being only decent doesn't matter. Yet other pairs of evals do not show any clear relationship (such as in subfigure A).

Correlation Matrices

Looking at scatter plots can give us a sense of the sorts of relationships that exist, but there are about 2000 pairs of evals, so this is not a feasible way to get the full picture.

I computed the correlation coefficient for each pair of evals, which essentially reduces each scatter plot to a single number. Looking at these correlation coefficients, almost every pair of evals are positively correlated. This could mean that models are good at transferring skills from one domain to another. However, it could also mean that everybody is using the same general training sets, and so eval performance is based more on general model quality (and factors like parameter count) than on any particular focus on an eval area.

Histogram showing the distribution of correlation coefficients between all pairs of evals in the HuggingFace dataset.

Looking at the full correlation matrix reveals more structure. In particular, there are big blocks on the diagonal, showing that many evals within the same family are strongly positively correlated. For example, models that are good at Prealgebra in the MATH eval family are also good at Geometry. However, there does not appear to be any especially strong correlation between math evals from the MATH eval family and the more mathematical evals from the BBH family (such as the logical deduction evals, Boolean Expressions, or maybe Geometric Shapes). That might mean that models are bad at transferring math skills between remotely distance domains. (It could also mean that there is contamination of the training data, where models that were trained on one eval from the MATH family were more likely to be trained on all of them, making performance on MATH correlated.)

Correlation matrix of evals from the HuggingFace dataset.

All of the analysis so far has been done with data from HuggingFace's LLM Leaderboard. However, their data only covers models uploaded to HuggingFace, which excludes the biggest commercial models like GPT-4o, Claude, PaLM, etc.

Thankfully, Stanford HELM runs evals on a set of about 75 models, including many of the best commercial ones. The correlation matrix below shows the results using data from HELM.

Correlation matrix of evals from the HELM dataset.

As with the HuggingFace data, most evals are somewhat correlated, but there are blocks of especially strong correlation. Again, the math evals are strongly correlated with one another, as are the translation evals. However, the LegalBench evals (which are designed to test legal reasoning) do not show any noticeable block structure. Perhaps there is less transferable knowledge between the tested legal domains than there is between the math domains, or perhaps there is just less overlap in the required knowledge.

Also, the Abstract Algebra eval from the MMLU family is not especially correlated to any of the other math evals. It would be surprising if it was the only math topic with which there weren't transferable skills from other math topics, suggesting that block structure might be the result something other than skill transfer.

Conclusion

A model's performance on one eval is generally correlated with its performance on all other evals. Evals from the same family show especially strong correlation with one another, but not with out-of-family evals that target similar skills.

My interpretation is that the general correlation between all evals is the result of some common factor (like parameter count) that predicts model performance on everything. The block structure is the best evidence for skill transfer, but the lack of skill transfer between eval families suggests that something else might be going on (like training set contamination).

Code

Code for this project is available in the EvalCorrelation repository on GitHub.