A data scientist's best LLM friend
It's not the one you might think, with an example from Kaggle
What large language model (LLM) do you think a data scientist reaches for when they need to solve a problem? Maybe GPT-4, for it’s incredible performance. Or GPT-3.5 Turbo, because it’s cheaper and good enough for their use case. If they prefer to host their own, maybe they go with Mistral 7B, which apparently performs as well as Llama2-13B, another popular open source model.
Nope. None of those. The LLM that most data scientists reach for again and again for language processing use cases is not any of these buzzy GenAI models. It’s BERT a.k.a. Bidirectional Encoder Representations from Transformers, introduced in 2018 by researchers at Google.
BERT-style models vs GPT-style models
BERT is an encoder-only model. It can only transform language into a contextual representation (encoding), not generate text itself (decoding). GPT models incorporate both encoder and decoder capabilities (though they are often considered decoder-only), allowing them to create contextual understanding via the encoder and then generate coherent output using the decoder.
Why is BERT so popular? Because there are so many use cases in data science that do not require generating text (and do not benefit from them). On the other hand, encoding text into a numeric representation is extremely useful across a variety of natural language processing tasks.
Some of the key tasks in natural language processing that an encoder-only model can help with include:
Topic assignment. For example, automatically categorizing news articles as to their topic.
Normalizing entity names. For example, taking a job title and assigning it to a standardized job title from a taxonomy.
Named entity recognition. Finding proper nouns such as people names or organization names in text.
Sentiment analysis. Determining whether some text expresses primarily positive, negative, or neutral sentiment.
Evaluating student summaries
A good example of the differing roles that GenAI language models play in a project compared to encoder-style models like BERT can be found in the winning solution to this Kaggle competition aimed at automatically evaluating student summaries.
In this competition, participants built models to automatically assign a wording and content score to a student summary of some text, such as Aristotle’s essay on tragedy.
The competition organizers only provided four different texts for summary, with student summary examples just for those four texts. This made the task quite challenging, as the model the competitors built needed to generalize to hundreds of unseen texts.
The winning competitor used an LLM (likely ChatGPT) to generate additional training data in the form of prompts (with topic and topic text to be summarized) as well as simulated student summaries of the prompt of varying quality.
Then they used an approach called meta pseudo labels to assign content and wording scores to the generated training data, using the provided training data as examples. Meta pseudo labels is a semi-supervised learning method that relies upon two neural networks, a teacher and a student, to learn labels for unlabeled data based on labeled data.
Finally, the winner built a supervised machine learning model to predict content and wording scores. This supervised model used Deberta v3 base, a variant of BERT, along with LightGBM, a gradient boosting framework that uses tree based ensembles for supervised learning.
The winner used an encoder-only model in their supervised learning algorithm along with a tree-based ensemble method. Note there was no GPT-style GenAI model involved in the predictive modeling at the end.
The use of the GenAI model was required because the competition organizer CommonLit didn’t provide all the training data it had. CommonLit provided just four different texts for summary along with labeled student examples. In fact, CommonLit had hundreds of different prompt texts with graded student summaries, and these were used to evaluate the submissions. Had a data scientist been working with CommonLit and had access to the entire dataset for developing the automatic scoring capability, they would have had little need for a GenAI capability.
But GenAI models are so easy and powerful
Yes they are, and they may someday put data scientists out of a job. At some point we may be able to just ask them to do the tasks that we put together complicated semi-supervised learning pipelines for today. Getting them to do tasks that they haven’t been trained to do is called zero-shot generalization.
In the domain where I have my day job, recruiting, for example, we could just provide a GenAI model with a set of resumes and a job description and ask them to rank the resumes in order of fit and describe why they ranked the way they did.
We can’t make that work today. Why?
We’re likely to get biased recommendations, since, having been trained on all the internet, GenAI models tend to reproduce bias. GPT-4 would probably pick a man for a software engineering position and a woman for a nursing position, all other things equal.
Most hiring outcomes aren’t available to GenAI models. They can’t tell who got hired for a particular position or how that person performs once on the job. So they can’t do a good job reproducing outcomes that humans produced (which may be a good thing!)
In other domains as well, the key to building a useful AI capability is having a labeled dataset that you can train a supervised model on. CommonLit already has the data to train automated student summary grading, but they just didn’t share it with the competition participants (to make the competition more interesting, I suppose). This, in fact, is the key differentiator that companies need to build useful AI/ML capabilities: proprietary, detailed, labeled outcome data for their particular domain.
In the future, fine tuning GenAI models with our own labeled data to complete a specific task—supervised fine tuning—will surely become more feasible and more common. BERT can be fine tuned in this way, but so can decoder models such as GPT-4.
GenAI isn’t the only thing we need
I wrote earlier about how GenAI has become synonymous with artificial intelligence in the popular imagination. Maybe that’s because ChatGPT and similar GenAI-based capabilities seem the closest thing we’ve ever seen to artificially intelligent.
But in the AI/ML space, there are much more interesting and convenient ways to build usefully intelligent features—for now. One of those is BERT, my best LLM friend.