Aug 15, 2024

Changing landscape in Data Science with the advent of LLMs

WIth all the hype around AI and LLMs & the fast paced changes in conversational and vision based AI, it is not just the chatbots that can converse with humans in natural language or creation of images with prompts but also the traditional machine learning / deep learning world is changing. AI Agents are well equipped with python and its tools to create models or even ensembles with just a prompt and input data, this is fast becoming the way data scientists and developer create boilerplate code, first iteration of models, test suites, vulnerability checks apart from all the copilots now available to improve productivity in everyday work and all other spheres of life. AI Agents equipped with RAG (retrieval augmented generation) are able to extract information in a conversational format, while ReAct (reasoning and action) based frameworks are becoming excellent in chaining multiple agents equipped with various capabilities to come together and take the next best action while COT (chain of thought & tree of thought) have the capability to combine LLMs internal knowledge and external info to come to an answer.

Let’s talk about NLP based text classification models in this discussion, transformer based models were the pinnacle in the NLP world just a few years back to create state of the art custom text classification models. But that comes at a cost – be it the need to extensively label datasets, training the models and retraining exercise whenever there is a data drift or new labels to be included. This is also fast changing with the advent of LLMs, they are helping us get from 0 to 1 and 1 to N at an immense speed in text classification problem solving. Smaller open source 7B LLMs are now armed with enough reasoning capability to classify text with just a few examples (zero / few shot) and in a more contextual way without the need of extensive labeled data to be trained on and also adapt quickly to changing data. This LLMs based approach also helps us including new labels or adapting to data drift much faster than transformers and reduces the overall complexity of solutions eg: certain context was never seen by a transformer model would always misclassify but a simple change in prompt as less as a few words help the LLM solution adapt to these kind of drifts; data set where we needed to identify health condition of a client but in the input their relative’s health was being mentioned and not the client’s hence a transformer model that has never seen this variation would always misclassify while LLM can adapt quickly if only we mention an instruction asking it to consider only the condition of client and ignore everything else without the need to include more examples and augment data to have enough sample size.

Additionally let’s assume we have 10 different bespoke transformer models for variety of text classification tasks, if we are able to come up with a particular 7B LLM that can work on all these 10 use cases using different prompts and few shot examples for each use case we would just need to host one model and inference for respective use case is just a change of prompts in inference code (if engineered so) instead of hosting 10 separate bespoke models leading to reduction in cost over 70%. If the data is sensitive, open source models which are privately hosted become essential to safeguard your data from managed services like openAI models etc.

There are ways to handle complex multi label classification if a single prompt does not work for a particular use case : it can be broken down into binary classification problem with one prompt identifying each label and all inferences being joined during post processing. However keep in mind that this will increase overall latency and cost due to multiple calls to LLM for each data point.

Even if we do not want to use LLMs in production, this can be a great tool for pseudo-labeling, the labels generated can be fed into other deep learning models expediting the overall development time. Taking this a step further, if we have a use case where we want to finetune or train a LLM from scratch that requires a lot of data and this pseudo labeling approach can help companies. Many companies are seen using LLM distillation where a larger size model is able to do a wonderful job in text classification but a smaller one isn’t, in those cases folks are using the labeled data from larger model to finetune or knowledge transfer to a much smaller model to reduce overall cost of solutions.

Things to keep in mind when working with LLMs:

Cost of development and inference in production
Latency of answers when working with real time inference use cases, since larger the model higher the latency
Privacy concerns over PII data, consider open source models which are securely hosted for sensitive information
Choosing the right LLM that has enough context length as needed for your application, bigger the parameter size of the LLM does not always mean better performance
A plan to conquer hallucination and restrict them along with consistency of results since LLMs can be creative and we need to make sure that instructions are followed according to your prompt guidelines to make it reliable
A plan to counter deeply rooted bias in the LLM
A plan to compress prompts if you are using a LLM with smaller context window and you examples in prompts are too big
A plan to constantly adapt to the changing world of LLMs
Legal liabilities and explainability

Author