Too Important to go Unsupervised – Towards Generative Aggregative Labeling (GAL)
Most CX teams have a general idea of what their customers contact support for - returns, late delivery, reservation changes, refunds, etc.. What they really want to know is why – there has been a 30% increase in reports of package marked as delivered but not received in the southwest region, or that there is an 80% increase in complaints of coupon code error for booking a reservation after a recent IT deployment.
To know the Unknown, you obviously have to ask your data.
Motivation
If you asked a Data Scientist in 2020 to categorize the contact reasons and underlying fine-grained root causes of support conversations, they would undoubtedly turn to unsupervised clustering. It is a very reasonable approach to take when you have no labeled data, and no predefined classes. Evaluating the results is, however, in technical terms, a nightmare.
There are standard metrics, such as Silhouette Coefficient or Calinski-Harabasz Index, that can be used to evaluate different clustering algorithms and hyper-parameters, but these only tell you how well your clusters are separated. There is no metric to determine if the categories are useful, if they are capturing meaningful distinctions or just similarities in phrasing, or if you’ve captured the best level of granularity (i.e., if they’re too generic or too specific). These questions can only be addressed through manual human review.
If you spend enough time manually reviewing and evaluating the results of an unsupervised approach, it begs the question - why are you not creating labels and using supervised classification? This impetus to create your own labels is further motivated by the need to incorporate feedback from business stakeholders and create categories based on a specific company’s internal processes.
Before the wide availability of open-source LLMs, the challenge was always the cold start problem. How do you get an idea of what the potential categories are across a large volume of data, and how do you situate the data in a way that a human can go in and make opinionated and informed decisions about what a useful classification would be?
Solution
Generative Aggregative Labeling (GAL) is our LLM-powered method that has replaced Unsupervised Clustering with Human-in-the-Loop category creation. While an LLM is used for the process of label creation, the resulting production model is a simple lightweight classifier of your choosing (e.g., Multi-layer neural network).
Figure 1. Generated and Aggregate category labels
The GAL process leverages the incredible ability of LLMs to place similar sentences into a meaningful group that can be explained by a single brief category label. It tackles the challenge that LLMs have a limited context length, so they can not process thousands of sentences at once. By iteratively looping through small batches of data, it creates a growing list of category labels (Figure 1). After each batch, the newly generated labels are consolidated into the existing list from prior batches, and redundant labels are removed. Once the final list of categories is created, a final LLM call is done (in batch) to assign the best Label to a given input sentence (Figure 2).
Figure 2. Label input sentences in batch
Despite trillions of parameters, even the best all-purpose LLMs make mistakes and no amount of prompt engineering will perfectly capture the distinctions that you want. The categories selected by the automated GAL process need to be reviewed by a human to decide if and when some should be merged (combined), exploded (broken apart), or renamed. This stage also allows stakeholders to provide feedback after reviewing initial groupings. In this way, you can think of GAL as a magnificent information funnel that empowers humans at the helm of making decisions.
All good models still require effort. The final output of GAL is a list of sentences and their corresponding labels. To train a classification model, the sentences are embedded with a static embedding model, and those embeddings used to train a classifier. A wide variety of embedding models are available on HuggingFace, which can be tested for their performance on your given dataset. To productionize a well-behaved model, there is still Data Science to be done.
General LLMs, which have been trained on various Q&A and NLP tasks, are not good at consistently assigning the correct sentence to a label (fun fact; the quality drops off as the input tokens grow). Assigned Labels need to be reviewed and adjusted, and colliding categories need to be identified and dealt with (merge, explode, rename). The best custom models are created by iteratively training a model, looking at prediction scores and categories of the training data, adjusting labels, and retraining a model. Rinse and Repeat, bringing in and augmenting data as needed. Data Science is still at the heart of a robust and reliable model.
Conclusion
In the next article, I will discuss why we use a two-step approach to entity Extraction and Classification, and how we make the entire process easier and more scalable by leveraging fine-tuned LLMs with Summarized Entity Extraction (SEE).