Apr 25, 2023
51 min read
At PayPal, as customer champions, we believe in leaving no stone unturned to delight our customers. Even when millions of contacts are established by customers day after day, either with bots or with our customer support agents, we strive to listen to the conversations deeply and from a place of empathy. By listening to what is said and what is left unsaid in those conversations, we re-imagine and reinvent our offerings to cater to our customers’ needs.
Conversations usually involve many speakers, speaker-level context, and inter-speaker dependency, rendering sentiment analysis of a conversation complex.
This blog explores challenges, methodologies, and datasets around conversation sentiments and how PayPal analyses sentiments in our customer support conversations.
Business Use Case
Sentiment analysis is a popular Natural Language Processing (NLP) use case. It is used in product reviews and comments, but sentiment analysis is beginning to be applied to conversations and dialogues as well.
Conversational Sentiment Analysis helps to detect the polarity and emotion of speakers based on an ongoing interaction. Knowing how a customer feels during a conversation has multiple use cases in both offline and online modes.
Online:
Offline:
Challenges
Sentiment in a conversation is more complex than in a movie or product review. A movie or a product review carries all information required to identify sentiment in a block of text. A conversation comprises of two or more speakers and the sentiment of each speaker (or whole conversation) depends on what each one is saying and in what context. As compared to a review, the sentiment of a conversation keeps changing as each speaker contributes. Hence finding the sentiment of a conversation needs to consider what each speaker is saying as well dynamic changes which occur due to contributions from each speaker.
There are three important aspects which need to be considered and accounted for while building a model to identify sentiment of a conversation:
A typical conversation has two or more speakers with each one providing one or more utterances. An utterance is a single block of information provided by a speaker before the next speaker begins.
Sentiment in a conversation depends on three important aspects:
In figure 1, the sentiment of responses provided by the bot depends on the response provided by the user. It also depends on the overall context of the conversation.
Another example to highlight the significance of context:
Agent: Are you still getting the error?
Customer: No, I do not get it now.
Without context, customer response may be classified as negative.
Sentiment can be understood by focusing on context, temporal aspect of utterances, and dependency on speakers. Model architectures to predict sentiment must consider these aspects.
Model Architectures
This section focuses on model architectures to predict conversational sentiment — it starts off with a simple “without context” architecture, then discusses “contextual” models, and ends with a model for asynchronous conversation.
As with most of modern Natural Language Processing architectures, the models discussed in this section have the following layers:
With the above background, let us jump into various approaches to solving conversational sentiment analysis.
Non-Contextual
The non-contextual model, as displayed in Figure 2, is a simple architecture to predict the sentiment of each utterance and then aggregate them to arrive at speaker-level sentiment.
Each utterance “u” is passed through a BERT-based encoder to get vector representation “v”, which is passed through SoftMax to get the sentiment score of each utterance. To arrive at speaker-level sentiment, we use aggregation logic like:
Key points about this approach:
Contextual
Contextual models have the following key attributes:
These models start with BERT-based vector representation just like non-contextual models. In non-contextual models, we utilized specific aggregation strategies to arrive at the final sentiment of the speaker. But in contextual models, we want to encapsulate the following conversation properties and not just rely upon deterministic aggregation:
To include the above three contexts, the model needs to have something much more than just text encoding. This is where techniques like Graph Networks and COMET are utilized (discussed further in detail).
DialogueRNN
DialogueRNN is an Attentive RNN (Recurrent Neural Network) model which defines speaker-level encoding using speaker state, previous utterance context, and previous utterance emotion.
It considers individual speakers by focusing on three aspects:
•Speaker
• Context of preceding utterance
• Sentiment of preceding utterance
The final emotion of an utterance is determined by these states:
Speaker state
Global state
Emotion Representation
DialogueGCN
DialogueGCN approach utilizes the GCN (Graphical Convolution Network) to establish the relationship between utterance, speaker, and listener by forming a directed graph between various utterances and keeping account of the order of utterances.
DialogueGCN represents each conversation as a graph
For an edge between utterance 1 and 2, we can identify:
COSMIC
One of the latest in conversation emotion detection; COSMIC utilizes a generative model (COMET) to generate common knowledge to create speaker-level encoding.
Common sense feature extraction (COMET – Transformer based knowledge graph) helps in finding:
Sentiment Analysis on Asynchronous Conversations
At PayPal, we handle asynchronous conversations – conversations that span across bots and multiple agents over various segments. Sentiment analysis of such conversations is challenging as it involves multiple agents, and each conversation will involve a different sentiment and impact the next interaction in a unique way.
To model the above conversation structure, we are working on a comprehensive model architecture which can combine both structural information and contextual information. A hierarchical attention layer is used to aggregate sentiments and representations to arrive at uber level representation.
As we tried to label data and build models at PayPal, we encountered some curious challenges which inspired us to reexamine our thought process. While diving deeper into sentiment analysis, we figured that our use case needed labelled data. To overcome the challenge, we employed various methods like Active Learning, Weak Supervision, and Human Annotation, and we observed a considerable amount of variation in assigning sentiment to an utterance or conversation among various human labelers. An utterance was perceived in multiple ways by different labelers and the subjectivity of the process made the exercise even more fascinating. We tried to bridge the differences by using voting-based labels and it made us wonder about how model predictions shift shapes based on who observes them. Such nuances make our projects meaningful at PayPal as we consciously create inclusive products and democratize financial services.
Appendix
Popular Data Sets
To build superior quality models, the availability of superior quality labelled data is most important. Having custom labelled data is a luxury which may not be always available – hence utilizing open-source data sets becomes quite important.
The following are some popular conversation data sets which were used by research publications, and we found it interesting at PayPal as well.
IEMOCAP
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal, and multi-speaker database. 12 hours of audio-visual data annotated by multiple annotators into categorical labels, such as anger, happiness, sadness etc. (Link to the data set)
MELD
13,000 utterances from 1,433 dialogues from the TV series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. (Link to the data set)
DailyDialog
Human written multi-turn dialog data set. Represents daily communication. Manually labelled with communication intention and emotional information. (Link to the data set)
EmoryNLP
Data set based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for sub parts of the show are provided. (Link to the data set)
ScenarioSA
Manually labelled 2,214 multi-speaker English conversations collected from various websites that provide online communication services. (Link to the data set)