Topic Analysis is a proprietary feature only available through Signal AI which allows you to discover and understand the connection between organisations and topics.
At the core of this feature, Signal AI employs a powerful scoring to measure the association between topics and entities. Here we describe the technical details of how this scoring works but first we motivate our approach by describing the requirements of our users and the challenges they face with the traditional approaches.
The Signal AI platform enables users to make better strategic decisions. PR executives are time poor, yet need to be able to quickly identify insights that help their strategic decision making. To do so these professionals often have to sift through large amounts of unstructured information to stay truly up-to-date.
At Signal AI, we have a catalogue of over 500 topics and entities with different levels of granularity ranging from very broad topics like education, sport and regulation to very niche topics such as Business Intelligence, or War for talent.
If a PR executive is interested in the topics that underline their brand media coverage, and they rank the topics by their volume in this coverage, they will always get broad topics that are generally very common (e.g. Education for a University).
Traditionally, the PR industry relies on Share of Voice (SoV) to measure the effectiveness of PR work. In particular by counting articles mentioning their brand in relation to a topic of interest. They can then use this number to compare themselves against peers and competitors.
While this approach can be highly effective to compare similar-size businesses against each other, it may be unfair for smaller organizations. Smaller companies will naturally get less coverage in the media. Consider a situation where an organization ‘Cool Green Startup’ is mentioned only 10 times in the media and all its 10 mentions are on the topic of ‘Sustainability Research’. A large corporate ‘Big Oil’ may have 20 mentions in the media, so the SoV will favor this corporate, even though in situations where the Big Oil have thousands of mentions in the media and mostly about pollution.
Moreover, there is a lack of understanding around how big a SOV really is – because SOV is often calculated against either select competitors or the entire topic itself, you might look like you have a high SOV, but in the grand scheme of things you’re a speck of dust for this topic (and vice versa). SOV on its own does not provide perspective on the true weight of a company within a topic.
In summary, the share of voice only tells part of the story and it is an absolute concept, but ultimately association is a relative concept.
We wanted to develop a measure of association between entities (organization) and topics that can address the aforementioned issues for discovery and measurement. In particular, to address those issues, the following characteristics are identified:
1. The size of the entity
The measure of association should consider how much coverage overall the entity has. If two entities have the same number of articles about a topic and one has more coverage overall, the smaller entity should have a higher association to the topic.
2. The size of the topic
The measure of association should consider how broad or narrow the topic is. If an entity has the same number of articles about two different topics, it should be considered more associated with the narrower topic.
3. Media bias
The measure of association should take into consideration the probability of an entity being mentioned with a topic. The media is always biased towards popular things (big brands and celebrities). Therefore, if a big brand A is mentioned with a broad topic X for a certain number of times, say 10, and another small brand B has also 10 mentions with a niche topic Y we should consider that the association between B and Y is higher than the association between A and X.
We use an information distance metric to measure association between entities and topics. We chose this metric because it has the three characteristics we identified above.
This distance metric behaves as follows:
In order to show how this measure differs from SoV and other association measures, we will give examples from real data we observed in our platform.
But first, let’s get into a bit more mathematical details:
With the help of a Venn diagram, we can illustrate how the NGD and additional measures of association works.
For a given entity and a topic within a certain time period, in the diagram, we consider all documents mentioning the topic (total number of those is T), all the documents that are highly salient about the entity, we ignore passing mentions, (total number of these documents is E), and then documents where the topic is co-mentioned with the entity (total number of these is C). The total number of documents in our platform is N in the considered time period.
Given these notations, the SoV of an entity within a topic can be calculated as follows:
One association measure could be the co-mentions ratio (Jaccard index). This measure works by estimating the ratio between the number of co-mentions (intersection) and the total number of documents about the entity and the topic (union).
The higher the co-mentions ratio is, the higher the association is between them. The co-mentions ratio is bounded between 0 (no co-mentions), and 1 (complete overlap).
Now we introduce the normalized google distance. A simplified formula is given for the NGD distance.
The distance is 0 when there is complete overlap (see Figure) as max(T,E)=C and it is unbounded. The smaller the distance, the higher the association and vice versa.
The actual formula is very similar to the simplified one except that it uses logarithms instead of actual counts.
What is interesting about NGD as opposed to SoV and jaccard is that it takes into account the number of documents mentioning the topic and the entity individually and their relative size to the entire document collection. This is why it has the three characteristics that we have identified for a good association measure earlier in the document (size of topic, size of entity and media bias).
We use this for our association measure but to make it act like a measure rather than distance (i.e. the larger the higher association), we scale the distance between 0 and 1 using min-max scaling and then inverse it.
Now, let us use this math to calculate these various association measures for the organisation Booking.com and the topic Package Holidays during August 2022.
The following shows how we do that step-by-step.
It should be noted that minmax scaling is performed with min=0 and max=1.37. As explained above the NGD distance is unbounded, however, we can observe from our data that in 99.9% of the time the NGD <1.37 so we prune any larger value to 1.37 and consider this as the maximum
The absolute values of themselves are not meaningful but they become meaningful when we try to compare the associations of Booking.com to Package Holidays against other organisations or other topics.
Let’s do that by comparing the association of Booking.com with Package holidays against Airbnb.
As a fun exercise, we will leave it to you to calculate the SoV, Jaccard and the Signal AI association measure using the step-by-step guide above and see if you will arrive at the same numbers.
From the table we can observe that SoV and Jaccard will favour Airbnb and deem it more associated as it gets higher scores than Booking.com (0.0015> 0.0008 for SoV) and 0.0009>0.0008. However the Signal AI association measure favor Booking.com (0.53>0.44). In this association measure took the size of the entity coverage into account. Overall, Booking.com has much lower coverage than Airbnb (497 vs. 6529), so it is reasonable to deem it more associated to Package Holidays although it got slightly lower co-mentions (9 vs. 16)
Here we provide another example comparing the association of Microsoft with Video Games to its association with Future Market.
If we rely on volume only, Microsoft has more articles of Future Market (7710 vs 7586). However the topic of Future market is much larger with nearly 3 million articles compare to 360k articles in total on Video Games. As before, we leave it to you to have a fun exercise of computing the various measures. In this case, SoV, Jaccard and our association measure will deem Video Games more associated than Future Market despite the higher number of co-mentions. It is reasonable in this case since the topic is more niche and has much lower coverage overall.
Finally to show how our measure can handle media bias, let’s look at the association of the Twitter corporate with Cryptocurrencies and the association of Booking.com with Competition Policy.
Both Jaccard and SoV will deem the association between Twitter and Cryptocurrencies higher than that between Booking.com and Competition Policy. Indeed there are far more articles 493 for the first association vs. 6 for the other one. However, our association measure will deem the second one higher. Since Twitter and Cryptocurrency are very popular topics in the media and they have a higher probability of being mentioned together overall, our association measure based on NGD takes this information into account and adjusts the measurement accordingly.