AI/LLM

But what does it MEAN? Augmenting AI with Named Entity Recognition

By

Ted Callahan

on •

May 21, 2024

It’s one thing to cram a bunch of data into an AI and ask it to summarize or ideate, but it’s another to answer meaningful questions about the data. Augmentation is a practice that enriches data to give an AI additional context - allowing it to answer more meaningful questions.

At Artium we used Named Entity Recognition to enrich transcripts of our All Hands meetings so we could ask an AI about specific clients discussed at the meetings.

Why we did it

A prior experiment to create an AI knowledge bot with transcripts of our meetings identified a significant shortcoming - even if one of our clients was explicitly mentioned by name in the meeting, our AI didn’t understand that an iconic M&E brand we’ll call “Tasty Energy Co” was a client. As a result, we couldn’t get answers to questions like:

What clients has Artium worked with?

Well, clients are pretty important to Artium, so that just won’t do! We wondered:

  1. Could we augment our transcript data to teach our AI about clients

  2. Could we use Named Entity Recognition to augment our data programmatically

TURNS OUT, WE CAN!

How we did it

The basic flow of data is captured here:

Once transcripts of our Zoom recordings had been prepared, we used Google’s google.cloud.language_v2 Python client to perform entity extraction.

The analyze_entities method returns a structure of entities identified by category. We focused on “ORGANIZATIONS” as a way to identify clients:

Taking a closer look at the variable ARTIUM_LOWERCASE_REFERENCE_FORMATS might reveal a potential pitfall we stumbled across: false positives. The audio transcripts often contained many alternate spellings of the same entity. In addition, the language_v2 library would return very generic ORGANIZATION entities like “team” and “company” in addition to real company names. As a first pass, beyond filtering out known Artium strings, we decided not to filter beyond that.

Once the entities were extracted, we added them to a section of the transcript as follows:

As a result, the following sample of a transcript:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others."

Would be augmented as:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others. This part of the transcript mentions clients Tasty Energy Co."

The final piece of the puzzle was to vectorize the augmented transcripts for use by the AI. We accomplished this by storing the vectors of the augmented text but, in the query responses of the database, only returning the original text to the AI.

So how did it work?

Not at all! Initially we saw no improvement in the AI’s ability to identify clients in transcript data. We decided to iterate on the annotation, changing “this part of the transcript” to a simpler format:

"During my first year as an Artisan working with Artium, I worked on Tasty Energy Co, among others. Tasty Energy Co is an Artium client."

Ultimately this was successful! We were able to ask our AI Knowledge bot about client mentions in our meetings.

What did we learn?
  1. Entity extraction can extract a very broad set of entities that would need to be filtered down to a relevant domain. For example, while we want to extract “clients”, we’re actually extracting anything the model thinks is an “ORGANIZATION” which leads to a lot of false positives (e.g. team, llm, company, etc.). In the future, we would need to figure out how to filter down to relevant entities.

  2. The language of annotation matters. In one format, our AI was not able to glean useful context, in another it worked really well. Our take-away is that simpler, more definitive language seems to work better. For example, instead of saying, “this thing talks about this thing,” just say, “this is a thing.” That seemed to help with both the embeddings and the AI’s ability to understand the context.

Additional Resources
  1. Google Cloud Lanugage V2 API

It’s one thing to cram a bunch of data into an AI and ask it to summarize or ideate, but it’s another to answer meaningful questions about the data. Augmentation is a practice that enriches data to give an AI additional context - allowing it to answer more meaningful questions.

At Artium we used Named Entity Recognition to enrich transcripts of our All Hands meetings so we could ask an AI about specific clients discussed at the meetings.

Why we did it

A prior experiment to create an AI knowledge bot with transcripts of our meetings identified a significant shortcoming - even if one of our clients was explicitly mentioned by name in the meeting, our AI didn’t understand that an iconic M&E brand we’ll call “Tasty Energy Co” was a client. As a result, we couldn’t get answers to questions like:

What clients has Artium worked with?

Well, clients are pretty important to Artium, so that just won’t do! We wondered:

  1. Could we augment our transcript data to teach our AI about clients

  2. Could we use Named Entity Recognition to augment our data programmatically

TURNS OUT, WE CAN!

How we did it

The basic flow of data is captured here:

Once transcripts of our Zoom recordings had been prepared, we used Google’s google.cloud.language_v2 Python client to perform entity extraction.

The analyze_entities method returns a structure of entities identified by category. We focused on “ORGANIZATIONS” as a way to identify clients:

Taking a closer look at the variable ARTIUM_LOWERCASE_REFERENCE_FORMATS might reveal a potential pitfall we stumbled across: false positives. The audio transcripts often contained many alternate spellings of the same entity. In addition, the language_v2 library would return very generic ORGANIZATION entities like “team” and “company” in addition to real company names. As a first pass, beyond filtering out known Artium strings, we decided not to filter beyond that.

Once the entities were extracted, we added them to a section of the transcript as follows:

As a result, the following sample of a transcript:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others."

Would be augmented as:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others. This part of the transcript mentions clients Tasty Energy Co."

The final piece of the puzzle was to vectorize the augmented transcripts for use by the AI. We accomplished this by storing the vectors of the augmented text but, in the query responses of the database, only returning the original text to the AI.

So how did it work?

Not at all! Initially we saw no improvement in the AI’s ability to identify clients in transcript data. We decided to iterate on the annotation, changing “this part of the transcript” to a simpler format:

"During my first year as an Artisan working with Artium, I worked on Tasty Energy Co, among others. Tasty Energy Co is an Artium client."

Ultimately this was successful! We were able to ask our AI Knowledge bot about client mentions in our meetings.

What did we learn?
  1. Entity extraction can extract a very broad set of entities that would need to be filtered down to a relevant domain. For example, while we want to extract “clients”, we’re actually extracting anything the model thinks is an “ORGANIZATION” which leads to a lot of false positives (e.g. team, llm, company, etc.). In the future, we would need to figure out how to filter down to relevant entities.

  2. The language of annotation matters. In one format, our AI was not able to glean useful context, in another it worked really well. Our take-away is that simpler, more definitive language seems to work better. For example, instead of saying, “this thing talks about this thing,” just say, “this is a thing.” That seemed to help with both the embeddings and the AI’s ability to understand the context.

Additional Resources
  1. Google Cloud Lanugage V2 API

It’s one thing to cram a bunch of data into an AI and ask it to summarize or ideate, but it’s another to answer meaningful questions about the data. Augmentation is a practice that enriches data to give an AI additional context - allowing it to answer more meaningful questions.

At Artium we used Named Entity Recognition to enrich transcripts of our All Hands meetings so we could ask an AI about specific clients discussed at the meetings.

Why we did it

A prior experiment to create an AI knowledge bot with transcripts of our meetings identified a significant shortcoming - even if one of our clients was explicitly mentioned by name in the meeting, our AI didn’t understand that an iconic M&E brand we’ll call “Tasty Energy Co” was a client. As a result, we couldn’t get answers to questions like:

What clients has Artium worked with?

Well, clients are pretty important to Artium, so that just won’t do! We wondered:

  1. Could we augment our transcript data to teach our AI about clients

  2. Could we use Named Entity Recognition to augment our data programmatically

TURNS OUT, WE CAN!

How we did it

The basic flow of data is captured here:

Once transcripts of our Zoom recordings had been prepared, we used Google’s google.cloud.language_v2 Python client to perform entity extraction.

The analyze_entities method returns a structure of entities identified by category. We focused on “ORGANIZATIONS” as a way to identify clients:

Taking a closer look at the variable ARTIUM_LOWERCASE_REFERENCE_FORMATS might reveal a potential pitfall we stumbled across: false positives. The audio transcripts often contained many alternate spellings of the same entity. In addition, the language_v2 library would return very generic ORGANIZATION entities like “team” and “company” in addition to real company names. As a first pass, beyond filtering out known Artium strings, we decided not to filter beyond that.

Once the entities were extracted, we added them to a section of the transcript as follows:

As a result, the following sample of a transcript:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others."

Would be augmented as:

"During my first year as an Artisan working with Artium, I worked on an engagements for Tasty Energy Co, among others. This part of the transcript mentions clients Tasty Energy Co."

The final piece of the puzzle was to vectorize the augmented transcripts for use by the AI. We accomplished this by storing the vectors of the augmented text but, in the query responses of the database, only returning the original text to the AI.

So how did it work?

Not at all! Initially we saw no improvement in the AI’s ability to identify clients in transcript data. We decided to iterate on the annotation, changing “this part of the transcript” to a simpler format:

"During my first year as an Artisan working with Artium, I worked on Tasty Energy Co, among others. Tasty Energy Co is an Artium client."

Ultimately this was successful! We were able to ask our AI Knowledge bot about client mentions in our meetings.

What did we learn?
  1. Entity extraction can extract a very broad set of entities that would need to be filtered down to a relevant domain. For example, while we want to extract “clients”, we’re actually extracting anything the model thinks is an “ORGANIZATION” which leads to a lot of false positives (e.g. team, llm, company, etc.). In the future, we would need to figure out how to filter down to relevant entities.

  2. The language of annotation matters. In one format, our AI was not able to glean useful context, in another it worked really well. Our take-away is that simpler, more definitive language seems to work better. For example, instead of saying, “this thing talks about this thing,” just say, “this is a thing.” That seemed to help with both the embeddings and the AI’s ability to understand the context.

Additional Resources
  1. Google Cloud Lanugage V2 API

Looking to get more out of AI apps by adding Named Entity Recognition into the mix?