Skip to main content

As I was browsing the blogsphere for what is new, I came across this snippet “Knowledge Graphs and LLMs are “a match made in Heaven.” and it piqued my interest. So here goes a deep dive into what this sentiment actually means and some related themes.

First, the cliff notes version of what this is all about. Large language models (LLMs) are not known to be entirely factually correct, and their knowledge of events and circumstances can be spotty. They certainly don’t have access to your organizational data and cannot meaningfully respond to queries about the same. So, how do you fix the problem? Supply it with the correct knowledge appropriate for the context that you are interested in.

There is of course more to this than meets the 

Knowledge graphs

Knowledge graphs (KG) have been around for a long while (more than half a century) as this book on the topic attests. At the core of it is the notion that there are entities (referred to as nodes) which could be anything – a person, a place, an event, a document, a URL, etc. Relationships between entities (referred to as edges) are typically encoded in natural language. Here is a formal definition: “A KG is a directed labeled graph in which domain-specific meanings are associated with nodes and edges.”  

Wikidata is an example of publicly available knowledge graphs that use a specific form for representing knowledge: resource description framework (RDF). Here is a good blog about RDF. RDF organizes knowledge in terms of a collection of triplets: subject, predicate and object. The predicate links subject with object. So, if you want to know about Aristotle, a query to this datastore will return all triplets associated with the term.

So how does this datastore help? A recent paper explains how this can be combined with ChatGPT. Basically, if you input a query such as, “What is the birthplace of Aristotle?” (example from the paper), ChatGPT is likely to get it wrong. If you ask for evidence for the response, ChatGPT will also generate the wrong URLs. Instead, by parsing the query first you can identify the entities in the user question – in this case Aristotle – query Wikidata RDF, get its response, add it to the context for ChatGPT and voila: ChatGPT utilizes that context and provides correct responses. That is basically the fundamental approach behind combining external knowledge sources with ChatGPT (or any LLMs). In this LinkedIn post, Cambridge Analytics CTO Sean Martin highlights an illuminating demo video which shows a model generating graphs and charts to visually organize information. A sign of what is to come on this front.

Composable pipelines

Turns out using RDF triples is not the only way to incorporate external knowledge to augment chat conversations. This tutorial by Pinecone shows how to incorporate prompt engineering, composable pipelines to chat conversations with LLMs using code. Another example of this a recent arxiv submission. This one is titled: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.

What is the approach these folks are promoting? It is actually quite simple. You have a query which needs an answer. Ask ChatGPT to come up with a plan to solve the problem. Determine if any of the plan elements can take advantage of an external tool. If so, use that tool to solve the problem. Feed the results to the next element of the plan. Combine everything as a context to get the final response from ChatGPT. The knowledge graph approach discussed above is simply a variation of this theme.

Vector database

One last approach to round out all the different ways of addressing limitations of LLMs, is the use of a vector database. What exactly is a vector database? Again, our friends in Pinecone have the answer for this (and a commercial solution). Vector database is simply a database of embeddings. An embedding of, say, a sentence, is a vector representation of the sentence. These sentence embeddings were made popular by BERT almost five years ago. If you give a query, the query is first converted to an embedding and compared against every vector in the database. The ones that are most similar (using a metric called cosine-similarity) are returned as responses to that query. Essentially, the approach to utilizing vector database with LLMs are as follows: 

  • Index and create vector embeddings for your organization data. 
  • LLMs posed a query. 
  • LLMs action plan involves querying org data. 
  • Top results (based on cosine-similarity) passed by vector database back to LLMs. 
  • LLM uses the data as context to formulate a response.

Concluding comments

Overcoming the limitations of LLMs is now its own industry. ChatGPT and its ilk are basically going to be the natural language (or speech) interface to everything. Venture funds are pouring in and the applications are unbounded. Buckle up.

I am always looking for feedback and if you would like me to cover a story, please let me know! Leave me a comment below or ask a question on my blogger profile page.

“Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research.