SemGraph (short for “Semantic Graph”) is a technology that provides Large Language Models with a bird’s-eye view of a codebase structure by mapping code entities and their relationships as a graph and providing a way for an LLM to query that graph. This gives LLMs a deeper, more unified perspective on how everything fits together. By using this semantic approach, LLMs can understand high-level architecture across multiple languages, allowing them to answer broader questions - like how the entire payment flow works across different layers, without the need to manually sift through endless files.
The Graph Service, parsers, and chatbot can run locally on a developer’s machine or on a server hosting the codebase.
Other known approaches to enhance LLM code understanding require
loading code into context windows or creating vector representations
through embeddings. All of these are expensive in terms of token
consumption and computational resources.
SemGraph offers foundational advantages through its graph-based
approach. The graph provides a language-agnostic representation of a
codebase, which can be deterministically queried by an LLM. This
allows LLMs to extract specific information about the codebase
efficiently, improving both accuracy and performance.
Graph Service maintains a graph of nodes and edges. Each node
represents a codebase entity, such as a class or a function. Edges
capture relationships between these entities, for example “Owns”
or “Invokes.”
The Graph Service updates the entire
graph whenever it detects changes in the code—without the lengthy
tokenization or embedding steps often associated with LLM-based
approaches. This means an LLM can access fresh structural data as
soon as the code changes.
The types of nodes and edges are determined by a language. In
object-oriented languages, for instance, the language parser can
define “class” or “method” nodes, along with “inherits” or
“implements” edges. For TypeScript, the parser can include a
“Type” node type.
A codebase can have multiple associated parsers, defined in a
SemGraph configuration file and mapped to specific file
extensions. The graph service monitors the codebase for changes
and, when changes are detected, it invokes the appropriate
parser(s) for the updated files.
Parsers generate a subgraph for the changed entities. The graph
service then merges this subgraph into the main graph, creating,
updating, or deleting nodes and edges as necessary to keep the
graph in sync with the latest codebase state.
The graph service can be queried through a TCP/IP API. One key
query is “describe,” which tells the LLM what kinds of entities
and edges exist in the codebase. Another essential feature is the
ability to find paths between nodes, showing how different parts
of the codebase are connected.
A parser is a separate program invoked by the graph service. Its
purpose is to convert code entities and relationships into a graph
representation. Typically, a parser can be written in the same
language it handles (e.g., PHP for PHP, TypeScript for
TypeScript), making use of third-party libraries that generate
ASTs and simplify parser development.
Once the AST is generated, the parser traverses it and creates
corresponding graph nodes and edges. These nodes and edges are
returned as JSON and then used by the Graph Service as a subgraph.
The Chatbot receives prompts from the user and forwards them to the LLM. It includes a set of tools that the LLM can use to query the Graph Service through LLM function calls. The sample chatbot is implemented with Python on the server side and TypeScript/Vue on the client side, illustrating how SemGraph can be integrated into an application. The same principles apply to any other application that needs to leverage SemGraph for code analysis or similar tasks.
Functional alpha stage:
implements most of the required functionality and is backed by
comprehensive unit tests covering its core operations. Provides a
stable foundation for further development and iterative
refinement.
Future potential: there is
likely a broader range of query types that could enhance LLM
understanding of the codebase, which should be identified and
implemented. Additionally, there is room to further optimize both
graph merging and querying operations to improve overall
performance.
Working prototype demo the
included PHP and TypeScript/Vue parsers effectively demonstrate
the approach for different languages, showing how language-native
parsers can be used to generate subgraphs compatible with the
Graph Service.
Future potential custom nodes
and edges, like API endpoints, should currently be defined by
developers. Detecting them in ASTs for popular frameworks like
Laravel should not be difficult. Perhaps an LLM could assist with
this - an endpoint can be identified through static code analysis,
and the corresponding piece of code can then be sent to the LLM
for further analysis.
A working foundation that integrates a remote LLM API with the Graph Service endpoints. The backend is designed with modularity in mind, allowing it to be ported to other frameworks or extended with additional capabilities. It defines tools for querying the LLM, which can be further refined or expanded as needed.
Demo prototype illustrates how the technology can be embedded into a product, featuring graph visualization and the ability to output streamed Markdown and code blocks.
During development, I tested various LLMs and found that OpenAI's
The
current implementation relies on system instructions and tool
descriptions. Fine-tuning function calling could further improve
the tool's accuracy.
I see great potential in features that SemGraph and the chat interface could implement to make AI assistance more valuable for developers. Some ideas: