From GraphRAG to Surfacing the Structure: Lessons Learned and a New Path Forward
12 min read
We developed a GraphRAG pipeline and benchmarked its performance. In this blog post, we share the key challenges we encountered and the different approaches we explored, highlighting which ones worked and which didn’t. Finally, we discuss our takeaways and outline the next steps in our journey.
For a more general and basic understanding of GraphRAG, please refer to our first blog on GraphRAG at GraphRAG. It is crucial to understand how RAG works in a general sense, what chunks are, and what role they play in RAG. But no worries, for everything else, there will be repetition.
Introduction: An Example of GraphRAG

A potential application of GraphRAG can be found in digital retail, particularly in generating personalized user recommendations. The graph-based retrieval would use a knowledge graph that represents the store's available products and their relationships to attributes such as color, brand, or category. User searches can be translated into structured queries that extract a relevant subgraph containing the most suitable products. Importantly, this subgraph not only includes the recommended items but also encodes the reasoning behind their selection. By following the edges within the subgraph, we can understand why specific products were retrieved.
All of this information—the products, along with the connections that led to them—can be passed to an LLM to generate a user recommendation that includes a comprehensive list of products and a clear explanation of why each one was selected.
At The Heart of GraphRAG: The Knowledge Graph
In the example above, we illustrated how a knowledge graph can be used to retrieve relevant information to answer a question. However, this left out a crucial aspect of GraphRAG: building the graph itself. Before any GraphRAG can be used, its underlying knowledge graph must first be constructed. This graph is the core component: It holds the knowledge that powers the system. Generally, building the graph is the most challenging part of developing a GraphRAG pipeline.
GraphRAG for Unique
Usually, as in the digital retail example, graph-based retrieval systems are designed to solve a specific, well-defined problem. In contrast, our goal is to implement a GraphRAG within our AI chatbot, which must be capable of responding to any user query. Complicating matters further, each client brings unique data and use cases, yet we lack the resources to build custom solutions for every individual case. Our experience has also shown that we cannot rely on clients to handle this process themselves. This is why we need a fully automated approach that creates a knowledge graph tailored to each new environment and dataset and then reliably answers questions using the knowledge from that graph.
QuestionRAG: Graph Creation Driven by Real User Questions
To circumvent these issues, we propose QuestionRAG - a strategy driven by user questions. User questions provide targeted insight into which entities and relationships matter most, enabling relevant and meaningful graph creation.
This approach follows a structured process:
Collect Questions: Collect user questions that our chatbot was asked.
Clustering Questions: Group similar questions into semantic clusters.
Ontology Extraction: For each cluster and relevant document context, let an LLM identify essential entity types and relationships that define how the knowledge graph should be structured (ontology).
Ontology Aggregation: Consolidate and refine these outputs into a coherent, targeted ontology.
Knowledge Graph Construction: Leverage the ontology using the Neo4j extractor to build a knowledge graph on top of the dataset.
The advantage of this method lies in its adaptability. QuestionRAG continuously aligns with real user needs, creating a focused, relevant ontology that evolves naturally as questions emerge.

How the ontology relates to the knowledge graph
The Benchmarking
We built a GraphRAG pipeline using the QuestionRAG approach and benchmarked its performance. For the dataset, we selected a medium-sized company due diligence report comprising 9 documents and a total of 315 pages.
Using 140 representative questions, we generated the ontology through the QuestionRAG process. These questions were mapped to an ontology comprising 194 entity types and 257 relationship types. Based on this, we extracted a knowledge graph containing 2,000 entities and 4,732 relationships.

Printscreen of the knowledge graph in the Neo4j console
One key realization during the development of our GraphRAG pipeline was that conventional graph-based retrieval methods could not be directly applied to our automatically generated knowledge graph. The graph lacked sufficient structure and precision to support traditional querying techniques effectively.
As a result, we transitioned to a new retrieval method we call surfacing the structure. While this approach still leverages the knowledge graph to enhance search capabilities, it does not follow the standard GraphRAG methodology. From this point onward, we will focus exclusively on surfacing the structure, and later in the text, we will explain how it differs from conventional GraphRAG and why the usual methods proved unsuitable in our case.
One of our key findings was that surfacing the structure performs on par with our standard RAG approach when answering general questions not specifically designed for graph-based retrieval. However, we identified two scenarios where it outperforms conventional RAG:
Handling nested queries
Listing large sets of instances
Nested Search
A nested search refers to a multi-part question that requires retrieving information about an entity that is not explicitly mentioned in the query. For example, consider the question: “What are the responsibilities of the executive who used to work for Swissquote?” This query involves two steps -first identifying the executive in question, and then determining their responsibilities.
Our benchmarking results showed that surfacing the structure outperforms conventional RAG in handling such queries. Specifically, it retrieved more relevant information in 33% of nested search cases.
Listing Many Instances
We categorized a question as listing many instances if it requested a list containing more than six entries from the dataset. In this category, surfacing the structure returned more comprehensive results for 60% of the questions, while our conventional RAG approach produced more complete lists in 20% of the cases.
Why GraphRAG Failed
GraphRAG typically refers to approaches like Text2Cypher or graph hopping. In this section, we explain why both methods proved ineffective when applied to our automatically generated knowledge graphs, and why we ultimately decided to stop pursuing workarounds to make them viable in this context.

Representation of how chunks integrate in knowledge graph
There are two main approaches to graph hopping, and both fail for different reasons. Fundamentally, graph hopping begins with a set of seed chunks retrieved using conventional methods, then expands the search by performing one or more hops around these chunks within the graph.
In our knowledge graph, each chunk from the dataset is represented as a node. During graph extraction, each entity is linked to all chunks in which it appears. Additionally, a “next” relationship connects chunks that follow one another chronologically within the documents.
The first graph hopping method involves moving directly from chunk to chunk using these “next” relationships. Starting from the seed chunks, this approach includes their preceding and following chunks as additional context. Although straightforward to implement, our experiments showed that adding these surrounding chunks did not improve RAG performance. Consequently, we dismissed this method.
The second approach to graph hopping involves using entities to expand the search. One hop around a seed chunk works as follows: starting from the chunk, we first identify all entities extracted from it, then include all other chunks connected to any of these entities. The primary reason this fails is the sheer volume of connections generated by just one hop.
Specifically, the top 10% most connected chunks each link to over 150 other chunks. To put this in perspective, GPT-4o’s context window can hold roughly 100 chunks. Therefore, having just one highly connected chunk from this top segment in the seed exceeds the model’s capacity. Leading us to the conclusion that this method of graph hopping fails due to too high connectivity within the graph.
Text2Cypher
Text2Cypher is a retrieval method that translates natural language questions into graph queries. These queries traverse the knowledge graph to retrieve relevant results. Then, not only the results but the whole path through the graph (a subgraph) is passed to the LLM to generate the final answer. This allows the LLM to not only provide the correct response but also to explain the reasoning behind it.
This approach is illustrated in the user recommendation example from the introduction: a query for “Green jogging shoes from sustainable production” is translated by Text2Cypher into a graph query using entry points such as the color “green,” the category “sports,” and the production method “eco-friendly.” The query then traverses the graph edges to locate shoe entities connected to these criteria. Finding relevant results as well as a clear reasoning about how they were found.
Text2Cypher is a powerful retrieval method, especially when paired with rich knowledge graphs and large datasets. It enables precise navigation of vast graphs and can extract information with surgical accuracy. However, this precision also represents its main limitation: the knowledge graph must meet the high standards of exactness required by Text2Cypher. As we found, our automatically generated graphs fall short of this level of precision.
Let’s illustrate this with an example. Text2Cypher frequently fails at the straightforward question “Who works in Zurich?” This isn’t due to missing the necessary nodes in the graph, or a failure of the Text2Cypher translator to correctly interpret the question. Instead, the issue stems from the wide scope and uneven coverage of the knowledge graph.
In this case, the correct query would have been simple. Use Zurich as an entry point, and traverse to employees via a “works at” relationship. However, because our knowledge graph also included teams as nodes, Text2Cypher generated an alternative query that was still logically valid. It started at Zurich, searched for teams located there, and then attempted to retrieve individuals associated with those teams.
So why didn’t this second query return the right results? The problem lies in the structure of the dataset. Only a fraction of the employees were linked to teams, and only some teams had a location assigned to them. Critically, in our case, no team was connected to Zurich at all. As a result, the query returned nothing.
This highlights a fundamental limitation: Text2Cypher has no way of knowing which parts of the graph align with the dataset and are sufficiently populated to support a meaningful query. Working with unstructured data the way we do will always result in paths that return only partial or no answers, and in some cases, there might not even exist a single path that returns the right results.
We held multiple meetings with Neo4j - particularly with Dr. Estelle Scifo, Senior Engineer - where we explored potential solutions to this problem. Together, we concluded that the issue was not rooted in a technical misimplementation, but rather in the inherent challenges of working with unstructured data and the broad scope that the knowledge graph is expected to cover.
One approach we experimented with was manually simplifying the ontology to build a more constrained and focused knowledge graph. For example, we removed certain entity types, such as “team”, to reduce ambiguity. However, this either led to negligible improvements in Text2Cypher’s performance or the graph was reduced to so much that it no longer functioned as a general-purpose RAG system. Instead, it became a highly tailored solution designed to answer only a small set of predefined questions - precisely the opposite of our goal.
Rethinking GraphRAG: Surfacing the Structure
Although traditional GraphRAG methods fail when applied to our automatically generated knowledge graph, the information within the graph remains highly valuable. Conventional retrieval systems are effectively blind to structure. They operate identically across completely different datasets. Vector search, for instance, would vectorize the search query the same way for any dataset and then use it to retrieve chunks based on cosine similarity. While this approach works for many types of questions, it often fails when true understanding is needed to locate the correct information. In such cases, the knowledge graph might be the missing layer that allows the RAG system to “see” beyond surface-level similarity.
Integrating the knowledge graph into the retrieval process is what we refer to as surfacing the structure. This technique reveals the entities and their relationships buried within the dataset and brings them into play during search. We achieve this by constructing a graph skeleton—a simplified, triplet-based representation of the full graph, formatted as a CSV. Because the knowledge graph is significantly smaller than the original dataset, we can include the entire skeleton within the context window during LLM inference.
The graph skeleton is used in two critical stages:
Query rewriting – It helps the LLM reformulate the user query to improve retrieval performance.
Answer generation – It provides the LLM with structural context to craft a more informed and accurate response.
Updating the Search Query with the Graph Skeleton
By using the graph skeleton to refine the search query, certain complex queries—such as nested searches—can be partially or fully resolved before the retrieval step. Take, for example, the nested search question from our benchmarking section: “What are the responsibilities of the executive who used to work for Swissquote?”
Instead of directly retrieving chunks based on this question, we first prompt an LLM to reformulate the query using the structural information in the graph skeleton. This allows the system to break down the nested question, identify the executive in question, and construct a more precise query for the actual retrieval step, leading to improved results.

Extract from graph skeleton (Name changed)
With access to the graph skeleton - including the relevant portion shown in the diagram above - the LLM can leverage the structural context to accurately identify the individual referred to in the question. In our implementation, it reformulated the original query into:
“What are Simon Zeller’s responsibilities, the executive who used to work for Swissquote Bank SA and is currently the Chief Compliance Officer at Unigestion?”
By resolving the nested component of the question in advance, the subsequent retrieval using conventional methods becomes more focused, targeting the exact information needed - the structure was surfaced.
Graph Skeleton As Context For Answering
In addition to enhancing query rewriting, the graph skeleton can also be passed along with the retrieved chunks as context to the answering LLM. That way, it has a full account of all entities that appear in the data and knows how they are connected. This structural context allows the LLM to better understand the dataset as a whole and, therefore, also the context given by the chunks.
Consider the previously discussed question that Text2Cypher failed at: “Who works in Zurich?” The skeleton supplies the LLM with the structure of the data, including an account of all employees. This enables it to better recognize employees in the chunks, and ultimately leads to improved results.
In fact, in this particular case, the answer is already present in the skeleton. All employees have a “works at” connection to the respective location, serving the answer on a silver platter. Without this structural context, an LLM would have to infer the answer from the unstructured chunk context without any guidance. This becomes increasingly error-prone as the number of entries grows.
Conclusion
Our research has demonstrated that traditional GraphRAG approaches are not viable in a generic setting. Neither Text2Cypher nor graph hopping offered a reliable solution that would justify the resources required to implement a GraphRAG.
However, in the process of exploring these methods, we discovered alternative graph-based techniques that, while they cannot be classified as GraphRAG in the usual sense, offer significant value. These approaches move retrieval-augmented generation toward a more reasoning-driven paradigm. We refer to this concept as surfacing the structure - a method that leverages the knowledge graph to enhance both query understanding and answer generation to let the RAG see beyond just the surface and into the depths of connected data.
Outlook
We have decided to conclude our efforts toward building a GraphRAG and consider this project complete. One of the key takeaways, however, is the insight that surfacing the structure can meaningfully enhance retrieval systems and opens the door to more knowledge-based, reasoning-oriented approaches.
While promising, this approach remains experimental. At present, no active projects are continuing work on surfacing the structure within our team, but the concept holds potential for future exploration and development.
graphrag_2_slides.pptxPowerPoint presentationDownloadgraphrag_1_final_slides.pptxPowerPoint presentationDownload