GraphRAG

One idea makes this click:

It's not about the graph.
It's about what you pre-compute from it.

First, the problem

Standard RAG finds chunks similar to your question:

Query

"What did Sarah say about the budget?"

Sarah mentioned the budget needs revision...

The project timeline spans 6 months...

Sarah said the budget was too tight for Q3...

Team capacity is at 80% utilization...

Answer

Sarah said the budget needs revision and was too tight for Q3.

✓ Works great for specific questions.

But what about this?

"What are the main themes in the dataset?"

Vector search finds chunks similar to the query.
But no single chunk contains "the main themes."

This isn't a retrieval problem. It's a summarization problem.

The obvious idea

"What if we built a knowledge graph?"

Extract entities → Extract relationships → Build graph → Traverse at query time

Many systems do this: KAPING, G-Retriever, LlamaIndex, LangChain...

But does it solve our problem?

Try it yourself

Here's a knowledge graph. Click around. Can you answer:
"What are the main themes?"

Person Org Event Place

You can see connections. But "main themes"? You'd have to visit every node.

The problem with graph traversal

✗To answer "what are the themes?", you'd need to traverse the entire graph

✗That's O(n) — doesn't scale to millions of entities

✗Even if you did, you'd get a list of facts, not a summary

Knowledge graphs help with local questions.
They don't help with global questions.

The GraphRAG insight

From the paper:

"In contrast with related work that exploits the structured retrieval and traversal affordances of graph indexes, we focus on a previously unexplored quality of graphs in this context: their inherent modularity and the ability of community detection algorithms to partition graphs into modular communities of closely-related nodes."
— Section 1, Introduction

The innovation isn't the graph itself.
It's using communities to organize summaries.

The key: Community Summaries

Build a knowledge graph (entities + relationships)

Use Leiden algorithm to find communities (clusters)

Pre-generate summaries for each community

Step 3 is the magic. It happens at index time, not query time.

See the difference

Toggle to see what summaries add:

• Dr. Chen leads Nexus Labs

• Dr. Chen collaborates with Miller

• Miller works at FDA

• Dr. Chen conducted Trial 2024

• Nexus Labs is in Seattle

• Dr. Webb works at Stanford

• Dr. Webb is rival of Dr. Chen

...just facts. No synthesis.

Community: Nexus Research Network

Centers on Dr. Chen's Alzheimer's research at Nexus Labs, including FDA collaboration and the 2024 clinical trial.

Community: Stanford Competition

Dr. Webb's competing research at Stanford, presenting alternative findings.

✓ Now you can instantly see the themes.

This is Query-Focused Summarization

From the paper:

"RAG fails on global questions directed at an entire text corpus, such as 'What are the main themes in the dataset?', since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task."
— Abstract

The graph is just scaffolding. The summaries are the product.

The hierarchy matters

Communities form a hierarchy. Pick a level:

Fewest communities, broadest scope

Alzheimer's Research Network

Chen, Nexus Labs, FDA collaboration, trials

Academic Competitors

Webb, Stanford, alternative approaches

Sub-communities of C0

Clinical Trial Operations

Chen, Trial 2024, patient outcomes

Regulatory & Safety

FDA, Miller, approval process

Nexus Labs Research

Seattle operations, funding

Stanford Methodology

Webb's competing findings

Sub-communities of C1 — most granular

Trial 2024 Results

Patient Side Effects

FDA Review Timeline

Safety Protocols

Nexus Funding Sources

Seattle Lab Equipment

Webb's Publications

2 summaries — 97% fewer tokens than full text

Global Search: Map-Reduce

For "what are the main themes?", use all community summaries:

Query:

"What are the main themes in the dataset?"

MAP: Ask each community summary

Nexus Research → "Alzheimer's drug development, clinical trials..."

Stanford Competition → "Academic rivalry, alternative approaches..."

REDUCE: Synthesize top points

Main themes: (1) Alzheimer's drug research and trials, (2) Regulatory oversight, (3) Academic competition between institutions.

Local Search: Entity-centric

For specific questions, start with entities, then traverse:

1.Find entity: Dr. Chen

2.Traverse: Nexus Labs, Miller, Trial 2024...

3.Get community summary for context

4.Get original text chunks

Answer: Dr. Chen led Alzheimer's research at Nexus Labs, conducting the 2024 clinical trial in collaboration with Miller from the FDA.

The essence of GraphRAG

✗ What GraphRAG is NOT:

Just using a knowledge graph for retrieval (many systems do this)

✓ What GraphRAG IS:

Using graph modularity to partition data, then pre-computing summaries at each level to enable Query-Focused Summarization at scale.

Why this works

Summaries are pre-computed

Pay the cost once at index time, not every query

Hierarchy gives you flexibility

Broad themes (C0) or specific details (C2, C3)

Map-reduce scales

Query communities in parallel, aggregate results

The tradeoffs

Comprehensiveness

72% wins

Diversity

62% wins

Query tokens (C0)

97% fewer

Directness

Naive RAG wins

So when should you use it? Tap to check:

The full picture

1Chunk documents (~600 tokens)

2Extract entities + relationships

3Build knowledge graph

4Detect communities (Leiden)

5Pre-generate summaries ← THE KEY

6Query: Local (entity) or Global (map-reduce)

Remember this

GraphRAG = pre-computed hierarchical summaries
using graphs as scaffolding for
Query-Focused Summarization

The graph enables the summaries.
The summaries enable global understanding.