Skip to content

Hypothetical Question Search ​

The Hypothetical Question Search strategy matches user questions against pre-generated hypothetical question embeddings instead of chunk text embeddings. Question-to-question similarity is higher than question-to-text similarity, yielding better retrieval.

Status: βœ… Available Now

Overview ​

Standard vector search embeds the user's question and compares it against chunk text embeddings. The problem: questions and their answers often live in different regions of the embedding space. Hypothetical Question Search solves this by:

  1. Pre-generate questions that each chunk can answer (during indexing)
  2. Embed those questions into a dedicated questions vector index
  3. At query time, search question-to-question instead of question-to-text
  4. Deduplicate matched questions back to their parent chunks

Based on the Hypothetical Question Retriever pattern from the Neo4j GraphRAG Pattern Catalog.

Installation ​

bash
pnpm add @graphrag-sdk/lexical

Quick Start ​

typescript
import { createGraph, query } from 'graphrag-sdk';
import {
  lexicalGraph,
  chunkEntityGeneration,
  hypotheticalQuestionGeneration,
  hypotheticalQuestionSearch,
} from '@graphrag-sdk/lexical';
import { inMemoryGraph, inMemoryVector, inMemoryKV } from '@graphrag-sdk/in-memory-storage';
import { openai } from '@ai-sdk/openai';

const graph = createGraph({
  graph: lexicalGraph({
    graphStoreFactory: inMemoryGraph,
    vectorStoreFactory: inMemoryVector,
    kvStoreFactory: inMemoryKV,
    pipeline: [
      chunkEntityGeneration(),
      hypotheticalQuestionGeneration({ questionsPerChunk: 3 }),
    ],
  }),
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  namespace: 'docs',
});

await graph.insert([
  'TechCorp implements responsible AI through bias audits and model cards.',
  'Sarah Chen leads the AI ethics committee at TechCorp.',
  'The ethics committee publishes quarterly transparency reports.',
]);

// Search matches question embeddings, not chunk text
const { text } = await query({
  graph,
  search: hypotheticalQuestionSearch({ topK: 10 }),
  prompt: 'How does TechCorp approach responsible AI?',
});

console.log(text);

Prerequisites ​

The hypotheticalQuestionGeneration() processor must be in the pipeline. It creates:

  • Question embeddings in the questions vector index (each with { question, chunkId } metadata)
  • Question nodes in the graph store (nodeType: 'question', with chunkId)
  • HAS_QUESTION edges from question nodes to parent chunk nodes

Configuration ​

hypotheticalQuestionSearch(options) ​

typescript
interface HypotheticalQuestionSearchOptions {
  topK?: number;        // default: 10
  maxTokens?: number;   // default: 12000
  model?: LanguageModel;
  embedding?: EmbeddingModel;
}

Parameters:

OptionDefaultDescription
topK10Number of top matching questions from vector search
maxTokens12000Context window budget for assembled chunks

topK ​

Controls how many pre-generated questions are retrieved:

typescript
// Focused β€” fewer questions, faster
hypotheticalQuestionSearch({ topK: 5 })

// Broad β€” more questions, better recall
hypotheticalQuestionSearch({ topK: 20 })

Multiple matched questions may point to the same parent chunk. The strategy deduplicates by chunkId, keeping the highest similarity score per chunk.

maxTokens ​

Controls the total token budget for the assembled context:

typescript
// Smaller context window
hypotheticalQuestionSearch({ maxTokens: 8000 })

// Larger context window
hypotheticalQuestionSearch({ maxTokens: 16000 })

How It Works ​

1. Indexing (Pre-generation) ​

Input: Chunks from document processing
  ↓
For each chunk, LLM generates N hypothetical questions
  ↓
Embed each question
  ↓
Store in `questions` vector index with { question, chunkId } metadata
  ↓
Create Question nodes + HAS_QUESTION edges in graph

2. Query Processing ​

Input: User question
  ↓
Embed question
  ↓
Vector search on `questions` index β†’ Top K matching questions
  ↓
Deduplicate by chunkId (keep highest score per chunk)
  ↓
Load parent chunk text from KV store
  ↓
Assemble context (ordered by score, truncated to maxTokens)
  ↓
Send context + question to LLM
  ↓
Result: Answer

Why It Works ​

Consider this chunk: "The Leiden algorithm detects communities by optimizing modularity through iterative node movement between partitions."

A user might ask: "How do you find clusters in a graph?"

The cosine similarity between the question and chunk text is low β€” different vocabulary, different framing. But if we pre-generate the hypothetical question "How does the Leiden algorithm detect communities?", the question-to-question similarity is much higher.

Graph Structure ​

Question ──HAS_QUESTION──▢ Chunk ──PART_OF──▢ Document

Question node:

  • nodeType: 'question'
  • chunkId: string (reference to parent chunk)

Question vector index entry:

  • embedding: number[] (question embedding)
  • metadata: { question: string, chunkId: string }

Matches the Neo4j retrieval query pattern:

cypher
MATCH (node)<-[:HAS_QUESTION]-(chunk)
WITH chunk, max(score) AS score
RETURN chunk.text AS text, score

Usage Examples ​

Basic Usage ​

typescript
const { text } = await query({
  graph,
  search: hypotheticalQuestionSearch(),
  prompt: 'What compliance requirements does TechCorp follow?',
});

Generating More Questions Per Chunk ​

More questions per chunk increases recall but costs more during indexing:

typescript
const graph = createGraph({
  graph: lexicalGraph({
    graphStoreFactory: inMemoryGraph,
    vectorStoreFactory: inMemoryVector,
    kvStoreFactory: inMemoryKV,
    pipeline: [
      chunkEntityGeneration(),
      hypotheticalQuestionGeneration({ questionsPerChunk: 5 }), // More questions
    ],
  }),
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  namespace: 'docs',
});

You can use different search strategies for different queries:

typescript
// Use hypothetical question search for Q&A
const qaResult = await query({
  graph,
  search: hypotheticalQuestionSearch({ topK: 10 }),
  prompt: 'What is TechCorp?',
});

// Use entity enhanced search for relationship queries
const relResult = await query({
  graph,
  search: entityEnhancedSearch({ topK: 5, maxDepth: 2 }),
  prompt: 'How are TechCorp and Sarah Chen related?',
});

Return Metadata ​

The RetrievalResult.metadata includes:

typescript
metadata: {
  matchedQuestions: number;  // questions found by vector search
  uniqueChunks: number;     // deduplicated parent chunks
}

Error Handling ​

  • If the questions vector index doesn't exist (processor wasn't run), returns FAIL_RESPONSE with { matchedQuestions: 0, uniqueChunks: 0 }
  • If no matching questions are found, returns FAIL_RESPONSE context

When to Use ​

βœ… Good For ​

  • Q&A retrieval β€” When users ask questions and the answers are in chunk text
  • Low question-to-answer similarity β€” When questions use different vocabulary than the source text
  • FAQ-style knowledge bases β€” Documents that naturally answer specific questions
  • Technical documentation β€” Where users ask about concepts described with different terminology

❌ Not Ideal For ​

  • Relationship queries β€” Use entityEnhancedSearch() instead
  • Global/thematic questions β€” Use globalCommunitySearch() instead
  • Cost-sensitive indexing β€” Pre-generating questions adds LLM calls during indexing
  • Frequently updated corpora β€” Questions must be regenerated when chunks change

Comparison with Other Search Strategies ​

StrategySearches againstDeduplicationBest for
naiveSearch()Chunk text embeddingsNoneSimple fact lookups
hypotheticalQuestionSearch()Pre-generated question embeddingsBy parent chunk (max score)Q&A where Q↔A similarity is low
entityEnhancedSearch()Chunk embeddings β†’ entity graphNoneEnriching with entity relationships
similarChunkTraversalSearch()Chunk embeddings + graph traversalBy visited nodeConnected context across documents
localCommunitySearch()Entity embeddingsNoneCommunity-level summaries
globalCommunitySearch()Community reportsNoneHigh-level corpus questions

Performance Characteristics ​

Indexing Cost ​

Questions Per ChunkLLM CallsEmbedding Calls
3 (default)1 per chunk3 per chunk
51 per chunk5 per chunk
101 per chunk10 per chunk

Query Speed ​

OperationComplexity
Vector search on questions indexO(log N)
DeduplicationO(K) where K = topK
Chunk retrieval from KVO(unique chunks)

Next Steps ​

Source Code ​

View the implementation:

Released under the Elastic License 2.0. Made with ❀️ by Narek.