Entity Resolution：用向量相似度解決 GraphRAG 實體重複問題

GraphRAG 逐文處理的架構會產生實體重複節點（如 "TSMC" 和 "台積電"），本文分享如何用 Voyage AI Embedding + LanceDB 向量搜尋實作 Entity Resolution，並順便整合 jieba 到全局搜尋。

問題：實體重複

GraphRAG 的實體抽取通常是逐文處理的：

graph LR
    Doc1[文件 A] -->|單獨輸入| LLM
    Doc2[文件 B] -->|單獨輸入| LLM
    Doc3[文件 C] -->|單獨輸入| LLM
    
    LLM -->|輸出 A 的實體| Graph[(知識圖譜)]
    LLM -->|輸出 B 的實體| Graph
    LLM -->|輸出 C 的實體| Graph

這會導致嚴重問題：

文件 A 抽取	文件 B 抽取	問題
"TSMC"	"台積電"	同一公司，兩個節點
"Steve Jobs"	"Jobs"	同一人，兩個節點
"Python"	"Python programming"	同一技能，兩個節點

圖譜碎片化，查詢效果大打折扣。

解法：向量相似度

核心思路：用 Embedding 比對語意相似度

我們已經有 Voyage AI 的 embedding 基礎設施，可以直接利用：

# 測試相似度
from career_kb.rag.embedding import embed_document
import numpy as np

e1 = np.array(embed_document("TSMC"))
e2 = np.array(embed_document("台積電"))

cosine_sim = np.dot(e1, e2) / (np.linalg.norm(e1) * np.linalg.norm(e2))
print(f"Cosine similarity: {cosine_sim:.4f}")
# 輸出: 0.9555 ✓ 高度相似

對比不應該合併的：

e3 = np.array(embed_document("Python"))
e4 = np.array(embed_document("Java"))

cosine_sim = np.dot(e3, e4) / (np.linalg.norm(e3) * np.linalg.norm(e4))
# 輸出: 0.65 ✗ 相似度不夠

架構設計

┌─────────────────────────────────────────────────────────────┐
│                  Entity Resolution Flow                      │
└─────────────────────────────────────────────────────────────┘
          New Entity: "台積電" (from LLM extraction)
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Step 1: Compute Embedding                                   │
│  embedding = embed_document("台積電")  # 1024-dim vector     │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Step 2: Search Existing Entities (LanceDB)                  │
│  results = graph_entities.search(embedding).limit(1)         │
│  → Found: "TSMC" with distance 0.089 (similarity 0.955)      │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│  Step 3: Threshold Check                                     │
│  if similarity > 0.92:                                       │
│      return existing "TSMC" node  ← 合併！                   │
│  else:                                                       │
│      create new "台積電" node                                 │
└─────────────────────────────────────────────────────────────┘

實作

新模組：entity_resolution.py

# py-kb/src/career_kb/graph/entity_resolution.py

class EntityIndex:
    """Entity embedding index for deduplication."""
    
    def __init__(self, similarity_threshold: float = 0.92):
        self.threshold = similarity_threshold
        self._table = None  # Lazy-load LanceDB table
    
    def find_similar(self, name: str, entity_type: str) -> Optional[dict]:
        """Find existing entity with similar name."""
        embedding = embed_document(name)
        
        # LanceDB uses L2 distance
        max_distance = 2 * (1 - self.threshold)
        
        results = self.table.search(embedding).limit(1).to_list()
        
        if results and results[0].get("_distance") <= max_distance:
            return results[0]  # Found similar!
        return None
    
    def resolve(self, name: str, entity_type: str) -> ResolvedEntity:
        """Resolve entity to canonical form."""
        similar = self.find_similar(name, entity_type)
        
        if similar:
            # Use existing entity
            return ResolvedEntity(
                id=similar["id"],
                name=similar["name"],
                is_new=False
            )
        
        # Create new entity
        entity_id = str(uuid.uuid4())
        self.add_entity(name, entity_type, entity_id)
        return ResolvedEntity(id=entity_id, name=name, is_new=True)

LanceDB 表結構

# lancedb.py
def get_or_create_entity_table() -> Table:
    """Create graph_entities table for entity resolution."""
    schema = pa.schema([
        ("id", pa.string()),
        ("name", pa.string()),
        ("entity_type", pa.string()),
        ("vector", pa.list_(pa.float32(), 1024)),
    ])
    return db.create_table("graph_entities", schema=schema)

CLI 整合

# 新增 --similarity-threshold 選項
uv run career-kb graph build --similarity-threshold 0.92

# 輸出：
# Loaded 93 nodes from skill-graph.json
# Entity resolution enabled (threshold=0.92)
# Processing 12 material files...
# ✓ Extracted 45 unique entities
# ✓ Found 23 relations
# ✓ Merged 5 duplicate entities via vector similarity  ← 新功能！

驗證結果

測試已知重複實體：

from career_kb.graph.entity_resolution import EntityIndex

idx = EntityIndex(similarity_threshold=0.92)

# 測試 1: 台積電變體
idx.add_entity('TSMC', 'company', 'node-tsmc')
resolved = idx.resolve('台積電', 'company')
print(f'TSMC ← 台積電: {resolved.name} (merged: {not resolved.is_new})')
# ✓ TSMC ← 台積電: TSMC (merged: True)

# 測試 2: Python 變體
idx.add_entity('Python', 'skill', 'node-python')
resolved = idx.resolve('Python programming', 'skill')
# ✓ Python ← Python programming: Python (merged: True)

# 測試 3: 不應合併
resolved = idx.resolve('Java', 'skill')
# ✓ Java: Java (merged: False) - 正確建立新節點

加碼：Jieba 中文分詞整合

順便把 jieba 整合到全局搜尋中，用於 Query Expansion：

# bm25.py
def expand_query_with_jieba(query: str) -> str:
    """Expand query with jieba tokens for better FTS.
    
    Example: "機器學習專案" → "機器學習專案 機器學習 專案"
    """
    if not is_chinese(query):
        return query
    
    tokens = list(jieba.cut(query))
    tokens = [t for t in tokens if len(t) > 1]
    
    return f"{query} {' '.join(tokens)}"

整合到 hybrid_search_materials：

def hybrid_search_materials(query, limit=10, use_jieba=True):
    # Expand query with jieba for Chinese
    if use_jieba:
        fts_query = expand_query_with_jieba(query)
    else:
        fts_query = query
    
    # LanceDB FTS search with expanded query
    fts_results = table.search(fts_query, query_type="fts").limit(limit).to_list()

驗證：

機器學習專案 → 機器學習專案 機器學習 專案
後端工程師 → 後端工程師 後端工程師
Docker Kubernetes → Docker Kubernetes (英文不擴展)

關於 Seed Retrieval 效能

有人擔心：每次查詢時遍歷所有圖譜節點計算 embedding 會很慢嗎？

答案：不會！ 因為架構設計得當：

def seed_retrieval(query: str, top_k: int = 5) -> list[dict]:
    from career_kb.db.lancedb import search_materials
    results = search_materials(query, limit=top_k)  # 直接查 LanceDB
    return results

def search_materials(query: str, limit: int = 5) -> list[dict]:
    query_vector = embed_query(query)  # 只計算 1 次 query embedding
    return table.search(query_vector).limit(limit).to_list()  # ANN 搜尋

項目	說明
材料 embedding	在 `ingest` 時預先計算並存入 LanceDB
查詢時	只計算 1 次 query embedding
搜尋	LanceDB 用 ANN 快速找相似向量

不會有「遍歷 N 個節點 × N 次 API 呼叫」的問題。

模組結構

py-kb/src/career_kb/
├── graph/
│   ├── entity_resolution.py   # [NEW] 向量相似度實體解析
│   ├── entity_extractor.py    # LLM 實體抽取
│   ├── knowledge_graph.py     # 圖譜封裝
│   └── hybrid_retrieval.py    # 混合檢索
├── rag/
│   ├── bm25.py                # [UPDATED] 新增 expand_query_with_jieba
│   └── embedding.py           # Voyage AI embedding
└── db/
    └── lancedb.py             # [UPDATED] 新增 graph_entities 表

總結

功能	技術	效果
Entity Resolution	Voyage AI + LanceDB	TSMC↔台積電自動合併
中文查詢擴展	jieba 分詞	機器學習 → 機器學習
Threshold 控制	CLI 參數	`--similarity-threshold 0.92`

關鍵洞察：現有的 Embedding 基礎設施（Voyage AI + LanceDB）可以直接複用來解決 Entity Resolution，不需要額外引入特殊的 Entity Linking 模型。

Career Knowledge Base 是一個本地優先的履歷知識庫系統，使用 Python + LanceDB + NetworkX + LangChain 建構。