CtxFST CH9 - Pipeline 的遺失環節：不弄髒 Schema 的動態實體輪廓 (Entity Profiles) 生成

CtxFST CH9：Pipeline 的遺失環節——不弄髒 Schema 的動態實體輪廓 (Entity Profiles) 生成

在之前的 CtxFST 教學中，我們建立了一個共識：Schema 只定義骨架，Similarity 是算出來的。

但當開發者第一線實作 GraphRAG 時，常常會卡在一個非常實際的問題：「要算 Embedding 時，我總得有一段文字來代表這個 Entity 吧？」

最直覺的做法，是直接在 Markdown 的 YAML 裡加上 representation 或 description 欄位：

# ❌ 危險的直覺：把衍生資料寫死進核心 Schema
entities:
  - id: entity:fastapi
    name: FastAPI
    type: framework
    representation: "FastAPI is a Python web framework for building APIs..."
    mentioned_chunks: ["chunk:python-api", "chunk:docker-deploy"]

這在開發初期看起來很方便，可以直接讀取去算向量。但在 CtxFST 專案中，我們強烈反對直接把這段寫進 v1.1 的核心 Schema 裡。

為什麼？這篇文章將為你揭露 CtxFST 保持架構純淨的終極秘密：「衍生層 Pipeline (Derived Pipeline Artifacts)」。

為什麼把 Representation 寫進 Schema 是個架構災難？

如果把這些用來算 Embedding 的文字或反向索引（Reverse Links）放進 SKILL.md 的正式規範，你很快會面臨三個致命傷：

它是衍生表示，不是原始事實：
Entity 的 Name 與 Type 是事實。但 Representation（描述檔）與 Mentioned Chunks（它在哪裡被提過），是根據文章內容推導出來的。
只要文章一改，Profile 就會過期（Stale State）：
今天你在別篇文章加了一段 FastAPI 搭 PostgreSQL 的應用（chunk:fastapi-pg），如果 mentioned_chunks 寫在 YAML 裡，你就要手回去改 YAML；如果不改，資料庫狀態就會不一致。
把「格式標準」和「向量化策略」綁死了：
不同的下游場景可能需要不同的描述檔。有人只想要 Keyword-enriched text，有人想要 LLM 生成的故事，有人只需要 Chunk Context 的堆疊。如果寫死在 Schema 中，整套標準就失去彈性了。

👉 結論：

核心 Schema (ctxfst-spec.md) 應該永遠只放「Canonical Identity（權威身分）」。
任何依賴文章脈絡動態改變的資訊，都必須放到下游衍生層處理。

最佳解法：衍生層腳本 `build_entity_profiles.py`

為了解決這個需求，我們在 CtxFST 官方流程中加入了一支專門處理這件事的下游工具：skill-chunk-md/scripts/build_entity_profiles.py。

它的定位是在匯出 JSON 後、建立 Graph 前的中轉站。

這支腳本的工作非常純粹：

吃入穩定的 chunks.json。
對每個 Entity，反查所有連去它的 Linked Chunks。
動態聚合這些 Chunk 的 Context、Tags 與 Content 摘要。
最終輸出一份專為 Embedding 打造的 entity-profiles.json。

跑起來長怎樣？這段 Representation 自動寫了哪些事？

如果我們執行這支腳本，原本只有薄薄一層 id 與 name 的 FastAPI 節點，會自動被充實成以下這個模樣：

{
  "entities": [
    {
      "id": "entity:fastapi",
      "name": "FastAPI",
      "type": "framework",
      "mentioned_chunks": ["skill:python"],
      "representation": "name: FastAPI\ntype: framework\ncontext: Python programming skills focusing on data pipelines and REST APIs using FastAPI and Pandas\nrelated: Python Pandas"
    }
  ]
}

注意到了嗎？系統自動幫我們產生了最補的 representation 欄位（約 200-500 Tokens）與 mentioned_chunks 反向索引陣列。

這段 representation 將會是餵給 Embedding Backend（如 text-embedding-3-small 或 bge-large-zh）的頂級飼料！因為它包含了這個技能在你的筆記本裡，最真實的使用情境（Usage Context），而不是來自 Wikipedia 的罐頭維基百科定義。

給開發者的 Parser 實作提示：Prefix Tokens

在生成的 representation 文字裡，我們刻意採用了結構化的前綴語法（Prefix Tokens）。如果你想要自己寫 Parser 重建或清洗這些字串，請認明以下前綴，不要依賴固定的行號順序：

Prefix	意義與來源	範例
`name:`	原汁原味的 Entity Name	`name: FastAPI`
`type:`	Schema 規範的屬性	`type: framework`
`aliases`	如果有定義別名的話	`aliases: fastapi-web`
`tags`	關聯 chunk 使用的標籤集合	`tags: Backend, API`
`context`	關聯 chunk 的核心描述聚合	`context: Python programming skills focusing on...`
`content`	關聯 chunk 的內文節錄	`content: We implemented a fast web hook using...`
`related`	共現在同一個 chunk 中的其他兄弟節點	`related: Python, Pandas, PostgreSQL`

(💡 備註：這個格式並非核心 CtxFST 規格（ctxfst-spec.md）的一部分，它是一個純粹的 Derived Pipeline Artifact。)

最終：三層分離的完美 Pipeline 工作流

有這支小巧的 Profiles Builder 之後，CtxFST 的架構分層美學達到了極致：

核心層 (ctxfst-spec.md)：只管 Entities 身分（宣告有誰），與 Chunks 的單向歸屬（這段文章屬於誰）。
輪廓層 (entity-profiles.json)：專管衍生輪廓，把 Chunk 歷史濃縮成用來算 Embedding 的字串，建立雙向連結。
關聯層 (entity-graph.json)：專管找鄰居，依據輪廓層算出來的 Cosine Similarity 建立圖譜邊界（Edges）。

這個 Pipeline 寫成 CLI 指令就是這樣一氣呵成：

# 1. 將知識庫 Markdown 轉成穩定合約
python3 scripts/export_to_lancedb.py docs/ --output chunks.json

# 2. 自動生長 Entity 的動態脈絡輪廓
python3 scripts/build_entity_profiles.py chunks.json --output entity-profiles.json

# 3. 根據輪廓算出關聯圖譜
python3 scripts/build_entity_graph.py entity-profiles.json --output entity-graph.json

這套完美解耦的設計，保證了不管你是要換大模型、改文字摘要策略、還是加上多模態圖示，你都永遠不需要去修改最源頭人工編寫的那幾百份 Markdown 檔案。這，就是優雅架構的實證！

📌 CtxFST 開源專案：github.com/ctxfst