Files
biblical-guide.com/temp/azure-embed3-bible-pgvector-guide.md
Claude Assistant ee99e93ec2 Implement dynamic daily verse system with rotating Biblical content
- Add daily-verse API endpoint with 7 rotating verses in Romanian and English
- Replace static homepage verse with dynamic fetch from API
- Ensure consistent daily rotation using day-of-year calculation
- Support both ro and en locales for verse content

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-22 19:22:34 +00:00

373 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Azure OpenAI **embed-3** → Postgres + pgvector Ingestion Guide (Bible Corpus)
**Goal**: Create a productionready Python script that ingests the full Bible (Markdown source) into **Postgres** with **pgvector** and **fulltext** metadata, using **Azure OpenAI `embed-3`** embeddings. The vectors will power a consumer chat assistant (Q&A & conversations about the Bible) and a backend agent that generates custom prayers.
> Sample corpus used here: Romanian *Biblia Fidela* (Markdown). Structure contains books, chapters, verses (e.g., *Geneza 1:1…*) and a TOC in the file. fileciteturn0file0
---
## 0) Architecture at a glance
- **Input**: Bible in Markdown (`*.md`) → parser → normalized records: *(book, chapter, verse, text, lang=ro)*
- **Embedding**: Azure OpenAI **embed-3** (prefer `text-embedding-3-large`, 3072D). Batch inputs to cut cost/latency.
- **Storage**: Postgres with:
- `pgvector` column `embedding vector(3072)`
- `tsvector` column for hybrid lexical search (Romanian or English config as needed)
- metadata columns for fast filtering (book, chapter, verse, testament, translation, language)
- **Indexes**: `ivfflat` over `embedding`, GIN over `tsv` (and btree over metadata)
- **Retrieval**:
- Dense vector kNN
- Hybrid: combine kNN score + BM25/tsvector
- Windowed context stitching (neighbor verses) for chat
- **Consumers**:
- Chat assistant: answer + cite (book:chapter:verse).
- Prayer agent: promptcompose with retrieved passages & user intents.
---
## 1) Prerequisites
### Postgres + pgvector
```bash
# Install pgvector (on Ubuntu)
sudo apt-get update && sudo apt-get install -y postgresql postgresql-contrib
# In psql as superuser:
CREATE EXTENSION IF NOT EXISTS vector;
```
### Python deps
```bash
python -m venv .venv && source .venv/bin/activate
pip install psycopg[binary] pgvector pydantic python-dotenv httpx tqdm rapidfuzz
```
> `httpx` for HTTP (asynccapable), `pgvector` adapter, `rapidfuzz` for optional dedup or heuristic joins, `tqdm` for progress.
### Azure OpenAI
- Create **Embeddings** deployment for **`text-embedding-3-large`** (or `-small` if cost sensitive). Name it (e.g.) `embeddings`.
- Collect:
- `AZURE_OPENAI_ENDPOINT=https://<your>.openai.azure.com/`
- `AZURE_OPENAI_API_KEY=...`
- `AZURE_OPENAI_API_VERSION=2024-05-01-preview` *(or your current stable)*
- `AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings` *(your deployment name)*
Create `.env`:
```env
DATABASE_URL=postgresql://user:pass@localhost:5432/bible
AZURE_OPENAI_ENDPOINT=https://YOUR_RESOURCE.openai.azure.com/
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings
EMBED_DIMS=3072
BIBLE_MD_PATH=./Biblia-Fidela-limba-romana.md
LANG_CODE=ro
TRANSLATION_CODE=FIDELA
```
---
## 2) Database schema
```sql
-- One-time setup in your database
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS bible_passages (
id BIGSERIAL PRIMARY KEY,
testament TEXT NOT NULL, -- 'OT' or 'NT'
book TEXT NOT NULL,
chapter INT NOT NULL,
verse INT NOT NULL,
ref TEXT GENERATED ALWAYS AS (book || ' ' || chapter || ':' || verse) STORED,
lang TEXT NOT NULL DEFAULT 'ro',
translation TEXT NOT NULL DEFAULT 'FIDELA',
text_raw TEXT NOT NULL, -- exact verse text
text_norm TEXT NOT NULL, -- normalized/cleaned text (embedding input)
tsv tsvector,
embedding vector(3072), -- 1536 if using embed-3-small
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Uniqueness by canonical reference within translation/language
CREATE UNIQUE INDEX IF NOT EXISTS ux_ref_lang ON bible_passages (translation, lang, book, chapter, verse);
-- Full-text index (choose config; Romanian available if installed via ISPELL; else use 'simple' or 'english')
-- If you have pg_catalog.romanian, use that. Else fallback to 'simple' but keep lexemes.
CREATE INDEX IF NOT EXISTS idx_tsv ON bible_passages USING GIN (tsv);
-- Vector index (choose nlist to match data size; we set after populating table)
-- First create a flat index for small data, or IVFFLAT for scale:
-- Requires ANALYZE beforehand and SET enable_seqscan=off for kNN plans.
```
After loading, build the IVFFLAT index (the table must be populated first):
```sql
-- Example: around 31k verses ⇒ nlist ~ 100200 is reasonable; tune per EXPLAIN ANALYZE
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
```
Trigger to keep `updated_at` fresh:
```sql
CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS TRIGGER AS $$
BEGIN NEW.updated_at = now(); RETURN NEW; END; $$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_bible_updated ON bible_passages;
CREATE TRIGGER trg_bible_updated BEFORE UPDATE ON bible_passages
FOR EACH ROW EXECUTE PROCEDURE touch_updated_at();
```
---
## 3) Parsing & Chunking strategy (large, highquality)
**Why verselevel?** Its the canonical granular unit for Bible QA.
**Contextstitching**: during retrieval, fetch neighbor verses (±N) to maintain narrative continuity.
**Normalization** steps (for `text_norm`):
- Strip verse numbers and sidenotes if present in raw lines.
- Collapse whitespace, unify quotes, remove page headers/footers and TOC artifacts.
- Preserve punctuation; avoid stemming before embeddings.
- Lowercasing optional (OpenAI embeddings are case-robust).
**Testament/book detection**: From headings and TOC present in the Markdown; detect Book → Chapter → Verse boundaries via regex.
Example regex heuristics (tune to your file):
- Book headers: `^(?P<book>[A-ZĂÂÎȘȚ].+?)\s*$` (bounded by known canon order)
- Chapter headers: `^Capitolul\s+(?P<ch>\d+)` or `^CApitoLuL\s+(?P<ch>\d+)` (case variations)
- Verse lines: `^(?P<verse>\d+)\s+(.+)$`
> The provided Markdown clearly shows book order (e.g., *Geneza*, *Exodul*, …; NT: *Matei*, *Marcu*, …) and verse lines like “**1** LA început…”. fileciteturn0file0
---
## 4) Python ingestion script
> **Save as** `ingest_bible_pgvector.py`
```python
import os, re, json, math, time, asyncio
from typing import List, Dict, Tuple, Iterable
from dataclasses import dataclass
from pathlib import Path
from dotenv import load_dotenv
import httpx
import psycopg
from psycopg.rows import dict_row
load_dotenv()
AZ_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AZ_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZ_API_VER = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")
AZ_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "embeddings")
EMBED_DIMS = int(os.getenv("EMBED_DIMS", "3072"))
DB_URL = os.getenv("DATABASE_URL")
BIBLE_MD_PATH = os.getenv("BIBLE_MD_PATH")
LANG_CODE = os.getenv("LANG_CODE", "ro")
TRANSLATION = os.getenv("TRANSLATION_CODE", "FIDELA")
assert AZ_ENDPOINT and AZ_API_KEY and DB_URL and BIBLE_MD_PATH, "Missing required env vars"
EMBED_URL = f"{AZ_ENDPOINT}/openai/deployments/{AZ_DEPLOYMENT}/embeddings?api-version={AZ_API_VER}"
BOOKS_OT = [
"Geneza","Exodul","Leviticul","Numeri","Deuteronom","Iosua","Judecători","Rut",
"1 Samuel","2 Samuel","1 Imparati","2 Imparati","1 Cronici","2 Cronici","Ezra","Neemia","Estera",
"Iov","Psalmii","Proverbe","Eclesiastul","Cântarea Cântărilor","Isaia","Ieremia","Plângerile",
"Ezechiel","Daniel","Osea","Ioel","Amos","Obadia","Iona","Mica","Naum","Habacuc","Țefania","Hagai","Zaharia","Maleahi"
]
BOOKS_NT = [
"Matei","Marcu","Luca","Ioan","Faptele Apostolilor","Romani","1 Corinteni","2 Corinteni",
"Galateni","Efeseni","Filipeni","Coloseni","1 Tesaloniceni","2 Tesaloniceni","1 Timotei","2 Timotei",
"Titus","Filimon","Evrei","Iacov","1 Petru","2 Petru","1 Ioan","2 Ioan","3 Ioan","Iuda","Revelaţia"
]
BOOK_CANON = {b:("OT" if b in BOOKS_OT else "NT") for b in BOOKS_OT + BOOKS_NT}
@dataclass
class Verse:
testament: str
book: str
chapter: int
verse: int
text_raw: str
text_norm: str
def normalize_text(s: str) -> str:
s = re.sub(r"\s+", " ", s.strip())
s = s.replace(" ", " ")
return s
BOOK_RE = re.compile(r"^(?P<book>[A-ZĂÂÎȘȚ][^\n]+?)\s*$")
CH_RE = re.compile(r"^(?i:Capitolul|CApitoLuL)\s+(?P<ch>\d+)\b")
VERSE_RE = re.compile(r"^(?P<v>\d+)\s+(?P<body>.+)$")
def parse_bible_md(md_text: str):
cur_book, cur_ch = None, None
testament = None
for line in md_text.splitlines():
line = line.rstrip()
# Book detection
m_book = BOOK_RE.match(line)
if m_book:
bname = m_book.group("book").strip()
if bname in BOOK_CANON:
cur_book = bname
testament = BOOK_CANON[bname]
cur_ch = None
continue
m_ch = CH_RE.match(line)
if m_ch and cur_book:
cur_ch = int(m_ch.group("ch"))
continue
m_v = VERSE_RE.match(line)
if m_v and cur_book and cur_ch:
vnum = int(m_v.group("v"))
body = m_v.group("body").strip()
raw = body
norm = normalize_text(body)
yield {
"testament": testament, "book": cur_book, "chapter": cur_ch, "verse": vnum,
"text_raw": raw, "text_norm": norm
}
async def embed_batch(client, inputs):
payload = {"input": inputs}
headers = {"api-key": AZ_API_KEY, "Content-Type": "application/json"}
for attempt in range(6):
try:
r = await client.post(EMBED_URL, headers=headers, json=payload, timeout=60)
if r.status_code == 200:
data = r.json()
ordered = sorted(data["data"], key=lambda x: x["index"])
return [d["embedding"] for d in ordered]
elif r.status_code in (429, 500, 503):
backoff = 2 ** attempt + (0.1 * attempt)
await asyncio.sleep(backoff)
else:
raise RuntimeError(f"Embedding error {r.status_code}: {r.text}")
except Exception:
backoff = 2 ** attempt + (0.1 * attempt)
await asyncio.sleep(backoff)
raise RuntimeError("Failed to embed after retries")
UPSERT_SQL = """
INSERT INTO bible_passages (testament, book, chapter, verse, lang, translation, text_raw, text_norm, tsv, embedding)
VALUES (%(testament)s, %(book)s, %(chapter)s, %(verse)s, %(lang)s, %(translation)s, %(text_raw)s, %(text_norm)s,
to_tsvector(COALESCE(%(ts_lang)s,'simple')::regconfig, %(text_norm)s), %(embedding)s)
ON CONFLICT (translation, lang, book, chapter, verse) DO UPDATE
SET text_raw=EXCLUDED.text_raw,
text_norm=EXCLUDED.text_norm,
tsv=EXCLUDED.tsv,
embedding=EXCLUDED.embedding,
updated_at=now();
"""
async def main():
md_text = Path(BIBLE_MD_PATH).read_text(encoding="utf-8", errors="ignore")
verses = list(parse_bible_md(md_text))
print(f"Parsed verses: {len(verses)}")
batch_size = 128
async with httpx.AsyncClient() as client, psycopg.connect(DB_URL, autocommit=False) as conn:
with conn.cursor() as cur:
for i in range(0, len(verses), batch_size):
batch = verses[i:i+batch_size]
inputs = [v["text_norm"] for v in batch]
embs = await embed_batch(client, inputs)
rows = []
for v, e in zip(batch, embs):
rows.append({
**v,
"lang": os.getenv("LANG_CODE","ro"),
"translation": os.getenv("TRANSLATION_CODE","FIDELA"),
"ts_lang": "romanian",
"embedding": e
})
cur.executemany(UPSERT_SQL, rows)
conn.commit()
print(f"Upserted {len(rows)}{i+len(rows)}/{len(verses)}")
print("Done. Build IVFFLAT index after ANALYZE.")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
**Notes**
- If `romanian` text search config is unavailable, set `ts_lang='simple'`.
- For `embed-3-small`, set `EMBED_DIMS=1536` and change column type to `vector(1536)`.
---
## 5) Postingestion steps
```sql
VACUUM ANALYZE bible_passages;
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
CREATE INDEX IF NOT EXISTS idx_book_ch ON bible_passages (book, chapter);
```
---
## 6) Retrieval patterns
### A) Pure vector kNN (cosine)
```sql
SELECT ref, book, chapter, verse, text_raw,
1 - (embedding <=> $1) AS cosine_sim
FROM bible_passages
ORDER BY embedding <=> $1
LIMIT $2;
```
### B) Hybrid lexical + vector (weighted)
```sql
WITH v AS (
SELECT id, 1 - (embedding <=> $1) AS vsim
FROM bible_passages
ORDER BY embedding <=> $1
LIMIT 100
),
l AS (
SELECT id, ts_rank(tsv, $2) AS lrank
FROM bible_passages
WHERE tsv @@ $2
)
SELECT bp.ref, bp.book, bp.chapter, bp.verse, bp.text_raw,
COALESCE(v.vsim, 0) * 0.7 + COALESCE(l.lrank, 0) * 0.3 AS score
FROM bible_passages bp
LEFT JOIN v ON v.id = bp.id
LEFT JOIN l ON l.id = bp.id
ORDER BY score DESC
LIMIT 20;
```
---
## 7) Chat & Prayer agent tips
- **Answer grounding**: always cite `ref` (e.g., *Ioan 3:16*).
- **Multilingual output**: keep quotes in Romanian; explain in the users language.
- **Prayer agent**: constrain tone & doctrine; inject retrieved verses as anchors.
---
## 8) Ops
- Idempotent `UPSERT`.
- Backoff on 429/5xx.
- Consider keeping both `embed-3-large` and `-small` columns when migrating.
---
## 9) License & attribution
This guide references the structure of *Biblia Fidela* Markdown for ingestion demonstration. fileciteturn0file0