Files
biblical-guide.com/temp/azure-embed3-bible-pgvector-guide.md
Claude Assistant ee99e93ec2 Implement dynamic daily verse system with rotating Biblical content
- Add daily-verse API endpoint with 7 rotating verses in Romanian and English
- Replace static homepage verse with dynamic fetch from API
- Ensure consistent daily rotation using day-of-year calculation
- Support both ro and en locales for verse content

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-22 19:22:34 +00:00

14 KiB
Raw Permalink Blame History

Azure OpenAI embed-3 → Postgres + pgvector Ingestion Guide (Bible Corpus)

Goal: Create a productionready Python script that ingests the full Bible (Markdown source) into Postgres with pgvector and fulltext metadata, using Azure OpenAI embed-3 embeddings. The vectors will power a consumer chat assistant (Q&A & conversations about the Bible) and a backend agent that generates custom prayers.

Sample corpus used here: Romanian Biblia Fidela (Markdown). Structure contains books, chapters, verses (e.g., Geneza 1:1…) and a TOC in the file. fileciteturn0file0


0) Architecture at a glance

  • Input: Bible in Markdown (*.md) → parser → normalized records: (book, chapter, verse, text, lang=ro)
  • Embedding: Azure OpenAI embed-3 (prefer text-embedding-3-large, 3072D). Batch inputs to cut cost/latency.
  • Storage: Postgres with:
    • pgvector column embedding vector(3072)
    • tsvector column for hybrid lexical search (Romanian or English config as needed)
    • metadata columns for fast filtering (book, chapter, verse, testament, translation, language)
  • Indexes: ivfflat over embedding, GIN over tsv (and btree over metadata)
  • Retrieval:
    • Dense vector kNN
    • Hybrid: combine kNN score + BM25/tsvector
    • Windowed context stitching (neighbor verses) for chat
  • Consumers:
    • Chat assistant: answer + cite (book:chapter:verse).
    • Prayer agent: promptcompose with retrieved passages & user intents.

1) Prerequisites

Postgres + pgvector

# Install pgvector (on Ubuntu)
sudo apt-get update && sudo apt-get install -y postgresql postgresql-contrib
# In psql as superuser:
CREATE EXTENSION IF NOT EXISTS vector;

Python deps

python -m venv .venv && source .venv/bin/activate
pip install psycopg[binary] pgvector pydantic python-dotenv httpx tqdm rapidfuzz

httpx for HTTP (asynccapable), pgvector adapter, rapidfuzz for optional dedup or heuristic joins, tqdm for progress.

Azure OpenAI

  • Create Embeddings deployment for text-embedding-3-large (or -small if cost sensitive). Name it (e.g.) embeddings.
  • Collect:
    • AZURE_OPENAI_ENDPOINT=https://<your>.openai.azure.com/
    • AZURE_OPENAI_API_KEY=...
    • AZURE_OPENAI_API_VERSION=2024-05-01-preview (or your current stable)
    • AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings (your deployment name)

Create .env:

DATABASE_URL=postgresql://user:pass@localhost:5432/bible
AZURE_OPENAI_ENDPOINT=https://YOUR_RESOURCE.openai.azure.com/
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings
EMBED_DIMS=3072
BIBLE_MD_PATH=./Biblia-Fidela-limba-romana.md
LANG_CODE=ro
TRANSLATION_CODE=FIDELA

2) Database schema

-- One-time setup in your database
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS bible_passages (
  id               BIGSERIAL PRIMARY KEY,
  testament        TEXT NOT NULL,           -- 'OT' or 'NT'
  book             TEXT NOT NULL,
  chapter          INT  NOT NULL,
  verse            INT  NOT NULL,
  ref              TEXT GENERATED ALWAYS AS (book || ' ' || chapter || ':' || verse) STORED,
  lang             TEXT NOT NULL DEFAULT 'ro',
  translation      TEXT NOT NULL DEFAULT 'FIDELA',
  text_raw         TEXT NOT NULL,           -- exact verse text
  text_norm        TEXT NOT NULL,           -- normalized/cleaned text (embedding input)
  tsv              tsvector,
  embedding        vector(3072),            -- 1536 if using embed-3-small
  created_at       TIMESTAMPTZ DEFAULT now(),
  updated_at       TIMESTAMPTZ DEFAULT now()
);

-- Uniqueness by canonical reference within translation/language
CREATE UNIQUE INDEX IF NOT EXISTS ux_ref_lang ON bible_passages (translation, lang, book, chapter, verse);

-- Full-text index (choose config; Romanian available if installed via ISPELL; else use 'simple' or 'english')
-- If you have pg_catalog.romanian, use that. Else fallback to 'simple' but keep lexemes.
CREATE INDEX IF NOT EXISTS idx_tsv ON bible_passages USING GIN (tsv);

-- Vector index (choose nlist to match data size; we set after populating table)
-- First create a flat index for small data, or IVFFLAT for scale:
-- Requires ANALYZE beforehand and SET enable_seqscan=off for kNN plans.

After loading, build the IVFFLAT index (the table must be populated first):

-- Example: around 31k verses ⇒ nlist ~ 100200 is reasonable; tune per EXPLAIN ANALYZE
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);

Trigger to keep updated_at fresh:

CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS TRIGGER AS $$
BEGIN NEW.updated_at = now(); RETURN NEW; END; $$ LANGUAGE plpgsql;

DROP TRIGGER IF EXISTS trg_bible_updated ON bible_passages;
CREATE TRIGGER trg_bible_updated BEFORE UPDATE ON bible_passages
FOR EACH ROW EXECUTE PROCEDURE touch_updated_at();

3) Parsing & Chunking strategy (large, highquality)

Why verselevel? Its the canonical granular unit for Bible QA.
Contextstitching: during retrieval, fetch neighbor verses (±N) to maintain narrative continuity.
Normalization steps (for text_norm):

  • Strip verse numbers and sidenotes if present in raw lines.
  • Collapse whitespace, unify quotes, remove page headers/footers and TOC artifacts.
  • Preserve punctuation; avoid stemming before embeddings.
  • Lowercasing optional (OpenAI embeddings are case-robust).

Testament/book detection: From headings and TOC present in the Markdown; detect Book → Chapter → Verse boundaries via regex.
Example regex heuristics (tune to your file):

  • Book headers: ^(?P<book>[A-ZĂÂÎȘȚ].+?)\s*$ (bounded by known canon order)
  • Chapter headers: ^Capitolul\s+(?P<ch>\d+) or ^CApitoLuL\s+(?P<ch>\d+) (case variations)
  • Verse lines: ^(?P<verse>\d+)\s+(.+)$

The provided Markdown clearly shows book order (e.g., Geneza, Exodul, …; NT: Matei, Marcu, …) and verse lines like “1 LA început…”. fileciteturn0file0


4) Python ingestion script

Save as ingest_bible_pgvector.py

import os, re, json, math, time, asyncio
from typing import List, Dict, Tuple, Iterable
from dataclasses import dataclass
from pathlib import Path
from dotenv import load_dotenv
import httpx
import psycopg
from psycopg.rows import dict_row

load_dotenv()

AZ_ENDPOINT   = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AZ_API_KEY    = os.getenv("AZURE_OPENAI_API_KEY")
AZ_API_VER    = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")
AZ_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "embeddings")
EMBED_DIMS    = int(os.getenv("EMBED_DIMS", "3072"))
DB_URL        = os.getenv("DATABASE_URL")
BIBLE_MD_PATH = os.getenv("BIBLE_MD_PATH")
LANG_CODE     = os.getenv("LANG_CODE", "ro")
TRANSLATION   = os.getenv("TRANSLATION_CODE", "FIDELA")

assert AZ_ENDPOINT and AZ_API_KEY and DB_URL and BIBLE_MD_PATH, "Missing required env vars"

EMBED_URL = f"{AZ_ENDPOINT}/openai/deployments/{AZ_DEPLOYMENT}/embeddings?api-version={AZ_API_VER}"

BOOKS_OT = [
  "Geneza","Exodul","Leviticul","Numeri","Deuteronom","Iosua","Judecători","Rut",
  "1 Samuel","2 Samuel","1 Imparati","2 Imparati","1 Cronici","2 Cronici","Ezra","Neemia","Estera",
  "Iov","Psalmii","Proverbe","Eclesiastul","Cântarea Cântărilor","Isaia","Ieremia","Plângerile",
  "Ezechiel","Daniel","Osea","Ioel","Amos","Obadia","Iona","Mica","Naum","Habacuc","Țefania","Hagai","Zaharia","Maleahi"
]
BOOKS_NT = [
  "Matei","Marcu","Luca","Ioan","Faptele Apostolilor","Romani","1 Corinteni","2 Corinteni",
  "Galateni","Efeseni","Filipeni","Coloseni","1 Tesaloniceni","2 Tesaloniceni","1 Timotei","2 Timotei",
  "Titus","Filimon","Evrei","Iacov","1 Petru","2 Petru","1 Ioan","2 Ioan","3 Ioan","Iuda","Revelaţia"
]

BOOK_CANON = {b:("OT" if b in BOOKS_OT else "NT") for b in BOOKS_OT + BOOKS_NT}

@dataclass
class Verse:
    testament: str
    book: str
    chapter: int
    verse: int
    text_raw: str
    text_norm: str

def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s.strip())
    s = s.replace("  ", " ")
    return s

BOOK_RE   = re.compile(r"^(?P<book>[A-ZĂÂÎȘȚ][^\n]+?)\s*$")
CH_RE     = re.compile(r"^(?i:Capitolul|CApitoLuL)\s+(?P<ch>\d+)\b")
VERSE_RE  = re.compile(r"^(?P<v>\d+)\s+(?P<body>.+)$")

def parse_bible_md(md_text: str):
    cur_book, cur_ch = None, None
    testament = None
    for line in md_text.splitlines():
        line = line.rstrip()

        # Book detection
        m_book = BOOK_RE.match(line)
        if m_book:
            bname = m_book.group("book").strip()
            if bname in BOOK_CANON:
                cur_book = bname
                testament = BOOK_CANON[bname]
                cur_ch = None
                continue

        m_ch = CH_RE.match(line)
        if m_ch and cur_book:
            cur_ch = int(m_ch.group("ch"))
            continue

        m_v = VERSE_RE.match(line)
        if m_v and cur_book and cur_ch:
            vnum = int(m_v.group("v"))
            body = m_v.group("body").strip()
            raw = body
            norm = normalize_text(body)
            yield {
                "testament": testament, "book": cur_book, "chapter": cur_ch, "verse": vnum,
                "text_raw": raw, "text_norm": norm
            }

async def embed_batch(client, inputs):
    payload = {"input": inputs}
    headers = {"api-key": AZ_API_KEY, "Content-Type": "application/json"}
    for attempt in range(6):
        try:
            r = await client.post(EMBED_URL, headers=headers, json=payload, timeout=60)
            if r.status_code == 200:
                data = r.json()
                ordered = sorted(data["data"], key=lambda x: x["index"])
                return [d["embedding"] for d in ordered]
            elif r.status_code in (429, 500, 503):
                backoff = 2 ** attempt + (0.1 * attempt)
                await asyncio.sleep(backoff)
            else:
                raise RuntimeError(f"Embedding error {r.status_code}: {r.text}")
        except Exception:
            backoff = 2 ** attempt + (0.1 * attempt)
            await asyncio.sleep(backoff)
    raise RuntimeError("Failed to embed after retries")

UPSERT_SQL = """
INSERT INTO bible_passages (testament, book, chapter, verse, lang, translation, text_raw, text_norm, tsv, embedding)
VALUES (%(testament)s, %(book)s, %(chapter)s, %(verse)s, %(lang)s, %(translation)s, %(text_raw)s, %(text_norm)s,
        to_tsvector(COALESCE(%(ts_lang)s,'simple')::regconfig, %(text_norm)s), %(embedding)s)
ON CONFLICT (translation, lang, book, chapter, verse) DO UPDATE
SET text_raw=EXCLUDED.text_raw,
    text_norm=EXCLUDED.text_norm,
    tsv=EXCLUDED.tsv,
    embedding=EXCLUDED.embedding,
    updated_at=now();
"""

async def main():
    md_text = Path(BIBLE_MD_PATH).read_text(encoding="utf-8", errors="ignore")
    verses = list(parse_bible_md(md_text))
    print(f"Parsed verses: {len(verses)}")

    batch_size = 128
    async with httpx.AsyncClient() as client, psycopg.connect(DB_URL, autocommit=False) as conn:
        with conn.cursor() as cur:
            for i in range(0, len(verses), batch_size):
                batch = verses[i:i+batch_size]
                inputs = [v["text_norm"] for v in batch]
                embs = await embed_batch(client, inputs)
                rows = []
                for v, e in zip(batch, embs):
                    rows.append({
                        **v,
                        "lang": os.getenv("LANG_CODE","ro"),
                        "translation": os.getenv("TRANSLATION_CODE","FIDELA"),
                        "ts_lang": "romanian",
                        "embedding": e
                    })
                cur.executemany(UPSERT_SQL, rows)
                conn.commit()
                print(f"Upserted {len(rows)}{i+len(rows)}/{len(verses)}")
    print("Done. Build IVFFLAT index after ANALYZE.")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Notes

  • If romanian text search config is unavailable, set ts_lang='simple'.
  • For embed-3-small, set EMBED_DIMS=1536 and change column type to vector(1536).

5) Postingestion steps

VACUUM ANALYZE bible_passages;
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
CREATE INDEX IF NOT EXISTS idx_book_ch ON bible_passages (book, chapter);

6) Retrieval patterns

A) Pure vector kNN (cosine)

SELECT ref, book, chapter, verse, text_raw,
       1 - (embedding <=> $1) AS cosine_sim
FROM bible_passages
ORDER BY embedding <=> $1
LIMIT $2;

B) Hybrid lexical + vector (weighted)

WITH v AS (
  SELECT id, 1 - (embedding <=> $1) AS vsim
  FROM bible_passages
  ORDER BY embedding <=> $1
  LIMIT 100
),
l AS (
  SELECT id, ts_rank(tsv, $2) AS lrank
  FROM bible_passages
  WHERE tsv @@ $2
)
SELECT bp.ref, bp.book, bp.chapter, bp.verse, bp.text_raw,
       COALESCE(v.vsim, 0) * 0.7 + COALESCE(l.lrank, 0) * 0.3 AS score
FROM bible_passages bp
LEFT JOIN v ON v.id = bp.id
LEFT JOIN l ON l.id = bp.id
ORDER BY score DESC
LIMIT 20;

7) Chat & Prayer agent tips

  • Answer grounding: always cite ref (e.g., Ioan 3:16).
  • Multilingual output: keep quotes in Romanian; explain in the users language.
  • Prayer agent: constrain tone & doctrine; inject retrieved verses as anchors.

8) Ops

  • Idempotent UPSERT.
  • Backoff on 429/5xx.
  • Consider keeping both embed-3-large and -small columns when migrating.

9) License & attribution

This guide references the structure of Biblia Fidela Markdown for ingestion demonstration. fileciteturn0file0