Implement Azure OpenAI vector embeddings for Romanian Bible

- Add pgvector support with bible_passages table for vector search
- Create Python ingestion script for Azure OpenAI embed-3 embeddings
- Implement hybrid search combining vector similarity and full-text search
- Update AI chat to use vector search with Azure OpenAI gpt-4o
- Add floating chat component with Material UI design
- Import complete Romanian Bible (FIDELA) with 30K+ verses
- Add vector search library for semantic Bible search
- Create multi-language implementation plan for future expansion

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
andupetcu
2025-09-20 15:18:00 +03:00
parent 3b375c869b
commit dd5e1102eb
14 changed files with 2082 additions and 68 deletions

View File

@@ -11,6 +11,12 @@ JWT_SECRET=development-jwt-secret-change-in-production
AZURE_OPENAI_KEY=4DhkkXVdDOXZ7xX1eOLHTHQQnbCy0jFYdA6RPJtyAdOMtO16nZmFJQQJ99BCACYeBjFXJ3w3AAABACOGHgNC AZURE_OPENAI_KEY=4DhkkXVdDOXZ7xX1eOLHTHQQnbCy0jFYdA6RPJtyAdOMtO16nZmFJQQJ99BCACYeBjFXJ3w3AAABACOGHgNC
AZURE_OPENAI_ENDPOINT=https://azureopenaiinstant.openai.azure.com AZURE_OPENAI_ENDPOINT=https://azureopenaiinstant.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o AZURE_OPENAI_DEPLOYMENT=gpt-4o
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_EMBED_DEPLOYMENT=embed-3
EMBED_DIMS=3072
BIBLE_MD_PATH=./bibles/Biblia-Fidela-limba-romana.md
LANG_CODE=ro
TRANSLATION_CODE=FIDELA
# API Bible # API Bible
API_BIBLE_KEY=7b42606f8f809e155c9b0742c4f1849b API_BIBLE_KEY=7b42606f8f809e155c9b0742c4f1849b

View File

@@ -1,5 +1,6 @@
import { NextRequest, NextResponse } from 'next/server' import { NextRequest, NextResponse } from 'next/server'
import { z } from 'zod' import { z } from 'zod'
import { searchBibleHybrid, BibleVerse } from '@/lib/vector-search'
const chatRequestSchema = z.object({ const chatRequestSchema = z.object({
message: z.string().min(1), message: z.string().min(1),
@@ -49,73 +50,81 @@ export async function POST(request: NextRequest) {
} }
async function generateBiblicalResponse(message: string, history: any[]): Promise<string> { async function generateBiblicalResponse(message: string, history: any[]): Promise<string> {
// Mock biblical responses for common questions try {
const lowerMessage = message.toLowerCase() // Search for relevant Bible verses using vector search
const relevantVerses = await searchBibleHybrid(message, 5)
if (lowerMessage.includes('dragoste') || lowerMessage.includes('iubire')) { // Create context from relevant verses
return `Întrebarea ta despre dragoste este foarte frumoasă! Biblia ne învață că "Dumnezeu este dragoste" (1 Ioan 4:8). De asemenea, în 1 Corinteni 13:4-7 găsim descrierea perfectă a dragostei: "Dragostea este îndelung răbdătoare, dragostea este binevoitoare; dragostea nu pizmuiește; dragostea nu se fălește, nu se semeață, nu face nimic necuviincios, nu caută ale sale, nu se mânie, nu ține seama de răul făcut..." const versesContext = relevantVerses
.map(verse => `${verse.ref}: "${verse.text_raw}"`)
.join('\n\n')
Isus ne-a dat cea mai mare poruncă: "Să iubești pe Domnul Dumnezeul tău cu toată inima ta, cu tot sufletul tău și cu tot cugetul tău" și "să-ți iubești aproapele ca pe tine însuți" (Matei 22:37-39).` // Create conversation history for context
const conversationHistory = history
.slice(-3) // Last 3 messages for context
.map(msg => `${msg.role}: ${msg.content}`)
.join('\n')
// Construct prompt for Azure OpenAI
const systemPrompt = `Ești un asistent AI pentru întrebări biblice în limba română. Răspunde pe baza Scripturii, fiind respectuos și înțelept.
Instrucțiuni:
- Folosește versurile biblice relevante pentru a răspunde la întrebare
- Citează întotdeauna referințele biblice (ex: Ioan 3:16)
- Răspunde în română
- Fii empatic și încurajator
- Dacă nu ești sigur, încurajează studiul personal și rugăciunea
Versuri relevante pentru această întrebare:
${versesContext}
Conversația anterioară:
${conversationHistory}
Întrebarea curentă: ${message}`
// Call Azure OpenAI
const response = await fetch(
`${process.env.AZURE_OPENAI_ENDPOINT}/openai/deployments/${process.env.AZURE_OPENAI_DEPLOYMENT}/chat/completions?api-version=${process.env.AZURE_OPENAI_API_VERSION}`,
{
method: 'POST',
headers: {
'api-key': process.env.AZURE_OPENAI_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
messages: [
{
role: 'system',
content: systemPrompt
},
{
role: 'user',
content: message
}
],
max_tokens: 800,
temperature: 0.7,
top_p: 0.9
}),
}
)
if (!response.ok) {
throw new Error(`Azure OpenAI API error: ${response.status}`)
}
const data = await response.json()
return data.choices[0].message.content
} catch (error) {
console.error('Error calling Azure OpenAI:', error)
// Fallback to simple response if AI fails
return `Îmi pare rău, dar întâmpin o problemă tehnică în acest moment. Te încurajez să cercetezi acest subiect în Scripturi și să te rogi pentru înțelegere.
"Cercetați Scripturile, pentru că socotiți că în ele aveți viața veșnică, și tocmai ele mărturisesc despre Mine" (Ioan 5:39).
"Dacă vreunul dintre voi duce lipsă de înțelepciune, să ceară de la Dumnezeu, care dă tuturor cu dărnicie și fără mustrare, și i se va da" (Iacov 1:5).`
} }
if (lowerMessage.includes('rugăciune') || lowerMessage.includes('rog')) {
return `Rugăciunea este comunicarea noastră directă cu Dumnezeu! Isus ne-a învățat să ne rugăm prin "Tatăl nostru" (Matei 6:9-13).
Iată câteva principii importante pentru rugăciune:
• "Rugați-vă neîncetat" (1 Tesaloniceni 5:17)
• "Cerceți și veți găsi; bateți și vi se va deschide" (Matei 7:7)
• "Nu vă îngrijorați de nimic, ci în toate, prin rugăciune și cerere, cu mulțumire, să fie cunoscute cererile voastre înaintea lui Dumnezeu" (Filipeni 4:6)
Rugăciunea poate include laudă, mulțumire, spovedanie și cereri - Dumnezeu vrea să audă totul din inima ta!`
}
if (lowerMessage.includes('credință') || lowerMessage.includes('cred')) {
return `Credința este fundamentul vieții creștine! "Fără credință este cu neputință să fim plăcuți lui Dumnezeu; căci cine se apropie de Dumnezeu trebuie să creadă că El este și că răsplătește pe cei ce Îl caută" (Evrei 11:6).
"Credința este o încredere neclintită în lucrurile nădăjduite, o dovadă a lucrurilor care nu se văd" (Evrei 11:1).
Isus a spus: "Adevărat vă spun că, dacă aveți credință cât un grăunte de muștar, veți zice muntelui acestuia: 'Mută-te de aici acolo!' și se va muta" (Matei 17:20).
Credința crește prin ascultarea Cuvântului lui Dumnezeu: "Credința vine din ascultare, iar ascultarea vine din Cuvântul lui Hristos" (Romani 10:17).`
}
if (lowerMessage.includes('speranță') || lowerMessage.includes('sper')) {
return `Speranța creștină nu este o dorință vagă, ci o certitudine bazată pe promisiunile lui Dumnezeu!
"Fie ca Dumnezeul speranței să vă umple de toată bucuria și pacea în credință, pentru ca să prisosiți în speranță, prin puterea Duhului Sfânt!" (Romani 15:13).
Speranța noastră este ancorata în Isus Hristos: "Hristos în voi, nădejdea slavei" (Coloseni 1:27).
"Binecuvântat să fie Dumnezeu, Tatăl Domnului nostru Isus Hristos, care, după îndurarea Sa cea mare, ne-a născut din nou, printr-o înviere a lui Isus Hristos din morți, pentru o moștenire care nu se poate strica" (1 Petru 1:3-4).`
}
if (lowerMessage.includes('iertare') || lowerMessage.includes('iert')) {
return `Iertarea este una dintre cele mai puternice învățături ale lui Isus! El ne-a învățat să ne rugăm: "Iartă-ne greșelile noastre, precum și noi iertăm greșiților noștri" (Matei 6:12).
"Dacă iertați oamenilor greșelile lor, și Tatăl vostru cel ceresc vă va ierta greșelile voastre" (Matei 6:14).
Petru a întrebat pe Isus: "De câte ori să iert?" Isus a răspuns: "Nu îți zic până la șapte ori, ci până la șaptezeci de ori câte șapte" (Matei 18:21-22) - adică mereu!
Iertarea nu înseamnă că minimalizăm răul, ci că alegem să nu ținem seama de el, așa cum Dumnezeu face cu noi prin Hristos.`
}
if (lowerMessage.includes('pace') || lowerMessage.includes('liniște')) {
return `Pacea lui Dumnezeu este diferită de pacea lumii! Isus a spus: "Pace vă las, pacea Mea vă dau; nu cum dă lumea, vă dau Eu. Să nu vi se tulbure inima și să nu vă fie frică!" (Ioan 14:27).
"Pacea lui Dumnezeu, care întrece orice pricepere, vă va păzi inimile și gândurile în Hristos Isus" (Filipeni 4:7).
Pentru a avea pace:
• "În toate, prin rugăciune și cerere, cu mulțumire, să fie cunoscute cererile voastre înaintea lui Dumnezeu" (Filipeni 4:6)
• "Aruncați toată grija voastră asupra Lui, căci El îngrijește de voi" (1 Petru 5:7)
• "Isus le-a zis: 'Veniți la Mine, toți cei trudiți și împovărați, și Eu vă voi da odihnă'" (Matei 11:28)`
}
// Default response for other questions
return `Mulțumesc pentru întrebarea ta! Aceasta este o întrebare foarte importantă din punct de vedere biblic.
Te încurajez să cercetezi acest subiect în Scriptură, să te rogi pentru înțelegere și să discuți cu lideri spirituali maturi. "Cercetați Scripturile, pentru că socotiți că în ele aveți viața veșnică, și tocmai ele mărturisesc despre Mine" (Ioan 5:39).
Dacă ai întrebări mai specifice despre anumite pasaje biblice sau doctrine, voi fi bucuros să te ajut mai detaliat. Dumnezeu să te binecuvânteze în căutarea ta după adevăr!
"Dacă vreunul dintre voi duce lipsă de înțelepciune, să ceară de la Dumnezeu, care dă tuturor cu dărnicie și fără mustrare, și i se va da" (Iacob 1:5).`
} }

View File

@@ -1,6 +1,7 @@
import './globals.css' import './globals.css'
import type { Metadata } from 'next' import type { Metadata } from 'next'
import { MuiThemeProvider } from '@/components/providers/theme-provider' import { MuiThemeProvider } from '@/components/providers/theme-provider'
import FloatingChat from '@/components/chat/floating-chat'
export const metadata: Metadata = { export const metadata: Metadata = {
title: 'Ghid Biblic - Biblical Guide', title: 'Ghid Biblic - Biblical Guide',
@@ -17,6 +18,7 @@ export default function RootLayout({
<body> <body>
<MuiThemeProvider> <MuiThemeProvider>
{children} {children}
<FloatingChat />
</MuiThemeProvider> </MuiThemeProvider>
</body> </body>
</html> </html>

View File

@@ -0,0 +1,426 @@
'use client'
import {
Fab,
Drawer,
Box,
Typography,
TextField,
Button,
Paper,
Avatar,
Chip,
IconButton,
Divider,
List,
ListItem,
ListItemText,
useTheme,
Slide,
Grow,
Zoom,
} from '@mui/material'
import {
Chat,
Send,
Close,
SmartToy,
Person,
ContentCopy,
ThumbUp,
ThumbDown,
Minimize,
Launch,
} from '@mui/icons-material'
import { useState, useRef, useEffect } from 'react'
interface ChatMessage {
id: string
role: 'user' | 'assistant'
content: string
timestamp: Date
}
export default function FloatingChat() {
const theme = useTheme()
const [isOpen, setIsOpen] = useState(false)
const [isMinimized, setIsMinimized] = useState(false)
const [messages, setMessages] = useState<ChatMessage[]>([
{
id: '1',
role: 'assistant',
content: 'Bună ziua! Sunt asistentul tău AI pentru întrebări biblice. Cum te pot ajuta astăzi să înțelegi mai bine Scriptura?',
timestamp: new Date(),
}
])
const [inputMessage, setInputMessage] = useState('')
const [isLoading, setIsLoading] = useState(false)
const messagesEndRef = useRef<HTMLDivElement>(null)
const scrollToBottom = () => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' })
}
useEffect(() => {
scrollToBottom()
}, [messages])
const handleSendMessage = async () => {
if (!inputMessage.trim() || isLoading) return
const userMessage: ChatMessage = {
id: Date.now().toString(),
role: 'user',
content: inputMessage,
timestamp: new Date(),
}
setMessages(prev => [...prev, userMessage])
setInputMessage('')
setIsLoading(true)
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: inputMessage,
history: messages.slice(-5),
}),
})
if (!response.ok) {
throw new Error('Failed to get response')
}
const data = await response.json()
const assistantMessage: ChatMessage = {
id: (Date.now() + 1).toString(),
role: 'assistant',
content: data.response || 'Îmi pare rău, nu am putut procesa întrebarea ta. Te rog încearcă din nou.',
timestamp: new Date(),
}
setMessages(prev => [...prev, assistantMessage])
} catch (error) {
console.error('Error sending message:', error)
const errorMessage: ChatMessage = {
id: (Date.now() + 1).toString(),
role: 'assistant',
content: 'Îmi pare rău, a apărut o eroare. Te rog verifică conexiunea și încearcă din nou.',
timestamp: new Date(),
}
setMessages(prev => [...prev, errorMessage])
} finally {
setIsLoading(false)
}
}
const handleKeyPress = (event: React.KeyboardEvent) => {
if (event.key === 'Enter' && !event.shiftKey) {
event.preventDefault()
handleSendMessage()
}
}
const copyToClipboard = (text: string) => {
navigator.clipboard.writeText(text)
}
const suggestedQuestions = [
'Ce spune Biblia despre iubire?',
'Explică-mi parabola semănătorului',
'Care sunt fructele Duhului?',
'Ce înseamnă să fii născut din nou?',
'Cum pot să mă rog mai bine?',
]
const toggleChat = () => {
setIsOpen(!isOpen)
if (isMinimized) setIsMinimized(false)
}
const minimizeChat = () => {
setIsMinimized(!isMinimized)
}
const openFullChat = () => {
window.open('/chat', '_blank')
}
return (
<>
{/* Floating Action Button */}
<Zoom in={!isOpen} unmountOnExit>
<Fab
color="primary"
onClick={toggleChat}
sx={{
position: 'fixed',
bottom: 24,
right: 24,
zIndex: 1000,
background: 'linear-gradient(45deg, #2C5F6B 30%, #8B7355 90%)',
'&:hover': {
background: 'linear-gradient(45deg, #1e4148 30%, #6d5a43 90%)',
}
}}
>
<Chat />
</Fab>
</Zoom>
{/* Chat Overlay */}
<Slide direction="up" in={isOpen} mountOnExit>
<Paper
elevation={8}
sx={{
position: 'fixed',
bottom: 0,
right: 0,
width: { xs: '100vw', sm: '50vw', md: '40vw' },
height: isMinimized ? 'auto' : '100vh',
zIndex: 1200,
borderRadius: { xs: 0, sm: '12px 0 0 0' },
overflow: 'hidden',
display: 'flex',
flexDirection: 'column',
background: 'linear-gradient(to bottom, #f8f9fa, #ffffff)',
}}
>
{/* Header */}
<Box
sx={{
p: 2,
background: 'linear-gradient(45deg, #2C5F6B 30%, #8B7355 90%)',
color: 'white',
display: 'flex',
alignItems: 'center',
justifyContent: 'space-between',
}}
>
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
<Avatar sx={{ bgcolor: 'rgba(255,255,255,0.2)' }}>
<SmartToy />
</Avatar>
<Box>
<Typography variant="subtitle1" fontWeight="bold">
Chat AI Biblic
</Typography>
<Typography variant="caption" sx={{ opacity: 0.9 }}>
Asistent pentru întrebări biblice
</Typography>
</Box>
</Box>
<Box>
<IconButton
size="small"
onClick={minimizeChat}
sx={{ color: 'white', mr: 0.5 }}
>
<Minimize />
</IconButton>
<IconButton
size="small"
onClick={openFullChat}
sx={{ color: 'white', mr: 0.5 }}
>
<Launch />
</IconButton>
<IconButton
size="small"
onClick={toggleChat}
sx={{ color: 'white' }}
>
<Close />
</IconButton>
</Box>
</Box>
{!isMinimized && (
<>
{/* Suggested Questions */}
<Box sx={{ p: 2, borderBottom: 1, borderColor: 'divider' }}>
<Typography variant="body2" color="text.secondary" sx={{ mb: 1 }}>
Întrebări sugerate:
</Typography>
<Box sx={{ display: 'flex', flexWrap: 'wrap', gap: 0.5 }}>
{suggestedQuestions.slice(0, 3).map((question, index) => (
<Chip
key={index}
label={question}
size="small"
variant="outlined"
onClick={() => setInputMessage(question)}
sx={{
fontSize: '0.75rem',
cursor: 'pointer',
'&:hover': {
bgcolor: 'primary.light',
color: 'white',
},
}}
/>
))}
</Box>
</Box>
{/* Messages */}
<Box
sx={{
flexGrow: 1,
overflow: 'auto',
p: 1,
}}
>
{messages.map((message) => (
<Box
key={message.id}
sx={{
display: 'flex',
justifyContent: message.role === 'user' ? 'flex-end' : 'flex-start',
mb: 2,
}}
>
<Box
sx={{
display: 'flex',
flexDirection: message.role === 'user' ? 'row-reverse' : 'row',
alignItems: 'flex-start',
maxWidth: '85%',
gap: 1,
}}
>
<Avatar
sx={{
width: 32,
height: 32,
bgcolor: message.role === 'user' ? 'primary.main' : 'secondary.main',
}}
>
{message.role === 'user' ? <Person fontSize="small" /> : <SmartToy fontSize="small" />}
</Avatar>
<Paper
elevation={1}
sx={{
p: 1.5,
bgcolor: message.role === 'user' ? 'primary.light' : 'background.paper',
color: message.role === 'user' ? 'white' : 'text.primary',
borderRadius: 2,
maxWidth: '100%',
}}
>
<Typography
variant="body2"
sx={{
whiteSpace: 'pre-wrap',
lineHeight: 1.4,
}}
>
{message.content}
</Typography>
{message.role === 'assistant' && (
<Box sx={{ display: 'flex', gap: 0.5, mt: 1, justifyContent: 'flex-end' }}>
<IconButton
size="small"
onClick={() => copyToClipboard(message.content)}
>
<ContentCopy fontSize="small" />
</IconButton>
<IconButton size="small">
<ThumbUp fontSize="small" />
</IconButton>
<IconButton size="small">
<ThumbDown fontSize="small" />
</IconButton>
</Box>
)}
<Typography
variant="caption"
sx={{
display: 'block',
textAlign: 'right',
mt: 0.5,
opacity: 0.7,
}}
>
{message.timestamp.toLocaleTimeString('ro-RO', {
hour: '2-digit',
minute: '2-digit',
})}
</Typography>
</Paper>
</Box>
</Box>
))}
{isLoading && (
<Box sx={{ display: 'flex', justifyContent: 'flex-start', mb: 2 }}>
<Box sx={{ display: 'flex', alignItems: 'flex-start', gap: 1 }}>
<Avatar sx={{ width: 32, height: 32, bgcolor: 'secondary.main' }}>
<SmartToy fontSize="small" />
</Avatar>
<Paper elevation={1} sx={{ p: 1.5, borderRadius: 2 }}>
<Typography variant="body2">
Scriu răspunsul...
</Typography>
</Paper>
</Box>
</Box>
)}
<div ref={messagesEndRef} />
</Box>
<Divider />
{/* Input */}
<Box sx={{ p: 2 }}>
<Box sx={{ display: 'flex', gap: 1 }}>
<TextField
fullWidth
size="small"
multiline
maxRows={3}
placeholder="Scrie întrebarea ta despre Biblie..."
value={inputMessage}
onChange={(e) => setInputMessage(e.target.value)}
onKeyPress={handleKeyPress}
disabled={isLoading}
variant="outlined"
sx={{
'& .MuiOutlinedInput-root': {
borderRadius: 2,
}
}}
/>
<Button
variant="contained"
onClick={handleSendMessage}
disabled={!inputMessage.trim() || isLoading}
sx={{
minWidth: 'auto',
px: 2,
borderRadius: 2,
background: 'linear-gradient(45deg, #2C5F6B 30%, #8B7355 90%)',
}}
>
<Send fontSize="small" />
</Button>
</Box>
<Typography variant="caption" color="text.secondary" sx={{ mt: 0.5, display: 'block' }}>
Enter pentru a trimite Shift+Enter pentru linie nouă
</Typography>
</Box>
</>
)}
</Paper>
</Slide>
</>
)
}

View File

@@ -24,7 +24,6 @@ import {
import { import {
Menu as MenuIcon, Menu as MenuIcon,
MenuBook, MenuBook,
Chat,
Favorite as Prayer, Favorite as Prayer,
Search, Search,
AccountCircle, AccountCircle,
@@ -37,7 +36,6 @@ import { useRouter } from 'next/navigation'
const pages = [ const pages = [
{ name: 'Acasă', path: '/', icon: <Home /> }, { name: 'Acasă', path: '/', icon: <Home /> },
{ name: 'Biblia', path: '/bible', icon: <MenuBook /> }, { name: 'Biblia', path: '/bible', icon: <MenuBook /> },
{ name: 'Chat AI', path: '/chat', icon: <Chat /> },
{ name: 'Rugăciuni', path: '/prayers', icon: <Prayer /> }, { name: 'Rugăciuni', path: '/prayers', icon: <Prayer /> },
{ name: 'Căutare', path: '/search', icon: <Search /> }, { name: 'Căutare', path: '/search', icon: <Search /> },
] ]

140
lib/vector-search.ts Normal file
View File

@@ -0,0 +1,140 @@
import { Pool } from 'pg'
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
})
export interface BibleVerse {
id: string
ref: string
book: string
chapter: number
verse: number
text_raw: string
similarity?: number
combined_score?: number
}
export async function getEmbedding(text: string): Promise<number[]> {
const response = await fetch(
`${process.env.AZURE_OPENAI_ENDPOINT}/openai/deployments/${process.env.AZURE_OPENAI_EMBED_DEPLOYMENT}/embeddings?api-version=${process.env.AZURE_OPENAI_API_VERSION}`,
{
method: 'POST',
headers: {
'api-key': process.env.AZURE_OPENAI_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
input: [text],
}),
}
)
if (!response.ok) {
throw new Error(`Embedding API error: ${response.status}`)
}
const data = await response.json()
return data.data[0].embedding
}
export async function searchBibleSemantic(
query: string,
limit: number = 10
): Promise<BibleVerse[]> {
try {
const queryEmbedding = await getEmbedding(query)
const client = await pool.connect()
try {
const result = await client.query(
`
SELECT ref, book, chapter, verse, text_raw,
1 - (embedding <=> $1) AS similarity
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> $1
LIMIT $2
`,
[JSON.stringify(queryEmbedding), limit]
)
return result.rows
} finally {
client.release()
}
} catch (error) {
console.error('Error in semantic search:', error)
throw error
}
}
export async function searchBibleHybrid(
query: string,
limit: number = 10
): Promise<BibleVerse[]> {
try {
const queryEmbedding = await getEmbedding(query)
const client = await pool.connect()
try {
const result = await client.query(
`
WITH vector_search AS (
SELECT id, 1 - (embedding <=> $1) AS vector_sim
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> $1
LIMIT 100
),
text_search AS (
SELECT id, ts_rank(tsv, plainto_tsquery('romanian', $3)) AS text_rank
FROM bible_passages
WHERE tsv @@ plainto_tsquery('romanian', $3)
)
SELECT bp.ref, bp.book, bp.chapter, bp.verse, bp.text_raw,
COALESCE(vs.vector_sim, 0) * 0.7 + COALESCE(ts.text_rank, 0) * 0.3 AS combined_score
FROM bible_passages bp
LEFT JOIN vector_search vs ON vs.id = bp.id
LEFT JOIN text_search ts ON ts.id = bp.id
WHERE vs.id IS NOT NULL OR ts.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT $2
`,
[JSON.stringify(queryEmbedding), limit, query]
)
return result.rows
} finally {
client.release()
}
} catch (error) {
console.error('Error in hybrid search:', error)
throw error
}
}
export async function getContextVerses(
book: string,
chapter: number,
verse: number,
contextSize: number = 2
): Promise<BibleVerse[]> {
const client = await pool.connect()
try {
const result = await client.query(
`
SELECT ref, book, chapter, verse, text_raw
FROM bible_passages
WHERE book = $1 AND chapter = $2
AND verse BETWEEN $3 AND $4
ORDER BY verse
`,
[book, chapter, verse - contextSize, verse + contextSize]
)
return result.rows
} finally {
client.release()
}
}

View File

@@ -0,0 +1,212 @@
# Multi-Language Support Implementation Plan
## Overview
Add comprehensive multi-language support to the Ghid Biblic application, starting with English as the second language alongside Romanian.
## Current State
- **Database**: Already supports multiple languages (`lang` field) and translations (`translation` field)
- **Frontend**: Hardcoded Romanian interface
- **Vector Search**: Romanian-only search logic
- **Bible Data**: Only Romanian (FIDELA) version imported
## Implementation Phases
### Phase 1: Core Infrastructure
1. **Install i18n Framework**
- Add `next-intl` for Next.js internationalization
- Configure locale routing (`/ro/`, `/en/`)
- Set up translation file structure
2. **Language Configuration**
- Create language detection and switching logic
- Add language persistence (localStorage/cookies)
- Configure default language fallbacks
3. **Translation Files Structure**
```
messages/
├── ro.json (Romanian - existing content)
├── en.json (English translations)
└── common.json (shared terms)
```
### Phase 2: UI Internationalization
1. **Navigation Component**
- Translate all menu items and labels
- Add language switcher dropdown
- Update routing for locale-aware navigation
2. **Chat Interface**
- Translate all UI text and prompts
- Add suggested questions per language
- Update loading states and error messages
3. **Page Content**
- Home page (`/` → `/[locale]/`)
- Bible browser (`/bible` → `/[locale]/bible`)
- Search page (`/search` → `/[locale]/search`)
- Prayer requests (`/prayers` → `/[locale]/prayers`)
### Phase 3: Backend Localization
1. **Vector Search Updates**
- Modify search functions to filter by language
- Add language parameter to search APIs
- Update hybrid search for language-specific full-text search
2. **Chat API Enhancement**
- Language-aware Bible verse retrieval
- Localized AI response prompts
- Language-specific fallback responses
3. **API Route Updates**
- Add locale parameter to all API endpoints
- Update error responses for each language
- Configure language-specific search configurations
### Phase 4: Bible Data Management
1. **English Bible Import**
- Source: API.Bible or public domain English Bible (KJV/ESV)
- Adapt existing import script for English
- Generate English embeddings using Azure OpenAI
2. **Language-Aware Bible Browser**
- Add language selector in Bible interface
- Filter books/chapters/verses by selected language
- Show parallel verses when both languages available
### Phase 5: Enhanced Features
1. **Parallel Bible View**
- Side-by-side Romanian/English verse display
- Cross-reference linking between translations
- Language comparison in search results
2. **Smart Language Detection**
- Auto-detect query language in chat
- Suggest language switch based on user input
- Mixed-language search capabilities
3. **Advanced Search Features**
- Cross-language semantic search
- Translation comparison tools
- Language-specific biblical term glossaries
## Technical Implementation Details
### Routing Structure
```
Current: /page
New: /[locale]/page
Examples:
- /ro/biblia (Romanian Bible)
- /en/bible (English Bible)
- /ro/rugaciuni (Romanian Prayers)
- /en/prayers (English Prayers)
```
### Database Schema Changes
**No changes needed** - current schema already supports:
- Multiple languages via `lang` field
- Multiple translations via `translation` field
- Unique constraints per translation/language
### Vector Search Updates
```typescript
// Current
searchBibleHybrid(query: string, limit: number)
// Enhanced
searchBibleHybrid(query: string, language: string, limit: number)
```
### Translation File Structure
```json
// messages/en.json
{
"navigation": {
"home": "Home",
"bible": "Bible",
"prayers": "Prayers",
"search": "Search"
},
"chat": {
"placeholder": "Ask your biblical question...",
"suggestions": [
"What does the Bible say about love?",
"Explain the parable of the sower",
"What are the fruits of the Spirit?"
]
}
}
```
### Language Switcher Component
- Dropdown in navigation header
- Flag icons for visual identification
- Persist language choice across sessions
- Redirect to equivalent page in new language
## Dependencies to Add
```json
{
"next-intl": "^3.x",
"@formatjs/intl-localematcher": "^0.x",
"negotiator": "^0.x"
}
```
## File Structure Changes
```
app/
├── [locale]/
│ ├── page.tsx
│ ├── bible/
│ ├── prayers/
│ ├── search/
│ └── layout.tsx
├── api/ (unchanged)
└── globals.css
messages/
├── en.json
├── ro.json
└── index.ts
components/
├── language-switcher.tsx
├── navigation.tsx (updated)
└── chat/ (updated)
```
## Testing Strategy
1. **Unit Tests**: Translation loading and language switching
2. **Integration Tests**: API endpoints with locale parameters
3. **E2E Tests**: Complete user flows in both languages
4. **Performance Tests**: Vector search with language filtering
## Rollout Plan
1. **Development**: Implement Phase 1-3 (core infrastructure and UI)
2. **Testing**: Deploy to staging with Romanian/English support
3. **Beta Release**: Limited user testing with feedback collection
4. **Production**: Full release with both languages
5. **Future**: Add additional languages based on user demand
## Estimated Timeline
- **Phase 1-2**: 2-3 days (i18n setup and UI translation)
- **Phase 3**: 1-2 days (backend localization)
- **Phase 4**: 2-3 days (English Bible import and embeddings)
- **Phase 5**: 3-4 days (enhanced features)
- **Total**: 8-12 days for complete implementation
## Success Metrics
- Language switching works seamlessly
- Vector search returns accurate results in both languages
- AI chat responses are contextually appropriate per language
- User can browse Bible in preferred language
- Performance remains optimal with language filtering
## Future Considerations
- Spanish, French, German language support
- Regional dialect variations
- Audio Bible integration per language
- Collaborative translation features for community contributions

169
package-lock.json generated
View File

@@ -24,6 +24,7 @@
"@tailwindcss/postcss": "^4.1.13", "@tailwindcss/postcss": "^4.1.13",
"@types/node": "^24.5.2", "@types/node": "^24.5.2",
"@types/pdf-parse": "^1.1.5", "@types/pdf-parse": "^1.1.5",
"@types/pg": "^8.15.5",
"@types/react": "^19.1.13", "@types/react": "^19.1.13",
"@types/react-dom": "^19.1.9", "@types/react-dom": "^19.1.9",
"autoprefixer": "^10.4.21", "autoprefixer": "^10.4.21",
@@ -35,6 +36,8 @@
"next": "^15.5.3", "next": "^15.5.3",
"openai": "^5.22.0", "openai": "^5.22.0",
"pdf-parse": "^1.1.1", "pdf-parse": "^1.1.1",
"pg": "^8.16.3",
"pgvector": "^0.2.1",
"postcss": "^8.5.6", "postcss": "^8.5.6",
"prisma": "^6.16.2", "prisma": "^6.16.2",
"react": "^19.1.1", "react": "^19.1.1",
@@ -4182,6 +4185,17 @@
"@types/node": "*" "@types/node": "*"
} }
}, },
"node_modules/@types/pg": {
"version": "8.15.5",
"resolved": "https://registry.npmjs.org/@types/pg/-/pg-8.15.5.tgz",
"integrity": "sha512-LF7lF6zWEKxuT3/OR8wAZGzkg4ENGXFNyiV/JeOt9z5B+0ZVwbql9McqX5c/WStFq1GaGso7H1AzP/qSzmlCKQ==",
"license": "MIT",
"dependencies": {
"@types/node": "*",
"pg-protocol": "*",
"pg-types": "^2.2.0"
}
},
"node_modules/@types/prop-types": { "node_modules/@types/prop-types": {
"version": "15.7.15", "version": "15.7.15",
"resolved": "https://registry.npmjs.org/@types/prop-types/-/prop-types-15.7.15.tgz", "resolved": "https://registry.npmjs.org/@types/prop-types/-/prop-types-15.7.15.tgz",
@@ -9639,6 +9653,104 @@
"integrity": "sha512-xCy9V055GLEqoFaHoC1SoLIaLmWctgCUaBaWxDZ7/Zx4CTyX7cJQLJOok/orfjZAh9kEYpjJa4d0KcJmCbctZA==", "integrity": "sha512-xCy9V055GLEqoFaHoC1SoLIaLmWctgCUaBaWxDZ7/Zx4CTyX7cJQLJOok/orfjZAh9kEYpjJa4d0KcJmCbctZA==",
"license": "MIT" "license": "MIT"
}, },
"node_modules/pg": {
"version": "8.16.3",
"resolved": "https://registry.npmjs.org/pg/-/pg-8.16.3.tgz",
"integrity": "sha512-enxc1h0jA/aq5oSDMvqyW3q89ra6XIIDZgCX9vkMrnz5DFTw/Ny3Li2lFQ+pt3L6MCgm/5o2o8HW9hiJji+xvw==",
"license": "MIT",
"dependencies": {
"pg-connection-string": "^2.9.1",
"pg-pool": "^3.10.1",
"pg-protocol": "^1.10.3",
"pg-types": "2.2.0",
"pgpass": "1.0.5"
},
"engines": {
"node": ">= 16.0.0"
},
"optionalDependencies": {
"pg-cloudflare": "^1.2.7"
},
"peerDependencies": {
"pg-native": ">=3.0.1"
},
"peerDependenciesMeta": {
"pg-native": {
"optional": true
}
}
},
"node_modules/pg-cloudflare": {
"version": "1.2.7",
"resolved": "https://registry.npmjs.org/pg-cloudflare/-/pg-cloudflare-1.2.7.tgz",
"integrity": "sha512-YgCtzMH0ptvZJslLM1ffsY4EuGaU0cx4XSdXLRFae8bPP4dS5xL1tNB3k2o/N64cHJpwU7dxKli/nZ2lUa5fLg==",
"license": "MIT",
"optional": true
},
"node_modules/pg-connection-string": {
"version": "2.9.1",
"resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.9.1.tgz",
"integrity": "sha512-nkc6NpDcvPVpZXxrreI/FOtX3XemeLl8E0qFr6F2Lrm/I8WOnaWNhIPK2Z7OHpw7gh5XJThi6j6ppgNoaT1w4w==",
"license": "MIT"
},
"node_modules/pg-int8": {
"version": "1.0.1",
"resolved": "https://registry.npmjs.org/pg-int8/-/pg-int8-1.0.1.tgz",
"integrity": "sha512-WCtabS6t3c8SkpDBUlb1kjOs7l66xsGdKpIPZsg4wR+B3+u9UAum2odSsF9tnvxg80h4ZxLWMy4pRjOsFIqQpw==",
"license": "ISC",
"engines": {
"node": ">=4.0.0"
}
},
"node_modules/pg-pool": {
"version": "3.10.1",
"resolved": "https://registry.npmjs.org/pg-pool/-/pg-pool-3.10.1.tgz",
"integrity": "sha512-Tu8jMlcX+9d8+QVzKIvM/uJtp07PKr82IUOYEphaWcoBhIYkoHpLXN3qO59nAI11ripznDsEzEv8nUxBVWajGg==",
"license": "MIT",
"peerDependencies": {
"pg": ">=8.0"
}
},
"node_modules/pg-protocol": {
"version": "1.10.3",
"resolved": "https://registry.npmjs.org/pg-protocol/-/pg-protocol-1.10.3.tgz",
"integrity": "sha512-6DIBgBQaTKDJyxnXaLiLR8wBpQQcGWuAESkRBX/t6OwA8YsqP+iVSiond2EDy6Y/dsGk8rh/jtax3js5NeV7JQ==",
"license": "MIT"
},
"node_modules/pg-types": {
"version": "2.2.0",
"resolved": "https://registry.npmjs.org/pg-types/-/pg-types-2.2.0.tgz",
"integrity": "sha512-qTAAlrEsl8s4OiEQY69wDvcMIdQN6wdz5ojQiOy6YRMuynxenON0O5oCpJI6lshc6scgAY8qvJ2On/p+CXY0GA==",
"license": "MIT",
"dependencies": {
"pg-int8": "1.0.1",
"postgres-array": "~2.0.0",
"postgres-bytea": "~1.0.0",
"postgres-date": "~1.0.4",
"postgres-interval": "^1.1.0"
},
"engines": {
"node": ">=4"
}
},
"node_modules/pgpass": {
"version": "1.0.5",
"resolved": "https://registry.npmjs.org/pgpass/-/pgpass-1.0.5.tgz",
"integrity": "sha512-FdW9r/jQZhSeohs1Z3sI1yxFQNFvMcnmfuj4WBMUTxOrAyLMaTcE1aAMBiTlbMNaXvBCQuVi0R7hd8udDSP7ug==",
"license": "MIT",
"dependencies": {
"split2": "^4.1.0"
}
},
"node_modules/pgvector": {
"version": "0.2.1",
"resolved": "https://registry.npmjs.org/pgvector/-/pgvector-0.2.1.tgz",
"integrity": "sha512-nKaQY9wtuiidwLMdVIce1O3kL0d+FxrigCVzsShnoqzOSaWWWOvuctb/sYwlai5cTwwzRSNa+a/NtN2kVZGNJw==",
"license": "MIT",
"engines": {
"node": ">= 18"
}
},
"node_modules/picocolors": { "node_modules/picocolors": {
"version": "1.1.1", "version": "1.1.1",
"resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.1.1.tgz", "resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.1.1.tgz",
@@ -9726,6 +9838,45 @@
"integrity": "sha512-1NNCs6uurfkVbeXG4S8JFT9t19m45ICnif8zWLd5oPSZ50QnwMfK+H3jv408d4jw/7Bttv5axS5IiHoLaVNHeQ==", "integrity": "sha512-1NNCs6uurfkVbeXG4S8JFT9t19m45ICnif8zWLd5oPSZ50QnwMfK+H3jv408d4jw/7Bttv5axS5IiHoLaVNHeQ==",
"license": "MIT" "license": "MIT"
}, },
"node_modules/postgres-array": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/postgres-array/-/postgres-array-2.0.0.tgz",
"integrity": "sha512-VpZrUqU5A69eQyW2c5CA1jtLecCsN2U/bD6VilrFDWq5+5UIEVO7nazS3TEcHf1zuPYO/sqGvUvW62g86RXZuA==",
"license": "MIT",
"engines": {
"node": ">=4"
}
},
"node_modules/postgres-bytea": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/postgres-bytea/-/postgres-bytea-1.0.0.tgz",
"integrity": "sha512-xy3pmLuQqRBZBXDULy7KbaitYqLcmxigw14Q5sj8QBVLqEwXfeybIKVWiqAXTlcvdvb0+xkOtDbfQMOf4lST1w==",
"license": "MIT",
"engines": {
"node": ">=0.10.0"
}
},
"node_modules/postgres-date": {
"version": "1.0.7",
"resolved": "https://registry.npmjs.org/postgres-date/-/postgres-date-1.0.7.tgz",
"integrity": "sha512-suDmjLVQg78nMK2UZ454hAG+OAW+HQPZ6n++TNDUX+L0+uUlLywnoxJKDou51Zm+zTCjrCl0Nq6J9C5hP9vK/Q==",
"license": "MIT",
"engines": {
"node": ">=0.10.0"
}
},
"node_modules/postgres-interval": {
"version": "1.2.0",
"resolved": "https://registry.npmjs.org/postgres-interval/-/postgres-interval-1.2.0.tgz",
"integrity": "sha512-9ZhXKM/rw350N1ovuWHbGxnGh/SNJ4cnxHiM0rxE4VN41wsg8P8zWn9hv/buK00RP4WvlOyr/RBDiptyxVbkZQ==",
"license": "MIT",
"dependencies": {
"xtend": "^4.0.0"
},
"engines": {
"node": ">=0.10.0"
}
},
"node_modules/pretty-format": { "node_modules/pretty-format": {
"version": "27.5.1", "version": "27.5.1",
"resolved": "https://registry.npmjs.org/pretty-format/-/pretty-format-27.5.1.tgz", "resolved": "https://registry.npmjs.org/pretty-format/-/pretty-format-27.5.1.tgz",
@@ -10480,6 +10631,15 @@
"url": "https://github.com/sponsors/wooorm" "url": "https://github.com/sponsors/wooorm"
} }
}, },
"node_modules/split2": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/split2/-/split2-4.2.0.tgz",
"integrity": "sha512-UcjcJOWknrNkF6PLX83qcHM6KHgVKNkV62Y8a5uYDVv9ydGQVwAHMKqHdJje1VTWpljG0WYpCDhrCdAOYH4TWg==",
"license": "ISC",
"engines": {
"node": ">= 10.x"
}
},
"node_modules/sprintf-js": { "node_modules/sprintf-js": {
"version": "1.0.3", "version": "1.0.3",
"resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.0.3.tgz", "resolved": "https://registry.npmjs.org/sprintf-js/-/sprintf-js-1.0.3.tgz",
@@ -11638,6 +11798,15 @@
"node": ">=0.4.0" "node": ">=0.4.0"
} }
}, },
"node_modules/xtend": {
"version": "4.0.2",
"resolved": "https://registry.npmjs.org/xtend/-/xtend-4.0.2.tgz",
"integrity": "sha512-LKYU1iAXJXUgAXn9URjiu+MWhyUXHsvfp7mcuYm9dSUKK0/CjtrUwFAxD82/mCWbtLsGjFIad0wIsod4zrTAEQ==",
"license": "MIT",
"engines": {
"node": ">=0.4"
}
},
"node_modules/y18n": { "node_modules/y18n": {
"version": "5.0.8", "version": "5.0.8",
"resolved": "https://registry.npmjs.org/y18n/-/y18n-5.0.8.tgz", "resolved": "https://registry.npmjs.org/y18n/-/y18n-5.0.8.tgz",

View File

@@ -37,6 +37,7 @@
"@tailwindcss/postcss": "^4.1.13", "@tailwindcss/postcss": "^4.1.13",
"@types/node": "^24.5.2", "@types/node": "^24.5.2",
"@types/pdf-parse": "^1.1.5", "@types/pdf-parse": "^1.1.5",
"@types/pg": "^8.15.5",
"@types/react": "^19.1.13", "@types/react": "^19.1.13",
"@types/react-dom": "^19.1.9", "@types/react-dom": "^19.1.9",
"autoprefixer": "^10.4.21", "autoprefixer": "^10.4.21",
@@ -48,6 +49,8 @@
"next": "^15.5.3", "next": "^15.5.3",
"openai": "^5.22.0", "openai": "^5.22.0",
"pdf-parse": "^1.1.1", "pdf-parse": "^1.1.1",
"pg": "^8.16.3",
"pgvector": "^0.2.1",
"postcss": "^8.5.6", "postcss": "^8.5.6",
"prisma": "^6.16.2", "prisma": "^6.16.2",
"react": "^19.1.1", "react": "^19.1.1",

View File

@@ -78,6 +78,26 @@ model BibleVerse {
@@index([version]) @@index([version])
} }
model BiblePassage {
id String @id @default(uuid())
testament String // 'OT' or 'NT'
book String
chapter Int
verse Int
ref String // Generated field: "book chapter:verse"
lang String @default("ro")
translation String @default("FIDELA")
textRaw String @db.Text
textNorm String @db.Text // Normalized text for embedding
embedding Unsupported("vector(3072)")?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@unique([translation, lang, book, chapter, verse])
@@index([book, chapter])
@@index([testament])
}
model ChatMessage { model ChatMessage {
id String @id @default(uuid()) id String @id @default(uuid())
userId String userId String

121
scripts/bible_search.py Normal file
View File

@@ -0,0 +1,121 @@
import os
import asyncio
from typing import List, Dict
from dotenv import load_dotenv
import httpx
import psycopg
from psycopg.rows import dict_row
load_dotenv()
AZ_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AZ_API_KEY = os.getenv("AZURE_OPENAI_KEY")
AZ_API_VER = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")
AZ_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "embed-3")
DB_URL = os.getenv("DATABASE_URL")
EMBED_URL = f"{AZ_ENDPOINT}/openai/deployments/{AZ_DEPLOYMENT}/embeddings?api-version={AZ_API_VER}"
async def get_embedding(text: str) -> List[float]:
"""Get embedding for a text using Azure OpenAI"""
payload = {"input": [text]}
headers = {"api-key": AZ_API_KEY, "Content-Type": "application/json"}
async with httpx.AsyncClient() as client:
for attempt in range(3):
try:
r = await client.post(EMBED_URL, headers=headers, json=payload, timeout=30)
if r.status_code == 200:
data = r.json()
return data["data"][0]["embedding"]
elif r.status_code in (429, 500, 503):
backoff = 2 ** attempt
await asyncio.sleep(backoff)
else:
raise RuntimeError(f"Embedding error {r.status_code}: {r.text}")
except Exception as e:
if attempt == 2:
raise e
await asyncio.sleep(2 ** attempt)
async def search_bible_semantic(query: str, limit: int = 10) -> List[Dict]:
"""Search Bible using semantic similarity"""
# Get embedding for the query
query_embedding = await get_embedding(query)
# Search for similar verses
with psycopg.connect(DB_URL, row_factory=dict_row) as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT ref, book, chapter, verse, text_raw,
1 - (embedding <=> %s) AS similarity
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> %s
LIMIT %s
""", (query_embedding, query_embedding, limit))
return cur.fetchall()
async def search_bible_hybrid(query: str, limit: int = 10) -> List[Dict]:
"""Search Bible using hybrid semantic + lexical search"""
# Get embedding for the query
query_embedding = await get_embedding(query)
# Create search query for full-text search
search_query = " & ".join(query.split())
with psycopg.connect(DB_URL, row_factory=dict_row) as conn:
with conn.cursor() as cur:
cur.execute("""
WITH vector_search AS (
SELECT id, 1 - (embedding <=> %s) AS vector_sim
FROM bible_passages
WHERE embedding IS NOT NULL
ORDER BY embedding <=> %s
LIMIT 100
),
text_search AS (
SELECT id, ts_rank(tsv, plainto_tsquery('romanian', %s)) AS text_rank
FROM bible_passages
WHERE tsv @@ plainto_tsquery('romanian', %s)
)
SELECT bp.ref, bp.book, bp.chapter, bp.verse, bp.text_raw,
COALESCE(vs.vector_sim, 0) * 0.7 + COALESCE(ts.text_rank, 0) * 0.3 AS combined_score
FROM bible_passages bp
LEFT JOIN vector_search vs ON vs.id = bp.id
LEFT JOIN text_search ts ON ts.id = bp.id
WHERE vs.id IS NOT NULL OR ts.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT %s
""", (query_embedding, query_embedding, query, query, limit))
return cur.fetchall()
async def get_context_verses(book: str, chapter: int, verse: int, context_size: int = 2) -> List[Dict]:
"""Get surrounding verses for context"""
with psycopg.connect(DB_URL, row_factory=dict_row) as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT ref, book, chapter, verse, text_raw
FROM bible_passages
WHERE book = %s AND chapter = %s
AND verse BETWEEN %s AND %s
ORDER BY verse
""", (book, chapter, verse - context_size, verse + context_size))
return cur.fetchall()
if __name__ == "__main__":
async def test_search():
results = await search_bible_semantic("dragoste", 5)
print("Semantic search results for 'dragoste':")
for result in results:
print(f"{result['ref']}: {result['text_raw'][:100]}... (similarity: {result['similarity']:.3f})")
print("\nHybrid search results for 'dragoste':")
hybrid_results = await search_bible_hybrid("dragoste", 5)
for result in hybrid_results:
print(f"{result['ref']}: {result['text_raw'][:100]}... (score: {result['combined_score']:.3f})")
asyncio.run(test_search())

View File

@@ -0,0 +1,305 @@
import { PrismaClient } from '@prisma/client'
import * as fs from 'fs'
import * as path from 'path'
const prisma = new PrismaClient()
// Book name mappings from Romanian to standardized names
const BOOK_MAPPINGS: Record<string, { name: string; abbreviation: string; testament: string; orderNum: number }> = {
'Geneza': { name: 'Geneza', abbreviation: 'GEN', testament: 'OT', orderNum: 1 },
'Exodul': { name: 'Exodul', abbreviation: 'EXO', testament: 'OT', orderNum: 2 },
'Leviticul': { name: 'Leviticul', abbreviation: 'LEV', testament: 'OT', orderNum: 3 },
'Numeri': { name: 'Numerii', abbreviation: 'NUM', testament: 'OT', orderNum: 4 },
'Deuteronom': { name: 'Deuteronomul', abbreviation: 'DEU', testament: 'OT', orderNum: 5 },
'Iosua': { name: 'Iosua', abbreviation: 'JOS', testament: 'OT', orderNum: 6 },
'Judecători': { name: 'Judecătorii', abbreviation: 'JDG', testament: 'OT', orderNum: 7 },
'Rut': { name: 'Rut', abbreviation: 'RUT', testament: 'OT', orderNum: 8 },
'1 Samuel': { name: '1 Samuel', abbreviation: '1SA', testament: 'OT', orderNum: 9 },
'2 Samuel': { name: '2 Samuel', abbreviation: '2SA', testament: 'OT', orderNum: 10 },
'1 Imparati': { name: '1 Împărați', abbreviation: '1KI', testament: 'OT', orderNum: 11 },
'2 Imparati': { name: '2 Împărați', abbreviation: '2KI', testament: 'OT', orderNum: 12 },
'1 Cronici': { name: '1 Cronici', abbreviation: '1CH', testament: 'OT', orderNum: 13 },
'2 Cronici': { name: '2 Cronici', abbreviation: '2CH', testament: 'OT', orderNum: 14 },
'Ezra': { name: 'Ezra', abbreviation: 'EZR', testament: 'OT', orderNum: 15 },
'Neemia': { name: 'Neemia', abbreviation: 'NEH', testament: 'OT', orderNum: 16 },
'Estera': { name: 'Estera', abbreviation: 'EST', testament: 'OT', orderNum: 17 },
'Iov': { name: 'Iov', abbreviation: 'JOB', testament: 'OT', orderNum: 18 },
'Psalmii': { name: 'Psalmii', abbreviation: 'PSA', testament: 'OT', orderNum: 19 },
'Proverbe': { name: 'Proverbele', abbreviation: 'PRO', testament: 'OT', orderNum: 20 },
'Eclesiastul': { name: 'Eclesiastul', abbreviation: 'ECC', testament: 'OT', orderNum: 21 },
'Cântarea Cântărilor': { name: 'Cântarea Cântărilor', abbreviation: 'SNG', testament: 'OT', orderNum: 22 },
'Isaia': { name: 'Isaia', abbreviation: 'ISA', testament: 'OT', orderNum: 23 },
'Ieremia': { name: 'Ieremia', abbreviation: 'JER', testament: 'OT', orderNum: 24 },
'Plângerile': { name: 'Plângerile', abbreviation: 'LAM', testament: 'OT', orderNum: 25 },
'Ezechiel': { name: 'Ezechiel', abbreviation: 'EZK', testament: 'OT', orderNum: 26 },
'Daniel': { name: 'Daniel', abbreviation: 'DAN', testament: 'OT', orderNum: 27 },
'Osea': { name: 'Osea', abbreviation: 'HOS', testament: 'OT', orderNum: 28 },
'Ioel': { name: 'Ioel', abbreviation: 'JOL', testament: 'OT', orderNum: 29 },
'Amos': { name: 'Amos', abbreviation: 'AMO', testament: 'OT', orderNum: 30 },
'Obadia': { name: 'Obadia', abbreviation: 'OBA', testament: 'OT', orderNum: 31 },
'Iona': { name: 'Iona', abbreviation: 'JON', testament: 'OT', orderNum: 32 },
'Mica': { name: 'Mica', abbreviation: 'MIC', testament: 'OT', orderNum: 33 },
'Naum': { name: 'Naum', abbreviation: 'NAM', testament: 'OT', orderNum: 34 },
'Habacuc': { name: 'Habacuc', abbreviation: 'HAB', testament: 'OT', orderNum: 35 },
'Țefania': { name: 'Țefania', abbreviation: 'ZEP', testament: 'OT', orderNum: 36 },
'Hagai': { name: 'Hagai', abbreviation: 'HAG', testament: 'OT', orderNum: 37 },
'Zaharia': { name: 'Zaharia', abbreviation: 'ZEC', testament: 'OT', orderNum: 38 },
'Maleahi': { name: 'Maleahi', abbreviation: 'MAL', testament: 'OT', orderNum: 39 },
// New Testament
'Matei': { name: 'Matei', abbreviation: 'MAT', testament: 'NT', orderNum: 40 },
'Marcu': { name: 'Marcu', abbreviation: 'MRK', testament: 'NT', orderNum: 41 },
'Luca': { name: 'Luca', abbreviation: 'LUK', testament: 'NT', orderNum: 42 },
'Ioan': { name: 'Ioan', abbreviation: 'JHN', testament: 'NT', orderNum: 43 },
'Faptele Apostolilor': { name: 'Faptele Apostolilor', abbreviation: 'ACT', testament: 'NT', orderNum: 44 },
'Romani': { name: 'Romani', abbreviation: 'ROM', testament: 'NT', orderNum: 45 },
'1 Corinteni': { name: '1 Corinteni', abbreviation: '1CO', testament: 'NT', orderNum: 46 },
'2 Corinteni': { name: '2 Corinteni', abbreviation: '2CO', testament: 'NT', orderNum: 47 },
'Galateni': { name: 'Galateni', abbreviation: 'GAL', testament: 'NT', orderNum: 48 },
'Efeseni': { name: 'Efeseni', abbreviation: 'EPH', testament: 'NT', orderNum: 49 },
'Filipeni': { name: 'Filipeni', abbreviation: 'PHP', testament: 'NT', orderNum: 50 },
'Coloseni': { name: 'Coloseni', abbreviation: 'COL', testament: 'NT', orderNum: 51 },
'1 Tesaloniceni': { name: '1 Tesaloniceni', abbreviation: '1TH', testament: 'NT', orderNum: 52 },
'2 Tesaloniceni': { name: '2 Tesaloniceni', abbreviation: '2TH', testament: 'NT', orderNum: 53 },
'1 Timotei': { name: '1 Timotei', abbreviation: '1TI', testament: 'NT', orderNum: 54 },
'2 Timotei': { name: '2 Timotei', abbreviation: '2TI', testament: 'NT', orderNum: 55 },
'Titus': { name: 'Titus', abbreviation: 'TIT', testament: 'NT', orderNum: 56 },
'Filimon': { name: 'Filimon', abbreviation: 'PHM', testament: 'NT', orderNum: 57 },
'Evrei': { name: 'Evrei', abbreviation: 'HEB', testament: 'NT', orderNum: 58 },
'Iacov': { name: 'Iacov', abbreviation: 'JAS', testament: 'NT', orderNum: 59 },
'1 Petru': { name: '1 Petru', abbreviation: '1PE', testament: 'NT', orderNum: 60 },
'2 Petru': { name: '2 Petru', abbreviation: '2PE', testament: 'NT', orderNum: 61 },
'1 Ioan': { name: '1 Ioan', abbreviation: '1JN', testament: 'NT', orderNum: 62 },
'2 Ioan': { name: '2 Ioan', abbreviation: '2JN', testament: 'NT', orderNum: 63 },
'3 Ioan': { name: '3 Ioan', abbreviation: '3JN', testament: 'NT', orderNum: 64 },
'Iuda': { name: 'Iuda', abbreviation: 'JUD', testament: 'NT', orderNum: 65 },
'Revelaţia': { name: 'Revelația', abbreviation: 'REV', testament: 'NT', orderNum: 66 },
}
interface ParsedVerse {
verseNum: number
text: string
}
interface ParsedChapter {
chapterNum: number
verses: ParsedVerse[]
}
interface ParsedBook {
name: string
chapters: ParsedChapter[]
}
async function parseRomanianBible(filePath: string): Promise<ParsedBook[]> {
console.log(`Reading Romanian Bible from: ${filePath}`)
const content = fs.readFileSync(filePath, 'utf-8')
const lines = content.split('\n')
const books: ParsedBook[] = []
let currentBook: ParsedBook | null = null
let currentChapter: ParsedChapter | null = null
let isInBibleContent = false
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim()
// Start processing after "VECHIUL TESTAMENT"
if (line === 'VECHIUL TESTAMENT' || line === 'TESTAMENT') {
isInBibleContent = true
continue
}
if (!isInBibleContent) continue
// Book detection: … BookName …
const bookMatch = line.match(/^…\s*(.+?)\s*…$/)
if (bookMatch) {
// Save previous book if exists
if (currentBook && currentBook.chapters.length > 0) {
books.push(currentBook)
}
const bookName = bookMatch[1].trim()
console.log(`Found book: ${bookName}`)
currentBook = {
name: bookName,
chapters: []
}
currentChapter = null
continue
}
// Chapter detection: Capitolul X or CApitoLuL X
const chapterMatch = line.match(/^[cC][aA][pP][iI][tT][oO][lL][uU][lL]\s+(\d+)$/i)
if (chapterMatch && currentBook) {
// Save previous chapter if exists
if (currentChapter && currentChapter.verses.length > 0) {
currentBook.chapters.push(currentChapter)
}
const chapterNum = parseInt(chapterMatch[1])
console.log(` Chapter ${chapterNum}`)
currentChapter = {
chapterNum,
verses: []
}
continue
}
// Verse detection: starts with number
const verseMatch = line.match(/^(\d+)\s+(.+)$/)
if (verseMatch && currentChapter) {
const verseNum = parseInt(verseMatch[1])
let verseText = verseMatch[2].trim()
// Handle paragraph markers
verseText = verseText.replace(/^¶\s*/, '')
// Look ahead for continuation lines (lines that don't start with numbers or special markers)
let j = i + 1
while (j < lines.length) {
const nextLine = lines[j].trim()
// Stop if we hit a new verse, chapter, book, or empty line
if (!nextLine ||
nextLine.match(/^\d+\s/) || // New verse
nextLine.match(/^[cC][aA][pP][iI][tT][oO][lL][uU][lL]\s+\d+$/i) || // New chapter
nextLine.match(/^….*…$/) || // New book
nextLine === 'TESTAMENT') { // Testament marker
break
}
// Add continuation line
verseText += ' ' + nextLine
j++
}
// Clean up the text
verseText = verseText.replace(/\s+/g, ' ').trim()
currentChapter.verses.push({
verseNum,
text: verseText
})
// Skip the lines we've processed
i = j - 1
continue
}
}
// Save the last book and chapter
if (currentChapter && currentChapter.verses.length > 0 && currentBook) {
currentBook.chapters.push(currentChapter)
}
if (currentBook && currentBook.chapters.length > 0) {
books.push(currentBook)
}
console.log(`Parsed ${books.length} books`)
return books
}
async function importRomanianBible() {
try {
console.log('Starting Romanian Bible import...')
// Clear existing data
console.log('Clearing existing data...')
await prisma.bibleVerse.deleteMany()
await prisma.bibleChapter.deleteMany()
await prisma.bibleBook.deleteMany()
// Parse the markdown file
const filePath = path.join(process.cwd(), 'bibles', 'Biblia-Fidela-limba-romana.md')
const books = await parseRomanianBible(filePath)
console.log(`Importing ${books.length} books into database...`)
for (const book of books) {
const bookInfo = BOOK_MAPPINGS[book.name]
if (!bookInfo) {
console.warn(`Warning: No mapping found for book "${book.name}", skipping...`)
continue
}
console.log(`Creating book: ${bookInfo.name}`)
// Create book
const createdBook = await prisma.bibleBook.create({
data: {
id: bookInfo.orderNum,
name: bookInfo.name,
testament: bookInfo.testament,
orderNum: bookInfo.orderNum
}
})
// Create chapters and verses
for (const chapter of book.chapters) {
console.log(` Creating chapter ${chapter.chapterNum} with ${chapter.verses.length} verses`)
const createdChapter = await prisma.bibleChapter.create({
data: {
bookId: createdBook.id,
chapterNum: chapter.chapterNum
}
})
// Create verses in batch (deduplicate by verse number)
const uniqueVerses = chapter.verses.reduce((acc, verse) => {
acc[verse.verseNum] = verse // This will overwrite duplicates
return acc
}, {} as Record<number, ParsedVerse>)
const versesData = Object.values(uniqueVerses).map(verse => ({
chapterId: createdChapter.id,
verseNum: verse.verseNum,
text: verse.text,
version: 'FIDELA'
}))
if (versesData.length > 0) {
await prisma.bibleVerse.createMany({
data: versesData
})
}
}
}
// Print summary
const bookCount = await prisma.bibleBook.count()
const chapterCount = await prisma.bibleChapter.count()
const verseCount = await prisma.bibleVerse.count()
console.log('\n✅ Romanian Bible import completed successfully!')
console.log(`📚 Books imported: ${bookCount}`)
console.log(`📖 Chapters imported: ${chapterCount}`)
console.log(`📝 Verses imported: ${verseCount}`)
} catch (error) {
console.error('❌ Error importing Romanian Bible:', error)
throw error
} finally {
await prisma.$disconnect()
}
}
// Run the import
if (require.main === module) {
importRomanianBible()
.then(() => {
console.log('Import completed successfully!')
process.exit(0)
})
.catch((error) => {
console.error('Import failed:', error)
process.exit(1)
})
}
export { importRomanianBible }

View File

@@ -0,0 +1,231 @@
import os, re, json, math, time, asyncio
from typing import List, Dict, Tuple, Iterable
from dataclasses import dataclass
from pathlib import Path
from dotenv import load_dotenv
import httpx
import psycopg
from psycopg.rows import dict_row
load_dotenv()
AZ_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AZ_API_KEY = os.getenv("AZURE_OPENAI_KEY")
AZ_API_VER = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")
AZ_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "embed-3")
EMBED_DIMS = int(os.getenv("EMBED_DIMS", "3072"))
DB_URL = os.getenv("DATABASE_URL")
BIBLE_MD_PATH = os.getenv("BIBLE_MD_PATH")
LANG_CODE = os.getenv("LANG_CODE", "ro")
TRANSLATION = os.getenv("TRANSLATION_CODE", "FIDELA")
assert AZ_ENDPOINT and AZ_API_KEY and DB_URL and BIBLE_MD_PATH, "Missing required env vars"
EMBED_URL = f"{AZ_ENDPOINT}/openai/deployments/{AZ_DEPLOYMENT}/embeddings?api-version={AZ_API_VER}"
BOOKS_OT = [
"Geneza","Exodul","Leviticul","Numeri","Deuteronom","Iosua","Judecători","Rut",
"1 Samuel","2 Samuel","1 Imparati","2 Imparati","1 Cronici","2 Cronici","Ezra","Neemia","Estera",
"Iov","Psalmii","Proverbe","Eclesiastul","Cântarea Cântărilor","Isaia","Ieremia","Plângerile",
"Ezechiel","Daniel","Osea","Ioel","Amos","Obadia","Iona","Mica","Naum","Habacuc","Țefania","Hagai","Zaharia","Maleahi"
]
BOOKS_NT = [
"Matei","Marcu","Luca","Ioan","Faptele Apostolilor","Romani","1 Corinteni","2 Corinteni",
"Galateni","Efeseni","Filipeni","Coloseni","1 Tesaloniceni","2 Tesaloniceni","1 Timotei","2 Timotei",
"Titus","Filimon","Evrei","Iacov","1 Petru","2 Petru","1 Ioan","2 Ioan","3 Ioan","Iuda","Revelaţia"
]
BOOK_CANON = {b:("OT" if b in BOOKS_OT else "NT") for b in BOOKS_OT + BOOKS_NT}
@dataclass
class Verse:
testament: str
book: str
chapter: int
verse: int
text_raw: str
text_norm: str
def normalize_text(s: str) -> str:
s = re.sub(r"\s+", " ", s.strip())
s = s.replace(" ", " ")
return s
BOOK_RE = re.compile(r"^(?P<book>[A-ZĂÂÎȘȚ][^\n]+?)\s*$")
CH_RE = re.compile(r"^(?i:Capitolul|CApitoLuL)\s+(?P<ch>\d+)\b")
VERSE_RE = re.compile(r"^(?P<v>\d+)\s+(?P<body>.+)$")
def parse_bible_md(md_text: str):
cur_book, cur_ch = None, None
testament = None
is_in_bible_content = False
for line in md_text.splitlines():
line = line.rstrip()
# Start processing after "VECHIUL TESTAMENT" or when we find book markers
if line == 'VECHIUL TESTAMENT' or line == 'TESTAMENT' or '' in line:
is_in_bible_content = True
if not is_in_bible_content:
continue
# Book detection: … BookName …
book_match = re.match(r'^…\s*(.+?)\s*…$', line)
if book_match:
bname = book_match.group(1).strip()
if bname in BOOK_CANON:
cur_book = bname
testament = BOOK_CANON[bname]
cur_ch = None
print(f"Found book: {bname}")
continue
# Chapter detection: Capitolul X or CApitoLuL X
m_ch = CH_RE.match(line)
if m_ch and cur_book:
cur_ch = int(m_ch.group("ch"))
print(f" Chapter {cur_ch}")
continue
# Verse detection: starts with number
m_v = VERSE_RE.match(line)
if m_v and cur_book and cur_ch:
vnum = int(m_v.group("v"))
body = m_v.group("body").strip()
# Remove paragraph markers
body = re.sub(r'\s*', '', body)
raw = body
norm = normalize_text(body)
yield {
"testament": testament, "book": cur_book, "chapter": cur_ch, "verse": vnum,
"text_raw": raw, "text_norm": norm
}
async def embed_batch(client, inputs):
payload = {"input": inputs}
headers = {"api-key": AZ_API_KEY, "Content-Type": "application/json"}
for attempt in range(6):
try:
r = await client.post(EMBED_URL, headers=headers, json=payload, timeout=60)
if r.status_code == 200:
data = r.json()
ordered = sorted(data["data"], key=lambda x: x["index"])
return [d["embedding"] for d in ordered]
elif r.status_code in (429, 500, 503):
backoff = 2 ** attempt + (0.1 * attempt)
print(f"Rate limited, waiting {backoff:.1f}s...")
await asyncio.sleep(backoff)
else:
raise RuntimeError(f"Embedding error {r.status_code}: {r.text}")
except Exception as e:
backoff = 2 ** attempt + (0.1 * attempt)
print(f"Error on attempt {attempt + 1}: {e}, waiting {backoff:.1f}s...")
await asyncio.sleep(backoff)
raise RuntimeError("Failed to embed after retries")
# First, we need to create the table with proper SQL
CREATE_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS bible_passages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
testament TEXT NOT NULL,
book TEXT NOT NULL,
chapter INT NOT NULL,
verse INT NOT NULL,
ref TEXT GENERATED ALWAYS AS (book || ' ' || chapter || ':' || verse) STORED,
lang TEXT NOT NULL DEFAULT 'ro',
translation TEXT NOT NULL DEFAULT 'FIDELA',
text_raw TEXT NOT NULL,
text_norm TEXT NOT NULL,
tsv tsvector,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
"""
CREATE_INDEXES_SQL = """
-- Uniqueness by canonical reference within translation/language
CREATE UNIQUE INDEX IF NOT EXISTS ux_ref_lang ON bible_passages (translation, lang, book, chapter, verse);
-- Full-text index
CREATE INDEX IF NOT EXISTS idx_tsv ON bible_passages USING GIN (tsv);
-- Other indexes
CREATE INDEX IF NOT EXISTS idx_book_ch ON bible_passages (book, chapter);
CREATE INDEX IF NOT EXISTS idx_testament ON bible_passages (testament);
"""
UPSERT_SQL = """
INSERT INTO bible_passages (testament, book, chapter, verse, lang, translation, text_raw, text_norm, tsv, embedding)
VALUES (%(testament)s, %(book)s, %(chapter)s, %(verse)s, %(lang)s, %(translation)s, %(text_raw)s, %(text_norm)s,
to_tsvector(COALESCE(%(ts_lang)s,'simple')::regconfig, %(text_norm)s), %(embedding)s)
ON CONFLICT (translation, lang, book, chapter, verse) DO UPDATE
SET text_raw=EXCLUDED.text_raw,
text_norm=EXCLUDED.text_norm,
tsv=EXCLUDED.tsv,
embedding=EXCLUDED.embedding,
updated_at=now();
"""
async def main():
print("Starting Bible embedding ingestion...")
md_text = Path(BIBLE_MD_PATH).read_text(encoding="utf-8", errors="ignore")
verses = list(parse_bible_md(md_text))
print(f"Parsed verses: {len(verses)}")
batch_size = 128
# First create the table structure
with psycopg.connect(DB_URL) as conn:
with conn.cursor() as cur:
print("Creating bible_passages table...")
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute(CREATE_TABLE_SQL)
cur.execute(CREATE_INDEXES_SQL)
conn.commit()
print("Table created successfully")
# Now process embeddings
async with httpx.AsyncClient() as client:
with psycopg.connect(DB_URL, autocommit=False) as conn:
with conn.cursor() as cur:
for i in range(0, len(verses), batch_size):
batch = verses[i:i+batch_size]
inputs = [v["text_norm"] for v in batch]
print(f"Generating embeddings for batch {i//batch_size + 1}/{(len(verses) + batch_size - 1)//batch_size}")
embs = await embed_batch(client, inputs)
rows = []
for v, e in zip(batch, embs):
rows.append({
**v,
"lang": LANG_CODE,
"translation": TRANSLATION,
"ts_lang": "romanian",
"embedding": e
})
cur.executemany(UPSERT_SQL, rows)
conn.commit()
print(f"Upserted {len(rows)} verses... {i+len(rows)}/{len(verses)}")
# Create IVFFLAT index after data is loaded
print("Creating IVFFLAT index...")
with psycopg.connect(DB_URL, autocommit=True) as conn:
with conn.cursor() as cur:
cur.execute("VACUUM ANALYZE bible_passages;")
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
""")
print("✅ Bible embedding ingestion completed successfully!")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,372 @@
# Azure OpenAI **embed-3** → Postgres + pgvector Ingestion Guide (Bible Corpus)
**Goal**: Create a productionready Python script that ingests the full Bible (Markdown source) into **Postgres** with **pgvector** and **fulltext** metadata, using **Azure OpenAI `embed-3`** embeddings. The vectors will power a consumer chat assistant (Q&A & conversations about the Bible) and a backend agent that generates custom prayers.
> Sample corpus used here: Romanian *Biblia Fidela* (Markdown). Structure contains books, chapters, verses (e.g., *Geneza 1:1…*) and a TOC in the file. fileciteturn0file0
---
## 0) Architecture at a glance
- **Input**: Bible in Markdown (`*.md`) → parser → normalized records: *(book, chapter, verse, text, lang=ro)*
- **Embedding**: Azure OpenAI **embed-3** (prefer `text-embedding-3-large`, 3072D). Batch inputs to cut cost/latency.
- **Storage**: Postgres with:
- `pgvector` column `embedding vector(3072)`
- `tsvector` column for hybrid lexical search (Romanian or English config as needed)
- metadata columns for fast filtering (book, chapter, verse, testament, translation, language)
- **Indexes**: `ivfflat` over `embedding`, GIN over `tsv` (and btree over metadata)
- **Retrieval**:
- Dense vector kNN
- Hybrid: combine kNN score + BM25/tsvector
- Windowed context stitching (neighbor verses) for chat
- **Consumers**:
- Chat assistant: answer + cite (book:chapter:verse).
- Prayer agent: promptcompose with retrieved passages & user intents.
---
## 1) Prerequisites
### Postgres + pgvector
```bash
# Install pgvector (on Ubuntu)
sudo apt-get update && sudo apt-get install -y postgresql postgresql-contrib
# In psql as superuser:
CREATE EXTENSION IF NOT EXISTS vector;
```
### Python deps
```bash
python -m venv .venv && source .venv/bin/activate
pip install psycopg[binary] pgvector pydantic python-dotenv httpx tqdm rapidfuzz
```
> `httpx` for HTTP (asynccapable), `pgvector` adapter, `rapidfuzz` for optional dedup or heuristic joins, `tqdm` for progress.
### Azure OpenAI
- Create **Embeddings** deployment for **`text-embedding-3-large`** (or `-small` if cost sensitive). Name it (e.g.) `embeddings`.
- Collect:
- `AZURE_OPENAI_ENDPOINT=https://<your>.openai.azure.com/`
- `AZURE_OPENAI_API_KEY=...`
- `AZURE_OPENAI_API_VERSION=2024-05-01-preview` *(or your current stable)*
- `AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings` *(your deployment name)*
Create `.env`:
```env
DATABASE_URL=postgresql://user:pass@localhost:5432/bible
AZURE_OPENAI_ENDPOINT=https://YOUR_RESOURCE.openai.azure.com/
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_EMBED_DEPLOYMENT=embeddings
EMBED_DIMS=3072
BIBLE_MD_PATH=./Biblia-Fidela-limba-romana.md
LANG_CODE=ro
TRANSLATION_CODE=FIDELA
```
---
## 2) Database schema
```sql
-- One-time setup in your database
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS bible_passages (
id BIGSERIAL PRIMARY KEY,
testament TEXT NOT NULL, -- 'OT' or 'NT'
book TEXT NOT NULL,
chapter INT NOT NULL,
verse INT NOT NULL,
ref TEXT GENERATED ALWAYS AS (book || ' ' || chapter || ':' || verse) STORED,
lang TEXT NOT NULL DEFAULT 'ro',
translation TEXT NOT NULL DEFAULT 'FIDELA',
text_raw TEXT NOT NULL, -- exact verse text
text_norm TEXT NOT NULL, -- normalized/cleaned text (embedding input)
tsv tsvector,
embedding vector(3072), -- 1536 if using embed-3-small
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Uniqueness by canonical reference within translation/language
CREATE UNIQUE INDEX IF NOT EXISTS ux_ref_lang ON bible_passages (translation, lang, book, chapter, verse);
-- Full-text index (choose config; Romanian available if installed via ISPELL; else use 'simple' or 'english')
-- If you have pg_catalog.romanian, use that. Else fallback to 'simple' but keep lexemes.
CREATE INDEX IF NOT EXISTS idx_tsv ON bible_passages USING GIN (tsv);
-- Vector index (choose nlist to match data size; we set after populating table)
-- First create a flat index for small data, or IVFFLAT for scale:
-- Requires ANALYZE beforehand and SET enable_seqscan=off for kNN plans.
```
After loading, build the IVFFLAT index (the table must be populated first):
```sql
-- Example: around 31k verses ⇒ nlist ~ 100200 is reasonable; tune per EXPLAIN ANALYZE
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
```
Trigger to keep `updated_at` fresh:
```sql
CREATE OR REPLACE FUNCTION touch_updated_at() RETURNS TRIGGER AS $$
BEGIN NEW.updated_at = now(); RETURN NEW; END; $$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_bible_updated ON bible_passages;
CREATE TRIGGER trg_bible_updated BEFORE UPDATE ON bible_passages
FOR EACH ROW EXECUTE PROCEDURE touch_updated_at();
```
---
## 3) Parsing & Chunking strategy (large, highquality)
**Why verselevel?** Its the canonical granular unit for Bible QA.
**Contextstitching**: during retrieval, fetch neighbor verses (±N) to maintain narrative continuity.
**Normalization** steps (for `text_norm`):
- Strip verse numbers and sidenotes if present in raw lines.
- Collapse whitespace, unify quotes, remove page headers/footers and TOC artifacts.
- Preserve punctuation; avoid stemming before embeddings.
- Lowercasing optional (OpenAI embeddings are case-robust).
**Testament/book detection**: From headings and TOC present in the Markdown; detect Book → Chapter → Verse boundaries via regex.
Example regex heuristics (tune to your file):
- Book headers: `^(?P<book>[A-ZĂÂÎȘȚ].+?)\s*$` (bounded by known canon order)
- Chapter headers: `^Capitolul\s+(?P<ch>\d+)` or `^CApitoLuL\s+(?P<ch>\d+)` (case variations)
- Verse lines: `^(?P<verse>\d+)\s+(.+)$`
> The provided Markdown clearly shows book order (e.g., *Geneza*, *Exodul*, …; NT: *Matei*, *Marcu*, …) and verse lines like “**1** LA început…”. fileciteturn0file0
---
## 4) Python ingestion script
> **Save as** `ingest_bible_pgvector.py`
```python
import os, re, json, math, time, asyncio
from typing import List, Dict, Tuple, Iterable
from dataclasses import dataclass
from pathlib import Path
from dotenv import load_dotenv
import httpx
import psycopg
from psycopg.rows import dict_row
load_dotenv()
AZ_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
AZ_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZ_API_VER = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")
AZ_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBED_DEPLOYMENT", "embeddings")
EMBED_DIMS = int(os.getenv("EMBED_DIMS", "3072"))
DB_URL = os.getenv("DATABASE_URL")
BIBLE_MD_PATH = os.getenv("BIBLE_MD_PATH")
LANG_CODE = os.getenv("LANG_CODE", "ro")
TRANSLATION = os.getenv("TRANSLATION_CODE", "FIDELA")
assert AZ_ENDPOINT and AZ_API_KEY and DB_URL and BIBLE_MD_PATH, "Missing required env vars"
EMBED_URL = f"{AZ_ENDPOINT}/openai/deployments/{AZ_DEPLOYMENT}/embeddings?api-version={AZ_API_VER}"
BOOKS_OT = [
"Geneza","Exodul","Leviticul","Numeri","Deuteronom","Iosua","Judecători","Rut",
"1 Samuel","2 Samuel","1 Imparati","2 Imparati","1 Cronici","2 Cronici","Ezra","Neemia","Estera",
"Iov","Psalmii","Proverbe","Eclesiastul","Cântarea Cântărilor","Isaia","Ieremia","Plângerile",
"Ezechiel","Daniel","Osea","Ioel","Amos","Obadia","Iona","Mica","Naum","Habacuc","Țefania","Hagai","Zaharia","Maleahi"
]
BOOKS_NT = [
"Matei","Marcu","Luca","Ioan","Faptele Apostolilor","Romani","1 Corinteni","2 Corinteni",
"Galateni","Efeseni","Filipeni","Coloseni","1 Tesaloniceni","2 Tesaloniceni","1 Timotei","2 Timotei",
"Titus","Filimon","Evrei","Iacov","1 Petru","2 Petru","1 Ioan","2 Ioan","3 Ioan","Iuda","Revelaţia"
]
BOOK_CANON = {b:("OT" if b in BOOKS_OT else "NT") for b in BOOKS_OT + BOOKS_NT}
@dataclass
class Verse:
testament: str
book: str
chapter: int
verse: int
text_raw: str
text_norm: str
def normalize_text(s: str) -> str:
s = re.sub(r"\s+", " ", s.strip())
s = s.replace(" ", " ")
return s
BOOK_RE = re.compile(r"^(?P<book>[A-ZĂÂÎȘȚ][^\n]+?)\s*$")
CH_RE = re.compile(r"^(?i:Capitolul|CApitoLuL)\s+(?P<ch>\d+)\b")
VERSE_RE = re.compile(r"^(?P<v>\d+)\s+(?P<body>.+)$")
def parse_bible_md(md_text: str):
cur_book, cur_ch = None, None
testament = None
for line in md_text.splitlines():
line = line.rstrip()
# Book detection
m_book = BOOK_RE.match(line)
if m_book:
bname = m_book.group("book").strip()
if bname in BOOK_CANON:
cur_book = bname
testament = BOOK_CANON[bname]
cur_ch = None
continue
m_ch = CH_RE.match(line)
if m_ch and cur_book:
cur_ch = int(m_ch.group("ch"))
continue
m_v = VERSE_RE.match(line)
if m_v and cur_book and cur_ch:
vnum = int(m_v.group("v"))
body = m_v.group("body").strip()
raw = body
norm = normalize_text(body)
yield {
"testament": testament, "book": cur_book, "chapter": cur_ch, "verse": vnum,
"text_raw": raw, "text_norm": norm
}
async def embed_batch(client, inputs):
payload = {"input": inputs}
headers = {"api-key": AZ_API_KEY, "Content-Type": "application/json"}
for attempt in range(6):
try:
r = await client.post(EMBED_URL, headers=headers, json=payload, timeout=60)
if r.status_code == 200:
data = r.json()
ordered = sorted(data["data"], key=lambda x: x["index"])
return [d["embedding"] for d in ordered]
elif r.status_code in (429, 500, 503):
backoff = 2 ** attempt + (0.1 * attempt)
await asyncio.sleep(backoff)
else:
raise RuntimeError(f"Embedding error {r.status_code}: {r.text}")
except Exception:
backoff = 2 ** attempt + (0.1 * attempt)
await asyncio.sleep(backoff)
raise RuntimeError("Failed to embed after retries")
UPSERT_SQL = """
INSERT INTO bible_passages (testament, book, chapter, verse, lang, translation, text_raw, text_norm, tsv, embedding)
VALUES (%(testament)s, %(book)s, %(chapter)s, %(verse)s, %(lang)s, %(translation)s, %(text_raw)s, %(text_norm)s,
to_tsvector(COALESCE(%(ts_lang)s,'simple')::regconfig, %(text_norm)s), %(embedding)s)
ON CONFLICT (translation, lang, book, chapter, verse) DO UPDATE
SET text_raw=EXCLUDED.text_raw,
text_norm=EXCLUDED.text_norm,
tsv=EXCLUDED.tsv,
embedding=EXCLUDED.embedding,
updated_at=now();
"""
async def main():
md_text = Path(BIBLE_MD_PATH).read_text(encoding="utf-8", errors="ignore")
verses = list(parse_bible_md(md_text))
print(f"Parsed verses: {len(verses)}")
batch_size = 128
async with httpx.AsyncClient() as client, psycopg.connect(DB_URL, autocommit=False) as conn:
with conn.cursor() as cur:
for i in range(0, len(verses), batch_size):
batch = verses[i:i+batch_size]
inputs = [v["text_norm"] for v in batch]
embs = await embed_batch(client, inputs)
rows = []
for v, e in zip(batch, embs):
rows.append({
**v,
"lang": os.getenv("LANG_CODE","ro"),
"translation": os.getenv("TRANSLATION_CODE","FIDELA"),
"ts_lang": "romanian",
"embedding": e
})
cur.executemany(UPSERT_SQL, rows)
conn.commit()
print(f"Upserted {len(rows)}{i+len(rows)}/{len(verses)}")
print("Done. Build IVFFLAT index after ANALYZE.")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
**Notes**
- If `romanian` text search config is unavailable, set `ts_lang='simple'`.
- For `embed-3-small`, set `EMBED_DIMS=1536` and change column type to `vector(1536)`.
---
## 5) Postingestion steps
```sql
VACUUM ANALYZE bible_passages;
CREATE INDEX IF NOT EXISTS idx_vec_ivfflat
ON bible_passages USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 200);
CREATE INDEX IF NOT EXISTS idx_book_ch ON bible_passages (book, chapter);
```
---
## 6) Retrieval patterns
### A) Pure vector kNN (cosine)
```sql
SELECT ref, book, chapter, verse, text_raw,
1 - (embedding <=> $1) AS cosine_sim
FROM bible_passages
ORDER BY embedding <=> $1
LIMIT $2;
```
### B) Hybrid lexical + vector (weighted)
```sql
WITH v AS (
SELECT id, 1 - (embedding <=> $1) AS vsim
FROM bible_passages
ORDER BY embedding <=> $1
LIMIT 100
),
l AS (
SELECT id, ts_rank(tsv, $2) AS lrank
FROM bible_passages
WHERE tsv @@ $2
)
SELECT bp.ref, bp.book, bp.chapter, bp.verse, bp.text_raw,
COALESCE(v.vsim, 0) * 0.7 + COALESCE(l.lrank, 0) * 0.3 AS score
FROM bible_passages bp
LEFT JOIN v ON v.id = bp.id
LEFT JOIN l ON l.id = bp.id
ORDER BY score DESC
LIMIT 20;
```
---
## 7) Chat & Prayer agent tips
- **Answer grounding**: always cite `ref` (e.g., *Ioan 3:16*).
- **Multilingual output**: keep quotes in Romanian; explain in the users language.
- **Prayer agent**: constrain tone & doctrine; inject retrieved verses as anchors.
---
## 8) Ops
- Idempotent `UPSERT`.
- Backoff on 429/5xx.
- Consider keeping both `embed-3-large` and `-small` columns when migrating.
---
## 9) License & attribution
This guide references the structure of *Biblia Fidela* Markdown for ingestion demonstration. fileciteturn0file0