AI Concepts (the building blocks)¶
Definitions for the primitives that the RAG, vector, and LLM-provider pages keep referring back to. If you only know these terms by reputation, this is the page that pins them down.
Embeddings¶
What they are. A model maps a chunk of text (or image, audio, …) to a fixed-length array of floating-point numbers — typically 384, 768, 1024, 1536, or 3072 dimensions. Two pieces of content that "mean the same thing" land near each other in that high-dimensional space.
Why they matter. Once content is embedded, you can do similarity search (cosine, dot-product, or Euclidean), clustering, classification, and recommendation — without fine-tuning a model.
Embedding models in this portfolio:
- text-embedding-3-small (OpenAI, 1536 dim) — default across 17+ apps. Cheap (~$0.02 / M tokens), good enough for most retrieval.
- text-embedding-3-large (OpenAI, 3072 dim) — used by GoGreen-DOC-AI. Higher recall on dense legal/document text. ~6.5× the storage cost, ~2× the API cost.
- HuggingFace sentence-transformers — every app has the SDK installed via the unified ai-providers service, but most use it as a fallback path rather than the primary embedder.
- TF-IDF (Automotive-Repair-Diagnosis-AI, PolyMarketAI) — not learned embeddings, but the same vector-of-numbers shape; computed locally with scikit-learn or a custom tokenizer.
- BM25 keyword vectors (WP-Plugin) — also a sparse "vector" in concept; no neural model involved.
Dimension matters. Storage cost = dimensions × 4 bytes × num_vectors. 3072-dim vectors are 4× the storage of 768-dim. Index build time scales similarly.
Cosine similarity (and friends)¶
What it is. Three common distance metrics over embedding vectors: - Cosine similarity — the angle between two vectors, ignoring magnitude. The default for text embeddings because OpenAI/HuggingFace embeddings are roughly L2-normalized to magnitude 1, which makes cosine and dot-product equivalent. - Euclidean / L2 — straight-line distance. Sensitive to magnitude, less common for normalized text embeddings. - Inner product / dot product — fastest if vectors are already normalized; equivalent to cosine in that case.
In pgvector: <=> is cosine, <-> is L2, <#> is negative inner product. Most portfolio apps use <=>.
Chunking¶
What it is. Documents are too long for an embedding model's context window (8K tokens for OpenAI v3 embeddings) and would dilute meaning anyway, so we split them into ~500–1000-token chunks before embedding.
Strategies seen in the portfolio:
- Fixed-size character split — simplest, used by many small apps. Bad at sentence boundaries.
- Recursive character text split (LangChain RecursiveCharacterTextSplitter) — used in GoGreen-DOC-AI, Automotive-Repair-Diagnosis-AI. Splits on paragraph → sentence → word in priority order, respecting boundaries.
- Sentence-window — embed a single sentence but retrieve a window around it. Good for citation precision; not directly named in any app's CLAUDE.md but used implicitly.
- Markdown-aware — split on # headings before falling back to length. Useful for long-form docs.
- Code-aware — split on function/class boundaries. Not used; no code-RAG app in the portfolio.
- Token-count — split by tokenizer count rather than characters. More predictable cost; used in apps that batch-embed.
- Overlap — most apps use 10–20% overlap between adjacent chunks so a fact at a boundary isn't lost.
Tradeoffs. - Bigger chunks: more context per retrieval, fewer chunks total, easier on the vector DB. But higher noise — irrelevant prose drowns the signal. - Smaller chunks: precise, but you may need to retrieve more of them, and a single answer may be split across chunks. - Right answer is usually 500–1000 tokens with 100–200 overlap; tune per corpus.
BM25 (Best Match 25)¶
What it is. A sparse, keyword-based ranking function used by Lucene, Elasticsearch, Solr, and Postgres tsvector. Improvement over TF-IDF: adds saturation (the 10th occurrence of a word adds less than the 2nd) and document-length normalization.
Where it shows up: - WP-Plugin uses it as the primary retriever (no embeddings). - AI-Wordpress_Business-Directory, GoGreen-DOC-AI, OpenSentinel use it alongside dense embeddings in hybrid search.
Why it still matters in 2026. Embeddings can't match exact tokens — a SKU like "PRT-2024-A" doesn't have a meaningful embedding. BM25 catches those.
TF-IDF (Term Frequency × Inverse Document Frequency)¶
What it is. Older sparse retrieval — weight each term by how often it appears in this doc and how rare it is in the corpus. BM25 is a better default; TF-IDF is still useful when you want a fully local retriever with no neural model.
Used in: Automotive-Repair-Diagnosis-AI (in-memory), PolyMarketAI (scikit-learn).
Hybrid search¶
What it is. Run dense vector search and sparse keyword search (BM25/TF-IDF) in parallel, then merge.
Merging strategies:
- Reciprocal Rank Fusion (RRF) — score(d) = Σ 1/(k + rank_i(d)) over each retriever. Hyperparameter k ≈ 60 by default. Robust, no per-corpus tuning.
- Weighted linear combination — α × dense_score + (1-α) × sparse_score. Needs score normalization first.
- Learned fusion — train a small model to pick. Overkill for portfolio scale.
Used in (hybrid is explicit): AI-Wordpress_Business-Directory, GoGreen-DOC-AI, OpenSentinel.
Reranking¶
What it is. A second-pass scoring step. Stage 1 (the embedding retrieval) returns 50–100 candidates cheaply; stage 2 reranks them using a more expensive but more accurate model that reads the (query, doc) pair together.
Reranker types:
- Cross-encoder — a transformer that takes [CLS] query [SEP] doc and outputs a relevance score. Used by FamilyChat.
- Cohere Rerank — managed API. Not used in this portfolio (Claude rerank is preferred).
- BGE-reranker / bge-reranker-v2 — open-source cross-encoders from BAAI. Available via HuggingFace.
- LLM-as-reranker — prompt the LLM to score each candidate. Ecom-Sales / GogreenSellerAI use Claude this way. High quality, expensive, scales to ~20 candidates max.
- ColBERT — late-interaction; not used.
- Metadata reranker — non-neural. Score a document based on heuristics (recency, source-type weight, user prefs). Boomer_AI, Recruiting_AI use this layer on top of cosine.
Used in (apps with explicit reranking): Boomer_AI, Ecom-Sales, FamilyChat, GoGreen-DOC-AI, GoGreen-SmartForms, GoGreenSourcingAI, NaggingWifeAI, OpenSentinel, Recruiting_AI, Salon-Digital-Assistant, Voting_NewAndImproved.
HyDE (Hypothetical Document Embeddings)¶
What it is. Instead of embedding the user's question, ask the LLM to generate a hypothetical answer and embed that. Then look for documents that match the hypothetical answer's embedding.
Why it works. A question and a real answer are written in different registers ("how do I cancel?" vs "Subscription cancellation procedure"). Two answers — one hypothetical and one real — are written in the same register, so they cluster better.
Used in: the largest pattern in the portfolio. See RAG-Patterns.md for the full app list (~17 apps).
Cost. One additional small-LLM call per query (gpt-4o-mini or Claude Haiku is plenty). Latency hit is ~300 ms, often cached.
Query rewriting / expansion¶
What it is. Reword or expand the user's query before retrieval.
Variants: - Multi-query — generate N paraphrases, retrieve for each, dedupe and merge results. - Step-back — rewrite the specific question into a more general one, then retrieve. - Hypothetical example — generate examples of what the answer would look like (related to HyDE). - Standalone-question rewrite — for chat apps; rewrite "what about the second one?" into a context-free query using history.
Used in: GoGreenMarketing (HyDE + query rewriting), OpenSentinel (one of nine techniques), implicit in most chat apps.
Tool use / function calling¶
What it is. Modern LLMs (Claude, GPT-4o, Gemini) can be given a list of tool schemas (JSON schema). The model returns a structured tool_use block instead of text; the runtime executes the tool and feeds the result back. Loop until the model returns a text answer.
Vendor names for the same idea: - OpenAI: "function calling" (deprecated name) → "tool use." - Anthropic Claude: "tool use." - Google Gemini: "function calling." - All compatible at the schema level.
Used in: Tutor_AI-Docker (10 tools, 3-iteration cap), AI-Wordpress_Business-Directory (8 tools), Recruiting_AI, OpenSentinel, GoGreen-Workflow-Hub (the Workflow-Hub is a giant tool registry).
Iteration cap. Always cap the loop. Tutor_AI's max_iterations=3 is a sensible default for chat. Without it the model can loop indefinitely on a malformed schema.
Agents¶
What they are. Loosely defined: an LLM that decides what to do next (which tool to call, what to retrieve, when to stop) instead of running a fixed pipeline. Concretely in this portfolio: a while not done: tool_call loop wrapped around an LLM with a tool schema.
Frameworks present: - LangChain — Python; used in GoGreen-DOC-AI, GoGreenMarketing, Automotive-Repair-Diagnosis-AI. - Vercel AI SDK — TypeScript; used in GoGreen-Workflow-Hub. Streaming-first, tool-use-native. - Custom ReAct loop — Tutor_AI, OpenSentinel, AI-Wordpress_Business-Directory roll their own.
LangSmith is the LangChain observability layer (traces, eval). Used in GoGreen-SmartForms.
MCP (Model Context Protocol)¶
What it is. Anthropic's open spec (released 2024) for connecting LLMs to tools and data sources via a standardized client-server protocol. An "MCP server" exposes tools, resources, and prompts; an MCP client (the LLM host) consumes them. Like LSP for AI agents.
Used in: - OpenSentinel — runs an MCP client to consume tools from external MCP servers (Notion, GitHub, Matrix, etc.). - WP-Plugin — exposes MCP-compatible endpoints so external agents can drive the WordPress site.
Why it matters. Replaces app-specific glue code. If a new tool exposes an MCP server, every MCP-aware client gets it for free.
Adjacent retrieval ideas (not currently used)¶
ColBERT¶
Late-interaction retrieval — embed each token (not each document) and compute the maxsim across token pairs. State-of-the-art recall but storage-heavy. Not used in portfolio.
Splade¶
Learned sparse retrieval — neural BM25-replacement. Better-than-BM25 quality, embeddable in regular search engines. Not used.
Matryoshka embeddings¶
A single model produces nested embeddings (you can truncate text-embedding-3-large from 3072 → 1024 → 256 with graceful quality degradation). OpenAI v3 embeddings support this; the portfolio doesn't currently exploit it.
Long-context attention (no RAG)¶
Some 2025–2026 models advertise 200K to 1M-token contexts. For small corpora, "just stuff the whole document" can outperform poorly-tuned RAG. See the "long-context" entry in RAG-Patterns.md.
Self-RAG / Corrective-RAG / Adaptive-RAG (the academic family)¶
Variants where the LLM evaluates its own retrieval and decides to retry / refine / abstain. The portfolio's "Adaptive 3-Tier" (FamilyChat) is a simpler real-world cousin.