Skip to content

Vector Databases & Vector Stores

A vector database stores high-dimensional embedding vectors and supports approximate-nearest-neighbor (ANN) search. Embeddings come from LLM-Providers.md; the math is in AI-Concepts.md.

The portfolio strongly favors pgvector — Postgres + an extension — because it collapses the vector store and the transactional store into one piece of infrastructure. The exceptions are deliberate.

Portfolio map

Vector store App count Reason for the choice
pgvector (Postgres extension) 16 apps Default. One DB to operate. ACID + ANN in the same engine.
Qdrant 2 apps Higher-end ANN, payload filters, on-disk indexes for large corpora.
MySQL JSON as vector 3 apps Apps already on MySQL (WordPress / legacy PHP). Avoids a second DB.
SQLite JSON as vector 4 apps Single-binary deploy, no external DB. The Sell-Me family.
In-memory 2 apps Small corpus, fits in RAM, restart cost acceptable.
Neo4j vector 0 apps Neo4j present but used as a graph, not a vector store.
Pinecone 0 apps (1 configured-but-unused) Considered, dropped in favor of pgvector.
Weaviate / Chroma / Milvus / FAISS 0 apps Not used; covered below for comparison.

pgvector

What it is. A PostgreSQL extension that adds the vector(N) column type, the <-> (L2), <=> (cosine), and <#> (inner-product) distance operators, and IVFFlat / HNSW index types. Search runs as a regular SQL ORDER BY embedding <=> $1 LIMIT 10.

Why we use it. One database for everything. Joins between vector search results and relational metadata happen in the same query — SELECT chunks.* FROM chunks JOIN documents WHERE documents.tenant_id = $1 ORDER BY chunks.embedding <=> $2 LIMIT 10. Backups, transactions, point-in-time recovery, RLS — all the Postgres machinery you already trust.

Used in: - Automotive-Repair-Diagnosis-AI, Boomer_AI-Docker, Ecom-Sales, FamilyChat, GoGreen-Workflow-Hub, GoGreenPaperlessInitiative, GoGreenSourcingAI, GogreenSellerAI, MangyDogCoffee, MyPollingApp-Docker, NaggingWifeAI, OpenSentinel, Realestate-all-docker (sort-of — uses MySQL JSON, see below), Recruiting_AI-Docker, Sales_AI_App, TimeSheetAI, Tutor_AI-Docker.

Tradeoffs. - Index choice. IVFFlat is fast to build but needs retraining as data grows. HNSW is the modern default — better recall, larger memory footprint. - Scale ceiling. Single-instance Postgres handles ~5M vectors comfortably; beyond ~50M you start fighting it. Most portfolio apps are <100K vectors — nowhere near the limit. - No native filter-aware ANN. You can WHERE tenant_id = $1 before the ORDER BY, but the index can't always use both — sometimes degenerates to a scan.


Qdrant

What it is. Open-source dedicated vector database written in Rust. HTTP/gRPC API. Strong payload filtering (filters baked into the ANN traversal, not bolted on after), HNSW indexing, on-disk vector storage (don't need to fit corpus in RAM), good multi-tenancy via collections.

Why some apps picked it over pgvector. - Faster filter-aware ANN at scale. - On-disk vectors — important for GoGreen-DOC-AI's 3072-dim text-embedding-3-large (each vector is 12 KB, so 1M vectors = 12 GB). - Decoupled from the transactional DB — the document service can scale separately.

Used in: - GoGreen-AI-Concierge — Crawlee → batch embed → Qdrant. Citation tracking on responses. - GoGreen-DOC-AI — 3-tier Graph RAG. Qdrant is the vector tier; LangChain wraps it; entity graph sits on top.

Tradeoffs. Second piece of infrastructure to deploy and back up. Worth it when corpus size or filter complexity actually justifies it.


MySQL JSON as vector

What it is. Embeddings stored as a JSON array in a LONGTEXT/JSON column, similarity computed in application code or via SQL math. No real ANN — it's a brute-force scan with cosine similarity.

Why we use it. Some apps are stuck on MySQL (WordPress core, legacy PHP) and adding a second store wasn't worth it for the corpus size.

Used in: - AI-Wordpress_Business-Directory — WordPress lives in MySQL, so embeddings live there too. Cosine similarity in PHP after SELECT *. - Realestate-all-docker — Laravel + MySQL. 11 RAG types over property data. - Voting_NewAndImproved — MariaDB, embeddings column. 5 source types, HyDE-augmented retrieval.

Tradeoffs. O(n) every query. Fine up to ~10K vectors; awful past that. The right move when adding a vector DB would double your operational footprint for marginal benefit.


SQLite JSON as vector

What it is. Same idea as MySQL JSON, but the embedded SQLite use case: ship the app as a single binary or container with a file-backed DB. Cosine similarity computed in JavaScript.

Used in (the Sell-Me family of HyDE apps): - SCO-Digital-Assistant — SQLite JSON + JS cosine. - SellMeACar-Docker — same pattern. - SellMe_PRT-Docker — same pattern. - PolyMarketAI — SQLite + scikit-learn TF-IDF (not embeddings, but same storage shape).

Tradeoffs. Lowest operational complexity in the portfolio — no Postgres, no Redis required to run the assistant. Gets slow past a few thousand chunks.


In-memory vectors

What it is. Embeddings live in a Python list / JS Array, similarity computed at query time. Nothing persists across restarts (or you re-embed on boot from a source-of-truth store).

Used in: - Automotive-Repair-Diagnosis-AI — TF-IDF in memory. - GoGreen-SmartForms — HyDE over in-memory embeddings; persisted upstream. - GoGreenMarketing — HyDE in memory.

Tradeoffs. Cold-start cost (re-embed everything on boot). RAM cost for the vectors themselves. Zero infrastructure cost.


Pinecone (configured, not used)

What it is. Managed cloud vector database. The original "vector DB as a service." Strong on horizontal scale, weak on running locally.

Status in portfolio. - Automotive-Repair-Diagnosis-AI has Pinecone configured in .env but the team chose pgvector + LangChain instead — Pinecone connection code was never wired up. Marked [~] (config-only) in the Integrations Audit.

Why it didn't win. Cost (per-month vector index pricing), the ops cost of an external service that can't be backed up alongside Postgres, and pgvector being good enough.


Comparison terms (not currently used in the portfolio)

These are included so the wiki can answer "why did we pick X over Y."

Weaviate

What it is. Open-source vector DB written in Go. Built-in vectorizers (the DB itself can call OpenAI/Cohere to embed), GraphQL API, hybrid search out of the box. Schema-first.

Why we don't use it. Built-in vectorizers add lock-in we don't need; we already centralize embedding in app code. GraphQL was friction for teams that wanted plain HTTP/SQL.

Chroma

What it is. Embedded-friendly Python-first vector DB. Easiest "import chromadb; collection.add(...)" experience. Originally LangChain's go-to backing store.

Why we don't use it. When the same app already runs Postgres, pgvector beats it on ops. Chroma shines for notebooks and prototypes; the portfolio is past that stage.

Milvus

What it is. Heavy-duty distributed vector DB, originally from Zilliz. GPU acceleration, billion-scale corpora, multiple index types (IVF, HNSW, ANNOY, DiskANN).

Why we don't use it. Wildly oversized for any portfolio corpus. Operational footprint is enterprise-scale.

FAISS

What it is. Facebook AI Similarity Search. A library (not a server) for ANN — you load it in-process, build an index in RAM/disk, query it. The technology underlying many of the above.

Why we don't use it directly. No persistence layer, no auth, no remoting. Use it inside a service if you've outgrown pgvector but don't want a managed product.

What it is. Redis Stack adds FT.SEARCH and vector indexing. Same Redis instance you cache with can host an HNSW index.

Why we don't use it. Plain redis:7 is what's running across the portfolio (no Redis Stack module). Adding it would be a deliberate upgrade — pgvector has been the path of least resistance.

Elasticsearch / OpenSearch dense_vector

What it is. ES has had a dense_vector field type since 7.x. Native hybrid search (BM25 + vector with RRF) is its strength.

Why we don't use it. No app currently runs Elasticsearch. The closest portfolio analog is OpenSentinel's pgvector + BM25 hybrid implemented in app code.

What it is. MongoDB's vector index, available on Atlas (managed Mongo). HNSW, payload filters via Mongo's normal query language.

Why we don't use it. No Mongo in the portfolio.

LanceDB

What it is. Embedded columnar vector DB written in Rust. Apache Arrow native. Good for ML notebooks and edge devices.

Why we don't use it. Newer, less mature; SQLite + JSON has been "good enough" at the small end.

Vespa, Qdrant Cloud, Vald, MyScale

Mentioned for completeness — not evaluated, not used.


How embeddings are produced

See LLM-Providers.md for the OpenAI embedding tier (the portfolio default is text-embedding-3-small, 1536 dimensions; GoGreen-DOC-AI uses text-embedding-3-large, 3072 dim). HuggingFace local embedding usage is documented there too.

Decision tree

Already running Postgres?
├── Yes → pgvector (default).
│         └── Corpus > ~10M vectors OR heavy filter-aware ANN?
│             └── Yes → Add Qdrant alongside.
└── No
    ├── Single binary / desktop app? → SQLite JSON + JS cosine.
    ├── WordPress / legacy MySQL?    → MySQL JSON column.
    └── Greenfield, want a managed service? → Qdrant (self-host) or Pinecone (managed).