Project dossier
Engram
Self-hostable AI memory layer with proxy injection, pgvector retrieval, and MCP access.
What it solves
Overview
Engram sits between applications and LLM providers, retrieves relevant memories from PostgreSQL with pgvector, injects them into prompts, forwards the request, and stores durable memories after the response. Interview focus: explain the OpenAI-compatible proxy contract, per-user memory isolation, vector(384) embeddings, retrieval and dedup thresholds, retrieval logs, background extraction, cached-auth fallback, MCP tools, provider abstraction, and why the dashboard talks to the API through server-side service credentials.
Target audience
System design
Architecture
Engram is a four-service Docker Compose stack: FastAPI proxy, PostgreSQL 16 with pgvector, a TypeScript MCP server, and a Next.js dashboard protected by Clerk. The source schema includes users with max_memories_injected, retrieval_threshold, and dedup_threshold settings; memories with confidence, access_count, and source conversation IDs; retrieval_logs for auditability; conversations with extraction_status; and user_api_keys for named key management.
Architecture diagram
Proxy API
Intercepts OpenAI-compatible chat requests, authenticates users, retrieves memory, and forwards enriched messages.
Memory store
Stores users, API key hashes, memory text, metadata, and vector(384) embeddings.
Agent tool layer
Exposes memory reads and writes to AI coding agents through Model Context Protocol tools.
Developer console
Lets users inspect, search, and manage memories through a Next.js dashboard with Clerk sessions.
Provider adapter layer
Provider services translate Engram's proxy request into OpenAI, Ollama, or Gemini style calls while preserving a consistent proxy response shape.
Audit and observability layer
Retrieval logs and conversation records make it possible to inspect which memories were injected, what scores were used, and whether extraction completed.
Implementation surface
Tech stack
OpenAI-compatible proxy, user API, and memory extraction orchestrator.
Primary durable store for users, memory records, and metadata.
Similarity search over vector(384) embeddings.
Agent-facing tools for memory lookup and capture.
Dashboard for memory inspection and service administration.
Self-hosted orchestration for API, database, MCP, and dashboard services.
Authentication and session management for the dashboard.
Async PostgreSQL access for proxy-time retrieval, user lookup, and memory writes.
Runtime validation for MCP tool inputs such as search query, limit, and threshold.
Approximate vector index over pgvector embeddings for faster memory retrieval.
Isolate OpenAI, Ollama, and Gemini request differences behind one proxy pipeline.
Operational flow
How it works
A client sends a chat request to Engram instead of directly to the model provider. Engram authenticates the user, retrieves semantically similar memories, injects them into the prompt, forwards the request, and stores durable memories from the completed conversation.
Register a user
The API creates a user and returns a one-time API key while storing only its hash.
Receive chat traffic
The client calls /v1/chat with X-Engram-Key, X-Engram-User-ID, and provider headers.
Retrieve memory
The API embeds the current conversation and queries pgvector for memories with high semantic similarity.
Vector search lets 'prefers concise answers' match future requests about response style even when words differ.
Inject context and forward
Relevant memories are added to the message context before the request is forwarded to the configured LLM provider.
Extract durable memories
After the response, an extraction pass identifies stable preferences, decisions, and facts for future retrieval.
Log retrieval evidence
The API records query text, query embedding, retrieved memory IDs, retrieved scores, and conversation ID for later audit.
This lets an interview answer move beyond 'we used RAG' into exactly how retrieval behavior can be debugged.
Deduplicate extracted memory
New candidate memories are compared against existing memories using a high dedup threshold before they are inserted.
Deduplication prevents a memory store from filling with repeated versions of the same preference.
Fall back during database outages
If the database pool is unavailable, the proxy can use cached API-key auth for passthrough behavior instead of failing every request.
That fallback preserves provider access but intentionally cannot retrieve or persist fresh durable memory.
Sequence diagram
Concept depth
Key concepts
Embedding models convert text into vectors where related meanings land near each other. This enables retrieval by intent rather than exact keyword overlap.
In Engram: Engram stores memories as vector(384) embeddings so the proxy can retrieve relevant personal context for a new chat turn.
Confidence
Implementation evidence
Code highlights
Memory retrieval before provider call
The proxy enriches requests by retrieving user-specific memories before forwarding to the LLM provider.
Retrieval happens before provider forwarding, so the model sees relevant durable context.
Extraction is scheduled after response generation to avoid slowing down the user-facing path.
pgvector similarity query
Memory lookup ranks rows by distance between the stored embedding and the conversation embedding.
The vector operator returns nearest memories by semantic distance.
Filtering by user_id keeps personal memory isolated per account.
Proxy auth fallback path
The proxy distinguishes database-backed memory mode from cached-auth passthrough mode when PostgreSQL is temporarily unavailable.
The fallback keeps provider passthrough alive but cannot perform durable retrieval or extraction.
The external user ID is still checked so one key cannot impersonate another user.
MCP search input validation
Agent-facing memory tools validate arguments before calling the Engram API.
Tool input validation protects the API from malformed agent calls.
The tool exposes threshold control while keeping safe numeric bounds.
Contracts
API design
Base URL: http://localhost:8000
/usersCreates an Engram user and returns the one-time API key.
{ "email": "ayush@example.com" }{ "userId": "usr_9v2", "apiKey": "engram_live_..." }/v1/chatOpenAI-compatible chat endpoint with memory retrieval and post-response extraction.
{ "model": "gpt-4.1-mini", "messages": [{ "role": "user", "content": "Use my style preferences" }] }{ "id": "chatcmpl_...", "choices": [{ "message": { "role": "assistant", "content": "..." } }] }/memories/searchSearches durable memories for dashboard and MCP clients.
{ "results": [{ "content": "Prefers concise engineering summaries", "score": 0.82 }] }/v1/chatAccepts X-Engram-Key, X-Engram-User-ID, provider selection, and disable-injection or disable-extraction headers for proxy-time control.
{ "model": "gpt-4.1-mini", "messages": [{ "role": "user", "content": "Remember my preference" }] }{ "choices": [{ "message": { "content": "..." } }], "headers": { "X-Engram-Memories-Injected": "3" } }/retrieval-logsReturns recent retrieval evidence for a user, including retrieved memory IDs and similarity scores.
/memories/captureCaptures a conversation or manual memory through dashboard or MCP flows and schedules extraction or direct insert.
/users/{userId}/configUpdates per-user memory settings such as max injected memories, retrieval threshold, and deduplication threshold.
State model
Database design
Data relationship diagram
users
Application users who own isolated memory collections.
api_keys
One-way hashes of user API keys with creation and revocation metadata.
memories
Durable memory content, metadata, and vector(384) embeddings.
retrieval_logs
Audit table storing the natural-language query, query embedding, retrieved memory IDs, retrieved scores, and linked conversation.
conversations
Raw exchange and extraction status for post-response memory processing.
user_api_keys
Named API key hashes with last-used metadata, separate from the primary user row.
Architecture decisions
Trade-offs
Memory architecture
Proxy service over Client-side SDK only
A proxy can support any OpenAI-compatible client and centralize retrieval, injection, extraction, and audit behavior.
Vector storage
pgvector over Pinecone or Weaviate
Per-user memory volumes fit well inside PostgreSQL, and one database simplifies self-hosting, backup, and migrations.
Service shape
Docker Compose over Kubernetes
The target user is a self-hosting developer. Compose keeps the four services understandable and easy to run locally.
Extraction timing
Background extraction task over Blocking extraction before response
Users care about chat latency. Background extraction keeps the response path fast while still recording the conversation for durable memory processing.
Database outage behavior
Cached-auth passthrough over Failing all proxy requests
Provider access can continue for known keys even when retrieval is unavailable, but the degraded mode is explicit and does not claim to inject memory.
User-configurable memory controls
Per-user thresholds over One global retrieval setting
Some users want aggressive recall and others want precision. Storing max_memories_injected, retrieval_threshold, and dedup_threshold on the user model makes that behavior tunable.
Lessons learned
Challenges and solutions
Problem
Automatic memory extraction can store noisy or transient facts if it is too eager.
Solution: Treat durable memories as stable preferences, decisions, or repeated facts and keep extraction separate from request forwarding.
Lesson: Memory systems need precision and user inspectability, not just aggressive capture.
Problem
Dashboard and API need secure service-to-service communication without leaking credentials to the browser.
Solution: Use an ENGRAM_SERVICE_KEY for internal dashboard API calls and keep user API keys server-side.
Lesson: Self-hostable systems still need production-grade boundaries between browser, dashboard server, and API.
Problem
Injected memories can make a model worse if irrelevant context crosses the retrieval threshold.
Solution: Limit the number of injected memories, expose per-user thresholds, log retrieval scores, and keep dashboard review possible.
Lesson: RAG quality needs observability and controls, not just embeddings.
Problem
Self-hosted users may lose database connectivity but still expect model calls to work.
Solution: Add a cached-auth passthrough path that skips retrieval and extraction while preserving provider calls for known users.
Lesson: Graceful degradation should clearly reduce features instead of hiding failure.
Problem
MCP tools are called by agents and can receive malformed or overly broad inputs.
Solution: Validate tool arguments with Zod, bound limits and thresholds, and route errors through structured tool results.
Lesson: Agent-facing tools need the same input discipline as public APIs.
Runbook
Requirements and future work
Requirements
- Docker and Docker Compose for the four-service stack.
- PostgreSQL 16 with the pgvector extension.
- OpenAI API key or another compatible LLM provider key.
- Clerk account and ENGRAM_SERVICE_KEY for dashboard authentication.
- Database schema must enable the vector extension and uuid-ossp before tables are created.
- Users need configured max_memories_injected, retrieval_threshold, and dedup_threshold defaults.
- MCP server requires ENGRAM_API_URL, ENGRAM_USER_ID, and an Engram API key to call memory tools.
- Dashboard service routes must keep ENGRAM_SERVICE_KEY server-side and never expose it to browser bundles.
Future improvements
- Add per-memory confidence and source attribution.
- Expose memory review workflows before permanent storage.
- Support multiple embedding providers with migration tooling.
- Add a memory review queue where extracted candidates require approval before becoming durable memories.
- Display retrieval logs next to chat traces so users can see exactly which memories influenced an answer.
- Add embedding-provider migration tools for re-embedding all memories when vector dimensions or providers change.
- Add per-memory TTL and scope controls for temporary project facts versus long-term preferences.
Active recall
Interview Q&A
Why use a proxy instead of asking each app to manage memory?
Why is pgvector enough for Engram's first version?
How does Engram protect API keys?
What happens in Engram when PostgreSQL is unavailable?
Why store retrieval logs?
How do retrieval_threshold and dedup_threshold differ?
Why should memory extraction be asynchronous?
What security boundary exists between the dashboard and API?
What is the risk of storing memories automatically?
Why keep vector search inside PostgreSQL?
How does the MCP server fit into the architecture?