E
AI infrastructureactive

Project dossier

Engram

Self-hostable AI memory layer with proxy injection, pgvector retrieval, and MCP access.

What it solves

Overview

Engram sits between applications and LLM providers, retrieves relevant memories from PostgreSQL with pgvector, injects them into prompts, forwards the request, and stores durable memories after the response. Interview focus: explain the OpenAI-compatible proxy contract, per-user memory isolation, vector(384) embeddings, retrieval and dedup thresholds, retrieval logs, background extraction, cached-auth fallback, MCP tools, provider abstraction, and why the dashboard talks to the API through server-side service credentials.

Target audience

AI application developers who need persistent personalized context.Teams that want self-hosted memory instead of a managed black box.Agent builders using MCP-enabled coding tools.

System design

Architecture

Engram is a four-service Docker Compose stack: FastAPI proxy, PostgreSQL 16 with pgvector, a TypeScript MCP server, and a Next.js dashboard protected by Clerk. The source schema includes users with max_memories_injected, retrieval_threshold, and dedup_threshold settings; memories with confidence, access_count, and source conversation IDs; retrieval_logs for auditability; conversations with extraction_status; and user_api_keys for named key management.

Architecture diagram

Diagram loads when visible

Proxy API

Intercepts OpenAI-compatible chat requests, authenticates users, retrieves memory, and forwards enriched messages.

FastAPIPythonOpenAI-compatible HTTP

Memory store

Stores users, API key hashes, memory text, metadata, and vector(384) embeddings.

PostgreSQL 16pgvector

Agent tool layer

Exposes memory reads and writes to AI coding agents through Model Context Protocol tools.

TypeScriptMCP SDK

Developer console

Lets users inspect, search, and manage memories through a Next.js dashboard with Clerk sessions.

Next.jsClerkTypeScript

Provider adapter layer

Provider services translate Engram's proxy request into OpenAI, Ollama, or Gemini style calls while preserving a consistent proxy response shape.

OpenAI adapterOllama adapterGemini adapter

Audit and observability layer

Retrieval logs and conversation records make it possible to inspect which memories were injected, what scores were used, and whether extraction completed.

retrieval_logsconversationsJSONB

Implementation surface

Tech stack

FastAPIBackend

OpenAI-compatible proxy, user API, and memory extraction orchestrator.

PostgreSQL 16Database

Primary durable store for users, memory records, and metadata.

pgvectorDatabase

Similarity search over vector(384) embeddings.

TypeScript MCPBackend

Agent-facing tools for memory lookup and capture.

Next.jsFrontend

Dashboard for memory inspection and service administration.

Docker ComposeDevops

Self-hosted orchestration for API, database, MCP, and dashboard services.

ClerkLibrary

Authentication and session management for the dashboard.

asyncpgDatabase

Async PostgreSQL access for proxy-time retrieval, user lookup, and memory writes.

ZodLibrary

Runtime validation for MCP tool inputs such as search query, limit, and threshold.

IVFFlat indexDatabase

Approximate vector index over pgvector embeddings for faster memory retrieval.

Provider adaptersBackend

Isolate OpenAI, Ollama, and Gemini request differences behind one proxy pipeline.

Operational flow

How it works

A client sends a chat request to Engram instead of directly to the model provider. Engram authenticates the user, retrieves semantically similar memories, injects them into the prompt, forwards the request, and stores durable memories from the completed conversation.

1

Register a user

The API creates a user and returns a one-time API key while storing only its hash.

2

Receive chat traffic

The client calls /v1/chat with X-Engram-Key, X-Engram-User-ID, and provider headers.

3

Retrieve memory

The API embeds the current conversation and queries pgvector for memories with high semantic similarity.

Vector search lets 'prefers concise answers' match future requests about response style even when words differ.

4

Inject context and forward

Relevant memories are added to the message context before the request is forwarded to the configured LLM provider.

5

Extract durable memories

After the response, an extraction pass identifies stable preferences, decisions, and facts for future retrieval.

6

Log retrieval evidence

The API records query text, query embedding, retrieved memory IDs, retrieved scores, and conversation ID for later audit.

This lets an interview answer move beyond 'we used RAG' into exactly how retrieval behavior can be debugged.

7

Deduplicate extracted memory

New candidate memories are compared against existing memories using a high dedup threshold before they are inserted.

Deduplication prevents a memory store from filling with repeated versions of the same preference.

8

Fall back during database outages

If the database pool is unavailable, the proxy can use cached API-key auth for passthrough behavior instead of failing every request.

That fallback preserves provider access but intentionally cannot retrieve or persist fresh durable memory.

Sequence diagram

Diagram loads when visible

Concept depth

Key concepts

Embedding models convert text into vectors where related meanings land near each other. This enables retrieval by intent rather than exact keyword overlap.

In Engram: Engram stores memories as vector(384) embeddings so the proxy can retrieve relevant personal context for a new chat turn.

Confidence

Implementation evidence

Code highlights

Memory retrieval before provider call

The proxy enriches requests by retrieving user-specific memories before forwarding to the LLM provider.

Code highlight loads when visible

Retrieval happens before provider forwarding, so the model sees relevant durable context.

Extraction is scheduled after response generation to avoid slowing down the user-facing path.

pgvector similarity query

Memory lookup ranks rows by distance between the stored embedding and the conversation embedding.

Code highlight loads when visible

The vector operator returns nearest memories by semantic distance.

Filtering by user_id keeps personal memory isolated per account.

Proxy auth fallback path

The proxy distinguishes database-backed memory mode from cached-auth passthrough mode when PostgreSQL is temporarily unavailable.

Code highlight loads when visible

The fallback keeps provider passthrough alive but cannot perform durable retrieval or extraction.

The external user ID is still checked so one key cannot impersonate another user.

MCP search input validation

Agent-facing memory tools validate arguments before calling the Engram API.

Code highlight loads when visible

Tool input validation protects the API from malformed agent calls.

The tool exposes threshold control while keeping safe numeric bounds.

Contracts

API design

Base URL: http://localhost:8000

POST/users

Creates an Engram user and returns the one-time API key.

{ "email": "ayush@example.com" }
{ "userId": "usr_9v2", "apiKey": "engram_live_..." }
POST/v1/chat

OpenAI-compatible chat endpoint with memory retrieval and post-response extraction.

{ "model": "gpt-4.1-mini", "messages": [{ "role": "user", "content": "Use my style preferences" }] }
{ "id": "chatcmpl_...", "choices": [{ "message": { "role": "assistant", "content": "..." } }] }
GET/memories/search

Searches durable memories for dashboard and MCP clients.

{ "results": [{ "content": "Prefers concise engineering summaries", "score": 0.82 }] }
POST/v1/chat

Accepts X-Engram-Key, X-Engram-User-ID, provider selection, and disable-injection or disable-extraction headers for proxy-time control.

{ "model": "gpt-4.1-mini", "messages": [{ "role": "user", "content": "Remember my preference" }] }
{ "choices": [{ "message": { "content": "..." } }], "headers": { "X-Engram-Memories-Injected": "3" } }
GET/retrieval-logs

Returns recent retrieval evidence for a user, including retrieved memory IDs and similarity scores.

POST/memories/capture

Captures a conversation or manual memory through dashboard or MCP flows and schedules extraction or direct insert.

PATCH/users/{userId}/config

Updates per-user memory settings such as max injected memories, retrieval threshold, and deduplication threshold.

State model

Database design

PostgreSQL 16 with pgvector

Data relationship diagram

Diagram loads when visible

users

Application users who own isolated memory collections.

idemailcreated_at

api_keys

One-way hashes of user API keys with creation and revocation metadata.

iduser_idkey_hashcreated_atrevoked_at

memories

Durable memory content, metadata, and vector(384) embeddings.

iduser_idcontentmetadataembedding vector(384)created_at

retrieval_logs

Audit table storing the natural-language query, query embedding, retrieved memory IDs, retrieved scores, and linked conversation.

iduser_idqueryquery_embedding vector(384)retrieved_memory_idsretrieved_scoresconversation_id

conversations

Raw exchange and extraction status for post-response memory processing.

iduser_idextraction_statusmemories_extractedraw_exchangecreated_at

user_api_keys

Named API key hashes with last-used metadata, separate from the primary user row.

iduser_idapi_key_hashnamecreated_atlast_used_at

Architecture decisions

Trade-offs

Memory architecture

Proxy service over Client-side SDK only

A proxy can support any OpenAI-compatible client and centralize retrieval, injection, extraction, and audit behavior.

Vector storage

pgvector over Pinecone or Weaviate

Per-user memory volumes fit well inside PostgreSQL, and one database simplifies self-hosting, backup, and migrations.

Service shape

Docker Compose over Kubernetes

The target user is a self-hosting developer. Compose keeps the four services understandable and easy to run locally.

Extraction timing

Background extraction task over Blocking extraction before response

Users care about chat latency. Background extraction keeps the response path fast while still recording the conversation for durable memory processing.

Database outage behavior

Cached-auth passthrough over Failing all proxy requests

Provider access can continue for known keys even when retrieval is unavailable, but the degraded mode is explicit and does not claim to inject memory.

User-configurable memory controls

Per-user thresholds over One global retrieval setting

Some users want aggressive recall and others want precision. Storing max_memories_injected, retrieval_threshold, and dedup_threshold on the user model makes that behavior tunable.

Lessons learned

Challenges and solutions

Problem

Automatic memory extraction can store noisy or transient facts if it is too eager.

Solution: Treat durable memories as stable preferences, decisions, or repeated facts and keep extraction separate from request forwarding.

Lesson: Memory systems need precision and user inspectability, not just aggressive capture.

Problem

Dashboard and API need secure service-to-service communication without leaking credentials to the browser.

Solution: Use an ENGRAM_SERVICE_KEY for internal dashboard API calls and keep user API keys server-side.

Lesson: Self-hostable systems still need production-grade boundaries between browser, dashboard server, and API.

Problem

Injected memories can make a model worse if irrelevant context crosses the retrieval threshold.

Solution: Limit the number of injected memories, expose per-user thresholds, log retrieval scores, and keep dashboard review possible.

Lesson: RAG quality needs observability and controls, not just embeddings.

Problem

Self-hosted users may lose database connectivity but still expect model calls to work.

Solution: Add a cached-auth passthrough path that skips retrieval and extraction while preserving provider calls for known users.

Lesson: Graceful degradation should clearly reduce features instead of hiding failure.

Problem

MCP tools are called by agents and can receive malformed or overly broad inputs.

Solution: Validate tool arguments with Zod, bound limits and thresholds, and route errors through structured tool results.

Lesson: Agent-facing tools need the same input discipline as public APIs.

Runbook

Requirements and future work

Requirements

  • Docker and Docker Compose for the four-service stack.
  • PostgreSQL 16 with the pgvector extension.
  • OpenAI API key or another compatible LLM provider key.
  • Clerk account and ENGRAM_SERVICE_KEY for dashboard authentication.
  • Database schema must enable the vector extension and uuid-ossp before tables are created.
  • Users need configured max_memories_injected, retrieval_threshold, and dedup_threshold defaults.
  • MCP server requires ENGRAM_API_URL, ENGRAM_USER_ID, and an Engram API key to call memory tools.
  • Dashboard service routes must keep ENGRAM_SERVICE_KEY server-side and never expose it to browser bundles.

Future improvements

  • Add per-memory confidence and source attribution.
  • Expose memory review workflows before permanent storage.
  • Support multiple embedding providers with migration tooling.
  • Add a memory review queue where extracted candidates require approval before becoming durable memories.
  • Display retrieval logs next to chat traces so users can see exactly which memories influenced an answer.
  • Add embedding-provider migration tools for re-embedding all memories when vector dimensions or providers change.
  • Add per-memory TTL and scope controls for temporary project facts versus long-term preferences.

Active recall

Interview Q&A

ArchitectureMedium

Why use a proxy instead of asking each app to manage memory?

02:00
TradeoffsMedium

Why is pgvector enough for Engram's first version?

02:00
ConceptsEasy

How does Engram protect API keys?

02:00
ArchitectureHard

What happens in Engram when PostgreSQL is unavailable?

02:00
ConceptsMedium

Why store retrieval logs?

02:00
ConceptsMedium

How do retrieval_threshold and dedup_threshold differ?

02:00
TradeoffsMedium

Why should memory extraction be asynchronous?

02:00
ArchitectureHard

What security boundary exists between the dashboard and API?

02:00
BehavioralHard

What is the risk of storing memories automatically?

02:00
TradeoffsMedium

Why keep vector search inside PostgreSQL?

02:00
ArchitectureMedium

How does the MCP server fit into the architecture?

02:00