Ingestion pipeline Capture every new post / edit / media event. Preprocess: clean, language detection, extract metadata (timestamp, subreddit, post-id, reply-to, upvotes). Media routing: send images to a vision pipeline, videos to frame extractor → video model, audio to speech pipeline. Short-term context cache (hot memory) Keep most recent N posts / conversation turns directly in the LLM prompt (sliding window). Example: last 2–5 posts or last X minutes of activity. This is cheap and low-latency. Vector DB (semantic memory) Create embeddings for each item (post, comment, image caption, condensed transcript). Store vectors + metadata in a vector DB (FAISS / Milvus / Pinecone / Weaviate). Use semantic retrieval (k-nearest neighbors) to fetch relevant history for a given query. Memory condensation / hierarchical store Periodic background condensing job (not "background" for you to wait on — it's part of infra) that: Clusters older posts by topic/time. Summarizes each cluster into compact human-readable memory (e.g., 50–200 tokens). Replaces many raw items with one condensed memory entry (but keep originals for full audit). Keeps vector DB with both raw recent vectors and long-term condensed vectors. Versioning + edits Store canonical entity per post-id. When edited: Replace text and update embedding (and optionally keep a diffs log). Keep a “last-modified” and optionally small delta log so retrieval can prioritize latest content. For the LLM: retrieval returns the canonical (latest) content. If you want to show an edit history, return both. Retriever + RAG On query, retrieve: Hot cache (recent posts). k-nearest semantic memories (raw + condensed). Topical summaries (precomputed). Feed these into the LLM in a structured prompt: short-term context first, high-relevance raw items next, condensed long-term memories after. Multimodal perception (vision / audio) Vision: SAM-style segmentation + CLIP embeddings + a dedicated classifier (or fine-tuned ViT) for fine detail queries. For video, use frame sampling + temporal model (I3D / SlowFast or transformer-based video encoders) to produce embeddings or structured facts. Audio: VAD → transcription (Whisper or wav2vec2 variants) → embed transcript; also extract audio embeddings for speaker/tonal cues if needed. Store all modality embeddings in the same vector DB with modality tag. Neural compression & learned selectors Use an LLM to create semantic pointers — short summaries that are more effective than naive truncation. Keep high-value items uncompressed (high karma, many replies), compress low-value ones.