Last year, I set out to create a "digital twin" by training an LLM on a corpus of my own data, and instructing it to engage in the first-person as me. To be clear, I’m not interested in creating an avatar that looks/sounds like me. Personally, I find that idea creepy and I have little interest in engaging with other people’s virtual likeness. My interest is in personal representation — my essence as a guide. I’ve explored this idea in the past with simple decision trees, but they were highly constrained and limited.
My "Digital Twin"
My first attempt was using a product called Selfie from Vana. The promise of Selfie is an LLM that’s trained and runs “device-side” (similar to Apple's promise), which keeps your personal data out of the prying eyes of big tech. I wasn't happy with the results — it was was too limited and resource intensive. Last month I tried again with Anthropic's Claude 3.5 Sonnet. Of all the foundation models, I trust Anthropic the most with my data, as they've put privacy front/center in their policies. That said, I made a concerted effort to decouple my directly identifiable data (PII) from my training data to limit downstream exposure. I got close to an experience I was happy with, but hit a wall with the context window size and the reliability of their JSON rendering.
So I tried again with OpenAI, the company I love to hate, but keeps giving me what I want. And I've come much closer to what I was looking for — a deep and fairly accurate 1st-person representation of myself, my work, and my writing, my interest in music, and many other aspects of my life. Having played with some other "digital twins" I wanted more than just a chat UI. So I built a mixed-media chat UI (going back to my roots in multimedia and hypertext).
I’ll share a followup post with all the details about how I built it. But here are a few insights to share:
- The current training corpus is just over 200,000 characters, or 28,000 words.
- The entire training corpus sits inside an OpenAI Assistant instruction context window (this consumes about half the available context window, which grows as with the session).
- The best results have come from structuring the training data in Markdown with deeply nested hierarchy (I’ll share the structure in my followup).
- I’m using 4o-mini and averaging ~75k tokens per query at a cost of about $0.015/query (in/out). So far this results in an average cost per session of about $0.10.
It's far from perfect, in fact it'd deeply flawed still. But I think it's pretty interesting. I invite you to give it a spin. If you know me, dig into my life a bit, and let me know if you feel it's an accurate representation.