Delphi Research Engineer

Our “Clone Brain” architecture allows you to create a digital representation of your mind—reflecting your knowledge, tone, ways of thinking, and even the purpose that drives your conversations. (For example, a leadership coach might direct their clone to mentor emerging managers, while a consultant might want their clone to focus on sales strategy and client onboarding.)

Up until now, many of our improvements have come from intuition, first principles, and a very basic testing suite. We want to increase the fidelity of each Clone Brain, ensuring it captures its owner’s unique style, knowledge, and conversational aims, while also being able to reason in new situations. But to do that, we need rigorous measurements and interpretability tools that transform “it feels right” into “we have metrics & benchmarks that prove it.”

Enter the Research Engineer – Evals & Interpretability. You’ll develop frameworks that quantify how well each digital clone mirrors the authenticity and expertise of its human counterpart, while also building the tooling to open the black box and figure out why the clone behaves the way it does. If you’re curious about cognitive science, neural network interpretability, and the essence of what makes a human mind unique—this role has your name on it.

What You Will Work On

Frontier Eval Systems & Metrics
- Design, implement, and manage robust evaluation frameworks that measure how faithfully a clone reflects its owner’s tone, style, purpose, and reasoning.
- Develop automated tests and analysis pipelines to compare new models and architectures, ensuring we’re always improving the fidelity of our Clone Brain.
Interpretability & Debugging
- Build interpretability tools that shine a light on the internal workings of our clone models, from attention heads to knowledge graph structures.
- Investigate model behaviors and anomalies, surfacing insights that guide algorithmic improvements and mitigate unexpected outcomes.
Collaboration & Deployment
- Work closely with our AI, product, and engineering teams to integrate your evaluation suites into production workflows.
- Contribute to real-time feedback loops that help experts refine their clone’s knowledge and style with confidence.
Infrastructure & Tooling
- Develop the technical infrastructure for large-scale experimentation and analysis, ensuring that interpretability and eval frameworks can scale across thousands of clones.
- Help define our data schemas, retrieval strategies, and model instrumentation in collaboration with data and infra engineers.

Preferred Abilities

Hands-On Research Experience: A track record of designing experiments and running them end-to-end—whether in AI, ML, or another scientific domain.
LLM Familiarity: Experience evaluating or fine-tuning large language models, with an emphasis on measuring alignment, style transfer, or interpretability.
Python Proficiency: Strong coding skills to build robust pipelines and experiment frameworks.
Evals & Benchmarking: Familiarity with common language model benchmarks and an eagerness to develop new ones.
Interpretability Fundamentals: Knowledge of mechanistic interpretability, feature attribution, or circuit-level analysis is a huge plus.
Infrastructure & Tools: Comfort with containers, scaling experiments on clusters, and building internal tools.
Experimental Mindset: Ability to pivot quickly when an approach doesn’t pan out, and a relentless drive to find creative solutions to open-ended questions.

Why You Might Like This Role

Evals for AI is pushing the frontier of research. How to do evals correctly is still an open question. People who will thrive in this role are excited by this challenge, and the opportunity to be at the forefront of research.
High level of ownership and impact on product, technical architecture, and company culture
Opportunity to define the future of digital cloning, ultimately enabling digital immortality and 1-1 mentorship for the masses.