Interpretable Machine Learning

But humans’ interpretation of themselves are also abstractions? We don’t actually know why we do what we do. We only have an interpretation.

We tell ourselves Story.

Resources:

  • https://transformer-circuits.pub/ These are really good by Anthropic, if you want to get a better intuition of how transformers actually understand
    • What if you just had an API to visualize these models? like do model.explain(), and somehow you get stats? Like there should be a systematic way to do interpretability based on some inputs?