LLM Evaluation

Did a first-time investigation of this through Etched hackathon.

How do we evaluate the performance of an LLM?

And if you can do hyperparameter tuning on types of the optimal weights. That could be interesting.

Inventing a new type

The common evals:

🛠️ Steven Gong