LLM Evaluation
Did a first-time investigation of this through Etched hackathon.
How do we evaluate the performance of an LLM?
And if you can do hyperparameter tuning on types of the optimal weights. That could be interesting.
Inventing a new type
The common evals:
- https://github.com/JonathanChavezTamales/LLMStats
- MMLU, GPQAA (this is what deepseek was evaluted on)
- Perplexity
- HumanEval