🛠️ Steven Gong

Search

SearchSearch

Mar 30, 2025, 1 min read

LLM Evaluation

Did a first-time investigation of this through Etched hackathon.

How do we evaluate the performance of an LLM?

And if you can do hyperparameter tuning on types of the optimal weights. That could be interesting.

Inventing a new type

The common evals:

  • https://github.com/JonathanChavezTamales/LLMStats
  • MMLU, GPQAA (this is what deepseek was evaluted on)
  • Perplexity
  • HumanEval

Graph View

Backlinks

  • Graduate-Level Google-Proof Q&A Benchmark (GPQA)

Created with Quartz, © 2025

  • Blog
  • LinkedIn
  • Twitter
  • GitHub