A Survey on Evaluation of Large Language Models
- mirglobalacademy
- Oct 31, 2025
- 4 min read

1. Evaluation Matrices / Dimensions
When you evaluate an LLM’s output (for example in QA, summarisation, dialogue, code generation), it’s useful to use a matrix of criteria rather than a single “correct/incorrect” label. Typical dimensions include:
Dimension | What it means | Why it matters |
Correctness / Factual Accuracy | Are the claims true (or consistent with the ground‐truth)? | Without factual accuracy the value of output is undermined. |
Completeness | Does the output cover all required parts of the task (no missing pieces)? | An answer can be factually correct but incomplete. |
Relevance / Task Alignment | Does the response follow the prompt/instruction, and stay on topic? | Models may go off‐topic or ignore parts of the prompt. |
Logical Consistency / Reasoning Quality | If reasoning is required, are the steps valid and consistent? | For reasoning tasks (math, logic) the chain matters. |
Fluency / Style / Readability | Is the output clear, grammatically correct, easy to understand? | Even correct content can be unreadable or ambiguous. |
Originality / Appropriateness (for creative tasks) | For open‐ended generation: is the content novel, interesting, and appropriate? | Many valid outputs exist — this dimension captures qualitative aspects. |
Robustness / Safety / Bias | Does the model avoid hallucinations, biased statements, harmful content? | Especially important in production or sensitive domains. |
Efficiency / Cost (optional) | How much compute, time or model effort was used to produce the output? | For practical deployments, you may care about resource usage. |
You can imagine each output being scored on each dimension (e.g., 0–5 scale) and perhaps aggregated depending on your use case.
2. Key Evaluation Metrics
Here are commonly used quantitative metrics (especially for automatic/algorithmic evaluation) — and when each is most appropriate.
2.1 Traditional reference‐based metrics
Good when you have a ground‐truth “reference” answer.
Accuracy / Exact Match (EM): Proportion of outputs that exactly match the reference.
Precision / Recall / F1: Common in structured tasks or classification.
BLEU: Measures n‐gram overlap between generated text and reference translations/summaries.
ROUGE: Recall‐based overlap metric (commonly used in summarisation). ACL Anthology+1
METEOR: Combines precision & recall + synonym/stemming matching.
BERTScore / embedding‐based similarity: Computes similarity of embeddings between output and reference (hence more semantic) rather than just exact token overlap.
2.2 Reference‐free / model‐based / novel metrics
Because many LLM tasks are open‐ended (multiple valid answers) traditional metrics may fail.
LLM‐as‐a‐Judge / “Generative” evaluation: Use a strong LLM (or the same LLM) to judge outputs according to criteria. For example the G‑Eval framework uses chain‐of‐thought and form‐filling to score. arXiv+1
Hallucination Score / Faithfulness Metrics: Quantify how much the output deviates from the source / introduces ungrounded content. Weights & Biases+1
Model Utilization Index (MUI): A newer metric that evaluates how much “effort” or mechanism of the model was used to generate the output (i.e., interpretability‐related)
2.3 Human / Qualitative Metrics
When automatic metrics aren’t enough (especially in open‐ended tasks), human judgement is still gold.
Likert scale ratings: E.g., 1–5 for correctness, clarity, helpfulness.
Ranking / Preference judgments: Ask annotators to pick the better output among several.
Detailed error annotation: Identify specific kinds of errors (fact‐error, missing‐info, logic‐error, etc.).
User feedback / real‐world impact: In production systems, the ultimate metric may be user satisfaction, engagement, erroneous behaviour.
2.4 Aggregate / Monitoring Metrics
For ongoing monitoring of production LLMs:
Task completion rate: % of tasks the model completes successfully. Confident AI+1
Latency / throughput / cost per output: For practical deployments.
Drift / degradation metrics: Change in performance over time.
Bias / fairness metrics: Difference in performance across demographics, or frequency of harmful content.
3. Putting It Together: Evaluation Workflow
Here’s a practical evaluation workflow you can follow:
Define metrics: Choose which dimensions (from Section 1) matter for your use case.
Select appropriate measurement method:
If you have reference answers → use traditional metrics.
If open‐ended → use human evaluation and/or LLM‐as‐judge.
In production → include monitoring metrics (completion rate, latency, bias).
Collect data: Create a test set (with diverse cases), include edge cases, adversarial ones.
Run automated metrics: For each output compute metrics (BLEU/ROUGE etc).
Run human/LLM‐judge evaluation: For a subset or full set depending on cost.
Analyse failures: Examine where the model fails (missing info, logic error, hallucination etc).
Set thresholds / benchmarks: Decide what performance is “good enough” for deployment.
Monitor over time: As you update models or data drift occurs, keep tracking.
Iterate: Use findings to improve prompts, retrieval, fine‐tuning, or error‐handling.
4. Important Recent Papers
Here are some key recent academic papers that discuss LLM evaluation metrics, frameworks and challenges:
A Survey on Evaluation of Large Language Models (Chang et al., 2023) — “what to evaluate, where to evaluate, how to evaluate” for LLMs. arXiv+1
G‑Eval: NLG Evaluation using GPT‑4 with Better Human Alignment (Liu et al., 2023) — introduces a framework using LLMs as evaluators. arXiv
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges (Li et al., 2024) — overview of LLM‐based evaluation methods and challenges. arXiv
Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law (Cao et al., 2025) — proposes the MUI metric and discusses evaluation in terms of mechanism interpretability. arXiv
Multi‑Layered Evaluation Using a Fusion of Metrics and LLMs as Judges (Rahnamoun & Shamsfard, 2025) — explores combining lexical, semantic, and LLM‐judge metrics. ACL Anthology


Comments