Top 25 LLMs System Design Interview Questions
- mirglobalacademy
- Nov 20, 2025
- 3 min read
🧠 Chapter 1:
The Tokenizer Trap in Domain-Specific LLM Training
🎯 The Interview Question
Alright, imagine you're in an interview. The interviewer leans forward and says:
“We’re training a new LLM for the legal and medical domains. Can we just use a standard LLaMA 3 tokenizer? What’s the risk… and how would you fix it?”
Most folks might answer with some level of nonchalance (casual disregard), but this question hides a deep pitfall (hidden danger).
🚫 The Common Wrong Answer
"Well, it’s not ideal, but the tokenizer will just break up unknown words like ‘aneurysm’ into subwords. The model can learn those combinations during fine-tuning.”
Sounds plausible (seemingly reasonable), right?
Wrong. That’s a deceptively inadequate response. Here's why:
The tokenizer doesn’t know the esoteric (intended for a small, specialized group) terms in law or medicine.
So, it splits them into many tiny pieces—called subwords—like aneurysm → ane + ur + ysm.
Now, your model needs to process way more tokens per input, ballooning (expanding rapidly) the input length.
Transformer models operate with O(n²) complexity—so more tokens = exponential compute cost.
In some cases, costs jump 16x, context windows shrink, and your model learns slower.
Ouch.
✅ How It Actually Works
Here’s the savvy (well-informed) fix:
👉 Train a custom tokenizer using your legal and medical dataset from scratch.
Why?
Because a domain-specific tokenizer compresses tokens better.
Instead of breaking “glioblastoma” into 5 fragments, your custom tokenizer might learn to treat it as a single token.
That means:
Shorter sequences
Lower memory usage
Faster training
Better accuracy on domain-specific tasks
In tech-speak: You’re minimizing fertility (average number of tokens per word), which directly reduces the computational onerousness (burdensomeness) of attention layers.
📄 The Key Paper
Title: Tokenizer Choice For LLM Training: Negligible or Crucial? Authors: Mehdi Ali et al., 2024Link: Read the paper
Key insight: Generic tokenizers are deleterious (harmful) in specialized domains. They inflate sequence lengths, cost more, and hurt performance.
💡 Big Takeaway
Don’t let a generic tokenizer sabotage (undermine) your domain-specific model.
👉 Build a tokenizer that speaks the language of your domain.
That one change could save you 10x in compute—and possibly your job.
⚡️ Chapter 2:
Speculative Decoding for Lossless Inference Acceleration
🎯 The Interview Question
You’re in a system design interview. The product lead walks in and says:
“We need a 2x speedup on our LLaMA 3 70B model. But no lossy tricks like quantization or pruning. Can we still go faster?”
Tricky, right?
This question is all about understanding asymmetry (lack of equality between parts) inside the Transformer architecture.
🚫 The Common Wrong Answer
"Let’s optimize our batching strategy. Like vLLM’s PagedAttention—it handles memory super efficiently.”
That’s a decent guess... but fundamentally misguided (based on a faulty understanding).
Why?
Because batching improves throughput (how much you can do overall), but not latency (how fast one user sees a response).
So, for one user waiting on a single output—it still takes forever.
✅ How It Actually Works
Here’s the clever trick: Speculative Decoding 🧠💡
What’s the idea?
Use a small, fast model (called the “draft model”) to guess several tokens ahead. Then, let the big, slow model verify those guesses all at once.
This taps into the inherent asymmetry: generation is slow and memory-heavy, but verification is fast and compute-bound.
Think of it like: the junior dev (small model) writes a rough draft, and the senior engineer (large model) just reviews and approves chunks. Much faster!
🔍 Mechanism in Action
The draft model speculates, say, 5 tokens: “The patient has a...”
The target model checks those 5 all at once.
If they match: 🎉 done!
If not: it rolls back to the mismatch and continues generation as usual.
The process includes:
Rejection Sampling: The model accepts the longest matching prefix.
Parallel Verification: Multiple tokens are checked in one forward pass.
This radically reduces the number of slow, memory-bound steps.
📄 The Key Paper
Title: Fast Inference from Transformers via Speculative Decoding Authors: Leviathan et al., 2022Link: arXiv:2211.17192
Big Idea: You can achieve 2–3x inference speedups without altering your model’s output or doing any lossy approximation.
A truly elegant (pleasingly effective and simple) solution.
💡 Big Takeaway
If your goal is speed without sacrifice, speculative decoding is a panacea (remedy for all difficulties).
Lossless? ✅
Faster? ✅
No need to retrain your model? ✅✅✅



Comments