Translation Quality Management: Metrics & Framework

Most translation quality problems don't start with bad translators. They start with no clear definition of what "good" means. Without measurement, quality discussions turn into opinions. The project manager thinks the translations are fine. The in-country reviewer says they're terrible. Neither can point to specific, repeatable criteria.

Translation quality management (TQM) gives you a framework to define, measure, and improve translation output systematically. It replaces gut feelings with data. This guide covers the practical side: what to measure, how to measure it, and how to build quality into your workflow instead of catching problems after the fact.

What "quality" actually means

Quality isn't a single score. A translation can be linguistically perfect but completely wrong for the audience. A technical manual translated into casual conversational tone is accurate but inappropriate. A marketing headline translated literally is technically correct but sounds robotic.

Quality has two dimensions that matter:

Adequacy: Does the translation convey the same meaning as the source? Are there omissions, additions, or distortions?
Fluency: Does the translation read naturally in the target language? Would a native speaker notice it was translated?

For most software localization, adequacy is non-negotiable. Fluency matters more for user-facing content and less for internal documentation. A perfectly fluent translation that changes the meaning is worse than an awkward translation that preserves it.

MQM: the industry standard framework

The Multidimensional Quality Metrics (MQM) framework is the closest thing the translation industry has to a universal quality standard. It was developed by DFKI and QTLaunchPad, and it gives you a structured way to categorize errors.

MQM breaks errors into categories:

Accuracy errors

Mistranslation: The translation says something different from the source
Omission: Content from the source is missing in the translation
Addition: The translation includes information not present in the source
Untranslated: Source language text left in the target without translation

Fluency errors

Grammar: Verb agreement, word order, case errors
Spelling: Typos, incorrect accents, wrong characters
Punctuation: Wrong or missing punctuation marks
Register: Wrong level of formality (formal when informal is needed, or vice versa)

Terminology errors

Inconsistent terminology: Using different translations for the same term
Wrong terminology: Using an incorrect term for the domain

Style errors

Awkward phrasing: Grammatically correct but unnatural
Inconsistent style: Mixing formal and informal tone within the same document

Each error gets a severity level: critical, major, or minor. Critical errors change meaning or break functionality. Major errors are noticeable and confusing. Minor errors are cosmetic.

How to score translations

The simplest useful approach is error-based scoring. Review a sample of translated content, count errors by category and severity, and calculate a score.

A common formula:

Score = 100 - (weighted errors / word count * 100)

Where weights are: Critical = 5, Major = 3, Minor = 1

Example: A 1,000-word document with 2 critical errors, 3 major errors, and 5 minor errors:

Score = 100 - ((2*5 + 3*3 + 5*1) / 1000 * 100) = 100 - (24/1000 * 100) = 100 - 2.4 = 97.6

Typical pass/fail thresholds:

98+: Publication quality, suitable for marketing and legal content
95-98: Good quality, suitable for most software UI and documentation
90-95: Acceptable for internal content, needs improvement for user-facing
Below 90: Needs significant rework

These numbers only work if you sample consistently. Reviewing 10% of translated content is a common benchmark. For critical content (legal, medical, safety), review 100%.

Automated quality checks

Not everything needs human review. A surprising number of quality issues can be caught automatically before any reviewer sees the translation.

What automated checks can catch

Placeholder integrity: Did all {variables} and %s placeholders survive translation? This is the most common machine translation error and the easiest to detect.
Number consistency: Does the translation contain the same numbers as the source? A price of "$49.99" shouldn't become "$49.00" in translation.
Tag integrity: Are HTML/XML tags properly opened and closed? Does the tag count match?
Length ratio: Is the translation suspiciously short or long compared to the source? A German translation that's shorter than the English source is likely missing content.
Glossary compliance: Do approved terms appear in the translation? Did any blacklisted terms slip through?
Duplicate detection: Are identical source strings translated identically? Inconsistency here is a quality smell.
Encoding issues: Are special characters (accents, CJK characters, RTL markers) rendered correctly?

Tools for automated QA

Most translation management systems include built-in QA checks. Crowdin and Phrase have the most comprehensive suites. If you're building a custom pipeline, libraries like ICU MessageFormat validators can check placeholder integrity programmatically.

Run automated checks as a gate in your CI/CD pipeline. If a translated string file fails placeholder validation, block the deployment. This single check prevents the most common category of localization bugs.

Building a review workflow

Automated checks catch structural errors. Human review catches everything else: wrong meaning, unnatural phrasing, incorrect terminology, tone issues. The question is how to structure review efficiently.

Three-tier review model

This model works well for teams scaling from a few languages to many:

Automated QA: Run checks on every translated string. Block errors automatically. No human time needed.
Sample review: A bilingual reviewer checks 10-20% of translated content for each batch. Focuses on accuracy and fluency. Provides a quality score.
In-context review: A native speaker reviews translations in the actual product UI. Catches issues that look fine in a spreadsheet but fail in context (truncation, wrong register, confusing navigation).

For most software teams, tier 1 runs on every commit. Tier 2 runs on each release. Tier 3 runs quarterly or when entering a new market.

Reviewer selection

The best reviewers are native speakers of the target language who also use your product. They catch both linguistic issues and product-specific problems. Professional linguistic reviewers miss product context. Product managers who speak the language miss linguistic subtleties. You need both perspectives, ideally in one person.

If you can't find that person, pair a linguistic reviewer with a product-savvy reviewer. The linguistic reviewer checks accuracy and fluency. The product reviewer checks that translated strings make sense in the product context.

Quality metrics to track over time

One-time quality checks tell you where you are. Tracking metrics over time tells you if you're getting better.

Useful metrics

Error rate per 1,000 words: Track by language, content type, and source (machine translation vs. human). This is your primary quality indicator.
Post-edit distance: How much do reviewers change machine-translated output? Measured as edit distance (Levenshtein) or percentage of words changed. Lower is better.
Time to review: How long does a reviewer spend per 1,000 words? Decreasing review time with stable quality means your source quality is improving.
Localization bug rate: How many production bugs are localization-related? Track per release and per language.
Glossary hit rate: What percentage of glossary terms are translated correctly on the first pass? Increasing hit rate means your translation engine and translators are learning your terminology.

Feedback loops

Quality data is useless if it doesn't feed back into the process. When reviewers correct errors, those corrections should:

Update the translation memory so the same error doesn't repeat
Update the glossary if terminology was wrong
Inform the translation engine (through feedback or fine-tuning) to improve future output
Update your style guide if the error reveals a missing guideline

Machine translation quality

Machine translation quality has improved dramatically. Context-aware translation engines produce output that's close to human quality for straightforward content. But quality varies by language pair and content type.

General patterns:

High-resource language pairs (EN-DE, EN-FR, EN-ES, EN-JA): Machine translation quality is high. Review focuses on terminology and style, not accuracy.
Medium-resource pairs (EN-PL, EN-TH, EN-VI): Good but inconsistent. More accuracy errors, especially with domain-specific content.
Low-resource pairs (EN-UR, EN-SW, EN-MY): Significant quality gaps. Heavy human review needed.

The practical implication: adjust your review intensity by language pair. German translations might need 10% sample review. Burmese translations might need 50% or more.

When comparing translation engines, test with your actual content, not generic benchmarks. A translation API that scores well on news articles might struggle with your product's technical vocabulary. Langbly's approach of using context-aware translation tends to produce better results for software content because it understands the domain, not just the words. And at $1.99-$3.80 per million characters, running quality comparisons is cheap. Follow the quickstart guide to test it against your current engine with a representative sample.

Practical checklist

If you're starting from zero, here's the order to build quality management:

Week 1: Set up automated QA checks (placeholder validation, number consistency, tag integrity). Block deployments that fail.
Week 2: Create a terminology glossary with approved translations for your top 50-100 terms.
Week 3: Establish a sample review process. Pick one reviewer per language. Define pass/fail criteria.
Month 2: Start tracking error rates. Build a simple dashboard showing errors per language per release.
Month 3: Implement feedback loops. Reviewer corrections update TM and glossary automatically.
Ongoing: Quarterly in-context review for each major language.

This progression takes you from "we hope translations are okay" to "we know translations meet our standards" in about three months.

Translation Quality Management: How to Measure and Improve Translation Output