ParliaBench — UK Parliamentary Speech Generation
Inference demo for five LLMs fine-tuned on ParlaMint-GB with QLoRA
Koniaris, Tsipi & Tsanakas · arXiv:2511.08247 · NTUA 2025
About ParliaBench
ParliaBench is a benchmark and evaluation framework for LLM-generated UK parliamentary speeches,
combining a curated dataset, multi-dimensional evaluation metrics, and five domain-specific fine-tuned models.
Paper: arXiv:2511.08247
Dataset
Constructed from the UK subset of the ParlaMint corpus, 2015–2022. Four-step pipeline: XML parsing → metadata alignment → content filtering → EuroVoc thematic classification.
Corpus Statistics
| Statistic | Value |
|---|---|
| Total speeches | 447,778 |
| Unique speakers | 1,901 |
| Political affiliations | 11 |
| Total words | ~99.94 million |
| Mean speech length | 223 words |
| Median speech length | 99 words |
| P10 — min threshold | 43 words |
| P90 — max threshold | 635 words |
| EuroVoc topic domains | 21 |
| Temporal coverage | 2015–2022 |
Political Parties in Dataset
| Party | Orientation | Speeches | Speakers | Share |
|---|---|---|---|---|
| Conservative | Centre-right | 263,513 | 792 | 58.9% |
| Labour | Centre-left | 108,831 | 592 | 24.3% |
| Scottish National Party | Centre-left | 23,562 | 67 | 5.3% |
| Liberal Democrats | Centre / centre-left | 23,517 | 168 | 5.3% |
| Crossbench | Non-partisan | 11,878 | 215 | 2.7% |
| Democratic Unionist Party | Right | 6,610 | 15 | 1.5% |
| Independent | Non-partisan | 2,783 | 45 | 0.6% |
| Plaid Cymru | Centre-left to left | 2,229 | 7 | 0.5% |
| Green Party | Left | 1,992 | 3 | 0.4% |
| Non-Affiliated | Non-partisan | 1,713 | 60 | 0.4% |
| Bishops | Non-partisan | 1,150 | 41 | 0.3% |
Bishops, Crossbench, and Non-Affiliated are formal parliamentary affiliations. Minimum threshold: 1,000 speeches.
Models
Five LLMs fine-tuned with QLoRA via the Unsloth framework:
| Model | Base (Unsloth 4-bit) | HF Repository |
|---|---|---|
| Llama-3.1-8B | unsloth/Meta-Llama-3.1-8B-bnb-4bit | argyrotsipi/parliabench-unsloth-llama-3.1-8b |
| Gemma-2-9B | unsloth/gemma-2-9b-bnb-4bit | argyrotsipi/parliabench-unsloth-gemma-2-9b |
| Mistral-7B | unsloth/mistral-7b-v0.3-bnb-4bit | argyrotsipi/parliabench-unsloth-mistral-7b-v0.3 |
| Qwen2-7B | unsloth/Qwen2-7B-bnb-4bit | argyrotsipi/parliabench-unsloth-qwen-2-7b |
| Yi-1.5-6B | unsloth/Yi-1.5-6B-bnb-4bit | argyrotsipi/parliabench-unsloth-yi-1.5-6b |
QLoRA Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| Target modules | q, k, v, o, gate, up, down |
| Dropout | 0 |
| Batch size | 64 |
| Learning rate | 2e-4 |
| Optimizer | AdamW (fused) |
| Max steps | 11,194 (~2 epochs) |
| Warmup steps | 336 |
| Max sequence length | 1,024 |
Generation Parameters
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-p | 0.85 |
| Repetition penalty | 1.2 |
| Max new tokens | 850 |
| Min words (P10) | 43 |
| Max words (P90) | 635 |
| Sampling | Nucleus (top-p) |
| Max regen attempts | 3 |
Prompt Architecture
System prompt — training (no word count):
You are a seasoned UK parliamentary member. Use proper British parliamentary language appropriate for the specified House. The speech should reflect the political orientation and typical positions of the specified party on the given topic.
System prompt — generation (includes word count target):
You are a seasoned UK parliamentary member. Generate a coherent speech of
{min_words}-{max_words} words in standard English (no Unicode artifacts, no special
characters). Use proper British parliamentary language appropriate for the specified
House. The speech should reflect the political orientation and typical positions of the
specified party on the given topic.
Context string (pipe-separated, injected as user turn):
EUROVOC TOPIC: {topic} | SECTION: {section} | PARTY: {party} | POLITICAL ORIENTATION: {orientation} | HOUSE: {house}
Model-specific chat templates
Mistral
<s>[INST] {SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction} [/INST] {response}</s>
Llama 3.1
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM_PROMPT}<|eot_id|><|start_header_id|>user<|end_header_id|>
Context: {context}
Instruction: {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{response}<|eot_id|>
Gemma 2
<bos><start_of_turn>user
{SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction}<end_of_turn>
<start_of_turn>model
{response}<end_of_turn>
Qwen2 & Yi-1.5 (ChatML)
<|im_start|>system
{SYSTEM_PROMPT}<|im_end|>
<|im_start|>user
Context: {context}
Instruction: {instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>
Speech Validation
Every generated speech passes a 9-step validation pipeline. Invalid speeches are automatically regenerated up to 3 times. Baseline models exhibited higher failure rates, suggesting fine-tuning improved output quality directly.
| # | Check | Detail |
|---|---|---|
| 1 | Template leakage | 27 markers: role tokens (
user,
assistant), context labels (Context:, Instruction:), special tokens ([INST], im_start, etc.) |
| 2 | Unicode corruption | 14 corruption patterns + 11 forbidden script ranges (CJK, Cyrillic, Arabic, Thai, technical symbols) |
| 3 | Language detection | spaCy en_core_web_sm with 85% confidence threshold on texts ≥ 30 characters |
| 4 | Repetition | Same word > 3× consecutive; sequences of 3–7 words repeated > 3×; > 5 ordinal counting words |
| 5 | Semantic relevance | Cosine similarity < 0.08 via all-MiniLM-L6-v2 against “UK parliamentary debate about {section} on {topic}” |
| 6 | Length | Valid word count: 43–635 words (P10–P90 of training corpus) |
| 7 | Concatenation | Rejects if ≥ 4 parliamentary opening phrases (My Lords, Mr Speaker …) suggesting multiple speeches joined |
| 8 | Corrupted endings | Nonsensical suffixes (e.g. ▍▍▍, });) |
| 9 | Refusal patterns | AI refusal phrases (I cannot generate, I'm sorry but I cannot …) |
Results
27,560 fully evaluated speeches — baseline (B) vs fine-tuned (F) across all 14 metrics.
green = significant improvement red = significant regression PPL ↓ · Self-BLEU ↓ · all others ↑
| Model | Linguistic Quality | Semantic Coherence | Political Authenticity | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PPL↓ | Dist-N↑ | Self-BLEU↓ | J_Coh↑ | J_Conc↑ | GRUEN↑ | BERTScore↑ | MoverScore↑ | J_Rel↑ | PSA↑ | Party Align↑ | J_Auth↑ | J_PolApp↑ | J_Qual↑ | |
| Llama 3.1 8B (B) | 60.854 | 0.988 | 0.006 | 7.041 | 5.935 | 0.539 | 0.803 | 0.505 | 5.465 | 0.399 | 0.504 | 4.403 | 6.177 | 4.791 |
| Llama 3.1 8B (F) | 31.724 ↓ | 0.974 ↓ | 0.018 | 7.915 ↑ | 7.129 ↑ | 0.508 ↓ | 0.820 ↑ | 0.511 ↑ | 6.139 ↑ | 0.487 ↑ | 0.576 ↑ | 6.106 ↑ | 7.277 ↑ | 5.399 ↑ |
| Gemma 2 9B (B) | 89.784 | 0.992 | 0.008 | 7.788 | 4.784 | 0.526 | 0.804 | 0.508 | 5.782 | 0.444 | 0.543 | 3.837 | 6.498 | 4.442 |
| Gemma 2 9B (F) | 101.578 ↑ | 0.990 | 0.010 | 7.507 ↓ | 5.006 | 0.501 | 0.804 | 0.510 ↑ | 5.529 | 0.498 ↑ | 0.590 | 4.209 ↑ | 7.293 ↑ | 4.950 ↑ |
| Mistral 7B v0.3 (B) | 31.280 | 0.966 | 0.008 | 6.598 | 6.899 | 0.555 | 0.810 | 0.505 | 5.418 | 0.418 | 0.521 | 4.237 | 5.617 | 4.179 |
| Mistral 7B v0.3 (F) | 29.562 ↓ | 0.972 ↑ | 0.016 | 7.961 ↑ | 8.962 ↑ | 0.552 | 0.825 ↑ | 0.508 | 5.681 ↑ | 0.437 ↑ | 0.507 ↓ | 3.983 ↓ | 6.382 ↑ | 3.727 ↓ |
| Qwen2 7B (B) | 44.927 | 0.981 | 0.020 | 7.911 | 5.928 | 0.488 | 0.803 | 0.508 | 6.904 | 0.444 | 0.560 | 6.565 | 7.291 | 6.348 |
| Qwen2 7B (F) | 36.090 ↓ | 0.982 | 0.017 ↓ | 8.060 ↑ | 7.625 ↑ | 0.539 ↑ | 0.821 ↑ | 0.512 ↑ | 6.009 ↓ | 0.488 ↑ | 0.572 | 5.731 ↓ | 7.138 | 5.014 ↓ |
| Yi 6B (B) | 82.100 | 0.990 | 0.006 | 6.741 | 4.303 | 0.563 | 0.799 | 0.505 | 4.490 | 0.343 | 0.423 | 2.981 | 5.385 | 3.083 |
| Yi 6B (F) | 42.893 ↓ | 0.987 | 0.016 | 8.043 ↑ | 6.856 ↑ | 0.537 | 0.817 ↑ | 0.511 ↑ | 5.984 ↑ | 0.493 ↑ | 0.582 ↑ | 6.102 ↑ | 7.326 ↑ | 5.392 ↑ |
↑/↓ p < 0.05 after Bonferroni correction. PSA and Party Align on 0–1 scale; J scores on 0–10 scale; GRUEN/BERT/Mover on 0–1; PPL raw.
Political Spectrum & Party Alignment
These are the two novel embedding-based metrics introduced by ParliaBench to measure political authenticity — dimensions entirely unavailable to conventional NLP metrics.
Political Spectrum Alignment (PSA)
Measures how well a generated speech's ideological positioning matches the intended party orientation on the left–right spectrum.
1. Build centroid embeddings for each orientation (Far-left → Far-right) from real ParlaMint-GB speeches
2. Embed the generated speech with
all-mpnet-base-v23. Find the closest orientation centroid via cosine similarity
4. Score =
sim(speech, closest_centroid) × max(0, 1 − Δφ/12)where Δφ = |expected_orientation − closest_orientation|
5. Range 0→1; perfect alignment approaches 1
Party Alignment
Measures how closely a generated speech matches the linguistic style and rhetoric of the specified party, independent of spectrum position.
1. Build a centroid embedding per party from all real speeches in that party's training data
2. Embed the generated speech with
all-mpnet-base-v23. Score =
cosine_similarity(speech, expected_party_centroid)4. Range 0→1; captures party-specific vocabulary, rhetorical style, and framing beyond ideological position alone
Evaluation Framework
27,560 fully evaluated speeches across three assessment dimensions:
Linguistic quality
- Perplexity (PPL) — text naturalness via GPT-2 (↓ better)
- Distinct-N — lexical diversity via unique n-gram ratios (↑ better)
- Self-BLEU — intra-model diversity; lower = more varied outputs (↓ better)
- J_Coh / J_Conc — LLM-as-a-Judge coherence and conciseness (1–10 scale)
Semantic coherence
- GRUEN — grammaticality and semantic coherence (↑ better)
- BERTScore — semantic similarity via RoBERTa-large F1 (↑ better)
- MoverScore — Earth Mover's Distance over contextual embeddings (↑ better)
- J_Rel — LLM-as-a-Judge relevance to prompt (1–10 scale)
Political authenticity (novel metrics)
- Political Spectrum Alignment (PSA) — cosine similarity to orientation centroids weighted by ideological distance on a 13-point left–right scale
- Party Alignment — cosine similarity to party-specific embedding centroids
- J_Auth / J_PolApp / J_Qual — LLM-as-a-Judge authenticity, political appropriateness, overall quality (1–10 scale)
LLM judge: FlowJudge-v0.1 (3.8B, Phi-3.5-mini architecture) — architecturally independent from all evaluated models.
Citation
@article{koniaris2025parliabench,
title = {ParliaBench: An Evaluation and Benchmarking Framework for
LLM-Generated Parliamentary Speech},
author = {Koniaris, Marios and Tsipi, Argyro and Tsanakas, Panayiotis},
journal = {arXiv preprint arXiv:2511.08247},
year = {2025},
url = {https://arxiv.org/abs/2511.08247}
}
National Technical University of Athens · School of Electrical and Computer Engineering
Generated Speech Examples
Representative outputs from the ParliaBench evaluation set — one per model, comparing baseline and fine-tuned performance.
Model Gemma-2-9B (baseline)
Party Scottish National Party · Orientation Centre-left
Topic ENERGY · Section Domestic Renewable Energy
House House of Commons · Words 56
Configuration
Fine-tuned = QLoRA adapter on Unsloth base; Baseline = raw 4-bit base model
Some parties are restricted to the Lords
21 domains from the EUROVOC thesaurus
Generation Parameters
Generated Speech
Stats will appear here after generation.
The prompt panel shows the exact input fed to the model (including chat template tokens) — useful for reproducibility.
ParliaBench Demo · NTUA 2025 · argyrotsipi on HF · Train dataset · Generated dataset