ParliaBench — UK Parliamentary Speech Generation

Inference demo for five LLMs fine-tuned on ParlaMint-GB with QLoRA
Koniaris, Tsipi & Tsanakas · arXiv:2511.08247 · NTUA 2025

About ParliaBench

ParliaBench is a benchmark and evaluation framework for LLM-generated UK parliamentary speeches, combining a curated dataset, multi-dimensional evaluation metrics, and five domain-specific fine-tuned models.
Paper: arXiv:2511.08247

Dataset

Constructed from the UK subset of the ParlaMint corpus, 2015–2022. Four-step pipeline: XML parsing → metadata alignment → content filtering → EuroVoc thematic classification.

Corpus Statistics

StatisticValue
Total speeches447,778
Unique speakers1,901
Political affiliations11
Total words~99.94 million
Mean speech length223 words
Median speech length99 words
P10 — min threshold43 words
P90 — max threshold635 words
EuroVoc topic domains21
Temporal coverage2015–2022

Political Parties in Dataset

PartyOrientationSpeechesSpeakersShare
ConservativeCentre-right263,51379258.9%
LabourCentre-left108,83159224.3%
Scottish National PartyCentre-left23,562675.3%
Liberal DemocratsCentre / centre-left23,5171685.3%
CrossbenchNon-partisan11,8782152.7%
Democratic Unionist PartyRight6,610151.5%
IndependentNon-partisan2,783450.6%
Plaid CymruCentre-left to left2,22970.5%
Green PartyLeft1,99230.4%
Non-AffiliatedNon-partisan1,713600.4%
BishopsNon-partisan1,150410.3%

Bishops, Crossbench, and Non-Affiliated are formal parliamentary affiliations. Minimum threshold: 1,000 speeches.

Models

Five LLMs fine-tuned with QLoRA via the Unsloth framework:

ModelBase (Unsloth 4-bit)HF Repository
Llama-3.1-8Bunsloth/Meta-Llama-3.1-8B-bnb-4bitargyrotsipi/parliabench-unsloth-llama-3.1-8b
Gemma-2-9Bunsloth/gemma-2-9b-bnb-4bitargyrotsipi/parliabench-unsloth-gemma-2-9b
Mistral-7Bunsloth/mistral-7b-v0.3-bnb-4bitargyrotsipi/parliabench-unsloth-mistral-7b-v0.3
Qwen2-7Bunsloth/Qwen2-7B-bnb-4bitargyrotsipi/parliabench-unsloth-qwen-2-7b
Yi-1.5-6Bunsloth/Yi-1.5-6B-bnb-4bitargyrotsipi/parliabench-unsloth-yi-1.5-6b

QLoRA Configuration

ParameterValue
LoRA rank (r)16
LoRA alpha16
Target modulesq, k, v, o, gate, up, down
Dropout0
Batch size64
Learning rate2e-4
OptimizerAdamW (fused)
Max steps11,194 (~2 epochs)
Warmup steps336
Max sequence length1,024

Generation Parameters

ParameterValue
Temperature0.7
Top-p0.85
Repetition penalty1.2
Max new tokens850
Min words (P10)43
Max words (P90)635
SamplingNucleus (top-p)
Max regen attempts3

Prompt Architecture

System prompt — training (no word count):

You are a seasoned UK parliamentary member. Use proper British parliamentary language
appropriate for the specified House. The speech should reflect the political orientation
and typical positions of the specified party on the given topic.

System prompt — generation (includes word count target):

You are a seasoned UK parliamentary member. Generate a coherent speech of
{min_words}-{max_words} words in standard English (no Unicode artifacts, no special
characters). Use proper British parliamentary language appropriate for the specified
House. The speech should reflect the political orientation and typical positions of the
specified party on the given topic.

Context string (pipe-separated, injected as user turn):

EUROVOC TOPIC: {topic} | SECTION: {section} | PARTY: {party} | POLITICAL ORIENTATION: {orientation} | HOUSE: {house}

Model-specific chat templates

Mistral

<s>[INST] {SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction} [/INST] {response}</s>

Llama 3.1

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM_PROMPT}<|eot_id|><|start_header_id|>user<|end_header_id|>
Context: {context}
Instruction: {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{response}<|eot_id|>

Gemma 2

<bos><start_of_turn>user
{SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction}<end_of_turn>
<start_of_turn>model
{response}<end_of_turn>

Qwen2 & Yi-1.5 (ChatML)

<|im_start|>system
{SYSTEM_PROMPT}<|im_end|>
<|im_start|>user
Context: {context}
Instruction: {instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>

Speech Validation

Every generated speech passes a 9-step validation pipeline. Invalid speeches are automatically regenerated up to 3 times. Baseline models exhibited higher failure rates, suggesting fine-tuning improved output quality directly.

# Check Detail
1 Template leakage 27 markers: role tokens ( user, assistant), context labels (Context:, Instruction:), special tokens ([INST], im_start, etc.)
2 Unicode corruption 14 corruption patterns + 11 forbidden script ranges (CJK, Cyrillic, Arabic, Thai, technical symbols)
3 Language detection spaCy en_core_web_sm with 85% confidence threshold on texts ≥ 30 characters
4 Repetition Same word > 3× consecutive; sequences of 3–7 words repeated > 3×; > 5 ordinal counting words
5 Semantic relevance Cosine similarity < 0.08 via all-MiniLM-L6-v2 against “UK parliamentary debate about {section} on {topic}”
6 Length Valid word count: 43–635 words (P10–P90 of training corpus)
7 Concatenation Rejects if ≥ 4 parliamentary opening phrases (My Lords, Mr Speaker …) suggesting multiple speeches joined
8 Corrupted endings Nonsensical suffixes (e.g. ▍▍▍, });)
9 Refusal patterns AI refusal phrases (I cannot generate, I'm sorry but I cannot …)

Results

27,560 fully evaluated speeches — baseline (B) vs fine-tuned (F) across all 14 metrics.

green = significant improvement   red = significant regression   PPL ↓ · Self-BLEU ↓ · all others ↑

Model Linguistic Quality Semantic Coherence Political Authenticity
PPL↓ Dist-N↑ Self-BLEU↓ J_Coh↑ J_Conc↑ GRUEN↑ BERTScore↑ MoverScore↑ J_Rel↑ PSA↑ Party Align↑ J_Auth↑ J_PolApp↑ J_Qual↑
Llama 3.1 8B (B) 60.854 0.988 0.006 7.041 5.935 0.539 0.803 0.505 5.465 0.399 0.504 4.403 6.177 4.791
Llama 3.1 8B (F) 31.724 ↓ 0.974 ↓ 0.018 7.915 ↑ 7.129 ↑ 0.508 ↓ 0.820 ↑ 0.511 ↑ 6.139 ↑ 0.487 ↑ 0.576 ↑ 6.106 ↑ 7.277 ↑ 5.399 ↑
Gemma 2 9B (B) 89.784 0.992 0.008 7.788 4.784 0.526 0.804 0.508 5.782 0.444 0.543 3.837 6.498 4.442
Gemma 2 9B (F) 101.578 ↑ 0.990 0.010 7.507 ↓ 5.006 0.501 0.804 0.510 ↑ 5.529 0.498 ↑ 0.590 4.209 ↑ 7.293 ↑ 4.950 ↑
Mistral 7B v0.3 (B) 31.280 0.966 0.008 6.598 6.899 0.555 0.810 0.505 5.418 0.418 0.521 4.237 5.617 4.179
Mistral 7B v0.3 (F) 29.562 ↓ 0.972 ↑ 0.016 7.961 ↑ 8.962 ↑ 0.552 0.825 ↑ 0.508 5.681 ↑ 0.437 ↑ 0.507 ↓ 3.983 ↓ 6.382 ↑ 3.727 ↓
Qwen2 7B (B) 44.927 0.981 0.020 7.911 5.928 0.488 0.803 0.508 6.904 0.444 0.560 6.565 7.291 6.348
Qwen2 7B (F) 36.090 ↓ 0.982 0.017 ↓ 8.060 ↑ 7.625 ↑ 0.539 ↑ 0.821 ↑ 0.512 ↑ 6.009 ↓ 0.488 ↑ 0.572 5.731 ↓ 7.138 5.014 ↓
Yi 6B (B) 82.100 0.990 0.006 6.741 4.303 0.563 0.799 0.505 4.490 0.343 0.423 2.981 5.385 3.083
Yi 6B (F) 42.893 ↓ 0.987 0.016 8.043 ↑ 6.856 ↑ 0.537 0.817 ↑ 0.511 ↑ 5.984 ↑ 0.493 ↑ 0.582 ↑ 6.102 ↑ 7.326 ↑ 5.392 ↑

↑/↓ p < 0.05 after Bonferroni correction. PSA and Party Align on 0–1 scale; J scores on 0–10 scale; GRUEN/BERT/Mover on 0–1; PPL raw.


Political Spectrum & Party Alignment

These are the two novel embedding-based metrics introduced by ParliaBench to measure political authenticity — dimensions entirely unavailable to conventional NLP metrics.

Political Spectrum Alignment (PSA)

Measures how well a generated speech's ideological positioning matches the intended party orientation on the left–right spectrum.

Far-left (−6)Centre (0)Far-right (+6)
Green   Labour/SNP   LibDems   Conservative   DUP   ▼ generated speech
How it's computed:
1. Build centroid embeddings for each orientation (Far-left → Far-right) from real ParlaMint-GB speeches
2. Embed the generated speech with all-mpnet-base-v2
3. Find the closest orientation centroid via cosine similarity
4. Score = sim(speech, closest_centroid) × max(0, 1 − Δφ/12)
   where Δφ = |expected_orientation − closest_orientation|
5. Range 0→1; perfect alignment approaches 1

Party Alignment

Measures how closely a generated speech matches the linguistic style and rhetoric of the specified party, independent of spectrum position.

Con
Lab
SNP
LibD
sim=0.61
generated speech (expected: Labour)  ·  circles = party centroid embeddings
How it's computed:
1. Build a centroid embedding per party from all real speeches in that party's training data
2. Embed the generated speech with all-mpnet-base-v2
3. Score = cosine_similarity(speech, expected_party_centroid)
4. Range 0→1; captures party-specific vocabulary, rhetorical style, and framing beyond ideological position alone
Key finding: Both metrics successfully discriminate their target dimensions (both p < 0.001). All five fine-tuned models showed statistically significant improvements in PSA (effect sizes d = 0.14–1.05), validating that fine-tuning genuinely improves ideological alignment — not just surface fluency. Mistral achieved the highest PSA after fine-tuning (8.94), while Llama led on Party Alignment (6.19).

Evaluation Framework

27,560 fully evaluated speeches across three assessment dimensions:

Linguistic quality

  • Perplexity (PPL) — text naturalness via GPT-2 (↓ better)
  • Distinct-N — lexical diversity via unique n-gram ratios (↑ better)
  • Self-BLEU — intra-model diversity; lower = more varied outputs (↓ better)
  • J_Coh / J_Conc — LLM-as-a-Judge coherence and conciseness (1–10 scale)

Semantic coherence

  • GRUEN — grammaticality and semantic coherence (↑ better)
  • BERTScore — semantic similarity via RoBERTa-large F1 (↑ better)
  • MoverScore — Earth Mover's Distance over contextual embeddings (↑ better)
  • J_Rel — LLM-as-a-Judge relevance to prompt (1–10 scale)

Political authenticity (novel metrics)

  • Political Spectrum Alignment (PSA) — cosine similarity to orientation centroids weighted by ideological distance on a 13-point left–right scale
  • Party Alignment — cosine similarity to party-specific embedding centroids
  • J_Auth / J_PolApp / J_Qual — LLM-as-a-Judge authenticity, political appropriateness, overall quality (1–10 scale)

LLM judge: FlowJudge-v0.1 (3.8B, Phi-3.5-mini architecture) — architecturally independent from all evaluated models.


Citation

@article{koniaris2025parliabench,
  title   = {ParliaBench: An Evaluation and Benchmarking Framework for
             LLM-Generated Parliamentary Speech},
  author  = {Koniaris, Marios and Tsipi, Argyro and Tsanakas, Panayiotis},
  journal = {arXiv preprint arXiv:2511.08247},
  year    = {2025},
  url     = {https://arxiv.org/abs/2511.08247}
}

National Technical University of Athens · School of Electrical and Computer Engineering


ParliaBench Demo · NTUA 2025 · argyrotsipi on HF · Train dataset · Generated dataset