ParliaBench Demo

About ParliaBench

ParliaBench is a benchmark and evaluation framework for LLM-generated UK parliamentary speeches, combining a curated dataset, multi-dimensional evaluation metrics, and five domain-specific fine-tuned models.
Paper: arXiv:2511.08247

Dataset

Constructed from the UK subset of the ParlaMint corpus, 2015–2022. Four-step pipeline: XML parsing → metadata alignment → content filtering → EuroVoc thematic classification.

Corpus Statistics

Statistic	Value
Total speeches	447,778
Unique speakers	1,901
Political affiliations	11
Total words	~99.94 million
Mean speech length	223 words
Median speech length	99 words
P10 — min threshold	43 words
P90 — max threshold	635 words
EuroVoc topic domains	21
Temporal coverage	2015–2022

Political Parties in Dataset

Party	Orientation	Speeches	Speakers	Share
Conservative	Centre-right	263,513	792	58.9%
Labour	Centre-left	108,831	592	24.3%
Scottish National Party	Centre-left	23,562	67	5.3%
Liberal Democrats	Centre / centre-left	23,517	168	5.3%
Crossbench	Non-partisan	11,878	215	2.7%
Democratic Unionist Party	Right	6,610	15	1.5%
Independent	Non-partisan	2,783	45	0.6%
Plaid Cymru	Centre-left to left	2,229	7	0.5%
Green Party	Left	1,992	3	0.4%
Non-Affiliated	Non-partisan	1,713	60	0.4%
Bishops	Non-partisan	1,150	41	0.3%

Bishops, Crossbench, and Non-Affiliated are formal parliamentary affiliations. Minimum threshold: 1,000 speeches.

Models

Five LLMs fine-tuned with QLoRA via the Unsloth framework:

Model	Base (Unsloth 4-bit)	HF Repository
Llama-3.1-8B	unsloth/Meta-Llama-3.1-8B-bnb-4bit	argyrotsipi/parliabench-unsloth-llama-3.1-8b
Gemma-2-9B	unsloth/gemma-2-9b-bnb-4bit	argyrotsipi/parliabench-unsloth-gemma-2-9b
Mistral-7B	unsloth/mistral-7b-v0.3-bnb-4bit	argyrotsipi/parliabench-unsloth-mistral-7b-v0.3
Qwen2-7B	unsloth/Qwen2-7B-bnb-4bit	argyrotsipi/parliabench-unsloth-qwen-2-7b
Yi-1.5-6B	unsloth/Yi-1.5-6B-bnb-4bit	argyrotsipi/parliabench-unsloth-yi-1.5-6b

QLoRA Configuration

Parameter	Value
LoRA rank (r)	16
LoRA alpha	16
Target modules	q, k, v, o, gate, up, down
Dropout	0
Batch size	64
Learning rate	2e-4
Optimizer	AdamW (fused)
Max steps	11,194 (~2 epochs)
Warmup steps	336
Max sequence length	1,024

Generation Parameters

Parameter	Value
Temperature	0.7
Top-p	0.85
Repetition penalty	1.2
Max new tokens	850
Min words (P10)	43
Max words (P90)	635
Sampling	Nucleus (top-p)
Max regen attempts	3

Prompt Architecture

System prompt — training (no word count):

You are a seasoned UK parliamentary member. Use proper British parliamentary language
appropriate for the specified House. The speech should reflect the political orientation
and typical positions of the specified party on the given topic.

System prompt — generation (includes word count target):

You are a seasoned UK parliamentary member. Generate a coherent speech of
{min_words}-{max_words} words in standard English (no Unicode artifacts, no special
characters). Use proper British parliamentary language appropriate for the specified
House. The speech should reflect the political orientation and typical positions of the
specified party on the given topic.

Context string (pipe-separated, injected as user turn):

EUROVOC TOPIC: {topic} | SECTION: {section} | PARTY: {party} | POLITICAL ORIENTATION: {orientation} | HOUSE: {house}

Model-specific chat templates

Mistral

<s>[INST] {SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction} [/INST] {response}</s>

Llama 3.1

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{SYSTEM_PROMPT}<|eot_id|><|start_header_id|>user<|end_header_id|>
Context: {context}
Instruction: {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{response}<|eot_id|>

Gemma 2

<bos><start_of_turn>user
{SYSTEM_PROMPT}
Context: {context}
Instruction: {instruction}<end_of_turn>
<start_of_turn>model
{response}<end_of_turn>

Qwen2 & Yi-1.5 (ChatML)

<|im_start|>system
{SYSTEM_PROMPT}<|im_end|>
<|im_start|>user
Context: {context}
Instruction: {instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>

Speech Validation

Every generated speech passes a 9-step validation pipeline. Invalid speeches are automatically regenerated up to 3 times. Baseline models exhibited higher failure rates, suggesting fine-tuning improved output quality directly.

#	Check	Detail
1	Template leakage	27 markers: role tokens (`user`, `assistant`), context labels (`Context:`, `Instruction:`), special tokens (`[INST]`, im_start, etc.)
2	Unicode corruption	14 corruption patterns + 11 forbidden script ranges (CJK, Cyrillic, Arabic, Thai, technical symbols)
3	Language detection	spaCy `en_core_web_sm` with 85% confidence threshold on texts ≥ 30 characters
4	Repetition	Same word > 3× consecutive; sequences of 3–7 words repeated > 3×; > 5 ordinal counting words
5	Semantic relevance	Cosine similarity < 0.08 via `all-MiniLM-L6-v2` against “UK parliamentary debate about {section} on {topic}”
6	Length	Valid word count: 43–635 words (P10–P90 of training corpus)
7	Concatenation	Rejects if ≥ 4 parliamentary opening phrases (`My Lords`, `Mr Speaker` …) suggesting multiple speeches joined
8	Corrupted endings	Nonsensical suffixes (e.g. `▍▍▍`, `});`)
9	Refusal patterns	AI refusal phrases (`I cannot generate`, `I'm sorry but I cannot` …)

Results

27,560 fully evaluated speeches — baseline (B) vs fine-tuned (F) across all 14 metrics.

green = significant improvement red = significant regression PPL ↓ · Self-BLEU ↓ · all others ↑

Model	Linguistic Quality					Semantic Coherence				Political Authenticity
Model	PPL↓	Dist-N↑	Self-BLEU↓	J_Coh↑	J_Conc↑	GRUEN↑	BERTScore↑	MoverScore↑	J_Rel↑	PSA↑	Party Align↑	J_Auth↑	J_PolApp↑	J_Qual↑
Llama 3.1 8B (B)	60.854	0.988	0.006	7.041	5.935	0.539	0.803	0.505	5.465	0.399	0.504	4.403	6.177	4.791
Llama 3.1 8B (F)	31.724 ↓	0.974 ↓	0.018	7.915 ↑	7.129 ↑	0.508 ↓	0.820 ↑	0.511 ↑	6.139 ↑	0.487 ↑	0.576 ↑	6.106 ↑	7.277 ↑	5.399 ↑
Gemma 2 9B (B)	89.784	0.992	0.008	7.788	4.784	0.526	0.804	0.508	5.782	0.444	0.543	3.837	6.498	4.442
Gemma 2 9B (F)	101.578 ↑	0.990	0.010	7.507 ↓	5.006	0.501	0.804	0.510 ↑	5.529	0.498 ↑	0.590	4.209 ↑	7.293 ↑	4.950 ↑
Mistral 7B v0.3 (B)	31.280	0.966	0.008	6.598	6.899	0.555	0.810	0.505	5.418	0.418	0.521	4.237	5.617	4.179
Mistral 7B v0.3 (F)	29.562 ↓	0.972 ↑	0.016	7.961 ↑	8.962 ↑	0.552	0.825 ↑	0.508	5.681 ↑	0.437 ↑	0.507 ↓	3.983 ↓	6.382 ↑	3.727 ↓
Qwen2 7B (B)	44.927	0.981	0.020	7.911	5.928	0.488	0.803	0.508	6.904	0.444	0.560	6.565	7.291	6.348
Qwen2 7B (F)	36.090 ↓	0.982	0.017 ↓	8.060 ↑	7.625 ↑	0.539 ↑	0.821 ↑	0.512 ↑	6.009 ↓	0.488 ↑	0.572	5.731 ↓	7.138	5.014 ↓
Yi 6B (B)	82.100	0.990	0.006	6.741	4.303	0.563	0.799	0.505	4.490	0.343	0.423	2.981	5.385	3.083
Yi 6B (F)	42.893 ↓	0.987	0.016	8.043 ↑	6.856 ↑	0.537	0.817 ↑	0.511 ↑	5.984 ↑	0.493 ↑	0.582 ↑	6.102 ↑	7.326 ↑	5.392 ↑

↑/↓ p < 0.05 after Bonferroni correction. PSA and Party Align on 0–1 scale; J scores on 0–10 scale; GRUEN/BERT/Mover on 0–1; PPL raw.

Political Spectrum & Party Alignment

These are the two novel embedding-based metrics introduced by ParliaBench to measure political authenticity — dimensions entirely unavailable to conventional NLP metrics.

Political Spectrum Alignment (PSA)

Measures how well a generated speech's ideological positioning matches the intended party orientation on the left–right spectrum.

Far-left (−6)Centre (0)Far-right (+6)

Green Labour/SNP LibDems Conservative DUP ▼ generated speech

How it's computed:
1. Build centroid embeddings for each orientation (Far-left → Far-right) from real ParlaMint-GB speeches
2. Embed the generated speech with all-mpnet-base-v2
3. Find the closest orientation centroid via cosine similarity
4. Score = sim(speech, closest_centroid) × max(0, 1 − Δφ/12)
where Δφ = |expected_orientation − closest_orientation|
5. Range 0→1; perfect alignment approaches 1

Party Alignment

Measures how closely a generated speech matches the linguistic style and rhetoric of the specified party, independent of spectrum position.

Con

Lab

SNP

LibD

generated speech (expected: Labour) · circles = party centroid embeddings

How it's computed:
1. Build a centroid embedding per party from all real speeches in that party's training data
2. Embed the generated speech with all-mpnet-base-v2
3. Score = cosine_similarity(speech, expected_party_centroid)
4. Range 0→1; captures party-specific vocabulary, rhetorical style, and framing beyond ideological position alone

Key finding: Both metrics successfully discriminate their target dimensions (both p < 0.001). All five fine-tuned models showed statistically significant improvements in PSA (effect sizes d = 0.14–1.05), validating that fine-tuning genuinely improves ideological alignment — not just surface fluency. Mistral achieved the highest PSA after fine-tuning (8.94), while Llama led on Party Alignment (6.19).

Evaluation Framework

27,560 fully evaluated speeches across three assessment dimensions:

Linguistic quality

Perplexity (PPL) — text naturalness via GPT-2 (↓ better)
Distinct-N — lexical diversity via unique n-gram ratios (↑ better)
Self-BLEU — intra-model diversity; lower = more varied outputs (↓ better)
J_Coh / J_Conc — LLM-as-a-Judge coherence and conciseness (1–10 scale)

Semantic coherence

GRUEN — grammaticality and semantic coherence (↑ better)
BERTScore — semantic similarity via RoBERTa-large F1 (↑ better)
MoverScore — Earth Mover's Distance over contextual embeddings (↑ better)
J_Rel — LLM-as-a-Judge relevance to prompt (1–10 scale)

Political authenticity (novel metrics)

Political Spectrum Alignment (PSA) — cosine similarity to orientation centroids weighted by ideological distance on a 13-point left–right scale
Party Alignment — cosine similarity to party-specific embedding centroids
J_Auth / J_PolApp / J_Qual — LLM-as-a-Judge authenticity, political appropriateness, overall quality (1–10 scale)

LLM judge: FlowJudge-v0.1 (3.8B, Phi-3.5-mini architecture) — architecturally independent from all evaluated models.

Citation

@article{koniaris2025parliabench,
  title   = {ParliaBench: An Evaluation and Benchmarking Framework for
             LLM-Generated Parliamentary Speech},
  author  = {Koniaris, Marios and Tsipi, Argyro and Tsanakas, Panayiotis},
  journal = {arXiv preprint arXiv:2511.08247},
  year    = {2025},
  url     = {https://arxiv.org/abs/2511.08247}
}

National Technical University of Athens · School of Electrical and Computer Engineering

ParliaBench — UK Parliamentary Speech Generation