Best Nigerian Pidgin English Speech-to-Text Model Available Today – Open Sourced

I fine-tuned OpenAI’s Whisper-large-v3-turbo on ~8.6 hours of Nigerian Pidgin audio using LoRA. The result: 21.37% Word Error Rate on the held-out test set — an 8.2 percentage-point absolute improvement (28% relative) over the strongest published Pidgin ASR baseline. The model, dataset, training scripts, and a live browser demo are all open source.

🤗 Model: michaelodafe/whisper-pidgin-v1
📦 Dataset: michaelodafe/pidgin-asr-combined
🎤 Live demo: Hugging Face Space
💻 Code: github.com/michaelodafe/Naija-Pidgin-Whisper

The problem

Most speech-to-text models you’ve ever used — from Siri to Google Voice Typing to Otter.ai — were trained on English, Mandarin, and a handful of other major languages. Nigerian Pidgin English is spoken by an estimated 75+ million people across West Africa, but no production-grade open-source STT model existed for it.

That gap matters. A keyboard, dictation app, or transcription tool built for Nigerian users still asks them to switch to Standard English to use voice input. Pidgin in, English out — losing the language we actually speak in.

I wanted to fix that, and to do it openly: model, data, training recipe, deployment path — all public, all reproducible.

What I built

A complete pipeline, end to end:

Curated a Nigerian Pidgin training corpus by combining publicly available Pidgin ASR sources into a unified train/validation/test split with a consistent schema.
LoRA fine-tuned Whisper-large-v3-turbo on a single free Kaggle T4 GPU in under 4 hours. The trained adapter is just 26 MB.
Built a real-time inference path using faster-whisper (CTranslate2 int8 quantization) with Silero VAD for utterance segmentation. Runs in ~200–600 ms per utterance on a Mac CPU.
Shipped a deployment recipe — a Hugging Face Inference Endpoint handler + a free Gradio demo Space, so anyone can try the model in their browser without installing anything.

Results

Metric	Pidgin Whisper v1	Wav2Vec2-XLSR-53 baseline
Test WER	21.37%	29.6%
Test CER	9.90%	—

Trained on 5.41 h, validated on 1.37 h, evaluated on 1.78 h of held-out test data. Test WER was actually slightly better than validation WER (21.96%), suggesting clean generalization with no overfitting.

Concretely: of every 5 words spoken, the model gets ~4 right — and the errors are mostly Pidgin orthographic variants (hin ↔ him) and proper nouns rather than misheard semantics.

How it works

Base model: Whisper-large-v3-turbo (809M parameters) — best ratio of accuracy, speed, and fine-tunability for the constraints.
Fine-tune: LoRA r=32, alpha=64 on attention q_proj and v_proj. ~3M trainable parameters.
Effective batch: 16 (4 per device × 4 grad-accum), LR 1e-4, fp16, 5 epochs.
Inference: the trained adapter is merged into the base, exported to CTranslate2 int8, and served via faster-whisper (4–6× faster than HuggingFace transformers on CPU).
Decode-time enhancements: an initial_prompt containing common Nigerian proper nouns + Pidgin function words biases the decoder toward correct vocabulary; a postprocess pass strips punctuation the labels don’t use and merges digit groups.

The full design notes — including every dead end I hit — are in the project documentation on GitHub.

What I learned

Audit your data before you train on it. I burned an afternoon pulling and uploading a 600 MB “Nigerian web” corpus before checking a sample — turned out only 0.09% of it was actually Pidgin. The rest was Standard English. A 30-second grep for Pidgin function words would have caught it.
Beam search rescoring fails on confident models. I trained a Pidgin-specific KenLM language model intending to do n-best rescoring of Whisper’s outputs. Beautiful idea on paper. In practice, my LoRA-finetuned Whisper returned the identical string in all 8 beams — the model is so confident that there’s nothing for the LM to rerank. To actually use a Pidgin LM, I’d need token-level shallow fusion, not n-best rescoring.
Decode-time tricks are underrated. A 90-token initial prompt with proper nouns + a 20-line postprocess regex bought me ~1–2 pp WER for zero engineering cost and zero inference cost. That’s better ROI than most of the bigger architectural ideas I tried.
Open-sourcing it forces clarity. Writing a model card and dataset card for public consumption made me re-examine a lot of decisions I’d made on autopilot.

What’s next

The model is shipped, but it’s a v1. Roughly in order of leverage:

More data. timniel/Pidgin_ASR_Dataset_Combined would add ~12 h of new Pidgin audio (~2× current data). Would target ~17–19% WER.
Mobile keyboard product. The end goal: a free iPhone + Android keyboard that lets Nigerians type by speaking Pidgin, monetized through rewarded video ads. Cloud inference now, on-device distilled model later.
Token-level LM fusion. Properly integrate the KenLM I trained at the decoder logits level, not as post-hoc rescoring. ~1–3 pp additional WER for the right work.
Distill for on-device. Whisper-tiny or Moonshine fine-tuned on Pidgin, quantized to int4, deployed via Core ML / TFLite. The endgame for a private, offline, latency-free keyboard.

Tech stack

PyTorch · Transformers · PEFT · faster-whisper · CTranslate2 · Silero VAD · KenLM · Kaggle (training compute) · Hugging Face (model + dataset + Space hosting) · Gradio (demo UI).

If you want to try it, the demo Space is the fastest path. If you want to read how I built it, the GitHub repo and its documentation.md have the unvarnished story.