I fine-tuned OpenAI’s Whisper-large-v3-turbo on ~8.6 hours of Nigerian Pidgin audio using LoRA. The result: 21.37% Word Error Rate on the held-out test set — an 8.2 percentage-point absolute improvement (28% relative) over the strongest published Pidgin ASR baseline. The model, dataset, training scripts, and a live browser demo are all open source.
- 🤗 Model: michaelodafe/whisper-pidgin-v1
- 📦 Dataset: michaelodafe/pidgin-asr-combined
- 🎤 Live demo: Hugging Face Space
- 💻 Code: github.com/michaelodafe/Naija-Pidgin-Whisper
The problem
Most speech-to-text models you’ve ever used — from Siri to Google Voice Typing to Otter.ai — were trained on English, Mandarin, and a handful of other major languages. Nigerian Pidgin English is spoken by an estimated 75+ million people across West Africa, but no production-grade open-source STT model existed for it.
That gap matters. A keyboard, dictation app, or transcription tool built for Nigerian users still asks them to switch to Standard English to use voice input. Pidgin in, English out — losing the language we actually speak in.
I wanted to fix that, and to do it openly: model, data, training recipe, deployment path — all public, all reproducible.
What I built
A complete pipeline, end to end:
- Curated a Nigerian Pidgin training corpus by combining publicly available Pidgin ASR sources into a unified train/validation/test split with a consistent schema.
- LoRA fine-tuned Whisper-large-v3-turbo on a single free Kaggle T4 GPU in under 4 hours. The trained adapter is just 26 MB.
- Built a real-time inference path using
faster-whisper(CTranslate2 int8 quantization) with Silero VAD for utterance segmentation. Runs in ~200–600 ms per utterance on a Mac CPU. - Shipped a deployment recipe — a Hugging Face Inference Endpoint handler + a free Gradio demo Space, so anyone can try the model in their browser without installing anything.
Results
| Metric | Pidgin Whisper v1 | Wav2Vec2-XLSR-53 baseline |
|---|---|---|
| Test WER | 21.37% | 29.6% |
| Test CER | 9.90% | — |
Trained on 5.41 h, validated on 1.37 h, evaluated on 1.78 h of held-out test data. Test WER was actually slightly better than validation WER (21.96%), suggesting clean generalization with no overfitting.
Concretely: of every 5 words spoken, the model gets ~4 right — and the errors are mostly Pidgin orthographic variants (hin ↔ him) and proper nouns rather than misheard semantics.
How it works
- Base model: Whisper-large-v3-turbo (809M parameters) — best ratio of accuracy, speed, and fine-tunability for the constraints.
- Fine-tune: LoRA
r=32, alpha=64on attentionq_projandv_proj. ~3M trainable parameters. - Effective batch: 16 (4 per device × 4 grad-accum), LR 1e-4, fp16, 5 epochs.
- Inference: the trained adapter is merged into the base, exported to CTranslate2 int8, and served via
faster-whisper(4–6× faster than HuggingFacetransformerson CPU). - Decode-time enhancements: an
initial_promptcontaining common Nigerian proper nouns + Pidgin function words biases the decoder toward correct vocabulary; a postprocess pass strips punctuation the labels don’t use and merges digit groups.
The full design notes — including every dead end I hit — are in the project documentation on GitHub.
What I learned
- Audit your data before you train on it. I burned an afternoon pulling and uploading a 600 MB “Nigerian web” corpus before checking a sample — turned out only 0.09% of it was actually Pidgin. The rest was Standard English. A 30-second
grepfor Pidgin function words would have caught it. - Beam search rescoring fails on confident models. I trained a Pidgin-specific KenLM language model intending to do n-best rescoring of Whisper’s outputs. Beautiful idea on paper. In practice, my LoRA-finetuned Whisper returned the identical string in all 8 beams — the model is so confident that there’s nothing for the LM to rerank. To actually use a Pidgin LM, I’d need token-level shallow fusion, not n-best rescoring.
- Decode-time tricks are underrated. A 90-token initial prompt with proper nouns + a 20-line postprocess regex bought me ~1–2 pp WER for zero engineering cost and zero inference cost. That’s better ROI than most of the bigger architectural ideas I tried.
- Open-sourcing it forces clarity. Writing a model card and dataset card for public consumption made me re-examine a lot of decisions I’d made on autopilot.
What’s next
The model is shipped, but it’s a v1. Roughly in order of leverage:
- More data.
timniel/Pidgin_ASR_Dataset_Combinedwould add ~12 h of new Pidgin audio (~2× current data). Would target ~17–19% WER. - Mobile keyboard product. The end goal: a free iPhone + Android keyboard that lets Nigerians type by speaking Pidgin, monetized through rewarded video ads. Cloud inference now, on-device distilled model later.
- Token-level LM fusion. Properly integrate the KenLM I trained at the decoder logits level, not as post-hoc rescoring. ~1–3 pp additional WER for the right work.
- Distill for on-device. Whisper-tiny or Moonshine fine-tuned on Pidgin, quantized to int4, deployed via Core ML / TFLite. The endgame for a private, offline, latency-free keyboard.
Tech stack
PyTorch · Transformers · PEFT · faster-whisper · CTranslate2 · Silero VAD · KenLM · Kaggle (training compute) · Hugging Face (model + dataset + Space hosting) · Gradio (demo UI).
If you want to try it, the demo Space is the fastest path. If you want to read how I built it, the GitHub repo and its documentation.md have the unvarnished story.
