Best Nigerian Pidgin English Speech-to-Text Model Available Today – Open Sourced

I fine-tuned OpenAI’s Whisper-large-v3-turbo on ~8.6 hours of Nigerian Pidgin audio using LoRA. The result: 21.37% Word Error Rate on the held-out test set — an 8.2 percentage-point absolute improvement (28% relative) over the strongest published Pidgin ASR baseline. The model, dataset, training scripts, and a live browser demo are all open source.


The problem

Most speech-to-text models you’ve ever used — from Siri to Google Voice Typing to Otter.ai — were trained on English, Mandarin, and a handful of other major languages. Nigerian Pidgin English is spoken by an estimated 75+ million people across West Africa, but no production-grade open-source STT model existed for it.

That gap matters. A keyboard, dictation app, or transcription tool built for Nigerian users still asks them to switch to Standard English to use voice input. Pidgin in, English out — losing the language we actually speak in.

I wanted to fix that, and to do it openly: model, data, training recipe, deployment path — all public, all reproducible.

What I built

A complete pipeline, end to end:

  1. Curated a Nigerian Pidgin training corpus by combining publicly available Pidgin ASR sources into a unified train/validation/test split with a consistent schema.
  2. LoRA fine-tuned Whisper-large-v3-turbo on a single free Kaggle T4 GPU in under 4 hours. The trained adapter is just 26 MB.
  3. Built a real-time inference path using faster-whisper (CTranslate2 int8 quantization) with Silero VAD for utterance segmentation. Runs in ~200–600 ms per utterance on a Mac CPU.
  4. Shipped a deployment recipe — a Hugging Face Inference Endpoint handler + a free Gradio demo Space, so anyone can try the model in their browser without installing anything.

Results

MetricPidgin Whisper v1Wav2Vec2-XLSR-53 baseline
Test WER21.37%29.6%
Test CER9.90%

Trained on 5.41 h, validated on 1.37 h, evaluated on 1.78 h of held-out test data. Test WER was actually slightly better than validation WER (21.96%), suggesting clean generalization with no overfitting.

Concretely: of every 5 words spoken, the model gets ~4 right — and the errors are mostly Pidgin orthographic variants (hinhim) and proper nouns rather than misheard semantics.

How it works

  • Base model: Whisper-large-v3-turbo (809M parameters) — best ratio of accuracy, speed, and fine-tunability for the constraints.
  • Fine-tune: LoRA r=32, alpha=64 on attention q_proj and v_proj. ~3M trainable parameters.
  • Effective batch: 16 (4 per device × 4 grad-accum), LR 1e-4, fp16, 5 epochs.
  • Inference: the trained adapter is merged into the base, exported to CTranslate2 int8, and served via faster-whisper (4–6× faster than HuggingFace transformers on CPU).
  • Decode-time enhancements: an initial_prompt containing common Nigerian proper nouns + Pidgin function words biases the decoder toward correct vocabulary; a postprocess pass strips punctuation the labels don’t use and merges digit groups.

The full design notes — including every dead end I hit — are in the project documentation on GitHub.

What I learned

  • Audit your data before you train on it. I burned an afternoon pulling and uploading a 600 MB “Nigerian web” corpus before checking a sample — turned out only 0.09% of it was actually Pidgin. The rest was Standard English. A 30-second grep for Pidgin function words would have caught it.
  • Beam search rescoring fails on confident models. I trained a Pidgin-specific KenLM language model intending to do n-best rescoring of Whisper’s outputs. Beautiful idea on paper. In practice, my LoRA-finetuned Whisper returned the identical string in all 8 beams — the model is so confident that there’s nothing for the LM to rerank. To actually use a Pidgin LM, I’d need token-level shallow fusion, not n-best rescoring.
  • Decode-time tricks are underrated. A 90-token initial prompt with proper nouns + a 20-line postprocess regex bought me ~1–2 pp WER for zero engineering cost and zero inference cost. That’s better ROI than most of the bigger architectural ideas I tried.
  • Open-sourcing it forces clarity. Writing a model card and dataset card for public consumption made me re-examine a lot of decisions I’d made on autopilot.

What’s next

The model is shipped, but it’s a v1. Roughly in order of leverage:

  • More data. timniel/Pidgin_ASR_Dataset_Combined would add ~12 h of new Pidgin audio (~2× current data). Would target ~17–19% WER.
  • Mobile keyboard product. The end goal: a free iPhone + Android keyboard that lets Nigerians type by speaking Pidgin, monetized through rewarded video ads. Cloud inference now, on-device distilled model later.
  • Token-level LM fusion. Properly integrate the KenLM I trained at the decoder logits level, not as post-hoc rescoring. ~1–3 pp additional WER for the right work.
  • Distill for on-device. Whisper-tiny or Moonshine fine-tuned on Pidgin, quantized to int4, deployed via Core ML / TFLite. The endgame for a private, offline, latency-free keyboard.

Tech stack

PyTorch · Transformers · PEFT · faster-whisper · CTranslate2 · Silero VAD · KenLM · Kaggle (training compute) · Hugging Face (model + dataset + Space hosting) · Gradio (demo UI).


If you want to try it, the demo Space is the fastest path. If you want to read how I built it, the GitHub repo and its documentation.md have the unvarnished story.

Testimonials
Feedback from the Client
Emily Carter
Creative Director at Pixel Studio
Stephen’s insights are a game-changer for anyone in the digital space. His creativity and expertise never fail to inspire.
Get in Touch
Feel Free to Contact Me
Want to say hello? Fill out the form below, and I’ll get back to you as soon as I can. Looking forward to hearing from you!