What Read-Along Books Actually Are
Children's picture books on iPad. Language-learning materials with synced audio. Accessibility-focused books that read themselves while highlighting current text. These are all built using ePub3's media overlays specification (officially "Media Overlays 3.0").
A media overlay is a SMIL (Synchronized Multimedia Integration Language) file that pairs each text element in the book with a timestamp in an audio file. When the audio plays, the corresponding text highlights. When the user taps text, audio jumps to that point.
The format has been around since 2011 but production tools are rare. This post covers the authoring workflow, validation, and platform-specific quirks for Apple Books, Google Play Books, and Adobe Digital Editions.
For converting other ebook formats, see our ebook converter.
ePub3 Structure
An ePub3 is a ZIP archive containing:
mimetype(must be first, identifying the archive)META-INF/container.xml(entry point)OEBPS/content.opf(manifest)OEBPS/nav.xhtml(navigation/table of contents)OEBPS/chapter1.xhtml(HTML content)OEBPS/audio/chapter1.mp3(audio)OEBPS/audio/chapter1.smil(SMIL synchronization)
The SMIL file is what makes it a media-overlay book.
SMIL Synchronization Format
A simplified SMIL file:
<?xml version="1.0" encoding="UTF-8"?>
<smil xmlns="http://www.w3.org/ns/SMIL"
xmlns:epub="http://www.idpf.org/2007/ops"
version="3.0">
<body>
<par id="par1">
<text src="chapter1.xhtml#sentence1"/>
<audio src="audio/chapter1.mp3" clipBegin="0s" clipEnd="3.5s"/>
</par>
<par id="par2">
<text src="chapter1.xhtml#sentence2"/>
<audio src="audio/chapter1.mp3" clipBegin="3.5s" clipEnd="7.2s"/>
</par>
...
</body>
</smil>
Each <par> (parallel) element pairs a text element (identified by ID in the XHTML) with an audio segment (start and end times).
The XHTML chapter has corresponding IDs:
<p id="sentence1">Once upon a time, there was a fox.</p>
<p id="sentence2">The fox was very clever.</p>
The SMIL says "when the audio plays from 0 to 3.5 seconds, sentence1 should be highlighted."
Authoring Workflow
The synchronization data can be created several ways:
Manual: open the audio in Audacity, identify sentence boundaries by ear, write timestamps into SMIL by hand. Slow but precise.
Forced alignment: tools like Aeneas, Gentle, or DSAlign automatically align text and audio. Output SMIL or JSON. Very fast but accuracy varies with audio quality.
Author tools: dedicated ePub3 authoring tools (Sigil, Calibre with plugins, BookBaker) provide UI for syncing. Best for complex layouts.
For most production: forced alignment with manual review. Tools like Aeneas process an entire chapter in seconds, producing SMIL files that need cleanup but are 90% correct.
# Aeneas command-line example
python -m aeneas.tools.execute_task \
chapter1.txt \
chapter1.mp3 \
"task_language=eng|os_task_file_format=smil|is_text_type=plain" \
chapter1.smil
The output SMIL pairs each line of the text file with a timestamp range.
Linking SMIL to ePub Content
In content.opf manifest, declare the SMIL files:
<manifest>
<item id="chapter1" href="chapter1.xhtml" media-type="application/xhtml+xml" media-overlay="chapter1-smil"/>
<item id="chapter1-smil" href="chapter1.smil" media-type="application/smil+xml"/>
<item id="chapter1-audio" href="audio/chapter1.mp3" media-type="audio/mpeg"/>
</manifest>
The media-overlay="chapter1-smil" attribute on the chapter item tells readers that synchronization data exists.
For each chapter, repeat this pattern. Some books synchronize the whole book to one audio file; some have per-chapter audio.
Audio Format Choices
| Format | Compatibility | Notes |
|---|---|---|
| MP3 | Universal | Default choice for ePub3 |
| AAC (.m4a) | Most readers | Apple Books prefers this |
| Opus | Modern readers | Smaller files |
| FLAC | Some readers | Larger files |
For Apple Books: AAC at 64-128 kbps mono produces clean audio. For Google Play Books: MP3 at 64-128 kbps. For mixed delivery: MP3 has the broadest support.
For details on audio format trade-offs, see M4B Audiobook Chapters.
Validation
Before delivery, validate with EpubCheck:
java -jar epubcheck.jar input.epub
EpubCheck verifies:
- ePub3 spec compliance
- Manifest completeness
- Media overlay references valid
- SMIL syntax correct
- Audio files exist and play
- XHTML well-formed
A passing EpubCheck doesn't guarantee good user experience but catches structural errors that platforms reject.
For Apple Books specifically, additional validation:
- Use Apple Books' own validator (BookBaker has integration)
- Test on iPad and iPhone
- Verify highlighting is per-sentence, not per-paragraph for children's books
Reader Support Matrix
| Reader | Media overlays | Accessibility integration |
|---|---|---|
| Apple Books (iOS, macOS) | Yes (full) | VoiceOver compatible |
| Google Play Books | Limited | TalkBack compatible |
| Adobe Digital Editions | Yes | Limited |
| Calibre Reader | Limited | Limited |
| Kobo Reader | Yes (newer firmware) | Limited |
| Amazon Kindle | No (KFX format different) | n/a |
| Thorium Reader | Yes | Strong accessibility |
| Vivlio (formerly Bookari) | Yes | Limited |
Apple Books has the most polished media overlay implementation. Other readers vary widely in implementation quality.
For Kindle delivery, the workflow is different: KCC (Kindle Comic Creator) doesn't support media overlays. Audio Kindle books use a different mechanism through ACX.
Granularity Decisions
How small a text unit should be synchronized?
| Granularity | Pros | Cons |
|---|---|---|
| Per-paragraph | Easy to author, less precise | Less educational value |
| Per-sentence | Standard for children's books | More authoring time |
| Per-word | Maximum precision | Very labor-intensive |
| Per-syllable | Karaoke-style | Specialty use only |
For most read-along books: per-sentence. For language learning: per-word for the target language sections, per-sentence for context. For early readers: per-word with visual cues.
File Size Reality
For a 50-page picture book with audio:
| Component | Size |
|---|---|
| Text (XHTML) | 50-200 KB |
| Images (compressed JPG) | 5-15 MB |
| Audio (MP3 64 kbps mono) | 8-15 MB (for ~30 min) |
| SMIL files | 50-200 KB |
| ePub package total | 15-30 MB |
For longer audiobooks integrated as ePub3: the audio dominates file size. Expect 50-150 MB for a full audiobook with synchronized text.
Pro Tip: Compress audio aggressively for read-along books. 64 kbps mono MP3 is sufficient for narration and saves bandwidth on mobile delivery. Higher bitrate doesn't improve perceived quality for spoken word.
Common Issues
Highlight doesn't follow audio: SMIL timestamps wrong. Re-run forced alignment, manually verify a few key transitions.
Apple Books shows "no audio available": SMIL or audio files missing from manifest. Verify with EpubCheck.
Tap-to-jump doesn't work: text element IDs in XHTML don't match SMIL references. Verify each <par>'s text src attribute exists in the chapter XHTML.
Audio plays but text doesn't highlight: highlighting CSS not injected. Most readers add highlighting CSS automatically; if missing, add epub:type="readaloud" to elements.
File rejected by Apple Books: media overlay structure doesn't match Apple's strict requirements. Use Apple Books' own validation tools.
For broader ebook format conversion, see our ebook converter.
Frequently Asked Questions
Can I add media overlays to an existing ePub3?
Yes. Add the SMIL files, audio files, update the manifest, add media-overlay attribute to chapter items. Re-zip the ePub. Validate.
Do media overlays work in Calibre?
Limited. Calibre's reader displays the audio and text but the synchronization quality varies. Test on actual delivery readers.
Can I have multiple voices/narrators?
Yes. Each <par> can reference a different audio source. For dialogue, alternate audio sources between speakers.
What about ASL or sign-language video for accessibility?
Media overlays support video too. The <audio> element can be <video>. ASL versions of children's books use this for deaf accessibility.
How long does it take to author a 30-page book?
With Aeneas forced alignment: 1-2 days for synchronization, more for careful review. Without automation: 1-2 weeks. The audio recording itself is separate (typically 1-3 hours per 30 pages).
Can I author ePub3 with media overlays in InDesign?
InDesign's ePub3 export doesn't include media overlays directly. Export the ePub from InDesign, then use Sigil or a programmatic tool to add SMIL synchronization.
Related Reading
Bottom Line
For ePub3 with media overlays: use Aeneas or similar forced alignment for initial synchronization, validate with EpubCheck, target per-sentence granularity for children's books, deliver MP3 64 kbps mono audio. Apple Books has the strongest reader support. Our ebook converter handles ePub-to-other-format conversion if your delivery target needs PDF or KFX.



