Supertonic TTS: Lightning Fast On-Device Text-to-Speech System

Supertonic TTS is a fast, on-device text-to-speech system designed for performance with minimal computational overhead. The system runs entirely on your device using ONNX Runtime, providing complete privacy and zero latency for speech generation.

What is Supertonic TTS?

Supertonic TTS converts text to speech directly on your device without requiring cloud services or internet connectivity. The system generates speech at speeds up to 167 times faster than real-time on consumer hardware, making it suitable for real-time applications. With only 66 million parameters, Supertonic TTS maintains a small footprint while delivering natural-sounding speech output.

The system handles complex text inputs naturally, processing numbers, dates, currency, abbreviations, and technical expressions without requiring pre-processing. This capability makes it easy to integrate into applications that need to speak various types of content.

Key Features

Fast Performance: Generates speech up to 167 times faster than real-time on consumer hardware
Small Model Size: Only 66 million parameters for efficient deployment
Complete Privacy: All processing happens on-device with no cloud connections
Natural Text Handling: Processes complex text inputs without special formatting
Configurable Settings: Adjust inference steps and parameters to match your needs
Cross-Platform Support: Works with Python, Node.js, Java, C++, C#, Go, Swift, iOS, Rust, and Flutter
Zero Latency: Instant speech generation without network delays

Technical Architecture

Supertonic TTS uses a speech autoencoder and flow-matching based text-to-latent module for efficient speech generation. The system employs Length-Aware Rotary Position Embedding (LARoPE) to improve text-speech alignment in cross-attention mechanisms. Training includes self-purification techniques for robust model training with reliable labels.

The system is built on ONNX Runtime for cross-platform inference, with CPU-optimized processing and optional GPU acceleration. Browser support is provided through onnxruntime-web for client-side inference. The model outputs 16-bit WAV files for good audio quality.

Performance

Supertonic TTS achieves high performance across different hardware configurations. On an M4 Pro with CPU processing, the system generates 912 to 1263 characters per second. With WebGPU acceleration, performance increases to 996 to 2509 characters per second. On high-end hardware like the RTX 4090, the system reaches 2615 to 12164 characters per second.

The real-time factor, which measures generation speed relative to audio duration, is as low as 0.005 on RTX 4090, 0.012 on M4 Pro CPU, and 0.006 on M4 Pro with WebGPU. This means the system can generate audio much faster than it would take to play it back.

Use Cases

Supertonic TTS is suitable for various applications including accessibility features, interactive applications, educational software, productivity tools, mobile applications, embedded systems, gaming applications, content creation tools, assistive technology, and navigation systems.

License

The sample code is released under the MIT License. The accompanying model is released under the OpenRAIL-M License. The model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project.