rmvpe_onnx.model¶
RMVPE model and feature extraction (ONNX Runtime backend).
This module implements a pure NumPy/SciPy mel spectrogram frontend and an ONNX Runtime-based RMVPE pitch estimator.
MelSpectrogramcomputes log-mel spectrograms without PyTorch.RMVPEperforms F0 estimation using a pre-trained ONNX model.
References
RVC Project (MIT License), code: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion Copyright (c) 2023 liujing04, 源文雨, Ftps
VoiceConversionWebUI (MIT License), model: https://huggingface.co/lj1995/VoiceConversionWebUI Copyright (c) 2022 lj1995
Classes¶
Pure NumPy/SciPy mel spectrogram — no PyTorch dependency. |
|
RMVPE pitch estimator using ONNX Runtime. |
- class rmvpe_onnx.model.MelSpectrogram(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)¶
Bases:
objectPure NumPy/SciPy mel spectrogram — no PyTorch dependency.
- __call__(audio, keyshift=0, speed=1, center=True)¶
Compute log-mel spectrogram from audio.
- Parameters:
audio (np.ndarray [shape=(N,)]) – Mono audio signal.
keyshift (float, optional) – Pitch shift in semitones. Affects FFT size and frequency scaling. Default is 0 (no shift).
speed (float, optional) – Time-stretch factor. Affects hop length. Default is 1 (no change).
center (bool, optional) – If True, pad the signal so frames are centered. Default is True.
- Returns:
Log-mel spectrogram.
- Return type:
np.ndarray [shape=(n_mels, T)]
Notes
Uses a Hann window and strided STFT implementation.
Output is natural log of mel energies with lower bound clipping.
Frequency resolution may change when
keyshiftis applied.
Examples
>>> import numpy as np >>> from rmvpe_onnx import MelSpectrogram >>> sr = 16000 >>> t = np.linspace(0, 1, sr, endpoint=False) >>> audio = np.sin(2 * np.pi * 440 * t).astype(np.float32) >>> mel = MelSpectrogram() >>> spec = mel(audio) >>> spec.shape[0] 128
- __init__(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)¶
- class rmvpe_onnx.model.RMVPE(model_path=None, device=None)¶
Bases:
objectRMVPE pitch estimator using ONNX Runtime.
This class estimates fundamental frequency (F0) from audio using a pre-trained RMVPE ONNX model. It provides a lightweight alternative to PyTorch-based implementations and follows a similar interface to
crepe.predict().- Parameters:
model_path (str or Path or None, optional) –
Path to
rmvpe.onnx.None: use the default model pathIf the file does not exist, it will be downloaded automatically
Custom paths and filenames are supported
device (str or None, optional) –
Execution device for ONNX Runtime.
Supported values include:
'cpu','cuda','cuda:1','dml','rocm','coreml','tensorrt','openvino'.None: automatically select the best available provider
Notes
Uses a NumPy-based mel spectrogram frontend (no PyTorch dependency).
Audio is internally resampled to 16 kHz and downmixed to mono.
Frame hop is 160 samples (~10 ms at 16 kHz).
The model is loaded via
ensure_model()and cached locally.
Examples
>>> import soundfile as sf >>> from rmvpe_onnx import RMVPE >>> audio, sr = sf.read("assets/example.wav") >>> rmvpe = RMVPE() >>> time, frequency, confidence, activation = rmvpe.predict(audio, sr) >>> len(time) == len(frequency) == len(confidence) True
- __init__(model_path=None, device=None)¶
- predict(audio, sr)¶
Estimate fundamental frequency (F0) from audio.
This method follows a similar interface to
crepe.predict(), returning time, frequency, confidence, and activation.- Parameters:
audio (np.ndarray [shape=(N,) or (N, C)]) – Audio samples. Multichannel audio will be downmixed to mono. Expected dtype is float-like; values are typically in
[-1, 1].sr (int) – Sample rate of the input audio. Audio will be resampled to 16 kHz internally if needed.
- Return type:
tuple[ndarray,ndarray,ndarray,ndarray]- Returns:
time (np.ndarray [shape=(T,)]) – Timestamps in seconds for each frame (~10 ms resolution).
frequency (np.ndarray [shape=(T,)]) – Estimated pitch in Hz.
confidence (np.ndarray [shape=(T,)]) – Voicing confidence in the range
[0, 1].activation (np.ndarray [shape=(T, 360)]) – Raw salience over pitch bins (~20-cent resolution).
Notes
Internally resamples audio to 16 kHz.
Uses a hop length of 160 samples (~10 ms per frame).
Pitch is computed via local averaging in the log-frequency domain.
Unvoiced frames may have low confidence and unstable frequency.
Examples
>>> import soundfile as sf >>> from rmvpe_onnx import RMVPE >>> audio, sr = sf.read("assets/example.wav") >>> rmvpe = RMVPE() >>> time, frequency, confidence, activation = rmvpe.predict(audio, sr) >>> len(time) == len(frequency) == len(confidence) True