rmvpe_onnx.model¶

RMVPE model and feature extraction (ONNX Runtime backend).

This module implements a pure NumPy/SciPy mel spectrogram frontend and an ONNX Runtime-based RMVPE pitch estimator.

MelSpectrogram computes log-mel spectrograms without PyTorch.
RMVPE performs F0 estimation using a pre-trained ONNX model.

References

RVC Project (MIT License), code: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion Copyright (c) 2023 liujing04, 源文雨, Ftps
VoiceConversionWebUI (MIT License), model: https://huggingface.co/lj1995/VoiceConversionWebUI Copyright (c) 2022 lj1995

Classes¶

`MelSpectrogram`	Pure NumPy/SciPy mel spectrogram — no PyTorch dependency.
`RMVPE`	RMVPE pitch estimator using ONNX Runtime.

class rmvpe_onnx.model.MelSpectrogram(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)¶

Bases: object

Pure NumPy/SciPy mel spectrogram — no PyTorch dependency.

__call__(audio, keyshift=0, speed=1, center=True)¶

Compute log-mel spectrogram from audio.

Parameters:

audio (np.ndarray [shape=(N,)]) – Mono audio signal.
keyshift (float, optional) – Pitch shift in semitones. Affects FFT size and frequency scaling. Default is 0 (no shift).
speed (float, optional) – Time-stretch factor. Affects hop length. Default is 1 (no change).
center (bool, optional) – If True, pad the signal so frames are centered. Default is True.

Returns:

Log-mel spectrogram.

Return type:

np.ndarray [shape=(n_mels, T)]

Notes

Uses a Hann window and strided STFT implementation.
Output is natural log of mel energies with lower bound clipping.
Frequency resolution may change when keyshift is applied.

Examples

>>> import numpy as np
>>> from rmvpe_onnx import MelSpectrogram
>>> sr = 16000
>>> t = np.linspace(0, 1, sr, endpoint=False)
>>> audio = np.sin(2 * np.pi * 440 * t).astype(np.float32)
>>> mel = MelSpectrogram()
>>> spec = mel(audio)
>>> spec.shape[0]
128

__init__(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)¶

class rmvpe_onnx.model.RMVPE(model_path=None, device=None)¶

Bases: object

RMVPE pitch estimator using ONNX Runtime.

This class estimates fundamental frequency (F0) from audio using a pre-trained RMVPE ONNX model. It provides a lightweight alternative to PyTorch-based implementations and follows a similar interface to crepe.predict().

Parameters:

model_path (str or Path or None, optional) –
Path to rmvpe.onnx.
- None: use the default model path
- If the file does not exist, it will be downloaded automatically
- Custom paths and filenames are supported
device (str or None, optional) –
Execution device for ONNX Runtime.

Supported values include: 'cpu', 'cuda', 'cuda:1', 'dml', 'rocm', 'coreml', 'tensorrt', 'openvino'.
- None: automatically select the best available provider

Notes

Uses a NumPy-based mel spectrogram frontend (no PyTorch dependency).
Audio is internally resampled to 16 kHz and downmixed to mono.
Frame hop is 160 samples (~10 ms at 16 kHz).
The model is loaded via ensure_model() and cached locally.

Examples

>>> import soundfile as sf
>>> from rmvpe_onnx import RMVPE
>>> audio, sr = sf.read("assets/example.wav")  
>>> rmvpe = RMVPE()  
>>> time, frequency, confidence, activation = rmvpe.predict(audio, sr)  
>>> len(time) == len(frequency) == len(confidence)  
True

__init__(model_path=None, device=None)¶

predict(audio, sr)¶

Estimate fundamental frequency (F0) from audio.

This method follows a similar interface to crepe.predict(), returning time, frequency, confidence, and activation.

Parameters:

audio (np.ndarray [shape=(N,) or (N, C)]) – Audio samples. Multichannel audio will be downmixed to mono. Expected dtype is float-like; values are typically in [-1, 1].
sr (int) – Sample rate of the input audio. Audio will be resampled to 16 kHz internally if needed.

Return type:

tuple[ndarray, ndarray, ndarray, ndarray]

Returns:

time (np.ndarray [shape=(T,)]) – Timestamps in seconds for each frame (~10 ms resolution).
frequency (np.ndarray [shape=(T,)]) – Estimated pitch in Hz.
confidence (np.ndarray [shape=(T,)]) – Voicing confidence in the range [0, 1].
activation (np.ndarray [shape=(T, 360)]) – Raw salience over pitch bins (~20-cent resolution).

Notes

Internally resamples audio to 16 kHz.
Uses a hop length of 160 samples (~10 ms per frame).
Pitch is computed via local averaging in the log-frequency domain.
Unvoiced frames may have low confidence and unstable frequency.

Examples

>>> import soundfile as sf
>>> from rmvpe_onnx import RMVPE
>>> audio, sr = sf.read("assets/example.wav")  
>>> rmvpe = RMVPE()  
>>> time, frequency, confidence, activation = rmvpe.predict(audio, sr)  
>>> len(time) == len(frequency) == len(confidence)  
True