rmvpe_onnx.model

RMVPE model and feature extraction (ONNX Runtime backend).

This module implements a pure NumPy/SciPy mel spectrogram frontend and an ONNX Runtime-based RMVPE pitch estimator.

  • MelSpectrogram computes log-mel spectrograms without PyTorch.

  • RMVPE performs F0 estimation using a pre-trained ONNX model.

References

Classes

MelSpectrogram

Pure NumPy/SciPy mel spectrogram — no PyTorch dependency.

RMVPE

RMVPE pitch estimator using ONNX Runtime.

class rmvpe_onnx.model.MelSpectrogram(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)

Bases: object

Pure NumPy/SciPy mel spectrogram — no PyTorch dependency.

__call__(audio, keyshift=0, speed=1, center=True)

Compute log-mel spectrogram from audio.

Parameters:
  • audio (np.ndarray [shape=(N,)]) – Mono audio signal.

  • keyshift (float, optional) – Pitch shift in semitones. Affects FFT size and frequency scaling. Default is 0 (no shift).

  • speed (float, optional) – Time-stretch factor. Affects hop length. Default is 1 (no change).

  • center (bool, optional) – If True, pad the signal so frames are centered. Default is True.

Returns:

Log-mel spectrogram.

Return type:

np.ndarray [shape=(n_mels, T)]

Notes

  • Uses a Hann window and strided STFT implementation.

  • Output is natural log of mel energies with lower bound clipping.

  • Frequency resolution may change when keyshift is applied.

Examples

>>> import numpy as np
>>> from rmvpe_onnx import MelSpectrogram
>>> sr = 16000
>>> t = np.linspace(0, 1, sr, endpoint=False)
>>> audio = np.sin(2 * np.pi * 440 * t).astype(np.float32)
>>> mel = MelSpectrogram()
>>> spec = mel(audio)
>>> spec.shape[0]
128
__init__(n_mel_channels=128, sampling_rate=16000, win_length=1024, hop_length=160, n_fft=None, mel_fmin=30, mel_fmax=8000, clamp=1e-05)
class rmvpe_onnx.model.RMVPE(model_path=None, device=None)

Bases: object

RMVPE pitch estimator using ONNX Runtime.

This class estimates fundamental frequency (F0) from audio using a pre-trained RMVPE ONNX model. It provides a lightweight alternative to PyTorch-based implementations and follows a similar interface to crepe.predict().

Parameters:
  • model_path (str or Path or None, optional) –

    Path to rmvpe.onnx.

    • None: use the default model path

    • If the file does not exist, it will be downloaded automatically

    • Custom paths and filenames are supported

  • device (str or None, optional) –

    Execution device for ONNX Runtime.

    Supported values include: 'cpu', 'cuda', 'cuda:1', 'dml', 'rocm', 'coreml', 'tensorrt', 'openvino'.

    • None: automatically select the best available provider

Notes

  • Uses a NumPy-based mel spectrogram frontend (no PyTorch dependency).

  • Audio is internally resampled to 16 kHz and downmixed to mono.

  • Frame hop is 160 samples (~10 ms at 16 kHz).

  • The model is loaded via ensure_model() and cached locally.

Examples

>>> import soundfile as sf
>>> from rmvpe_onnx import RMVPE
>>> audio, sr = sf.read("assets/example.wav")  
>>> rmvpe = RMVPE()  
>>> time, frequency, confidence, activation = rmvpe.predict(audio, sr)  
>>> len(time) == len(frequency) == len(confidence)  
True
__init__(model_path=None, device=None)
predict(audio, sr)

Estimate fundamental frequency (F0) from audio.

This method follows a similar interface to crepe.predict(), returning time, frequency, confidence, and activation.

Parameters:
  • audio (np.ndarray [shape=(N,) or (N, C)]) – Audio samples. Multichannel audio will be downmixed to mono. Expected dtype is float-like; values are typically in [-1, 1].

  • sr (int) – Sample rate of the input audio. Audio will be resampled to 16 kHz internally if needed.

Return type:

tuple[ndarray, ndarray, ndarray, ndarray]

Returns:

  • time (np.ndarray [shape=(T,)]) – Timestamps in seconds for each frame (~10 ms resolution).

  • frequency (np.ndarray [shape=(T,)]) – Estimated pitch in Hz.

  • confidence (np.ndarray [shape=(T,)]) – Voicing confidence in the range [0, 1].

  • activation (np.ndarray [shape=(T, 360)]) – Raw salience over pitch bins (~20-cent resolution).

Notes

  • Internally resamples audio to 16 kHz.

  • Uses a hop length of 160 samples (~10 ms per frame).

  • Pitch is computed via local averaging in the log-frequency domain.

  • Unvoiced frames may have low confidence and unstable frequency.

Examples

>>> import soundfile as sf
>>> from rmvpe_onnx import RMVPE
>>> audio, sr = sf.read("assets/example.wav")  
>>> rmvpe = RMVPE()  
>>> time, frequency, confidence, activation = rmvpe.predict(audio, sr)  
>>> len(time) == len(frequency) == len(confidence)  
True