vocalpy.segment.ava#
- vocalpy.segment.ava(sound: Sound, nperseg: int = 1024, noverlap: int = 512, min_freq: int = 30000.0, max_freq: int = 110000.0, spect_min_val: float | None = None, spect_max_val: float | None = None, thresh_lowest: float = 0.1, thresh_min: float = 0.2, thresh_max: float = 0.3, min_dur: float = 0.03, max_dur: float = 0.2, min_isi_dur: float | None = None, use_softmax_amp: bool = True, temperature: float = 0.5, smoothing_timescale: float = 0.007, scale: bool = True, scale_val: int | float = 32768, scale_dtype: npt.DTypeLike = <class 'numpy.int16'>) Segments [source]#
Find segments in audio, using algorithm from
ava
package.Segments audio by generating a spectrogram from it, summing power across frequencies, and then thresholding this summed spectral power as if it were an amplitude trace.
The spectral power is segmented with three thresholds,
thresh_lowest
,thresh_min
, andthresh_max
, wherethresh_lowest <= thresh_min <= thresh_max
. The segmenting algorithm works as follows: first detect all local maxima that exceedthresh_max
. Then for each local maximum, find onsets and offsets. An offset is detected wherever a local maxima is followed by a subsequent local minimum in the summed spectral power less thanthresh_min
, or when the power is less thanthresh_lowest
. Onsets are located in the same way, by looking for a preceding local minimum less thanthresh_min
, or any value less thanthresh_lowest
.- Parameters:
- soundvocalpy.Sound
Sound loaded from an audio file.
- npersegint
Number of samples per segment for Short-Time Fourier Transform. Default is 1024.
- noverlapint
Number of samples to overlap per segment for Short-Time Fourier Transform. Default is 512.
- min_freqint
Minimum frequency. Spectrogram is “cropped” below this frequency (instead of, e.g., bandpass filtering). Default is 30e3.
- max_freqint
Maximum frequency. Spectrogram is “cropped” above this frequency (instead of, e.g., bandpass filtering). Default is 110e3.
- spect_min_valfloat, optional
Expected minimum value of spectrogram after transforming to the log of the magnitude. Used for a min-max scaling: \((s - s_{min} / (s_{max} - s_{min})\) where
spect_min_val
is \(s_{min}\). Default is None, in which case the minimum value of the spectrogram is used.- spect_max_valfloat, optional
Expected maximum value of spectrogram after transforming to the log of the magnitude. Used for a min-max scaling: \((s - s_{min} / (s_{max} - s_{min})\) where
spect_min_val
is \(s_{min}\). Default is None, in which case the maximum value of the spectrogram is used.- thresh_maxfloat
Threshold used to find local maxima.
- thresh_minfloat
Threshold used to find local minima, in relation to local maxima. Used to find onsets and offsets of segments.
- thresh_lowestfloat
Lowest threshold used to find onsets and offsets of segments.
- min_durfloat
Minimum duration of a segment, in seconds.
- max_durfloat
Maximum duration of a segment, in seconds.
- min_isi_durfloat, optional
Minimum duration of inter-segment intervals, in seconds. If specified, any inter-segment intervals shorter than this value will be removed, and the adjacent segments merged. Default is None.
- use_softmax_ampbool
If True, compute summed spectral power from spectrogram with a softmax operation on each column. Default is True.
- temperaturefloat
Temperature for softmax. Only used if
use_softmax_amp
is True.- smoothing_timescalefloat
Timescale to use when smoothing summed spectral power with a gaussian filter. The window size will be
dt - smoothing_timescale / samplerate
, wheredt
is the size of a time bin in the spectrogram.- scalebool
If True, scale the
sound.data
. Default is True. This is needed to replicate the behavior ofava
, which assumes the audio data is loaded as 16-bit integers. Since the default forvocalpy.Sound
is to load sounds with a numpy dtype of float64, this function defaults to multiplying thesound.data
by 2**15, and then casting to the int16 dtype. This replicates the behavior of theava
function, given data with dtype float64. If you have loaded a sound with a dtype of int16, then set this to False.- scale_val
Value to multiply the
sound.data
by, to scale the data. Default is 2**15. Only used ifscale
isTrue
. This is needed to replicate the behavior ofava
, which assumes the audio data is loaded as 16-bit integers.- scale_dtypenumpy.dtype
Numpy Dtype to cast
sound.data
to, after scaling. Default isnp.int16
. Only used ifscale
isTrue
. This is needed to replicate the behavior ofava
, which assumes the audio data is loaded as 16-bit integers.
- Returns:
- segmentsvocalpy.Segments
Instance of
vocalpy.Segments
representing the segments found.
Notes
Code is adapted from [2]. Default parameters are taken from example script here: pearsonlab/autoencoded-vocal-analysis Note that example script suggests tuning these parameters using functionality built into it, that we do not replicate here.
Versions of this algorithm were also used to segment rodent vocalizations in [4] (see code in [5]) and [6] (see code in [7]).
References
[1]Goffinet, J., Brudner, S., Mooney, R., & Pearson, J. (2021). Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires. eLife, 10:e67855. https://doi.org/10.7554/eLife.67855
[3]Goffinet, J., Brudner, S., Mooney, R., & Pearson, J. (2021). Data from: Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires. Duke Research Data Repository. https://doi.org/10.7924/r4gq6zn8w
[4]Nicholas Jourjine, Maya L. Woolfolk, Juan I. Sanguinetti-Scheck, John E. Sabatini, Sade McFadden, Anna K. Lindholm, Hopi E. Hoekstra, Two pup vocalization types are genetically and functionally separable in deer mice, Current Biology, 2023 https://doi.org/10.1016/j.cub.2023.02.045
[6]Peterson, Ralph Emilio, Aman Choudhri, Catalin MItelut, Aramis Tanelus, Athena Capo-Battaglia, Alex H. Williams, David M. Schneider, and Dan H. Sanes. “Unsupervised discovery of family specific vocal usage in the Mongolian gerbil.” bioRxiv (2023): 2023-03.
Examples
>>> jourjineetal2023paths = voc.example('jourjine-et-al-2023') >>> wav_path = jourjineetal2023paths[0] >>> sound = voc.Sound.read(wav_path) >>> params = {**voc.segment.ava.JOURJINEETAL2023} >>> del params['min_isi_dur'] >>> segments = voc.segment.ava(sound, **params) >>> spect = voc.spectrogram(sound) >>> rows = 3; cols = 4 >>> import matplotlib.pyplot as plt >>> fig, ax_arr = plt.subplots(rows, cols) >>> start_inds, stop_inds = segments.start_inds, segments.stop_inds >>> ax_to_use = ax_arr.ravel()[:start_inds.shape[0]] >>> for start_ind, stop_ind, ax in zip(start_inds, stop_inds, ax_to_use): ... data = sound.data[:, start_ind:stop_ind] ... newsound = voc.Sound(data=data, samplerate=sound.samplerate) ... spect = voc.spectrogram(newsound) ... ax.pcolormesh(spect.times, spect.frequencies, np.squeeze(spect.data)) >>> for ax in ax_arr.ravel()[:start_inds.shape[0]]: ... ax.set_axis_off() >>> for ax in ax_arr.ravel()[start_inds.shape[0]:]: ... ax.remove()