vocalpy.segment.ava

Contents

vocalpy.segment.ava#

vocalpy.segment.ava(sound: Sound, nperseg: int = 1024, noverlap: int = 512, min_freq: int = 30000.0, max_freq: int = 110000.0, spect_min_val: float | None = None, spect_max_val: float | None = None, thresh_lowest: float = 0.1, thresh_min: float = 0.2, thresh_max: float = 0.3, min_dur: float = 0.03, max_dur: float = 0.2, min_isi_dur: float | None = None, use_softmax_amp: bool = True, temperature: float = 0.5, smoothing_timescale: float = 0.007, scale: bool = True, scale_val: int | float = 32768, scale_dtype: npt.DTypeLike = <class 'numpy.int16'>) Segments[source]#

Find segments in audio, using algorithm from ava package.

Segments audio by generating a spectrogram from it, summing power across frequencies, and then thresholding this summed spectral power as if it were an amplitude trace.

The spectral power is segmented with three thresholds, thresh_lowest, thresh_min, and thresh_max, where thresh_lowest <= thresh_min <= thresh_max. The segmenting algorithm works as follows: first detect all local maxima that exceed thresh_max. Then for each local maximum, find onsets and offsets. An offset is detected wherever a local maxima is followed by a subsequent local minimum in the summed spectral power less than thresh_min, or when the power is less than thresh_lowest. Onsets are located in the same way, by looking for a preceding local minimum less than thresh_min, or any value less than thresh_lowest.

Parameters:
soundvocalpy.Sound

Sound loaded from an audio file.

npersegint

Number of samples per segment for Short-Time Fourier Transform. Default is 1024.

noverlapint

Number of samples to overlap per segment for Short-Time Fourier Transform. Default is 512.

min_freqint

Minimum frequency. Spectrogram is “cropped” below this frequency (instead of, e.g., bandpass filtering). Default is 30e3.

max_freqint

Maximum frequency. Spectrogram is “cropped” above this frequency (instead of, e.g., bandpass filtering). Default is 110e3.

spect_min_valfloat, optional

Expected minimum value of spectrogram after transforming to the log of the magnitude. Used for a min-max scaling: \((s - s_{min} / (s_{max} - s_{min})\) where spect_min_val is \(s_{min}\). Default is None, in which case the minimum value of the spectrogram is used.

spect_max_valfloat, optional

Expected maximum value of spectrogram after transforming to the log of the magnitude. Used for a min-max scaling: \((s - s_{min} / (s_{max} - s_{min})\) where spect_min_val is \(s_{min}\). Default is None, in which case the maximum value of the spectrogram is used.

thresh_maxfloat

Threshold used to find local maxima.

thresh_minfloat

Threshold used to find local minima, in relation to local maxima. Used to find onsets and offsets of segments.

thresh_lowestfloat

Lowest threshold used to find onsets and offsets of segments.

min_durfloat

Minimum duration of a segment, in seconds.

max_durfloat

Maximum duration of a segment, in seconds.

min_isi_durfloat, optional

Minimum duration of inter-segment intervals, in seconds. If specified, any inter-segment intervals shorter than this value will be removed, and the adjacent segments merged. Default is None.

use_softmax_ampbool

If True, compute summed spectral power from spectrogram with a softmax operation on each column. Default is True.

temperaturefloat

Temperature for softmax. Only used if use_softmax_amp is True.

smoothing_timescalefloat

Timescale to use when smoothing summed spectral power with a gaussian filter. The window size will be dt - smoothing_timescale / samplerate, where dt is the size of a time bin in the spectrogram.

scalebool

If True, scale the sound.data. Default is True. This is needed to replicate the behavior of ava, which assumes the audio data is loaded as 16-bit integers. Since the default for vocalpy.Sound is to load sounds with a numpy dtype of float64, this function defaults to multiplying the sound.data by 2**15, and then casting to the int16 dtype. This replicates the behavior of the ava function, given data with dtype float64. If you have loaded a sound with a dtype of int16, then set this to False.

scale_val

Value to multiply the sound.data by, to scale the data. Default is 2**15. Only used if scale is True. This is needed to replicate the behavior of ava, which assumes the audio data is loaded as 16-bit integers.

scale_dtypenumpy.dtype

Numpy Dtype to cast sound.data to, after scaling. Default is np.int16. Only used if scale is True. This is needed to replicate the behavior of ava, which assumes the audio data is loaded as 16-bit integers.

Returns:
segmentsvocalpy.Segments

Instance of vocalpy.Segments representing the segments found.

Notes

Code is adapted from [2]. Default parameters are taken from example script here: pearsonlab/autoencoded-vocal-analysis Note that example script suggests tuning these parameters using functionality built into it, that we do not replicate here.

Versions of this algorithm were also used to segment rodent vocalizations in [4] (see code in [5]) and [6] (see code in [7]).

References

[1]

Goffinet, J., Brudner, S., Mooney, R., & Pearson, J. (2021). Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires. eLife, 10:e67855. https://doi.org/10.7554/eLife.67855

[3]

Goffinet, J., Brudner, S., Mooney, R., & Pearson, J. (2021). Data from: Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires. Duke Research Data Repository. https://doi.org/10.7924/r4gq6zn8w

[4]

Nicholas Jourjine, Maya L. Woolfolk, Juan I. Sanguinetti-Scheck, John E. Sabatini, Sade McFadden, Anna K. Lindholm, Hopi E. Hoekstra, Two pup vocalization types are genetically and functionally separable in deer mice, Current Biology, 2023 https://doi.org/10.1016/j.cub.2023.02.045

[6]

Peterson, Ralph Emilio, Aman Choudhri, Catalin MItelut, Aramis Tanelus, Athena Capo-Battaglia, Alex H. Williams, David M. Schneider, and Dan H. Sanes. “Unsupervised discovery of family specific vocal usage in the Mongolian gerbil.” bioRxiv (2023): 2023-03.

Examples

>>> jourjineetal2023paths = voc.example('jourjine-et-al-2023')
>>> wav_path = jourjineetal2023paths[0]
>>> sound = voc.Sound.read(wav_path)
>>> params = {**voc.segment.ava.JOURJINEETAL2023}
>>> del params['min_isi_dur']
>>> segments = voc.segment.ava(sound, **params)
>>> spect = voc.spectrogram(sound)
>>> rows = 3; cols = 4
>>> import matplotlib.pyplot as plt
>>> fig, ax_arr = plt.subplots(rows, cols)
>>> start_inds, stop_inds = segments.start_inds, segments.stop_inds
>>> ax_to_use = ax_arr.ravel()[:start_inds.shape[0]]
>>> for start_ind, stop_ind, ax in zip(start_inds, stop_inds, ax_to_use):
...     data = sound.data[:, start_ind:stop_ind]
...     newsound = voc.Sound(data=data, samplerate=sound.samplerate)
...     spect = voc.spectrogram(newsound)
...     ax.pcolormesh(spect.times, spect.frequencies, np.squeeze(spect.data))
>>> for ax in ax_arr.ravel()[:start_inds.shape[0]]:
...     ax.set_axis_off()
>>> for ax in ax_arr.ravel()[start_inds.shape[0]:]:
...     ax.remove()