How to use VocalPy with UMAP and HDBSCAN for dimensionality reduction and clustering#
It is becoming more and more common for researchers studying acoustic communication to apply dimensionality reduction methods to their data, and then cluster the data once it is embedded in that lower dimensional space. Many researchers use the UMAP method for dimensionality reduction, via the Python library that implements it, and the HDBSCAN method to cluster, as implemented in the HDBSCAN library. You can install these by running pip install umap-learn and pip install hdbscan, respectively. Note that an implementation of HDBSCAN is now provided by scikit-learn, but here we show using the HDBSCAN package. Our understanding is that as of this writing, there are cases where using the original package may be more computationally efficient (see this issue, but that is beyond the scope of this how-to.
Here we provide a brief walkthrough of how you would use VocalPy to work with your data, and prepare it for dimensionality reduction with UMAP and clustering with HDBSCAN. We will use a (very tiny!) sub-set of the dataset that accompanied the paper “Two pup vocalization types are genetically and functionally separable in deer mice”. Our goal is to replicate the analysis in that paper, that used dimensionality reduction and clustering to show that the calls that deer mice pups emit when socially isolated form two distinct clusters. These finding supports the idea that there are two separate categories of calls: ultrasonic vocalizations and lower-frequency “cries”.
For a more detailed tutorial on applying UMAP to animal sounds, please see this paper from Mara Thomas and co-authors. Material here is adapted in part from this project shared under CC-BY-4.0 license, that accompanied the paper from Thomas, et al. We encourage you to also read Tim Sainburg’s earlier work demonstrating how UMAP can be used with animal sounds. Some code is also adapted from the repository that reproduces the paper results. Thank you to Nick Jourjine for helping us pick a subset of their data that is a good size for a tutorial. The original material for this how-to comes from a bootcamp at the Neural Mechanism of Acoustic Communciation 2024 Graduate Research Seminar organized by Nick and Diana Liao.
import pathlib
import librosa
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import umap
import hdbscan
import vocalpy as voc
Load mouse pup call data#
We start by loading the example data, that is given to us as vocalpy.Sound and vocalpy.Segments instances.
twopup = voc.example("twopup")
Downloading file 'jourjine-et-al-2023.tar.gz' from 'doi:10.5281/zenodo.10685639/jourjine-et-al-2023.tar.gz' to '/home/docs/.cache/vocalpy'.
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[2], line 1
----> 1 twopup = voc.example("twopup")
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/vocalpy/examples/_examples.py:398, in example(name, return_path)
392 raise ValueError(
393 f"No example data found with name: {name}. "
394 "To see the names of all example data, call `vocalpy.examples.show()`"
395 )
397 example_: Example = REGISTRY[name]
--> 398 return example_.load(return_path=return_path)
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/vocalpy/examples/_examples.py:264, in Example.load(self, return_path)
259 raise ConnectionError(
260 "Unable to connect to registry to download example dataset. "
261 "This may be due to an issue with an internet connection."
262 ) from e
263 if self.filename.endswith(".tar.gz"):
--> 264 path = POOCH.fetch(self.filename, processor=pooch.Untar())
265 elif self.filename.endswith(".zip"):
266 path = POOCH.fetch(self.filename, processor=pooch.Unzip())
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/pooch/core.py:589, in Pooch.fetch(self, fname, processor, downloader, progressbar)
586 if downloader is None:
587 downloader = choose_downloader(url, progressbar=progressbar)
--> 589 stream_download(
590 url,
591 full_path,
592 known_hash,
593 downloader,
594 pooch=self,
595 retry_if_failed=self.retry_if_failed,
596 )
598 if processor is not None:
599 return processor(str(full_path), action, self)
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/pooch/core.py:807, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
803 try:
804 # Stream the file to a temporary so that we can safely check its
805 # hash before overwriting the original.
806 with temporary_file(path=str(fname.parent)) as tmp:
--> 807 downloader(url, tmp, pooch)
808 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
809 shutil.move(tmp, str(fname))
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/pooch/downloaders.py:627, in DOIDownloader.__call__(self, url, output_file, pooch)
623 # Instantiate the downloader object
624 downloader = HTTPDownloader(
625 progressbar=self.progressbar, chunk_size=self.chunk_size, **self.kwargs
626 )
--> 627 downloader(download_url, output_file, pooch)
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/pooch/downloaders.py:240, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
238 progress = self.progressbar
239 progress.total = total
--> 240 for chunk in content:
241 if chunk:
242 output_file.write(chunk)
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/requests/models.py:820, in Response.iter_content.<locals>.generate()
818 if hasattr(self.raw, "stream"):
819 try:
--> 820 yield from self.raw.stream(chunk_size, decode_content=True)
821 except ProtocolError as e:
822 raise ChunkedEncodingError(e)
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/urllib3/response.py:1253, in HTTPResponse.stream(self, amt, decode_content)
1247 else:
1248 while (
1249 not is_fp_closed(self._fp)
1250 or len(self._decoded_buffer) > 0
1251 or (self._decoder and self._decoder.has_unconsumed_tail)
1252 ):
-> 1253 data = self.read(amt=amt, decode_content=decode_content)
1255 if data:
1256 yield data
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/urllib3/response.py:1108, in HTTPResponse.read(self, amt, decode_content, cache_content)
1105 if len(self._decoded_buffer) >= amt:
1106 return self._decoded_buffer.get(amt)
-> 1108 data = self._raw_read(amt)
1110 flush_decoder = amt is None or (amt != 0 and not data)
1112 if (
1113 not data
1114 and len(self._decoded_buffer) == 0
1115 and not (self._decoder and self._decoder.has_unconsumed_tail)
1116 ):
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/urllib3/response.py:1024, in HTTPResponse._raw_read(self, amt, read1)
1021 fp_closed = getattr(self._fp, "closed", False)
1023 with self._error_catcher():
-> 1024 data = self._fp_read(amt, read1=read1) if not fp_closed else b""
1025 if amt is not None and amt != 0 and not data:
1026 # Platform-specific: Buggy versions of Python.
1027 # Close the connection when no data is returned
(...) 1032 # not properly close the connection in all cases. There is
1033 # no harm in redundantly calling close.
1034 self._fp.close()
File ~/checkouts/readthedocs.org/user_builds/vocalpy/envs/stable/lib/python3.13/site-packages/urllib3/response.py:1007, in HTTPResponse._fp_read(self, amt, read1)
1004 return self._fp.read1(amt) if amt is not None else self._fp.read1()
1005 else:
1006 # StringIO doesn't like amt=None
-> 1007 return self._fp.read(amt) if amt is not None else self._fp.read()
File ~/.asdf/installs/python/3.13.3/lib/python3.13/http/client.py:479, in HTTPResponse.read(self, amt)
476 if self.length is not None and amt > self.length:
477 # clip the read to the "end of response"
478 amt = self.length
--> 479 s = self.fp.read(amt)
480 if not s and amt:
481 # Ideally, we would raise IncompleteRead if the content-length
482 # wasn't satisfied, but it might break compatibility.
483 self._close_conn()
File ~/.asdf/installs/python/3.13.3/lib/python3.13/socket.py:719, in SocketIO.readinto(self, b)
717 raise OSError("cannot read from timed out object")
718 try:
--> 719 return self._sock.recv_into(b)
720 except timeout:
721 self._timeout_occurred = True
File ~/.asdf/installs/python/3.13.3/lib/python3.13/ssl.py:1304, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1300 if flags != 0:
1301 raise ValueError(
1302 "non-zero flags not allowed in calls to recv_into() on %s" %
1303 self.__class__)
-> 1304 return self.read(nbytes, buffer)
1305 else:
1306 return super().recv_into(buffer, nbytes, flags)
File ~/.asdf/installs/python/3.13.3/lib/python3.13/ssl.py:1138, in SSLSocket.read(self, len, buffer)
1136 try:
1137 if buffer is not None:
-> 1138 return self._sslobj.read(len, buffer)
1139 else:
1140 return self._sslobj.read(len)
KeyboardInterrupt:
We can see that vocalpy.example() gives us back a vocalpy.examples.ExampleData instance with two attributes, sound and segments.
twopup
We use the segment() method of the Sound class with the vocalpy.Segments to get a list of Sound instances, one for each segment in the sound.
sound, segments = twopup.sound, twopup.segments
all_sounds = sound.segment(segments)
Typically in research code you would have done that by looping through start and stop times of segments, e.g., that you load from a csv file, that is the output of some function that segments for you. Then you would need to write more code to use the start and stop times of segments that you get back from the function to get clips of audio, one for each segmnt. The two lines we wrote above are just a concise way of doing the same thing.
Traditionally, segmentation has been done with signal processing methods, that can work fairly well in acoustically isolated environments. For example, this audio was segmented with the vocalpy.segment.ava() function – you can replicate the results of the paper with the parameters vocalpy.segment.JOURJINEETAL2023.
If you click on Segments, you can read the API documentation, and see examples of how to save and load segmentation with this class. Loading segmentation from tiny files lets you avoid saving a bunch of separate files of audio “clips”, one for each segment. Such clips can take up a ton of space on your computer’s storage; instead, we make it easier for you to load the sound and its segmentation when you need them, and then make those clips on the fly.
Now we have a list of sounds, one for each call that we found by segmenting.
len(all_sounds)
We write a function to give us back Mel spectrograms, so we can parallelize processing with VocalPy.
def melspectrogram(
sound: voc.Sound, n_mels: int=50, window: str = "hann",
n_fft: int = 512, hop_length: int = 128, fmin=5000, fmax=125000
) -> voc.Spectrogram:
S = librosa.feature.melspectrogram(y=sound.data,
sr=sound.samplerate,
n_mels=n_mels ,
fmax=fmax,
fmin=fmin,
n_fft=n_fft,
hop_length=hop_length,
window=window,
win_length=n_fft)
S = librosa.power_to_db(S, ref=np.max)
t = librosa.frames_to_time(frames=np.arange(S.shape[-1]), sr=sound.samplerate, hop_length=hop_length)
f = librosa.mel_frequencies(n_mels=n_mels, fmin=fmin, fmax=fmax)
return voc.Spectrogram(data=S[0, ...], frequencies=f, times=t)
Now we will use the SpectrogramMaker class to make a Mel spectrogram for each call.
As with other pipeline classes, the SpectrogramMaker takes a callback that it will call once for every sound we give it.
callback = melspectrogram
spect_maker = voc.SpectrogramMaker(callback)
all_spects = spect_maker.make(all_sounds, parallelize=True)
Not surprisingly, the make() method gives us back a list of Spectrogram instances.
print(all_spects[0])
In this case, we just need the underlying data, the numpy arrays.
We replace the list of Spectrogram instances with a list of numpy arrays, at the same time throwing away the channel dimension.
# We use the same variable name, letting the Python garbage collector take care of the list of Spectrograms
# that we don't need hanging around in memory anymore.
# We also use a list comprehension to avoid a loop.
all_spects = [
spect.data[0, ...] # get rid of the channel dimension by indexing
for spect in all_spects
]
This should give us a two-dimensional array, with dimensions (frequency, time)
all_spects[0].shape
Let’s visualize a random subset, just to inspect our data.
import random
fig, ax_arr = plt.subplots(4, 6)
ax_arr = ax_arr.ravel()
for ax, spect in zip(ax_arr, random.sample(all_spects, len(all_spects))):
ax.pcolormesh(spect)
ax.set_axis_off()
Now we need to know the maximum number of time bins in any of the spectrograms, so we can pad all the spectrograms to the same size.
max_width = np.max([
spect.data.shape[1] for spect in all_spects
])
We also use the minimum value to pad, so we don’t add some very large value in the padding, that could impact the UMAP calculation.
min_val = np.min([spect.min() for spect in all_spects])
Once we have both of those values we can pad all the spectrograms.
We write a helper function to do so, with those values as the defaults.
def pad_spect(spect, max_width=max_width, constant_values=min_val):
pad_width = max_width - spect.shape[1]
# pad with half the width needed on both sides
left_pad = pad_width // 2
right_pad = pad_width - left_pad
return np.pad(
spect, ((0, 0), (left_pad, right_pad)), mode='constant', constant_values=min_val,
)
And then apply it to all our arrays. We use the same “trick” of using the same variable name to point to a new list of arrays, that are the old ones with padding added.
all_spects = [
pad_spect(spect, max_width)
for spect in all_spects
]
Let’s make sure that worked.
all(
[spect.shape[1] == max_width for spect in all_spects]
)
And again we visualize a random sample just to double check things are working as expected.
import random
fig, ax_arr = plt.subplots(4, 6)
ax_arr = ax_arr.ravel()
for ax, spect in zip(ax_arr, random.sample(all_spects, len(all_spects))):
ax.pcolormesh(spect)
ax.set_axis_off()
Finally, we will turn each spectrogram into a single 1-dimensional vector. We do this because the UMAP algorithm considers data points to be vectors in some high-dimensional space. We could map our spectrograms to 1-dimensional vectors in some other way (such as a variational autoencoder), but for our purposes it is good enough to simply “flatten” the spectrogram matrix with numpy.
data = np.array([
spect.flatten() for spect in all_spects
])
Dimensionality reduction with UMAP#
At last we are ready to reduce the dimensionality of our data with UMAP. Using its scikit-learn-like API, we will make an instance of the umap.UMAP class, and then fit our data.
reducer = umap.UMAP(
n_components=2, min_dist=0.25, n_neighbors=15, verbose=True
)
We call umap.UMAP.fit_transform() so we get back our data embedded in the lower dimensional space.
embedding = reducer.fit_transform(data)
Now we can visualize this data.
plt.scatter(embedding[:, 0], embedding[:, 1])
Looks pretty well separated to me!
Clustering with HDBSCAN#
Last but not least, we cluster with the HDBSCAN algorithm.
clusterer = hdbscan.HDBSCAN(min_cluster_size=100, allow_single_cluster=True)
Notice here we fit the embedded data that we got back from UMAP.
clusterer.fit(embedding)
We plot using the “labels” to color the data points. The labels are the integer class representing the cluster that each data point has been assigned to. The clusterer tracks these as a labels_ attribute for the data we fit it to.
y = clusterer.labels_
import matplotlib as mpl
cmap = mpl.colormaps['tab20'].resampled(np.unique(y).shape[0])
c = cmap.colors[y]
We visualize the data points colored by label, to see how well they segregate into distinct clusters.
plt.scatter(embedding[:, 0], embedding[:, 1], c=c);
By eye, it looks like they cluster quite well!
Of course, typically the quality of clustering is quantified with metrics such as the Silhoutte score.
Now you have seen the basic steps to using VocalPy to prepare your data for use with UMAP and HDBSCAN.
For more detail, we again encourage you to read the Thomas et al. paper referenced above and Sainburg et al. 2020’s work applying UMAP to a wide variety of animal vocalizations, as well as Sainburg McInnes and Gentner’s 2021 paper on parametric UMAP, and more recent work such as that described in this commentary from Kate Snyder and Nicole Creanza.