These are examples supporting our ICASSP 2023 paper.

Reference:
@{S. Kindt, J. Thienpondt, and N. Madhu, "Exploiting speaker embeddings for improved microphone clustering and speech separation in ad-hoc microphone arrays," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.}

Goal

We have ad-hoc distributed microphones and two concurrent speakers placed in a room. We want to cluster the microphones in three clusters: one for each speaker and one background cluster. We compare the Mod-MFCC based clustering features with our proposed speaker embedding features. Both are clustered using fuzzy-C-means clustering. With these clusters, we perform speech separation by first generating a rough estimate via masking. This estimate is then used to estimate the relative delays, which are then compensated for during delay and sum beamforming (DBS).

Setup

For the examples, we simulated different room sizes and reverberation times. 2 sources are placed in the room. Then 15 microphones are randomly placed with the constraint that at least 3 microphones are within the critical distances of each source. An example room and cluster is plotted below, where the sources are the green crosses, the critical distances is indicated by the green circles and the microphones are indicated with black dots. For the clustering, microphones with darker purple color belong to the same cluster, while light blue dots are microphones not belonging to that cluster. Note that the critical distances will change dependant on the room size and the revereration time of the room.

Room
SpVer Clusters image

Evaluation

To compare between Mod-MFCC based features and the Speaker Embeddings, we will show the following:
  1. What the cluster are
  2. Audio of the chosen reference microphone signals
  3. Audio of the masked reference microphone signals, which will be used to time-align the signals
  4. Audio of the DSB output for the clusters

Cases

We have 5 different examples with different room sizes and reverberation times; and where sources are closer or farther separated from each other.
The first two examples are with sources that are relatively far from each other:
  1. with low reverberation time of 0.2 seconds
  2. with normal reverberation time of 0.5 seconds
The last three examples are extra challenging with sources that are closer to each other:
  1. with modest reverberation time of 0.3 seconds
  2. with normal reverberation time of 0.5 seconds
  3. with overlapping critical distances in modest reverberant conditions of 0.3 seconds