SpeaDS: Speaker Diarization System. This service detects "who speaks when" within an audio recording, without prior information on the speakers.

Please log in to perform a job with this app.

SpeaDS segments the input audio stream according to the speakers appearing over time. These segments are labeled in terms of speaker index and of estimated genre (M: male, F: female). Speaker names are not known nor estimated.


In this service, the audio stream is analyzed using four main steps.

  • First, the complete audio stream is divided into breathe segments (or "pseudo-sentences") separated by small periods of silence and grouped using an agglomerative clustering. Consecutive segments which belong to the same group are merged. In this step, the audio content is represented by a sequence of mel-frequency cepstral coefficients. The segmentation is obtained using hidden Markov models and Viterbi decoding using the HTK speech recognition toolkit with external phoneme class models, and the clustering is done using gaussian modeling and squared Kullback-Leibler divergence (KL2).
  • Second, all non-speech segments (i.e., containing music only or silence) are discarded. The music/speech detection reliies on models used in SAMuSA.
  • Third, an agglomerative clustering is performed on the resulting segmentation, and again the consecutive segments belonging to the same group are merged [1]. This time, MFCCS are enriched with its derivatives and log energy, and segments and segment clusters are modeled using Gaussian Mixture Models. The criterion for merging clusters can be BIC-based or Euclidean-based, and the stopping criterion is automatically tuned.
  • Finally, the audio is segmented and labeled according to the appearance of male and female voices using hidden Markov models and Viterbi decoding (HTK) and pre-trained models of genre (male/female), and each cluster from the above diarization is divided w.r.t. the estimated genre of its elements.

[1] Ben, Betser, Bimbot and Gravier, "Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs", In Proc. of International Conference on Speech and Language Processing (Interspeech), 2004.
[2] SPro through Inria GForge
[3] Audioseg through Inria GForge
(Version 1.0 of this web service includes SPro v5.0 and Audioseg v1.2.2)

File formats:

SpeaDS takes the audio stream from audio or video files as input, and outputs the speaker segmentation in raw text and json formats. A JSON file produced by other multimedia services from A||go can be provided as input to be completed with this segmentation.

  • inputs:
    • audio file: many formats are supported (wav, mp3, ogg, flac, MP4...) as the entry is converted to a 16bits 16kHz mono wav file using ffmpeg.
    • JSON file (optional): uploading the JSON output "<audio_file_name>.json" from A||go's multimedia webservices leads to its update with SpeaDS's results under the 'speads' tag, along with metadata from the audio stream.
  • outputs:
    • text file: "<input_file_name>_speads.txt" is made of three columns of the format:
      each line describing a single segment, I being the speaker index, and G being the estimated gender of the speaker, either male (M) or female(F)
    • JSON file with the following format:
      "sampling_rate":"<frequency> Hz",
      "nb_channels":"<n> channels",
      "bit_rate":"<bit_rate> kb/s"
      "annotation_type":"speaker segments",
    • with <start_time> and <end_time> in seconds, I being the speaker index, and G being the estimated gender of the speaker, either male (M) or female (F).


  • -s <value> is the minimal duration of a silence for obtaining the initial low-level segmentation of the audio signal into breathe group (pseudo sentences), hence adjusting the thinness or coarseness of this segmentation. It is set to 0.35 s by default. Smaller values will divide the audio stream into thinner segments, whereas higher values will divide the output in coarser segments. Note that thinner segmetations will increase the computation time.
  • -t <value> is the maximal KL2 divergence between two segments of a single cluster in the first agglomerative clustering. By default it is set to 11.2. Choosing a higher value leads to a decrease the number of clusters to be obtained.
  • -r <value> is a parameter influencing the computation of GMM models in the second clustering step, while derived from an universal background model obtained on all the speech segments of the audio file. It takes positive values and is set to 4 by default.

Credits and license:

SpeaDS was developed by Gabriel Sargent and Guillaume Gravier in IRISA/Inria Rennes Bretagne Atlantique. It can be released and supplied under license on a case-by-case basis. Spro was developed by Guillaume Gravier. AudioSeg was developed by Mathieu Ben, Michaƫl Betser and Guillaume Gravier.

In input :

In output :

17/08/2017 : Version 1.0,

How to use our REST API :

Think to check your private token in your account first. You can find more detail in our documentation tab.

This app id is : 99

This curl command will create a job, and return your job url, and also the average execution time

files and/or dataset are optionnal, think to remove them if not wanted
curl -H 'Authorization: Token token=<your_private_token>' -X POST
-F job[webapp_id]=99
-F job[param]=""
-F job[queue]=standard
-F files[0]=@test.txt
-F files[1]=@test2.csv
-F job[file_url]=<my_file_url>
-F job[dataset]=<my_dataset_name> https://allgo.inria.fr/api/v1/jobs

Then, check your job to get the url files with :

curl -H 'Authorization: Token token=<your_private_token>' -X GET https://allgo.inria.fr/api/v1/jobs/<job_id>