Otis

Otis

Otis!: Topic segmentation of texts and speech transcripts.


Please log in to perform a job with this app.


Otis! is a service performing the topic segmentation of texts and automatic transcripts. It supports the English and French languages.

Overview:

Otis! agglomerates contiguous textual segments into topic segments according to an enhanced lexical cohesion criterion and a dynamic programming approach. It implements the approach of Utiyama and Isahara [1] and considers a language model interpolation on word counts [2]. The length of the topic segments can be controlled through the tuning of two parameters, s and p.

This service includes a preprocessing step of the input text, which estimates the type (or POS: Part-Of-Speech) and the lemma of each word, and performs a POS-based word filtering. This preprocessing relies mostly on Tree Tagger.

[1] Masao Utiyama and Hitoshi Isahara, "A Statistical Model for Domain-Independent Text Segmentation", in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 491–498, 2001.
[2] Guinaudeau C., Gravier G. and Sébillot P., "Enhancing Lexical Cohesion Measure with Confidence Measures, Semantic Relations and Language Model Interpolation for Multimedia Spoken Content Topic Segmentation", in "Computer Speech and Language", Elsevier, 26 (2), 2012, pp.90-104 (Online version).

File formats:

  • input: Otis! takes transcripts/texts in raw tex, XML-based SSD or Vecsys VOX in input, and optionnally a json file.
    • text file (.txt):
      <segment>
      <segment>
      <segment>
      ...
      
      where <segment> can refer to a word, a phrase or a paragraph of text regarding the scale of analysis considered. In the current version of this service, the length of a segment cannot exceed 256 words. It is assumed that each punctuation mark is followed by a space, and only "classic" apostrophes are taken into account, e.g. "l'Europe" is considered as a sequence of two words: "l'" and "Europe".
    • IRISA's XML Structured Spoken Document format (SSD 1.0):
      <?xml version="1.0" encoding="ISO-8859-1"?>
      <ssdoc src="<name_of_document>" version="1.0">
      <ts start="<segment1_start_time>" end="<segment1_end_time>">
      <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      <w str="<word2>" start="<word2_start_time>" end="<word2_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      ...
      </ts>
      <ts start="<segment2_start_time>" end="<segment2_end_time>">
      <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      <w str="<word2>" start="<word2_start_time>" end="<word2_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      ...
      </ts>
      </ssdoc>
      
      where <ts> and </ts> mark out the segments to process. is the i-th word in the segment. All the start and end times are in seconds.
    • VOX format: Otis supports Vecsys's VOX format (and returns a SSD file as output).
    • JSON file (optional): uploading the JSON output "<input_file_name>.json" from A||go's multimedia webservices leads to its update with Otis's results under the 'otis' label, along with the words within the input textual stream (cf. the 'general_info' then 'text' labels).
  • output: Otis! can produce either raw tex or SSD files, regarding to the format used in input.
    • text file (.txt):
      <segment>
      <segment>
      <---------- segment boundary ---------->
      <segment>
      ...
      
      with "<---------- segment boundary ---------->" being the boundary between topically-consistent groups of segments.
    • SSD: the output file is a copy of the input in which the topic segmentation have been added, delimited by the <segment-collection> and </segment-collection> tags:
      <?xml version="1.0" encoding="ISO-8859-1"?>
      <ssdoc src="<name_of_document>" version="1.0">
      <segment-collection type="topic">
      <topic-segment start="<topic_1_start_time>" end="<topic1_end_time>" score="" nwords=""/>
      <topic-segment start="<topic_2_start_time>" end="<topic2_end_time>" score="" nwords=""/>
      ...
      </segment-collection>
      <ts start="<segment1_start_time>" end="<segment1_end_time>">
      <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      ...
      </ts>
      <ts start="<segment2_start_time>" end="<segment2_end_time>">
      <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/>
      ...
      </ts>
      </ssdoc>
      
      where <topic_k_start_time> and <topic_k_start_time> are the start and end times of the k-th topic in seconds.
    • JSON file with the following format:
      {
      "general_info":{
      "src":""<input_file_name>"",
      "text":{
      "duration":"00:00:00",
      "start":0,
      "time_unit":"word positions",
      "words":[
      "<first_word_of_the_input_text>",
      "<second_word_of_the_input_text>",
      ...
      ]
      }
      },
      "otis":{
      "annotation_type":"topic segments",
      "system":"otis",
      "parameters":"<input_parameters>",
      "modality":"text",
      "time_unit":"word positions",
      "events":[
      {
      "start":<start_position>,
      "end":<end_position>
      },
      {
      "start":<start_position>,
      "end":<end_position>
      },
      ...
      {
      "start":<start_position>,
      "end":<end_position>
      }
      ]
      }
      }
      
      each element of the "events" list being a particular topic segment. Note: when a SSD file is given in input, the time instants of the segments are also reported: the "time_unit" labels are associated to the string "word positions and seconds" instead of "word positions", and two labels are added to each element of the "events" list: "start_time" and "end_time" associated to the times of the beginning and the end of the related topic segment in seconds.

Parameters:

In addition to the selection of the English and French languages, two parameters can be adjusted to prevent over-segmentation issues:

  • -l <language>: selects the language of the text. Supported languages : 'en' (English) and 'fr' (French). If unspecified, this parameter is set to 'fr'.
  • -s: prior scale factor between the lexical cohesion and the length of the output topic segments. s takes positive values and is set to 1 if unspecified. Increasing s leads to favor large topic segments.
  • -p: insertion penalty. p has the same function than s and brings a way to fine-tune the balance between the segments' lexical cohesion and their length. p takes negative values only and is set to 0 if unspecified. Decreasing p leads to favor large topic segments.

Credits and license:

Otis! is the online version of IRINTS: Irisa News Topic Segmenter. IRINTS was developed by G. Gravier and C. Guinaudeau in IRISA/Inria Rennes. It is the property of CNRS (DI 03033-01) and Inria and can be supplied under license on a case-by-case basis. This piece of software relies on the Tree Tagger software developed by Helmut Schmid and the libxml2 publicly available library.

In input :

example1.txt
Fin de partie juridique pour Facebook
Le réseau social américain affirmait n’avoir aucun compte à rendre aux tribunaux français en matière de litiges
La biosphère terrestre est de plus en plus marquée par l'empreinte de l'Homme
il devient de plus en plus difficile d'y trouver des espaces purement naturels


example2.ssd

    


In output :

example1_otis.txt
Fin de partie juridique pour Facebook
Le réseau social américain affirmait n’avoir aucun compte à rendre aux tribunaux français en matière de litiges
<---------- segment boundary ---------->
La biosphère terrestre est de plus en plus marquée par l'empreinte de l'Homme
il devient de plus en plus difficile d'y trouver des espaces purement naturels
example2_otis.ssd

22/08/2017 : Version 1.0,

How to use our REST API :

Think to check your private token in your account first. You can find more detail in our documentation tab.

This app id is : 4

This curl command will create a job, and return your job url, and also the average execution time

files and/or dataset are optionnal, think to remove them if not wanted
curl -H 'Authorization: Token token=<your_private_token>' -X POST
-F job[webapp_id]=4
-F job[param]=""
-F job[queue]=standard
-F files[0]=@test.txt
-F files[1]=@test2.csv
-F job[file_url]=<my_file_url>
-F job[dataset]=<my_dataset_name> https://allgo.inria.fr/api/v1/jobs

Then, check your job to get the url files with :

curl -H 'Authorization: Token token=<your_private_token>' -X GET https://allgo.inria.fr/api/v1/jobs/<job_id>