Tags:
text, topic segmentation, transcript, linkmedia, multimedia
Owner:
gabriel.sargent@irisa.fr
Otis!: Topic segmentation of texts and speech transcripts.
Please log in to perform a job with this app.
Otis! is a service performing the topic segmentation of texts and automatic transcripts. It supports the English and French languages.
Otis! agglomerates contiguous textual segments into topic segments according to an enhanced lexical cohesion criterion and a dynamic programming approach. It implements the approach of Utiyama and Isahara [1] and considers a language model interpolation on word counts [2]. The length of the topic segments can be controlled through the tuning of two parameters, s and p.
This service includes a preprocessing step of the input text, which estimates the type (or POS: Part-Of-Speech) and the lemma of each word, and performs a POS-based word filtering. This preprocessing relies mostly on Tree Tagger.
[1] Masao Utiyama and Hitoshi Isahara, "A Statistical Model for Domain-Independent Text Segmentation", in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 491–498, 2001.
[2] Guinaudeau C., Gravier G. and Sébillot P., "Enhancing Lexical Cohesion Measure with Confidence Measures, Semantic Relations and Language Model Interpolation for Multimedia Spoken Content Topic Segmentation", in "Computer Speech and Language", Elsevier, 26 (2), 2012, pp.90-104 (Online version).
<segment> <segment> <segment> ...where <segment> can refer to a word, a phrase or a paragraph of text regarding the scale of analysis considered. In the current version of this service, the length of a segment cannot exceed 256 words. It is assumed that each punctuation mark is followed by a space, and only "classic" apostrophes are taken into account, e.g. "l'Europe" is considered as a sequence of two words: "l'" and "Europe".
<?xml version="1.0" encoding="ISO-8859-1"?> <ssdoc src="<name_of_document>" version="1.0"> <ts start="<segment1_start_time>" end="<segment1_end_time>"> <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> <w str="<word2>" start="<word2_start_time>" end="<word2_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> ... </ts> <ts start="<segment2_start_time>" end="<segment2_end_time>"> <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> <w str="<word2>" start="<word2_start_time>" end="<word2_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> ... </ts> </ssdoc>where <ts> and </ts> mark out the segments to process.
<segment> <segment> <---------- segment boundary ----------> <segment> ...with "<---------- segment boundary ---------->" being the boundary between topically-consistent groups of segments.
<?xml version="1.0" encoding="ISO-8859-1"?> <ssdoc src="<name_of_document>" version="1.0"> <segment-collection type="topic"> <topic-segment start="<topic_1_start_time>" end="<topic1_end_time>" score="where <topic_k_start_time> and <topic_k_start_time> are the start and end times of the k-th topic in seconds." nwords=" "/> <topic-segment start="<topic_2_start_time>" end="<topic2_end_time>" score=" " nwords=" "/> ... </segment-collection> <ts start="<segment1_start_time>" end="<segment1_end_time>"> <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> ... </ts> <ts start="<segment2_start_time>" end="<segment2_end_time>"> <w str="<word1>" start="<word1_start_time>" end="<word1_end_time>" conf="<confidence_score>" pos="<type_of_word>" lem="<lemma>"/> ... </ts> </ssdoc>
{ "general_info":{ "src":""<input_file_name>"", "text":{ "duration":"00:00:00", "start":0, "time_unit":"word positions", "words":[ "<first_word_of_the_input_text>", "<second_word_of_the_input_text>", ... ] } }, "otis":{ "annotation_type":"topic segments", "system":"otis", "parameters":"<input_parameters>", "modality":"text", "time_unit":"word positions", "events":[ { "start":<start_position>, "end":<end_position> }, { "start":<start_position>, "end":<end_position> }, ... { "start":<start_position>, "end":<end_position> } ] } }each element of the "events" list being a particular topic segment. Note: when a SSD file is given in input, the time instants of the segments are also reported: the "time_unit" labels are associated to the string "word positions and seconds" instead of "word positions", and two labels are added to each element of the "events" list: "start_time" and "end_time" associated to the times of the beginning and the end of the related topic segment in seconds.
In addition to the selection of the English and French languages, two parameters can be adjusted to prevent over-segmentation issues:
Otis! is the online version of IRINTS: Irisa News Topic Segmenter. IRINTS was developed by G. Gravier and C. Guinaudeau in IRISA/Inria Rennes. It is the property of CNRS (DI 03033-01) and Inria and can be supplied under license on a case-by-case basis. This piece of software relies on the Tree Tagger software developed by Helmut Schmid and the libxml2 publicly available library.
In input :
Fin de partie juridique pour Facebook Le réseau social américain affirmait n’avoir aucun compte à rendre aux tribunaux français en matière de litiges La biosphère terrestre est de plus en plus marquée par l'empreinte de l'Homme il devient de plus en plus difficile d'y trouver des espaces purement naturels
In output :
Fin de partie juridique pour Facebook Le réseau social américain affirmait n’avoir aucun compte à rendre aux tribunaux français en matière de litiges <---------- segment boundary ----------> La biosphère terrestre est de plus en plus marquée par l'empreinte de l'Homme il devient de plus en plus difficile d'y trouver des espaces purement naturelsexample2_otis.ssd
22/08/2017 : Version 1.0,
This app id is : 4
This curl command will create a job, and return your job url, and also the average execution time
files and/or dataset are optionnal, think to remove them if not wantedcurl -H 'Authorization: Token token=<your_private_token>' -X POST -F job[webapp_id]=4 -F job[param]="" -F job[queue]=standard -F files[0]=@test.txt -F files[1]=@test2.csv -F job[file_url]=<my_file_url> -F job[dataset]=<my_dataset_name> https://allgo.inria.fr/api/v1/jobs
Then, check your job to get the url files with :
curl -H 'Authorization: Token token=<your_private_token>' -X GET https://allgo.inria.fr/api/v1/jobs/<job_id>