irisa-text-normalizer

Text normalisation tools from IRISA lab ( https://github.com/glecorve/irisa-text-normalizer )

Synopsis

The tools provided here are split into 3 steps: 1. Tokenisation (adding blanks around punctation marks, dealing with special cases like URLs, etc.) 2. Generic normalisation (leading to homogeneous texts where (almost) information have been lost and where tags have been added for some entities) 3. Specific normalisation (projection of the generic texts into specific forms)

Supported languages:

English
French

Commands

LANGUAGE="en" # (or "fr")

Tokenisation

perl bin/$LANGUAGE/basic-tokenizer.pl examples/$LANGUAGE/text.raw > examples/$LANGUAGE/text.tokenized.txt

Generic normalisation

perl bin/$LANGUAGE/start-generic-normalisation.pl examples/$LANGUAGE/text.tokenized > examples/$LANGUAGE/text.norm.step1
# <-- Here you may wish to run some extra tool -->
perl bin/$LANGUAGE/end-generic-normalisation.pl examples/$LANGUAGE/text.norm.step1.txt > examples/$LANGUAGE/text.norm.step2.txt

or simply:

bash bin/$LANGUAGE/generic-normalisation.sh text-normalisation/examples/$LANGUAGE/text.tokenized.txt

2 examples of specific normalisations

perl bin/$LANGUAGE/specific-normalisation.pl cfg/asr.cfg examples/$LANGUAGE/text.norm.step2 > examples/$LANGUAGE/text.asr.txt
perl bin/$LANGUAGE/specific-normalisation.pl cfg/tts.cfg examples/$LANGUAGE/text.norm.step2 > examples/$LANGUAGE/text.tts.txt

Create your own configuration for specific normalisation

perl bin/$LANGUAGE/specific-normalisation.pl -h

In input :

text.raw.txt

Ces activités concernent en particulier la compression d'images et de vidéos, tant en recherche fondamentale qu'une activité en standardisation auprès des organismes internationaux (l'ISO pour JPEG et MPEG4). D'autres travaux de Pr. J-M Grumpf et M. Plep concernent l'analyse de i-scènes.

L'IRISA a fêté en octobre 2005 (le 21/10/05, au XXI siècle), à 08h40, ses 30 ans et comprend environ 530 personnes, dont environ 400 chercheurs, dont 180 doctorants et 90 ingénieurs, techniciens, administratifs, etc.

Le centre est très lié à l'UFR Informatique et Electronique de Rennes-I, qui est à moins de 200m.

In output :

text.asr.txt

CES ACTIVITÉS CONCERNENT EN PARTICULIER LA COMPRESSION D' IMAGES ET DE VIDÉOS TANT EN RECHERCHE FONDAMENTALE QU' UNE ACTIVITÉ EN STANDARDISATION AUPRÈS DES ORGANISMES INTERNATIONAUX L' ISO POUR J. PEG ET M. PEG QUATRE D' AUTRES TRAVAUX DE PROFESSEUR J. M. GRUMPF ET MONSIEUR PLEP CONCERNENT L' ANALYSE DE I. SCÈNES

L' IRISA A FÊTÉ EN OCTOBRE DEUX MILLE CINQ LE VINGT ET UN OCTOBRE DEUX MILLE CINQ AU VINGT ET UN SIÈCLE À NEUF HEURES MOINS VINGT SES TRENTE ANS ET COMPREND ENVIRON CINQ CENT TRENTE PERSONNES DONT ENVIRON QUATRE CENTS CHERCHEURS DONT CENT QUATRE VINGTS DOCTORANTS ET QUATRE VINGT DIX INGÉNIEURS TECHNICIENS ADMINISTRATIFS ET CAETERA

LE CENTRE EST TRÈS LIÉ À L' U. F. R. INFORMATIQUE ET ÉLECTRONIQUE DE RENNES UN QUI EST À MOINS DE DEUX CENTS MÈTRES

30/08/2017 : Version 1.0, initial version

How to use our REST API :

Think to check your private token in your account first. You can find more detail in our documentation tab.

This app id is : 164

This curl command will create a job, and return your job url, and also the average execution time

files and/or dataset are optionnal, think to remove them if not wanted

curl -H 'Authorization: Token token=<your_private_token>' -X POST
-F job[webapp_id]=164
-F job[param]=""
-F job[queue]=standard
-F files[0]=@test.txt
-F files[1]=@test2.csv
-F job[file_url]=<my_file_url>
-F job[dataset]=<my_dataset_name> https://allgo.inria.fr/api/v1/jobs

Then, check your job to get the url files with :

curl -H 'Authorization: Token token=<your_private_token>' -X GET https://allgo.inria.fr/api/v1/jobs/<job_id>