E

etaloncorpuscreator

Command-line package for automatical creation of russian language audio corpus from YouTube audiotracks and subtitles with using forced alignment by sphinx3

skipped 305ee562 uploading files · by Daniil Grebenkin

About

Etaloncorpuscreator-package was made to automatically create a russian language audio corpus from YouTube videotracks playlists: it downloads video's audio and subtitles, makes pairs "sound-text", doing forced alignment and saves new corpus and varieties.

Installing

For installation you need Python 3.6 or later, OC Linux and sphinx3 on your local machine.

Start

To run etaloncorpuscreator you shoild prepare directories for audiotracks, subtitles, results. Also you need to create playlists.txt with playlists' links, every link should be on the separate line.

Arguments

All arguments are required for program use.

  1. -p URL_list

Playlists txt-file path.

  1. -a directory_audio

Path to download audiotracks.

  1. -s directory_subtitles

Path to download subtitles.

  1. -r directory_results

Path for audio results.

  1. -am sphinx_model_path

Your acoustic model path.

  1. -dict dictionary_path

Your dictionary path.

  1. -dict_f dictionary_filler_path

Your dictionary filler path.

  1. -ar directory_alignment_results

Path for alignment results.

Usage

eccr [-p URL_list] [-a directory_audio] [-s directory_subtitles] [-r directory_results] [-am sphinx_model_path] [-dict dictionary_path] [-dict_f dictionary_filler_path] [-ar directory_alignment_results]

Example

eccr -p playlists.txt -a Audio -s Subs -r Results -am ./voxforge_ru_sphinx/model_parameters/voxforge_ru.cd_cont_200 -dict ./voxforge_ru_sphinx/voxforge_ru.dic -dict_f ./voxforge_ru_sphinx/voxforge_ru.filler -ar Alignment