About
Etaloncorpuscreator-package was made to automatically create a russian language audio corpus from YouTube videotracks playlists: it downloads video's audio and subtitles, makes pairs "sound-text", doing forced alignment and saves new corpus and varieties.
Installing
For installation you need Python 3.6 or later, OC Linux and sphinx3 on your local machine.
Start
To run etaloncorpuscreator you shoild prepare directories for audiotracks, subtitles, results. Also you need to create playlists.txt with playlists' links, every link should be on the separate line.
Arguments
All arguments are required for program use.
- -p URL_list
Playlists txt-file path.
- -a directory_audio
Path to download audiotracks.
- -s directory_subtitles
Path to download subtitles.
- -r directory_results
Path for audio results.
- -am sphinx_model_path
Your acoustic model path.
- -dict dictionary_path
Your dictionary path.
- -dict_f dictionary_filler_path
Your dictionary filler path.
- -ar directory_alignment_results
Path for alignment results.
Usage
eccr [-p URL_list] [-a directory_audio] [-s directory_subtitles] [-r directory_results] [-am sphinx_model_path] [-dict dictionary_path] [-dict_f dictionary_filler_path] [-ar directory_alignment_results]
Example
eccr -p playlists.txt -a Audio -s Subs -r Results -am ./voxforge_ru_sphinx/model_parameters/voxforge_ru.cd_cont_200 -dict ./voxforge_ru_sphinx/voxforge_ru.dic -dict_f ./voxforge_ru_sphinx/voxforge_ru.filler -ar Alignment