Name Last Update
etaloncorpuscreator.egg-info Loading commit data...
etaloncorpuscreator Loading commit data...
PKG-INFO Loading commit data...
README.md Loading commit data...
setup.cfg Loading commit data...
setup.py Loading commit data...

About

Etaloncorpuscreator-package was made to automatically create a russian language audio corpus from YouTube videotracks playlists: it downloads video's audio and subtitles, makes pairs "sound-text", doing forced alignment and saves new corpus and varieties.

Installing

For installation you need Python 3.6 or later, OC Linux and sphinx3 on your local machine.

Start

To run etaloncorpuscreator you shoild prepare directories for audiotracks, subtitles, results. Also you need to create playlists.txt with playlists' links, every link should be on the separate line.

Arguments

All arguments are required for program use.

  1. -p URL_list

Playlists txt-file path.

  1. -a directory_audio

Path to download audiotracks.

  1. -s directory_subtitles

Path to download subtitles.

  1. -r directory_results

Path for audio results.

  1. -am sphinx_model_path

Your acoustic model path.

  1. -dict dictionary_path

Your dictionary path.

  1. -dict_f dictionary_filler_path

Your dictionary filler path.

  1. -ar directory_alignment_results

Path for alignment results.

Usage

eccr [-p URL_list] [-a directory_audio] [-s directory_subtitles] [-r directory_results] [-am sphinx_model_path] [-dict dictionary_path] [-dict_f dictionary_filler_path] [-ar directory_alignment_results]

Example

eccr -p playlists.txt -a Audio -s Subs -r Results -am ./voxforge_ru_sphinx/model_parameters/voxforge_ru.cd_cont_200 -dict ./voxforge_ru_sphinx/voxforge_ru.dic -dict_f ./voxforge_ru_sphinx/voxforge_ru.filler -ar Alignment