This file details the procedure to build a new Multisyn voice.


Prerequisites
=============

Speech tools must be compiled with the inclusion of Python wrappers.
This requires compilation with shared libraries. set SHARED=2 in speech_tools/config/config.

There is a potential bug in the speech_tools makefiles that cause
wrappers to be built incorrectly. We suggest you apply the following
fix before compiling speech tools:

1) Edit speech_tools/Makefile and remove the reference to the directory
"wrappers"
                                                                               
2) Edit speech_tools/wrappers/wrappers.mak and change the last few lines
to:
                                                                               
ifeq ($(DIRNAME),.)
    EXTRA_BUILD_DIRS := $(EXTRA_BUILD_DIRS) wrappers
endif







Initial setup
=============

Edit multisyn_build.sh and ensure the following variables as set
appropriately to point the places where speech_tools, festival and
festvox are installed:

ESTDIR
FESTIVAL 
FESTVOXDIR

and:
LD_LIBRARY_PATH=$ESTDIR/lib

Now source multisyn_build.sh to configure your shell with these
variables.

  .  ../multisyn_build/multisyn_build.sh


The following instructions assume that you are working in a voice 
directory called my_voice_data and the build tools are in ../multisyn_build

Run the setup script in your voice directory, to create a number of
sub directories.

  ../multisyn_build/bin/setup



Splitting sound files
=====================

The few next paragraph describes how to semi-automatically split
recordings containing multiple sentences that are separated with a
beep. If you are not using this procedure, do your own thing and end up with
individual sentences in separate sound files and utts.data file which
lists your sound files with each corresponding text.


For the automatic method, create session and session/session_wav
subdirectories of my_voice_data for the session waveforms.  These are
the long files containing multiple utterances.  Each session file is
assumed to be a series of utterances seperated by a 1/2 second tone at
7KHz.(resources/tone.wav can be mixed in to your recordings for this purpose)
The tone should occur directly after each correct sentence. New
sentences and restarts should always be preceeded by a short pause,
this way restarts will be filtered out automatically.

In the session subdirectory, run:

  ../../bin/process_session_wave_files session_wav/*.d

This assumes xwaves format files ending with a .d extension. If you
have .wav files for instance you will need to change the file
extension in the script to account for this.

This may take a while, but in the end you will have individual wav
files in the session/classify_wav subdirectory.

Check by hand for restarts that have crept it, suggest filesize > 500k
(depending on the length of your sentences). Fix as appropriate.  Chop
off the beginning of files which contain restarts, and manually fix any
files which are just wrong.

Realistically you want to listen to each individual file for problems,
fixing problems now (and there will be quite a few) will save you time
later.

Next a fextvox utts.data file needs to be created in the
my_voice_directory. The required format is like this:

( arctic_a0001 "Author of the danger trail, Philip steels, etc" )
( arctic_a0002 "Not at this particular case, Tom, apologized Whittemore." )
( arctic_a0003 "For the twentieth time that evening the two men shook hands." )
( arctic_a0004 "Lord, but I'm glad to see you again, Phil." )

The format is "(" followed by a filename root followed by the text for
that sentence, followed by ")" each on separate lines. This text when
converted to a phone sequence by Festival should match (as closely as
possible) the phone sequence of the speech. With this in mind you
should probably ensure all words are in your lexicon (if you are using
one) and it is probably best to write numbers and dates out in full as
they were spoken. (e.g "the ninth of May" rather than "9 May" etc...)

You need to create this file whichever waveform recording/splitting
method you are using.

Labelling the data
------------------

The next stage is to generate the segmentation labelling by forced
alignment with HTK. (You could hand label the data at this stage
instead if you wish)

Before you proceed with alignment of the labels you should double
check that each sound file that you have matches the text in the
utts.data file. If there is any extra stuff in these files it will
severely affect the performance of the resulting voice. you can of
course always come back to this point in the process and rebuild the
voice, but regenerating all of the required files is a tedious process.

Make the directory structure for forced alignment.

../multisyn_build/bin/setup_alignment

Create the files phone_list and phone_substitutions in the alignment
subdirectory. The phone_list file needs to contain a list of phones in
your phoneset with the following additions. If `X' is a stop or
affricate a label `X_cl' should be included to label the closure portion
with. `sp'(short pause)  needs to be added for inter-word space and
`sil' needs to be added for silence.

The phone_substitutions file contains a list of possible substitutions
that the aligner can make. We generally restrict this to vowel
reduction. the rule "aa ax" says, `aa' can be labeled ad `ax' (schwa
in cmu lexicon phoneset).

There are phone_list and phone_substitution example files for various
lexicons in the resources subdirectory.

Generate the initial segmentation for doing forced alignment.
The last argument is which lexicon to use. You you are using a non
standard one you will need to add it to the build_unitsel.scm script.
Add an extra section to the cond statement in the
setup_phoneset_and_lexicon function. The script currently supports
cmu, oed (not distributed) and unilex varients. 

  $FESTIVAL ../multisyn_build/scm/build_unitsel.scm
    (make_initial_phone_labs "utts.data" "utts.mlf" 'unilex-rpx)

                        [ utts.data - see above
                          utts.mlf  - hkt master label file (output) 
                          'unilex-rpx - lexicon to use (see above)   ]

It is suggested that the above step is carried out with the default
voice being a diphone voice in the language being developed. Using a
multisyn voice will add multisyn style pauses into the label files,
which are currently not supported.


Generate MFCCs for alignment:

	../multisyn_build/bin/make_mfccs alignment wav/*.wav

                       [ alignment - alignment directory (created)
		         wav/*.wav - recorded speech files          ] 

Actually do the alignment:
(See note about patching HTK in doc/DOCUMENT)

	cd alignment

	../../multisyn_build/bin/make_mfcc_list ../mfcc ../utts.data train.scp
	../../multisyn_build/bin/do_alignment .


Split mlf alignment file:
   cd ..
  ../multisyn_build/bin/break_mlf alignment/aligned.3.mlf lab

			[ aligned.3.mlf   - final aligned labels 
			  lab             - directory to put labels in]


Generate pitchmarks:

make_pm_wave will generate pitchmarks from a wavefile, where as
make_pm_eeg will generate them from a eeg signal. Which ever script
you use you may need to alter some of the parameters to make it work
well. Look for the PM_ARGS line in the script. The min and max values
are minimum and maximum times between pitch marks (1/f), set these as
appropriate for you speaker.


mkdir pm
../multisyn_build/bin/make_pm_wave -[mf] pm wav/*.wav

                          [  -m   - select default male parameters
                             -f   -  select default female parameters ]


../multisyn_build/bin/make_pm_fix pm/*.pm

				  [ pm  - output directory ]

It is worth taking some time to get reasonable pitchmarks. The
make_pm_lab script in festvox can convert pm files to readable label files.

Calculate information for power factor normalisation of waveforms.

 $FESTVOXDIR/src/general/find_powerfactors lab/*.lab

                              [ lab/*.lab label files from alignment ]

[   An aside....
[
[   Ideally all of the above labelling steps should probably be done with
[   normalised waveforms. However as correct labelling is needed to
[   normalise them, that is not possible. If you want to normalise your
[   waveforms and do this. Do the following and then repeat the above
[   steps with the new waveform directory.
[
[     mkdir wav_fn
[       ../multisyn_build/bin/make_wav_powernorm wav_fn wav/*.wav
[
[                                [ wav_fn - output directory ]
[
[   Now go and regenerate mfccs and relabel using wav_fn!                                  
[


Generate utterances.   - This uses pitchmarks and marks segments bad
if they have no pitchmarks. Remember to use the same lexicon as for
generating the initial labels for alignment.


  $FESTIVAL ../multisyn_build/scm/build_unitsel.scm
    (build_utts "utts.data" 'unilex-rpx)

			    [ utts.data   - input file
                              unilex-rpx  - lexicon      ]


Generate segment duration info. This looks at the distribution of
segment durations and marks outliers as such in the utterances.

  ../multisyn_build/bin/phone_lengths dur lab/*.lab

  $FESTIVAL ../multisyn_build/scm/build_unitsel.scm
    (add_duration_info_utts "utts.data" "dur/durations")

Generate f0 pitch track contours
You will probably have to fine tune the parameters that the make_f0
script uses. 

  ../multisyn_build/bin/make_f0 -[mf] wav/*.wav

                          [  -m   - select default male parameters
                             -f   -  select default female parameters ]

Generate normalised coefficients for use in join cost:

 ../multisyn_build/bin/make_norm_join_cost_coefs coef f0 mfcc/*.mfcc

Optionally strip unused time frames from join cost coefficient track
files - makes them much smaller and hence easier to distribute and
faster to load

  ../multisyn_build/bin/strip_join_cost_coefs coef coef_stripped utt/*.utt		

Generate LPC coefficients:

 $FESTVOXDIR/src/general/make_lpc wav/*.wav   




Pause Module:
-------------

Pauses in multisyn voices are treated as segments (To allow the
inclusion of breath sounds, etc.) This means that a new voice require
a pause module.

Building Pauses.

The easiest way to make a pause model, is use one from another
voice, it is only silence after all. This is how to make your own.

Record? (or dd if=/dev/zero) some silence.
Create a label file with a series of pause labels (B_150 - a 150ms
pause) with silence at each end. (eg.:   sil B_150 sil )

Create pitchmarks, join cost coefficients and lpc files as for real speech.

Then build the pause utterances:

  $FESTIVAL ../multisyn_build/scm/build_unitsel.scm
    (build_pauses_utts "utts.pauses" 'unilex-rpx)

If you are just using silence, the waveform can be used in place of
the residual, and the pitchmarks in place of the lpc coefficients.


Defining a new voice:
-------------------- 

We illustrate this by example:

The following example is going to suppose your voice is called:

cstr_edi_awb_arctic_multisyn   

cstr  - institution making  voice
edi   - lexicon (e.g.:  edi = unilex-edi, us = cmu , uk = oald)
awb_arctic   - voice name (arctic is added here as this is borrowed data)
multisyn     - This is a multisyn unit selection voice.



Create a new directory in festival/lib/voices-multisyn/english with the name
of your voice.

    mkdir cstr_edi_awb_arctic_multisyn
    cd cstr_edi_awb_arctic_multisyn

Make subdirectories for actual files.

    mkdir festvox awb awb/coef awb/lpc awb/utt awb/

Now copy the following files:

     Source File(s)                     Copy to Directory
     --------------                     -----------------
     utts.data                          awb
     lpc/*                              awb/lpc
     coef/* (or coef_stripped/*)        awb/coef
     utt/*.utt                          awb/utt

You probably need to steal a pause module from somewhere, so add the
above files for the pause module as well (including utts.pauses)

Now create a voice definition file for your voice. Your best bet is to
copy the awb file festvox/cstr_edi_awb_arctic_multisyn.scm and edit is
as appropriate.

If you need to build a voice with multiple lexicons, make sure the
variables which store the paths to data are unique in each case.




