Installation Instructions.
==========================

What you need
-------------

A reasonably up to date Linux environment.

Software:
  Festival LLSTI pre-release package
  Festival Unit Selection Voice package data package
  Festival Unit Selection Voice build tools package
  Festvox package (from www.festvox.org)
  HTK 3.x (or your own forced alignment tools).
  Perl and Python2 (sorry).
  swig 1.3.19

  gcc 2.9.6 or 3.3.x
          

Resources:
  A Festival format phoneset definition for your language.
  A Festival format lexicon and/or grapheme to phoneme (letter to
  sound) rules for your language


Download and unpack the above packages
Compile speech_tools and Festival

To run the English Unit Selection voice you will A copy of our unisyn
lexicon.  This currently has a non-commercial use only licence, which
shouldn't be a problem, as you won't want it for other languages.

To Compile speech_tools and festival
====================================


Generate speech_tools/config/config [`./configure' or `make info' will do this]
then edit it as appropriate

 To use gcc296 under RedHat 9:
    Find the `COMPILER =' line and set to gcc296
    Add right at the end of the file: 
      CC = gcc296
      CXX = g++296

You will also need to swith on the wrappers module, and set the python paths.
Make sure you have the version of swig mentioned above.

Set DEBUG if you like, and SHARED=2 if you like.


For both speech_tools and festival:
 make info
 make depend
 make

Running the English voice
=========================

Start Festival and type:
  
  (voice_cstr_nina_multisyn)

The synthesise in the normal way:

  (SayText "hello world!")

Beam pruning levels can be altered for search and selection respectively with the following:

  (du_voice.set_pruning_beam currentMultiSynVoice 0.25)       
  (du_voice.set_ob_pruning_beam currentMultisynVoice 0.25)

-1 switches pruning off, otherwise the beam width value is a threshold below the best score.
The default levels are 0.25





Important notes.
================

Diphone selection backoff procedure.

A list of phone backoffs can be specified. In English this is done on a per lexicon basis.

  (du_voice.setDiphoneBackoff currentMultiSynVoice backoff_rules)

The list specifies phone substitutions to try if a diphone can't be
found. A rule of the form (n! n) uses an `n' in place of 'n!' and
would substitute n_k for n!_k if the latter was not available.
  



Things not yet implemented or fully implemented
-----------------------------------------------

3) Pitch modification
      There are problems with this which we are in the process of
      fixing.

4) Pitch/Spectral smoothing
      The implementation of this is not yet completed




How to build a Unit Selection Voice (v0.01 very rough first draft)
==================================================================


A word of warning
-----------------

The set up is similar to using Festvox, but is really untested as
we've not yet started to build our second voice. So there may be
things here which are too specific to our first voice building
experience. Some paths are currently hard wired, and you'll just have
to change them as you come across problems.

Text selection
==============
[...]


Recording your data
===================

Read the festvox documentation on recording data.
The quality of your recording are very important the following
should be considered a minimum requirement:

A good (ideally professional) speaker, with a clear and consistent
voice, which stays this way after they have been speaking constantly
for 2 hours.

A good recording set up. A recording studio that is as
acoustically damped as possible.


We are not going to tell you exactly how to record your data instead
we will tell you what you need to end up with and suggest a couple of
ways to get there.

You need a file called `utts.data' which looks something like this:

(file_001 "The cat sat on the mat.")
(file_002 "The rain in Spain falls mainly on the plain.")

Each line consists of an open brace `(' followed by the root of a
filename followed by a text string in double quotes, followed by a
close brace `)'  This file defines the database. The filename root is
used for all the additional files and the text should be that from
which festival can derive the correct phoneme sequence for that
utterance.

Accompanying utts.data is needed a directory called `wav/' which
contains a set of files `file_001.wav' `file_002.wav' matching the
names specified in utts.data.  Each file should contain just the
stated utterance with a small amount (about a second will do) of
silence preceding and following the speech.


The first way to record your data is to following the Festvox
procedure for recording, with or without synthetic prompting. This
will nicely give you data in the required files.

However, we feel that the recording process is made more natural by
using the method describe below.

We have recorded `session' files of blocks of about 100-200 sentences at a
time read from a printed script. At the end of each correctly read sentence the
technician mixes a beep into the recordings. These session files are
then automatically processed to split the session file into the final
wave files by searching for the beep to signify the end of a
utterance, and then looking for a long pause to signify the beginning
of an utterance.  This method isn't perfect and requires some manual
intervention, but works well for us.

If you are going to use this method I strongly advise the following:
1) Prepare your script in advance and don't change it, or if you do
   make sure you have a clear record of the order you recorded things in.
2) Number the sentences on your script, number the pages of your
   script. Ring bind your script if at all possible.  3) To aid lining
   up the script with the recordings, a recording technician should
   mark the following on their copy of the script as recording proceeds:
   i) Mark restarts where the subject makes a mistake and
      restarts. (If the silence between the bad example and the
      restart is short, you will need to manually cut out the bad bit at the beginning)
   ii) Repetitions which have beeps after each version, you will need
       to decide which to keep or duplicate the text in the script which
       aligns with the audio files.


Building a Voice
================
The semi-automatic voice building procedure depends highly upon being
able to do the text processing stage of synthesis for your
language/dialect in Festival. This means that you will have to
implement phoneset, lexicon and phrasing before building your voice.

The file NOTES are my notes at what to run. They should give you an
idea of command line arguments etc...

Automatically labelling the data
================================

The process described here uses HTK. Sphynx could be used as an
alternative, or you could do it by hand.

The general philosophy with the HTK scripts that automate the
labelling is to do things more times than is probably absolutely
necessary, as taking longer is better than it not converging properly.

The HTK alignment process involves a number of steps.
1) Setup
  i) Generate MFCCs for the speech.
  ii) Generate a initial master label file (MLF)
  iii) Run the setup script

2) Alignment
  i) Run the alignment script

3) Post process
  i) Split the final master label file into individual label files



Setup
-----
This is reasonably straight forward. the MFCCs are the input used to
train the models. The MLF is the known segment sequence, and the setup
script creates directories, config files and prototype models.

The MLF is generated by festival by synthesising the text supplied in
utts.data as far as a segment stream. The only special things it does
is substitute all silence segment with the name "sil" and adds an
"sp" segment at the end of each word. "sp" is a short pause model to
model optional inter-word silence. Currently this is processed by festival when
it builds the utterance structures later. We may change this at some
point and remove them at an earlier point. Also stops and affricates are
split into two to model the closures and bursts separately

You currently have to supply 3 files yourself.

  phone_list      - a file which lists your phoneset, one phone per
                    line. It is recommended that you don't use unix
                    wild cards in you phoneset (e.g. ?*!)

  phone_substitutions     - a list of phone substitutions that the
                            alignment is allowed to make. (see below)


Alignment
---------

If all goes well this just works. If it doesn't you have to work out
why. The scripts produce quite meaningful error messages, but a
working knowledge of HTK helps. The most frequent problems are missing
files. The worst problem we have seen is what we assume to be a
numerical error resulting in a transition matrix row not totalling 1.
This is brought about by converting log probabilities back to
probabilities in some circumstances, particularly when the suggested
phone sequence is quite different from the speech data. If it
occurs with the silence model, then there may be too much silence at
the ends of you utterances. Adding an extra silence label at the start
and end of each utterance sometimes fixes this. We have also seen this
problem with schwa. The only fix we can suggest here is to turn the
HTK error into a warning. To do this edit HModel.c and negate the
numeric error value in the following line:
 HError(7031,"PutTransMat: Row %d of transition mat sum = %f\n",i,rSum);

The phone sequence generated from utts.data needs to be a pretty good
match to what is said, and utts.data should be tweaked to rectify
this. You may also need to add/adjust lexical entries to reflect what
the speaker actually says. HTK will sometimes tell you if you get a
really bad alignment for an utterance. adding commas, and spelling out
numbers and dates can often fix problems.

The alignment process allows substitution of some phones for others.
Vowel reduction is the obvious substitution. We have found that
labelling what the speaker actually has said is better than labelling
what you expected them to say. ( If they said a schwa when you
expected a full vowel, it may sound weird if you use it as if it were
a full vowel)

You can put whatever substitutions you like in the phone_substitutions
file, but you will have to alter the script than builds the utterances
to deal with your substitutions.

Post alignment processing
-------------------------

This just splits the final MLF up into individual label files.  These
are still a bit messy as they contain things like zero length short
pauses.

Building the Utterance Structures
=================================

This should be thought of as linguistic specification of you speech
database.

Utterances are build by synthesising the text to a segment stream as
before, and then `unifying' this with the output from the alignment
process. This process needs to deal with a number of mismatches that
can occur.

* Stop and affricate closures
  These are concatenated and the end time of the closure is used as
  the diphone join point for this type of segment.
  
* Silences
  Various things to assimilate silence names, keep silence segments
  short and allow insertions and deletions of silence are done.

* Vowel Reduction
  Decide what to do where the aligner has decided vowel reduction has
  occurred. (What is actually done here is likely to be changed at some
  point, as once backing off when missing diphones are found can be
  done, allowing the aligner the final say is probably appropriate.)


The multisyn unitselection engine
========================

The multisyn unit selection engine works as a direct replacement for the
diphone engine. Which engine actually gets called depends on how a
particular voice is configured. (the file
lib/voices-multisyn/LANG/VOICENAME/festvox/VOICENAME.scm defines this)

The linguistic processing that occurs is usually identical in each
case (although some of it is currently ignored for multisyn, and is
switched off.)

In some respects the details of the engine itself are unimportant, it
just finds the best candidate sequence that matches a prescribed
target. It is the voice definition and the target and join costs that are actually important.


Target Cost and Join Cost
=========================

Target cost and join cost are set in the voice configuration.  A
default target cost and a default join cost are implemented in C++ for
speed, are are automatically specified.

The Target cost can replaced with a user C++ or scheme target cost
(scheme is about 5 times slower that its C++ equivalent)

Target Cost
===========

The default C++ target cost is C++ class derived from a base class.
Other C++ target costs could also be derived, although there is
currently no interface to switch between different C++ implementations
on the fly.

For developement purposes target costs can be implemented in scheme.
A scheme implementation of the default C++ target cost is provided as an example.
(see: festival/lib/multisyn/target_cost.scm)

The default target cost is the weighted normalised some of a series of
components, which consider the following:
stress,syllable position. word position, phrase position, part of
speech, left and right phonetic context and `bad_duration'. there is
also a component which uses the alignment likelihood score, but this
is currently not used.

`Bad_duration' is a feature which is set by the utterance building
process and suggests a segment should not be used. Part of speech
assumes that a POS tagger has been run. If either the POS feature or
the bad segment feature is missing it would not be the end of the
world, and the target cost would still work. However its worst score
would be less than the theoretical maximum of 1. this would
effectively give more weight to the join cost. If it is known that
certain features are not being used, the target cost score weight
should probably be set to 0.

The target cost associated with a voice can be changed with the function:
  (du_voice.setTargetCost currentMultiSynVoice tc)

Join Cost
=========

This is currently a fixed C++ implementation. Other implementations will follow.

Configuring a new voice
=======================

Multisyn voice configurations are kept in festival/lib/voices-multisyn
as their loading procedure is slightly different to diphone voices.

The major difference being that a multisyn voice can have multiple
configurations using different `modules`.



TODO
====


Other things that we are likely to do before a full public release of
this code and a voice.

* Synthesis API

  - There is currently a lot of reliance on global scheme parameters. We
    want all of these to become parameters local to an individual voice.

  - The flow of control at which modules are called and when is
    global, this too will become specific to a voice. We would like this
    to be in C++ so that it can be wrapped in other languages as well as
    scheme.

* Voice database

  The raw data files are currently accessed to do synthesis from. A
  packing mechanism similar to that used for diphone voice will
  probably be used in the future.

