Input Side

Speech Synthesis Independent Protocol

Speech Synthesis Independent Protocol (SSIP) is intended to provide a device independent layer between application and speech synthesizer.

SSIP will be an application protocol over TCP/IP. However there is no reason to constraint SSIP only for this architecture. SSIP should be designed, to allow its transmission within any higher level protocol such as HTTP.

SSIP should be composed of two TCP connections -- one control connection for transmitting commands and one data connection.

SSIP control connection

Control connection will be based on a classical line based command protocol, similar to FTP, HTTP or whatever... Control connection will set the session global speech properties, which can be, however, changed locally for a message insdide it's data.

SSIP data connection

Data connection will tranfer raw text, to be spoken, as well as inline commands for local speech prerties. XML seems to be a good solution for the syntax of these commands (ideally some standard, such as SABLE). We can consider also other formats.

Message Control Commands

message

Add a message to the queue.

stop

Stop speaking and empty all message queues.

pause

Stop speaking until continue is received.

continue

Continue paused speech.

cancel

Throw away currently spoken message. Continue with the next message in the queue. This is a way to skip some long boring messages, that you do not want to hear until the end from some reason.

Message Priority System

The possibility to distinguish between several message priority levels seems to be essential. Each message sent by client to speech server should have a priority level assigned.

We suppose the system of three priority levels. Every message will either contain explicit level information, or the default value will be considered. There is a separate message queue for each of the levels. The behavior is as follows:

level 1

These messages will be said immediately as they come to server. They are never interrupted. These messages should be as short as possible, because they block the output of all other messages. When several concurrent messages are received by server, they are queued and said in the order, they came. When a new message of level 1 comes during lower level message is spoken, lower level message is canceled and removed from the queue (removed messages are stored in the history, as described in the section called Message History).

level 2

Second level messages are said in the moment, when there is no message of level 1 queued. Several messages of level 2 are said in the order, they are received (queued, but in their own queue). This is the default level for messages without explicit level information.

level 3

Third level messages are only said, when there are no messages of any higher level queued. So they will be never saied, if the output device, they are dirrected to, is busy in the moment, they arrive. But if the message is not saied, it is still copied to the history as described in the section called Message History.

To make the things more complicated, thes thr the section called Output Device Selection.

Synthesis control

SSIP should provide the following basic primitives, to control the way, in which the synthesizer handles the input text:

Language selection

Various synthesizers provide different sets of possible languages, they are allowed to speak. We must be able to receive a request for setting particular language (using ISO language code) and reply, if the language is supported.

Punctuation mode

Punctuation mode describes the way, in which the synthesizer works with non-alphanumeric characters. Most synthesizers support several punctuation modes. We will support a reasonable superset of those modes, which may be implemented in device driver, when not supported by hardware.

Prosody

Prosody setting allows us, to distinguish interpunction characters in spoken text, as we are familiar in normal speech. This means the way, we pronounce the text with interrogation mark, coma, dot etc.

Speed

Speed of the speech is supported by all synthesizers, but the values and their ranges differ. Each output module is responsible to set the speed to the value, best responding to current setting. This may be a little bit difficult, because there is no exact scale.

Pitch

Pitch is the voice frequency. We face the similar problems here, as with Speed setting.

Voice type

Most synthesizers provide several voice types, such as male, female, child etc. The set is again different for each of the devices.

Spelling

Spelling mode is provided by nearly all devices and is also easy to emulate in output module.

Capital letters recognition

That is again a widely supported feature. However it would be desirable to support this internally, using the sound icons feature (the section called Sound Icons), but this requires a good possibility of synchronization, which is not possible with all devices (as discussed in the section called Synchronization).

Session Management

One Speech Daemon session is based on two TCP connections as described in the section called Speech Synthesis Independent Protocol. These connections persist during the whole session. Session is closed, by closing the control connection.

Speech Daemon is responsible for tracking global properties for each session and switching the context properly, while switching between particular sessions.

Synchronization

Speaking application may need to synchronize it's bahavior with speech output. For this purpose we want to enable to insert synchonization marks into spoken text. The idea is as follows:

What we called a parameter in the above text, may be a simple text string.

This method has several problematic issues.

At first, there are some devices, which do not support backwards communication, so they will not inform the output driver at the right time. There is some possibility to predict the time of speech in software, but it does not seem to be a reliable solution.

The another drawback is the necessity for client to keep connection for whole time of speech and listen for server messages. However this problem is determined by the rules of socket communication, which still seems to be the best choice for other reasons.

Message History

All messages will be copied to the history in order, they are received, without respect to priority.

Messages should be sorted to groups with respect, to their originating client, but any client will be able to browse the history of all clients (in addition to browsing it's own messages).

The rationale behind allowing to browse messages of all clients is as follows: You work with one client (e.g. Emacs) and messages come from other clients (e.g. some cron script notifies you about new mail). So if the Emacs client supports browsing the message history, you can check all prewious new mail notifications from Emacs.

Sound Icons

Output Device Selection

Client must identify itself by a name, which alows Speech Daemon to select appropriate output device for this client as discussed in the section called Multiple Output Modules. Client should not deal with the explicit selection of output device, but may use different identification names (on several connections) for several kinds of messages, to enable server side redirection.