Rules files are the foundation of this method of transcription. I believe the method is sufficiently flexible to cover the 80% and possibly more. UK braille seems to be based on exceptions, so the software must be able to handle exceptions. First a brief outline of the way the rules are used.
The rules files are loaded into memory at startup, parsing the
XML file appropriate to the language in use. I've used
ENrules.xml for the English rules file. Rules are
grouped by the first letter of the input field. Each rule has a rule number,
either read from the rules file or calculated as the rules are read in
from disk. It is not important, and no implementation should come to
depend on the rule number. As experience grows, more rules will be
added and rule numbers will change. XSLT will help with renumbering of
rules if needed. To increase the speed of access a hashtable is
generated keyed off the first letter of the first rule in each
group. Then the rules are read into an array of Rule objects which is used
as the primary access at runtime.
When attempting to match input being transcribed, a call to
getRuleNumber() is made. First the first rule in the group is found
using the hashtable. Then the code iterates through the rules array
attempting a match. Each rule is tried in turn repeating a fixed
sequence. It is key to understanding rules additions to understand how this works.
The input text being matched is split into two parts. A two character leftContext and the main text (normally a word terminated by whitespace or punctuation. The first match uses the leftContext which is matched using a regular expression (regex) obtained from the lftContext field of the rule. The match must be up to the end of the leftContext. The java regex method used is find(). Hence the code reads
patt1 = Pattern.compile(rules[ruleNumber].LEFTCONTEXT + "$",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
...
// (1) Check for match on left context
m = patt1.matcher(prevString);
if (m.find()) {
Note that such matches are case insensitive and Unicode aware (as are all rules. Please use utf-8 or utf-16 encoding). The find() produces a match irrespective of position. The $ ensures this is up to the end of the leftContext. If this match succeeds, a second match is sought. This uses a Pattern from the Rule input field. The text used is the source text being transcribed.
regex = "^" + rules[ruleNumber].INPUT
+ rules[ruleNumber].RIGHTCONTEXT; // ignore case
patt2 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE
| Pattern.UNICODE_CASE);
....
if (m.find()) {
//(2) Check for match on input text, as far as possible
m = patt2.matcher(text);
if (m.lookingAt()) {
// then we have found the rule
term = true; // exit the loop, match found
}
This time the match must be from the start of the input and the lookingAt() method is used, which requires a match from the the start of the input string. A match on both of these regex produces a match for the given rule, (except for grade one processing, see the code below the matching in getRuleNumber(). That is necessary to ensure no contractions take place for grade one.
This sequence means that the ordering of rules is important. The generalisation is that longer matches precede shorter ones. The bare letter being the very last rule in a group. Taking the ENrules.xml file for instance,
<rule>
<lftContext>[\s"]</lftContext>
<input>AND WITH</input>
<rtContext>\s</rtContext>
<output>&</output>
<inputShift>4</inputShift>
<ruleNum>100</ruleNum>
</rule>
<rule>
<lftContext>.</lftContext>
<input>AND</input>
<rtContext>.</rtContext>
<output>&</output>
<inputShift>3</inputShift>
<ruleNum>101</ruleNum>
</rule>
....
<rule>
<lftContext>.</lftContext>
<input>ASTHMA</input>
<rtContext>.</rtContext>
<output>AS?MA</output>
<inputShift>6</inputShift>
<ruleNum>105</ruleNum>
</rule>
<rule>
<lftContext>[A-Z]</lftContext>
<input>ATION</input>
<rtContext>.</rtContext>
<output>,N</output>
<inputShift>5</inputShift>
<comment>// Final Groupsign</comment>
<ruleNum>106</ruleNum>
</rule>
<!---->
<!---->
<!---->
<rule>
<lftContext>[-0-9]</lftContext>
<input>A</input>
<rtContext>[^-A-Z]</rtContext>
<output>;A</output>
<inputShift>1</inputShift>
<ruleNum>107</ruleNum>
</rule>
<rule>
<lftContext>[0-9]</lftContext>
<input>A</input>
<rtContext>.</rtContext>
<output>;A</output>
<inputShift>1</inputShift>
<ruleNum>108</ruleNum>
</rule>
<rule>
<lftContext>.</lftContext>
<input>A</input>
<rtContext>.</rtContext>
<output>A</output>
<inputShift>1</inputShift>
<ruleNum>109</ruleNum>
</rule>
Note that the final rule must match on the letter A (or the engine will fail having tried the last rule). We found it necessary to add test cases for each new rule you add.
Having found a match, the processing sequence is to substitute the match length text for the Rules output field, then move the working point onwards from the start of the match by the inputShift length. The process is then repeated (this is in the translate() method in TextToBraille class)
To assist the generation of tables for other languages, I've included a suite of rules in a file named XXrules.xml in the rules directory of the source. Apart from the punctuation, this may be used to initiate testing with a simple input matches output transcription.
Until I resolved the table bugs I kept rolling off the end of the table without finding a match. Easy enough to do. Just input a character for which there is no first rule in any group. This bothered me to the point where I've done something about it. I'm not sure it's rock solid but it seems to be a working solution until I get something better. [Could you help?]
Proposed solution (partially coded).
Using the hashmap, I find that my input string first character isn't in the table. Move on. I've created another HashMap of what I'm currently calling special characters. For example é which is seen as special in English (no, I agree it may not be special in your alphabet). In this map, held in the locale specific braille file (e.g. ENBrailleTable.java) I map from é to its braille equivalent string. A small inner class is then returned containing the number of characters to move on within the source document, and the output string to produce for this character. This is then used to add to the output, and step on within the input. All this is an update to the Translate() method within TextToBraille class. My final proposal is pretty desparate. The aim is not to fall over unless absolutely necessary. The result is to log an error message, then output some character indicating an error in the output file, and move on. So far I'm unable to find a better solution.