11 people reacted until now (Fri Aug 15 11:14:20 1997):
Table 1. shows the amount of usage of the NIST/UNIPEN database versions. The last training data version (train_r01_v07) has been used the most, as well as the most recent development test set version (devtest_r01_v02). Note that the word release is ambiguous and avoided here. A new release will contain new and fresh data, whereas until today, we have only seen versions. When the benchmark process ('the RECMARK factory') starts to run, we will need new releases with uncontaminated data in the future. Please concentrate your work on the most recent versions, which will be of a higher quality than older versions. The success of this WWW-form based query has stimulated us to think of a similar form-based approach for the reporting of database errors. In the near future, annotation errors may be reported on a WWW-form and will be automatically forwarded to NIST.
| Version | not seen | just looked at | used |
|---|---|---|---|
| train_r01_v01: | 5 | 4 | 2 |
| train_r01_v02: | 6 | 4 | 1 |
| train_r01_v03: | 7 | 4 | 0 |
| train_r01_v04: | 6 | 3 | 2 |
| train_r01_v05: | 5 | 3 | 3 |
| train_r01_v06: | 6 | 3 | 2 |
| devtest_r01_v01: | 8 | 3 | 0 |
| train_r01_v07: | 4 | 0 | 7 |
| devtest_r01_v02: | 4 | 3 | 4 |
Table 2. displays the distribution of interest in participating in the different benchmark subtests among the respondents. First of all, note that N is too small to draw statistically reliable conclusions. As an example, with N=11 and five categories, a count of 7 vs 2 start to be significant at alpha=.05. Columns 3-7 span all respondents (sum=11), column 8 was optional. There is a cluster of interest around the isolated digits and letters (test 1a,1b,1c), and isolated mixed style words (test 6). As some may recall from the Colchester meeting, the most interest was focused on the isolated words. One respondent said they were out if the first benchmark only involved isolated characters. 'Willing to try' yields a maximum for test 6, which also does not have a single 'never' rating. Second in line are 1a (isolated digits) and 2 (isolated, mixed case). A promising observation is that more groups are ready to start within a few months than a year ago! All in all, the results are better than expected.
| Benchmark subtest | ready now! |
within 6 months |
within a year |
later | never | willing to try | |
|---|---|---|---|---|---|---|---|
| 1a | isolated digits | 5 | 2 | 0 | 1 | 3 | 6 |
| 1b | isolated upper case | 4 | 3 | 1 | 1 | 2 | 0 |
| 1c | isolated lower case | 5 | 2 | 1 | 1 | 2 | 0 |
| 1d | isolated symbols (punctuations etc.) | 1 | 2 | 0 | 2 | 6 | 4 |
| 2 | isolated characters, mixed case | 3 | 3 | 0 | 2 | 3 | 6 |
| 3 | isolated characters in the context of words or texts | 3 | 2 | 1 | 1 | 4 | 5 |
| 4 | isolated printed words, not mixed with digits and symbols | 2 | 1 | 1 | 4 | 3 | 4 |
| 5 | isolated printed words, full character set | 1 | 2 | 1 | 3 | 4 | 4 |
| 6 | isolated cursive or mixed-style words (without digits and symbols) | 4 | 2 | 3 | 2 | 0 | 8 |
| 7 | isolated words, any style, full character set | 1 | 2 | 2 | 4 | 2 | 4 |
| 8 | text: (minimally two words of) free text, full character set | 0 | 1 | 2 | 5 | 3 | 1 |
| 1 | Registration | Participant registration deadline | [date] |
| 2 | Release | First actual test set is released by NIST on | [date] |
| 3 | Run | Participants send *.REC and *.RES files by E-mail to NIST, deadline | [date] |
| 4 | Results | Broadcast of results to participants | [date] |
The whole process can be remembered by its four "R"'s: Registration, Release, Run and Results. The duration of stage 3 (Run) is a point of discussion. If it is too long, results will be unreliable due to tweaking of recognizer parameters. If stage 3 lasts too short, results will be biased because things were done in a hurry and trivial bugs may have caused major effects on recognition rate. A *.REC file describes the recognizer, the data (training set and test set identification) etc. A *.RES file describes the results of the recognition, mainly in the form of hypotheses and their confidence value. Formats are described in the unipen.def file by Isabelle Guyon (ftp://ftp.nici.kun.nl/pub/UNIPEN/definition/unipen.def).
Examples of the *.REC and *.RES files are given at
Table 4. gives two ficticious examples of how a UNIPEN RECMARK profile for your recognizer may look like. The full profile is a reminder of where your recognizer is located among the levels and types of on-line handwriting recognizers. Again, as stated at other occasions, test 8 is the ultimate test, because it is the most application-oriented test in this battery, including both text segmentation and classification. The two companies mentioned in Table 4 do not exist, nor do their recognizers, and the reported recognition rates are picked from a Fahrenheit thermometer:
| Benchmark subtest | Perf (%) | |
|---|---|---|
| 1a | isolated digits | 99.5 |
| 1b | isolated upper case | 97.1 |
| 1c | isolated lower case | 96.2 |
| 1d | isolated symbols (punctuations etc.) | N/D |
| 2 | isolated characters, mixed case | 76.8 |
| 3 | isolated characters in the context of words or texts | N/D |
| 4 | isolated printed words, not mixed with digits and symbols | N/A |
| 5 | isolated printed words, full character set | N/A |
| 6 | isolated cursive or mixed-style words (without digits and symbols) | N/A |
| 7 | isolated words, any style, full character set | N/A |
| 8 | text: (minimally two words of) free text, full character set | N/A |
| N/A: |
Not Aapplicable: This test is fundamentally not applicable to this recognizer, which was not designed for this input category. |
| N/D: |
Not Done: This test can be performed, but this has not be done, as yet. |
| Benchmark subtest | Perf (%) | |
|---|---|---|
| 1a | isolated digits | N/D |
| 1b | isolated upper case | N/D |
| 1c | isolated lower case | N/D |
| 1d | isolated symbols (punctuations etc.) | N/D |
| 2 | isolated characters, mixed case | 81.1 |
| 3 | isolated characters in the context of words or texts | 69.7 |
| 4 | isolated printed words, not mixed with digits and symbols | N/D |
| 5 | isolated printed words, full character set | N/D |
| 6 | isolated cursive or mixed-style words (without digits and symbols) | 85.0 |
| 7 | isolated words, any style, full character set | N/D |
| 8 | text: (minimally two words of) free text, full character set | N/A |
| N/A: |
Not Aapplicable: This test is fundamentally not applicable to this recognizer, which was not designed for this input category. |
| N/D: |
Not Done: This test can be performed, but this has not be done, as yet. |
First, there will be a pilot benchmark, starting in December 1997.
Then, early 1998, there will be an official benchmark.
The first choice would be Test 6 (isolated words), but this is a point of discussion (see above).
So here is the proposed planning for the pilot (A) and the first real benchmark (B):
- participant registration deadline..............Monday Nov. 24, 1997
- devtest (sic!) subset selection is released
by NIST on ....................................Monday Dec. 01, 1997
- participants sending *.REC E-mail to NIST,
deadline .....................................Monday Feb. 11, 1998
- broadcast of results to participants...........Monday Feb. 18, 1998
|
|
Notes:
|
- participant registration deadline..............Monday Mar. 23, 1998
- first actual test set is released
by NIST on ....................................Monday Mar. 30, 1998
- participants sending *.REC E-mail to NIST,
deadline .....................................Monday May. 04, 1998
- broadcast of results to participants...........Monday May. 11, 1998
|
Results can be presented at a dedicated UNIPEN workshop. There, it will also be decided what is being published.
Isabelle Guyon has volunteered to assist in developing the protocol (thanks Isabelle!).
So what do you think? Is it feasible? Note the delays (2 months for the recognizer runs may be a little long, but hey, we're just starting. For the pilot it is justifiable. For the actual benchmark a tighter 1 month 'recognizer-run time slot' may be more trustworthy). And for Stan Janet, there is the period between Feb. 18 and Mar. 30 for the actual test-set preparation.
A problem is the proposed anonymity during the pilot, which is not endorsed by the NIST procedures, which favor openness.