Results of the UNIPEN INQUIRY '97

Lambert Schomaker & Louis Vuurpijl (NICI, The Netherlands)

1. Introduction

This document describes the results of the UNIPEN survey which was executed for the ICDAR'97 in Ulm. Via two E-mail broadcasts, UNIPEN members were asked to fill in a WWW form concerning their current interests and the state of their work in UNIPEN.


2. Respondents

11 people reacted until now (Fri Aug 15 11:14:20 1997):

  1. (jerrico@max.corp.mot.com) on Tuesday, July 29, 1997 at 16:44:49
  2. (parizeau@gel.ulaval.ca) on Tuesday, July 29, 1997 at 17:03:02
  3. (manke@ira.uka.de) on Wednesday, July 30, 1997 at 11:32:28
  4. (vuurpijl@nici.kun.nl) on Thursday, July 31, 1997 at 15:26:00
  5. (Bernadette.Dorizzi@int-evry.fr) on Friday, August 8, 1997 at 11:44:28
  6. (jianhu@att.research.com) on Friday, August 8, 1997 at 14:59:23
  7. (dolfing@pfa.research.philips.com) on Friday, August 8, 1997 at 17:31:08
  8. (dostr@ic.ac.uk) on Saturday, August 9, 1997 at 00:45:49
  9. (gerd.maderlechner@mchp.siemens.de) on Wednesday, August 13, 1997 at 10:58:36
  10. (jan@dtro.e-technik.th-darmstadt.de) on Wednesday, August 13, 1997 at 17:27:04
  11. (giovanni@lexicus.mot.com) on Wednesday, August 13, 1997 at 18:58:17


3. Database Usage

Table 1. shows the amount of usage of the NIST/UNIPEN database versions. The last training data version (train_r01_v07) has been used the most, as well as the most recent development test set version (devtest_r01_v02). Note that the word release is ambiguous and avoided here. A new release will contain new and fresh data, whereas until today, we have only seen versions. When the benchmark process ('the RECMARK factory') starts to run, we will need new releases with uncontaminated data in the future. Please concentrate your work on the most recent versions, which will be of a higher quality than older versions. The success of this WWW-form based query has stimulated us to think of a similar form-based approach for the reporting of database errors. In the near future, annotation errors may be reported on a WWW-form and will be automatically forwarded to NIST.




Table 1. Usage of the UNIPEN database

Versionnot seenjust looked atused
train_r01_v01: 5 4 2
train_r01_v02: 6 4 1
train_r01_v03: 7 4 0
train_r01_v04: 6 3 2
train_r01_v05: 5 3 3
train_r01_v06: 6 3 2
devtest_r01_v01: 8 3 0
train_r01_v07: 4 0 7
devtest_r01_v02: 4 3 4

4. Readiness for benchmark participation

Table 2. displays the distribution of interest in participating in the different benchmark subtests among the respondents. First of all, note that N is too small to draw statistically reliable conclusions. As an example, with N=11 and five categories, a count of 7 vs 2 start to be significant at alpha=.05. Columns 3-7 span all respondents (sum=11), column 8 was optional. There is a cluster of interest around the isolated digits and letters (test 1a,1b,1c), and isolated mixed style words (test 6). As some may recall from the Colchester meeting, the most interest was focused on the isolated words. One respondent said they were out if the first benchmark only involved isolated characters. 'Willing to try' yields a maximum for test 6, which also does not have a single 'never' rating. Second in line are 1a (isolated digits) and 2 (isolated, mixed case). A promising observation is that more groups are ready to start within a few months than a year ago! All in all, the results are better than expected.

Table 2. Readiness for participation in a benchmark process

Benchmark subtestready
now!
within 6
months
within
a year
later never willing
to try
1a isolated digits 5 2 0 1 3 6
1b isolated upper case 4 3 1 1 2 0
1c isolated lower case 5 2 1 1 2 0
1d isolated symbols (punctuations etc.) 1 2 0 2 6 4
2 isolated characters, mixed case 3 3 0 2 3 6
3 isolated characters in the context of words or texts 3 2 1 1 4 5
4 isolated printed words, not mixed with digits and symbols 2 1 1 4 3 4
5 isolated printed words, full character set 1 2 1 3 4 4
6 isolated cursive or mixed-style words (without digits and symbols) 4 2 3 2 0 8
7 isolated words, any style, full character set 1 2 2 4 2 4
8 text: (minimally two words of) free text, full character set 0 1 2 5 3 1



Point of discussion

There are a number of options for the realisation of the first (pilot) benchmark.
  1. Test 6 (isolated mixed/cursive words) (Leaves character recognizers out)
  2. Test 1a,1c,1b (isolated characters) (Leaves word recognizers out)
  3. Both (Which is more work for Stan Janet at NIST)


5. How does a benchmark work?

These are the stages in a UNIPEN benchmark (Table 3.):

Table 3. Stages in a UNIPEN benchmark round

1RegistrationParticipant registration deadline [date]
2ReleaseFirst actual test set is released by NIST on [date]
3RunParticipants send *.REC and *.RES files by E-mail to NIST, deadline [date]
4ResultsBroadcast of results to participants [date]

The whole process can be remembered by its four "R"'s: Registration, Release, Run and Results. The duration of stage 3 (Run) is a point of discussion. If it is too long, results will be unreliable due to tweaking of recognizer parameters. If stage 3 lasts too short, results will be biased because things were done in a hurry and trivial bugs may have caused major effects on recognition rate. A *.REC file describes the recognizer, the data (training set and test set identification) etc. A *.RES file describes the results of the recognition, mainly in the form of hypotheses and their confidence value. Formats are described in the unipen.def file by Isabelle Guyon (ftp://ftp.nici.kun.nl/pub/UNIPEN/definition/unipen.def).

Examples of the *.REC and *.RES files are given at

Table 4. gives two ficticious examples of how a UNIPEN RECMARK profile for your recognizer may look like. The full profile is a reminder of where your recognizer is located among the levels and types of on-line handwriting recognizers. Again, as stated at other occasions, test 8 is the ultimate test, because it is the most application-oriented test in this battery, including both text segmentation and classification. The two companies mentioned in Table 4 do not exist, nor do their recognizers, and the reported recognition rates are picked from a Fahrenheit thermometer:


Table 4. Two ficticious examples of a UNIPEN
RECMARK profile for a given recognizer


UNIPEN RECMARKS
Recognizer: CleverChar version 54.76b
Company/Institute: Omniclass Ltd.
Date: 12th of December 1997

Benchmark subtestPerf
(%)
1a isolated digits 99.5
1b isolated upper case 97.1
1c isolated lower case 96.2
1d isolated symbols (punctuations etc.) N/D
2 isolated characters, mixed case 76.8
3 isolated characters in the context of words or texts N/D
4 isolated printed words, not mixed with digits and symbols N/A
5 isolated printed words, full character set N/A
6 isolated cursive or mixed-style words (without digits and symbols) N/A
7 isolated words, any style, full character set N/A
8 text: (minimally two words of) free text, full character set N/A

N/A: Not Aapplicable: This test is fundamentally not
applicable to this recognizer, which was not designed
for this input category.
N/D: Not Done: This test can be performed, but this
has not be done, as yet.




UNIPEN RECMARKS
Recognizer: OmniWord version 543.1
Company/Institute: SpiralView S.p.a.
Date: 10th of December 1997

Benchmark subtestPerf
(%)
1a isolated digits N/D
1b isolated upper case N/D
1c isolated lower case N/D
1d isolated symbols (punctuations etc.) N/D
2 isolated characters, mixed case 81.1
3 isolated characters in the context of words or texts 69.7
4 isolated printed words, not mixed with digits and symbols N/D
5 isolated printed words, full character set N/D
6 isolated cursive or mixed-style words (without digits and symbols) 85.0
7 isolated words, any style, full character set N/D
8 text: (minimally two words of) free text, full character set N/A

N/A: Not Aapplicable: This test is fundamentally not
applicable to this recognizer, which was not designed
for this input category.
N/D: Not Done: This test can be performed, but this
has not be done, as yet.




6. Planning Proposal

The following Planning schedule was proposed to Stan Janet of NIST and to Isabelle Guyon. Stan had no problems with it.

First, there will be a pilot benchmark, starting in December 1997.

Then, early 1998, there will be an official benchmark.

The first choice would be Test 6 (isolated words), but this is a point of discussion (see above).

So here is the proposed planning for the pilot (A) and the first real benchmark (B):

A) "UNIPEN pilot benchmark (test 6: isolated words, cursive, mixed)"

  - participant registration deadline..............Monday Nov. 24, 1997
  - devtest (sic!) subset selection is released 
    by NIST on ....................................Monday Dec. 01, 1997
  - participants sending *.REC E-mail to NIST,
    deadline  .....................................Monday Feb. 11, 1998
  - broadcast of results to participants...........Monday Feb. 18, 1998

Notes:

  • Using the devtest prevents spoiling of the real test set data on this procedural pilot. Figures may be optimistic if some have used the devtest data for training, or if inadvertent indirect training has occurred, but it is the procedure and protocol which is evaluated, mainly. The annotation of the devtest data selection is checked thoroughly (labels must be ok!).

  • Participants get an alias in the reports on the statistics because anonymity will encourage participation. Participants will be able to detect bugs without embarassment.

  • Reports are sent back by Stan Janet/NIST only to those who have sent a .REC/.RES report.








B) "1st UNIPEN benchmark (test 6: isolated words, cursive, mixed)"

  - participant registration deadline..............Monday Mar. 23, 1998
  - first actual test set is released 
    by NIST on ....................................Monday Mar. 30, 1998
  - participants sending *.REC E-mail to NIST,
    deadline  .....................................Monday May. 04, 1998
  - broadcast of results to participants...........Monday May. 11, 1998

Results can be presented at a dedicated UNIPEN workshop. There, it will also be decided what is being published.

Isabelle Guyon has volunteered to assist in developing the protocol (thanks Isabelle!).

So what do you think? Is it feasible? Note the delays (2 months for the recognizer runs may be a little long, but hey, we're just starting. For the pilot it is justifiable. For the actual benchmark a tighter 1 month 'recognizer-run time slot' may be more trustworthy). And for Stan Janet, there is the period between Feb. 18 and Mar. 30 for the actual test-set preparation.

A problem is the proposed anonymity during the pilot, which is not endorsed by the NIST procedures, which favor openness.


7. Summary

The results of the UNIPEN inquiry 1997 indicate that UNIPEN really is alive. On the basis of the given reactions, a pilot benchmark event (end of 1997) and a first official benchmark are being proposed (first half of 1998). The pilot benchmark will not consume valuable test data, since it will be based on a random subset of the development test set. Points of discussion are the chosen (test/s) and aspects of anonymity.


Individual responses to the survey
Individual extended comments to the survey


(End of document) LS'1997