UNIPEN Database Release Overview and Publication Guidelines


The following UNIPEN data sets have been released by NIST:

Date Release Remark
Wed, 09 Aug 1995 "train_r01_v01" Initial release
Wed, 20 Sep 1995 "train_r01_v02" Supersedes previous
Tue, 14 Nov 1995 "train_r01_v03" Supersedes previous
Fri, 26 Jan 1996 "train_r01_v04" Supersedes previous
Mon, 15 Jul 1996 "train_r01_v05" Supersedes previous
Tue, 01 Oct 1996 "train_r01_v06" Supersedes previous
"devtest_r01_v01" First development test release
Fri, 25 Oct 1996 "train_r01_v07" Supersedes previous
"devtest_r01_v02" Supersedes previous devtest


UNIPEN Publication Policy

I - Reference

Members are required to mention the Unipen Release version in their publications, and are strongly urged to use the latest version.
    Reference example: 

        "As a training set, we used UNIPEN [xx] Train-R01/V07, 
         benchmark ... (see III, below), subsets ..... 
         As a test set, we used UNIPEN DevTest-R01/V02, 
         benchmark ..., subsets .... 
         To the raw UNIPEN data, the following pre-processing 
         was applied: ...."
             .
             .
             .

        [xx] Guyon, I., Schomaker, L., Plamondon, R., 
             Liberman, M. & Janet, S. (1994). 
             UNIPEN project of on-line data exchange and recognizer 
             benchmarks, Proceedings of the 12th International
             Conference on Pattern Recognition, ICPR'94, 
             pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

II - Which data?

Only results on training sets and development sets are allowed. Never use development sets for training purposes! The use of test sets (still to be released at the next benchmark event) as training sets is considered unethical, too.

Note that there is a problem in the use of test sets. Iterated use of a particular training / test set pair in a development process can be considered as indirect training! Even if a development set as such is not formally used for training, it is a well-known fact that all parameter adjustments, code improvements, etc., are a form of training, regardless of the type of pattern recognition algorithm which is used. Therefore, it is good practice to explain the effort spent in iterated testing in the publications.

III - Benchmark (eq. database subset) overview

Benchmark Description

1a

isolated digits

1b

isolated upper case

1c

isolated lower case

1d

isolated symbols (punctuations etc.)

2

isolated characters, mixed case

3

isolated characters in the context of words or texts

4

isolated printed words, not mixed with digits and symbols

5

isolated printed words, full character set

6

isolated cursive or mixed-style words (without digits and symbols)

7

isolated words, any style, full character set

8

text: (minimally two words of) free text, full character set

Note that only Benchmark #8 is a realistic, application-oriented test, because the word segmentation problem must also have been solved by the recognizer. No manual word segmentation is allowed in test Benchmark #8.


Lambert Schomaker, January 1997