UNIPEN SCRAWL #4,
Public Interim Report,
16 October 1995

- > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < -

- > UNIPEN project of data exchange and recognizer benchmarks < -

- > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < -

                                                 #
                                                #
                                               # OOOOOOOOOOOOOOOOO
                                         OOOOO# OOOOOOOOOOOOOOO
                                     OOOOOOOO#OOOOOOOOOOOOOOOOOO
                                    OOOO    # OOOOOOOOOOOOOO
                                   OOO     #   OOOOOOOOOOOOOO
                                   OOO    #   OOOOOO
                                   OOO   #    OOOO
                                    OOO #   OOOO
                                       #  OOO
                                      #
  UNIPEN SCRAWL #4                   /
  Public Interim Report, October 1995


	       Lambert Schomaker
               Isabelle Guyon
               Stan Janet

Introduction

It looks as if it has been quiet on the UNIPEN front lately, but this appearance is deceptive. On this very moment, dozens of researchers all over the world are struggling to process the first batches of UNIPEN training data. In this fourth edition of UNIPEN SCRAWLS, we will address a number of topics:

The intended audience consists of Scrib-L subscribers interested in handwriting recognition, UNIPEN members who want to be kept informed, as well as anyone interested in on-line handwriting recognition. Enjoy our enthusiasm,

                        Lambert Schomaker, 16 Oct. 1995

(P.S. This page looks better in Netscape than in Mosaic...sorry...)


The UNIPEN project in time

At the initiative of Technical Committee 11 of the International Association for Pattern Recognition (IAPR), the UNIPEN project was started to stimulate research and development in On-Line Handwriting Recognition (e.g., for pen computers and pen communicators). UNIPEN provides a platform of data exchange at the Linguistic Data Consortium (LDC) and is organizing a worldwide benchmark under the control of the US National Institute of Standards and Technologies (NIST).
Before 1993 All sorts of precursors, including projects by Colin Higgins,
            (Univ. of Nottingham), Bob Whitrow (NTU), Frans Maarse (NICI),
            the Active Book Company (later EO), Hans-Leo Teulings (NICI),
            the European Esprit projects P295, P419 and Papyrus and 
            numerous other projects, companies and institutes where databases
            of >20 writers, >100 words per writer were collected or comparisons
            between recognition algorithms were performed. However, these
            endeavours suffered from one small disadvantage:
            they were local (in a lot of possible dimensions).

May '93   - Start of UNIPEN: definition of the database format (Isabelle Guyon)
    
Mar '94   - UNIPEN advertized its existence on several electronic mailing
            lists, resulting in nearly 200 subscriptions to the UNIPEN
            newsletter. Public software tools are developed at NICI and AT&T.
    
Oct '94   - Deadline of the open & public call for data.
            A total of 5 million characters of on-line (western)
            handwriting has been donated by 40 institutions, 
            both from industry and university. 

            At this point in time, we enter the closed stage in UNIPEN!!!
            Here, 'closed' means that further data donation and thus
            participation in the first benchmark are no longer possible.
     
   1995   - Data accumulation, cleaning and organisation 
            by the NIST (Stan Janet)
    
Aug '95   - Release of training sets to donators only

Oct '95    -Release of a so-called 'development test set' so
            people can actually run tests about on about half as much data
            as the actual final test set will contain.
    .
    .
End 1996  - A closed benchmark of recognizers, organized by NIST,
    .       by the end of 1996, for original donators of UNIPEN data.
    .       The final date cannot be set because it is not known yet 
    .       how long it will take the members to be fully prepared for
    .       test set reception and processing. Then, finally, the 
    .       members will receive test sets for which the recognition
    .       results must be reported in a predefined format within
    .       two weeks after receiving the test sets.
    .

  1997    - A second benchmark round, on other test sets (held back at NIST).
    .       An equal amount of data as for the first round will be 
    .       used for this second benchmark in early 1997. 
    .
 (future) - Open distribution of UNIPEN on-line handwriting data by 
    .       the LDC. No date has been fixed, as yet. Most likely, this 
    .       will be in the form of CD-ROM, against a reasonable price.
    .      
 (plans)  - Periodical UNIPEN database updates and calls for new test sets.
    .       Examples are on-line recordings of other scripts such as
    .       Arab, Hangul, Hebrew, Kanji and others. Other examples include
    .       pen gestures, diagram entry sequences, and maybe even some
    .       signatures. Finally, there will be a growing need for data
    .       as was entered in real-life applications, during their use
    .       'in the field'.
    .


The Benchmark

The benchmark is concerned with writer independent recognition of sentences, isolated words and isolated characters of any writing style (handprinted and/or cursive). Although UNIPEN will provide, in the future, data for various alphabets, this particular benchmark is limited to letters and symbols from an English computer keyboard. Currently, training set samples are being distributed to donators of UNIPEN.

In February 1995, on a UNIPEN Workshop in Holmdel NJ, USA, details of the benchmarking procedure were determined in a lively discussion. A suite of tests was proposed, such that a given recognizer can be assessed on the basis of its performance profile over these tests. The result of a single test is either a performance number, the status 'Not Applicable', or, 'Test not performed'. As an example, a recognizer may be designed for isolated uppercase capitals, thus the category of digits is 'N/A (not applicable)'. However, this is just one simple example: Much more detailed aspects of benchmarking were discussed at this workshop, such as the statistical reliabity of recognition rates as a function of sample size.

At the ICDAR'95 conference in Montreal, several members of UNIPEN stated that a one year period between receiving training set data and the actual benchmark seems realistic. Although the UNIPEN file format is standardized, there are sufficient differences at the signal level, such that members will need time to write signal adaptation software (resampling, rescaling etc.), apart from the recognizer training or updating.


Tools for UNIPEN

There is a growing need for UNIPEN tools. Some software exists already (under ftp://ftp.nici.kun.nl/pub/UNIPEN/tools), including parsers and browsers, but this is not sufficient. In what follows, we will try to elucidate this issue. One can make a distinction between three layers:
Layer 1:   ASCII data: i.e., the standard UNIPEN file format

Layer 2:   SIGNAL: i.e., the assumptions with respect to the
           data-as-a-signal. Problems like equidistant sampling in time
           vs equidistant sampling in space play a role here.

Layer 3:   APPLICATION: i.e., the recognizer or browser which processes
           or graphically visualizes the signal.

At this point in time, a lot of work has been done in the area of Layer 1, some work has been done in Layer 3, but Layer 2 has been largely neglected. However, a lot of the work produced by the UNIPEN members in the next few months will be focused on exactly this: 'Obtaining a SIGNAL which is compatible with the local algorithms'. Most probably, the wheel will be reinvented on several sites, simultaneously.

===> If you are involved in UNIPEN and think you have software which could be used by others: Please share the knowledge whenever possible!! <===

Not only will this make things easier for others, but also the comparability of the benchmark results will be better if the same preprocessing is used within a given approach.

Example 1.

Suppose, for instance, you convert the on-line vectors to a 2-D pixel map in order to apply 'off-line' recognition algorithms. The line generator algorithm you use in this setup (basic incremental, midpoint line, or other) is evidently of paramount importance, as well as details such as line width in pixels and the brush shape used.

Example 2.

Spatial resampling: how did you solve the singularities at points of high curvature? What thresholds are used?

Example 3.

Temporal resampling: The time stamps T in a UNIPEN .COORD X Y T data set are less accurate than your virtual sample rate requires. What kind of interpolation method did you use?

Example 4.

What did you do to prevent boundary (i.e., run-in/run-out) problems in digital filtering (smoothing)?

We may conclude that the sharing of software at the SIGNAL level (Layer 2) is highly desirable. The very least one can do is to document the conversion process thoroughly. Instead of having about 40 signal conversion methods in UNIPEN, we may try to identify a much smaller set of methods, as is done superficially in examples 1-3 above. As a consequence, we will then be able to document, e.g., that preprocessing method PMn was used in obtaining the benchmark results of institution X on test set K.

We are eager to hear about the experiences of people using the example data and UNIPEN members in the processing of UNIPEN data.

                                                                      (L.S.)


Some Statistics

Below is a table, compiled by Stan Janet of NIST, with the number of characters per donator, and as distributed over the benchmark tasks. The idea is that a given recognizer obtains a performance profile for a number of categories (where some categories may be Not Applicable, of course).

UNIPEN Benchmark Tasks Synopsis:


Below are the character counts after partitioning (these are the data of 2733 writers in total!). Non-members of UNIPEN: be patient, this data will become available in due time. This is just to give you an idea of the amounts involved:


Character counts


     In all                      By benchmark task
ID   samples  1a   1b    1c     1d    2     3     4     5     6     7     8

apa    61k    3k   10k   38k    9k   61k   61k    0     0     0     0    61k 
apb   109k   15k   28k   60k    7k  109k  109k    0     0     0     0   109k 
apd   112k    0     0     0     0     0     0     0     0    92k  112k  112k 
ape    56k    0     0     0     0     0     0     0     0    41k   56k   56k 
app   115k    7k   20k   70k   18k  115k  115k    0     0     0     0   115k 
ata   376k    0     0     0     0     0     0     0     0   178k  187k  376k 
bbb    35k    0     0     0     0     0     0     0     0     0     0    35k 
bbc    34k    0     0     0     0     0     0     0     0     0     0    34k 
bbd   621k    0     0     0     0     0     0     0     0     0     0   621k 
ceb    50k   <1k    2k   45k    2k   50k   50k    0     0    47k   50k   50k 
cec   162k    0     0     0     0     0     0     0     0   152k  162k  162k 
ced    49k    8k   17k   16k    7k   49k   49k    0     0     0     0    49k 
cee   205k    0     0     0     0     0     0     0     0   207k  207k    1k 
cef    57k    0     0     0     0     0     0     0     0    43k   57k   57k 
hpp   303k    0     0     0     0     0     0     0     0   278k  302k  258k 
pap    74k    0     0     0     0     0     0     0     0    72k   74k    0  
rim    16k    0     0     0     0     0     0     0     0    16k   16k    0  
scr    38k    0     0     0     0     0     0     0     0     0     0    38k 
sie    65k    0     0     4k    0     4k    0     0     0    60k   60k    0  
sta   490k    0     0     0     0     0     0     0     0   484k  490k   25k 
tos    33k    4k    9k    9k   11k   33k    0     0     0     0     0     0  
ugi    17k    0     0     0     0     0     0     0     0    17k   17k    0  
uqb    26k    4k   12k    0    10k   26k    0     0     0     0     0     0  
abm    19k    0     0     0     0     0     0     0     0    18k   18k   <1k 
aga   133k    3k    7k    7k    1k   18k    0     0     0     0     0   115k 
anj    42k    0     0     0     0     0     0     0     0    42k   42k    0  
apc    65k    0     0     0     0     0     0     0     0    63k   65k   65k 
art    19k    1k    5k   12k    1k   19k   19k    0     0    14k   19k   19k 
att    71k    0     0     0     0     0     0     0     0    25k   35k   71k 
atu    13k    0     0     0     0     0     0     0     0     0     0    13k 
bba    54k    0     0     0     0     0     0     0     0     0     0    54k 
cea    14k   <1k    2k   11k   <1k   14k   14k    0     0    13k   14k   14k 
dar    16k    0     0     0     0     0     0     0     0    16k   16k   16k 
gmd    22k    6k    0    13k    4k   22k    0     0     0     0     0     0  
hpb   134k    0     0     0     0     0     0     0     0    42k   47k  134k 
huj    14k    0     0     0     0     0     0     0     0    14k   14k    0  
ibm   107k    9k   24k   24k   11k   68k    0     0     0    38k   38k    0  
imp    28k    1k    3k    3k    4k   12k    0     0     0    17k   17k    0  
imt    14k    0     0     0     0     0     0     0     0    14k   14k    0  
int    84k    0     0     0     0     0     0     0     0    84k   84k    0  
kai    66k    0    10k   44k   10k   65k   47k    0     0    20k   48k    0  
kar   115k    0     0     0     0     0     0     0     0   113k  115k    0  
lav    13k    0     0     6k    0     6k    0     0     0     7k    7k    0  
lex   217k    0     0     0     0     0     0     0     0   143k  200k  217k 
lou    55k   <1k   <1k   <1k   <1k   <1k    0     0     0    52k   54k    0  
mot    18k    0     0    18k    0    18k    0     0     0     0     0     0  
nic   349k    0     0     0     0     0     0     0     0   349k  349k    0  
not    42k    0     0     0     0     0     0     0     0    42k   42k    0  
phi   122k    0    <1k    0     0    <1k    0     0     0   110k  110k   12k 
pri    26k   <1k    1k    1k    1k    4k    0     0     0     6k    6k   16k 
syn    39k   24k    5k    5k    4k   39k    0     0     0     0     0     0  
val    24k    4k    9k    9k    2k   24k    0     0     0     0     0     0  

  5036605 Characters in total!!!

Thanks to Stan Janet for providing this overview (The database is so huge that it takes 8 hours just to count the stuff. Of course the format is not optimized on speed).


Institutions involved in UNIPEN


bbn Bolt Beranek and Newman Inc. (MA)
    John Makhoul,Han Shu
    Email: makhoul@bbn.com,hshu@bbn.com
    70 Fawcett Street, Cambridge, MA 02138

sta Stanford University (CA)
    Dave Reynolds
    Email: der@hplb.hpl.hp.com
    Filton Road, Stoke Gifford, Bristol BS12 6Qz, UK

app Apple Computer Inc. (CA)
    Richard F. Lyon,Rus Maxham
    Email: lyon@apple.com,rus@apple.com
    Apple Computer MS 301-3M, One Infinite Loop, Cupertino, CA 95014

att AT&T Bell Labs (CA)
    Isabelle Guyon
    Email: isabelle@research.att.com
    50 Fremont St. 40th Floor, San Francisco, CA 94105

aga AT&T Global Information Systems (GA)
    Wesley G. Hunter
    Email: Wesley.Hunter@AtlantaGA.NCR.com
    500 Tech Parkway, Atlanta, GA 30313

anj AT&T Bell Labs (NJ)
    Jianying Hu
    Email: jianhu@research.att.com
    AT&T Bell Labs, Room 2D-404, Murray Hill, NJ 07974-2070

hpb Hewlett-Packard Laboratories, Bristol (UK)
    Dave Reynolds
    Email: der@hplb.hpl.hp.com
    Filton Road, Stoke Gifford, Bristol BS12 6Qz, UK

abm AB&M GmbH (Germany)
    Michael J. Boldt
    Email: b@abm.de
    Haid-und-Neu-Str. 7, 76131 Karlsruhe, Germany

art Advanced Recognition Technologies Ltd. (Israel)
    Michael Tseitlin
    Email: art@actcom.co.il
    43 Brodezky St., P.O.B. 39918, Tel Aviv, 61398, Israel

atu Aachen Technical University (Germany)
    Christiane Schmidt and Walter Ruetten
    Email: schmidt@techinfo.rwth-aachen.de,walter@ghpc8.ihf.rwth-aachen.de
    Lehrstuhl fur Technische Informatik, Ahornstr. 55, D-52074 Aachen

ced CEDAR, SUNY at Buffalo (NY)
    Rohini K. Srihari
    Email: rohini@cedar.buffalo.edu
    UB Commons, SUNY at Buffalo, Buffalo, NY 14228-2567

dar TH-Darmstadt, Institut fur Datentechnik (Germany)
    Jan Sendler
    Email: jan@peel.dtro.e-technik.th-darmstadt.de
    Merckstr. 25 D-64283 Darmstadt FRG

gmd German Natl. Research Center for Computer Science (GMD) (Germany)
    Ashutosh Malaviya
    Email: malaviya@gmd.de
    Schloss Birlinghoven, 53757 St. Augustin, Germany

huj The Hebrew University, Inst. of Computer Science (Israel)
    Yoram Singer
    Email: singer@cs.huji.ac.il
    Institute of Computer Science, The Hebrew University, 
    Givat Ram, Jerusalem 91904, Israel

ibm IBM T.J. Watson Research Center (NY)
    Michael P. Frank and Andrew W. Senior
    Email: mpf@watson.ibm.com,aws@watson.ibm.com
    30 Saw Mill River Road, Hawthorne, NY 10532

imp Imperial College, Dept. of Elec. Eng. (UK)
    Dominic J. Ostrowski
    Email: d.ostrowski@ic.ac.uk
    Exhibition Rd, London SW7 2B8, England

imt Impending Technologies (CA)
    John Brookes
    Email: jbrookes@ccnet.com
    Suite 2020, 2140 Shattuck, Berkeley, CA 94704-1210

int Institut National des Telecommunications (France)
    Bernadette Dorizzi
    Email: Bernadette.Dorizzi@int-evry.fr
    9 Rue Charles Fourier, 91011 Evry, France

kai Korea Advanced Inst. of Sci. and Tech.
    Kwon Jae-Ook
    Email: jokwon@gorai.kaist.ac.kr
    AI Lab, Computer Science Dept., 373-1 Ku-Sung-Dong, 
    Yu-Sung-Gu, Taejon, Korea

kar University of Karlsruhe (Germany)
    Stefan Manke
    Email: manke@ira.uka.de
    Computer Science Dept., 76128 Karlsruhe, Germany

lav Laval University, Dept. of Elec. Eng. (Canada)
    Marc Parizeau
    Email: parizeau@gel.ulaval.ca
    Ste-Foy, Quebec, G1K 7P4, Canada

lex Lexicus Corp., A Motorola Company (CA)
    Ronjon Nag and Liyang Zhou
    Email: ronjonn@lexicus.com,liyang@lanthanum.lexicus.com
    490 California Ave, Palo Alto, CA 94306

lou Universite Catholique de Louvain (Belgium)
    Jean Luc Voz and Jean Didier Legat
    Email: voz@dice.ucl.ac.be,jdl@dice.ucl.ac.be
    Place du Levant 3, B 1348 Louvain la Neuve, Belgium

mot Motorola New Enterprises (IL)
    Jim Errico
    Email: cje003@email.corp.mot.com
    1501 Woodfield Road, Suite 208 North, Schaumburg, IL 60173

nic Nijmegen Institute for Cognition and Information, 
    Nijmegen University (The Netherlands)
    Dr. Lambert R.B. Schomaker
    Email: schomaker@nici.kun.nl
    P.O. Box 9104, 6500 HE Nijmegen, The Netherlands

not The Nottingham Trent University, Dept. of Computing (UK)
    Paul Anderson
    Email: pda@doc.ntu.ac.uk
    Burton Street, Nottingham NG1 4BU, England

pap Papyrus Associates (France)
    Brian Mottershead
    Email: 100115.3211@compuserve.com
    Place Sophie Lafitte, 06560 Sophia Antipolis, France

phi Philips Research Laboratories (The Netherlands)
    J.G.A. Dolfing and Philippe Gentric
    Email: dolfing@prl.philips.nl,gentric@trantor.lep-philips.fr
    Building WY 2.19,Prof. Holstlaan 44, NL-5656 AA Eindhoven, The Netherlands

pri Princeton University (NJ)
    Eric Sven Ristad
    Email: ristad@princeton.edu
    35 Olden St., Princeton, NJ 08544

rim Rimon Technologies (Israel)
    Haim Weissman
    Email: F67361@BARILAN.BITNET
    12 Hefetz Mordchai St., Petach-Tikva, Israel 49493

scr Scribe-Tek (TX)
    William Weideman
    Email: weideman@connect.net
    P.O. Box 13064, Arlington, TX 76094 (or 3503 Hastings Dr., 
    Arlington, TX 76013)

sie Siemens AG, Corporate Research and Development (Germany)
    Gerd Maderlechner,Volkmar Pflug,Brigitte Wirtz
    Email: gm@zfe.siemens.de,Volkmar.Pflug@zfe.siemens.de,
           Brigitte.Wirtz@zfe.siemens.de
    Otto-Hahn-Ring 6, D-81739, Munchen, Germany

syn Synaptics Inc. (CA)
    John Platt, Nada Matic
    Email: platt@synaptics.com,nada@synaptics.com
    2698 Orchard Parkway, San Jose, CA 95134

tos Toshiba, Multimedia Engineering Lab., AI & Human Interface Dept. (Japan)
    Akinori Kawamura, Yojiro Tonouchi
    Email: kawamura@cru.mmlab.toshiba.co.jp,tono@cru.mmlab.toshiba.co.jp
    70, Yanagi-cho, Saiwai-ku, Kawasaki 210, Japan

ugi University of Genoa - DIST (Italy) and Pentech Associates
    Luigi Barberis
    Email: jjd@dist.dist.unige.it
    Via Opera Pia 13, 16145 Genota, Italy

uqb Universite du Quebec a Trois-Rivieres and Ecole Polytechnique (Canada)
    Fathallah Nouboud and Rejean Plamondon
    Email: nouboud@uqtr.uquebec.ca,ha03@music.mus.polymtl.ca
    Dept. Math-Info, C.P. 500 Trois-Rivieres, QC, Canada G9A5H7

val University of Valladolid, Dept. of Systems Eng. and Control (Spain)
    Yannis Dimitriadis
    Email: yannis@eis.uva.es
    Dept. of Systems Engineering and Control, School of Industrial 
    Engineering, Paseo del Cauce S/N, 47011, Valladolid, Spain


Internet hooks


Isabelle Guyon
AT&T Bell Laboratories,
Room 4g324
Holmdel, NJ 07733, USA
Phone: +1 (908) 949 3220 / Fax: +1 (908) 949 7722
E-mail: isabelle@research.att.com

Stan Janet
National Institute of Standards and Technology
Bldg. 225 Rm. A-216
Gaithersburg, MD 20899, USA
Phone: +1 (301) 975-2919 / Fax: +1 (301) 840-1357
Email: stan@magi.nist.gov

Lambert Schomaker                                           /
NICI, Nijmegen Institute for Cognition and Information    #/########
University of Nijmegen, P.O.Box 9104                    ##/########
6500 HE Nijmegen, The Netherlands                      # / #######
Phone: +31 24 3616029 / Fax: +31 24 3616066            #/####
E-mail: schomaker@nici.kun.nl                          /


Next issue

Previous issue

Pen Computing Information