- > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < -
- > UNIPEN project of data exchange and recognizer benchmarks < -
- > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < -
#
#
# OOOOOOOOOOOOOOOOO
OOOOO# OOOOOOOOOOOOOOO
OOOOOOOO#OOOOOOOOOOOOOOOOOO
OOOO # OOOOOOOOOOOOOO
OOO # OOOOOOOOOOOOOO
OOO # OOOOOO
OOO # OOOO
OOO # OOOO
# OOO
#
UNIPEN SCRAWL #4 /
Public Interim Report, October 1995
Lambert Schomaker
Isabelle Guyon
Stan Janet
The intended audience consists of Scrib-L subscribers interested in handwriting recognition, UNIPEN members who want to be kept informed, as well as anyone interested in on-line handwriting recognition. Enjoy our enthusiasm,
Lambert Schomaker, 16 Oct. 1995
Before 1993 All sorts of precursors, including projects by Colin Higgins,
(Univ. of Nottingham), Bob Whitrow (NTU), Frans Maarse (NICI),
the Active Book Company (later EO), Hans-Leo Teulings (NICI),
the European Esprit projects P295, P419 and Papyrus and
numerous other projects, companies and institutes where databases
of >20 writers, >100 words per writer were collected or comparisons
between recognition algorithms were performed. However, these
endeavours suffered from one small disadvantage:
they were local (in a lot of possible dimensions).
May '93 - Start of UNIPEN: definition of the database format (Isabelle Guyon)
Mar '94 - UNIPEN advertized its existence on several electronic mailing
lists, resulting in nearly 200 subscriptions to the UNIPEN
newsletter. Public software tools are developed at NICI and AT&T.
Oct '94 - Deadline of the open & public call for data.
A total of 5 million characters of on-line (western)
handwriting has been donated by 40 institutions,
both from industry and university.
At this point in time, we enter the closed stage in UNIPEN!!!
Here, 'closed' means that further data donation and thus
participation in the first benchmark are no longer possible.
1995 - Data accumulation, cleaning and organisation
by the NIST (Stan Janet)
Aug '95 - Release of training sets to donators only
Oct '95 -Release of a so-called 'development test set' so
people can actually run tests about on about half as much data
as the actual final test set will contain.
.
.
End 1996 - A closed benchmark of recognizers, organized by NIST,
. by the end of 1996, for original donators of UNIPEN data.
. The final date cannot be set because it is not known yet
. how long it will take the members to be fully prepared for
. test set reception and processing. Then, finally, the
. members will receive test sets for which the recognition
. results must be reported in a predefined format within
. two weeks after receiving the test sets.
.
1997 - A second benchmark round, on other test sets (held back at NIST).
. An equal amount of data as for the first round will be
. used for this second benchmark in early 1997.
.
(future) - Open distribution of UNIPEN on-line handwriting data by
. the LDC. No date has been fixed, as yet. Most likely, this
. will be in the form of CD-ROM, against a reasonable price.
.
(plans) - Periodical UNIPEN database updates and calls for new test sets.
. Examples are on-line recordings of other scripts such as
. Arab, Hangul, Hebrew, Kanji and others. Other examples include
. pen gestures, diagram entry sequences, and maybe even some
. signatures. Finally, there will be a growing need for data
. as was entered in real-life applications, during their use
. 'in the field'.
.
The benchmark is concerned with writer independent recognition of sentences, isolated words and isolated characters of any writing style (handprinted and/or cursive). Although UNIPEN will provide, in the future, data for various alphabets, this particular benchmark is limited to letters and symbols from an English computer keyboard. Currently, training set samples are being distributed to donators of UNIPEN.
In February 1995, on a UNIPEN Workshop in Holmdel NJ, USA, details of the benchmarking procedure were determined in a lively discussion. A suite of tests was proposed, such that a given recognizer can be assessed on the basis of its performance profile over these tests. The result of a single test is either a performance number, the status 'Not Applicable', or, 'Test not performed'. As an example, a recognizer may be designed for isolated uppercase capitals, thus the category of digits is 'N/A (not applicable)'. However, this is just one simple example: Much more detailed aspects of benchmarking were discussed at this workshop, such as the statistical reliabity of recognition rates as a function of sample size.
At the ICDAR'95 conference in Montreal, several members of UNIPEN stated that a one year period between receiving training set data and the actual benchmark seems realistic. Although the UNIPEN file format is standardized, there are sufficient differences at the signal level, such that members will need time to write signal adaptation software (resampling, rescaling etc.), apart from the recognizer training or updating.
Layer 1: ASCII data: i.e., the standard UNIPEN file format
Layer 2: SIGNAL: i.e., the assumptions with respect to the
data-as-a-signal. Problems like equidistant sampling in time
vs equidistant sampling in space play a role here.
Layer 3: APPLICATION: i.e., the recognizer or browser which processes
or graphically visualizes the signal.
At this point in time, a lot of work has been done in the area of Layer 1, some work has been done in Layer 3, but Layer 2 has been largely neglected. However, a lot of the work produced by the UNIPEN members in the next few months will be focused on exactly this: 'Obtaining a SIGNAL which is compatible with the local algorithms'. Most probably, the wheel will be reinvented on several sites, simultaneously.
Not only will this make things easier for others, but also the comparability of the benchmark results will be better if the same preprocessing is used within a given approach.
Suppose, for instance, you convert the on-line vectors to a 2-D pixel map in order to apply 'off-line' recognition algorithms. The line generator algorithm you use in this setup (basic incremental, midpoint line, or other) is evidently of paramount importance, as well as details such as line width in pixels and the brush shape used.
Spatial resampling: how did you solve the singularities at points of high curvature? What thresholds are used?
Temporal resampling: The time stamps T in a UNIPEN .COORD X Y T data set are less accurate than your virtual sample rate requires. What kind of interpolation method did you use?
What did you do to prevent boundary (i.e., run-in/run-out) problems in digital filtering (smoothing)?
We may conclude that the sharing of software at the SIGNAL level (Layer 2) is highly desirable. The very least one can do is to document the conversion process thoroughly. Instead of having about 40 signal conversion methods in UNIPEN, we may try to identify a much smaller set of methods, as is done superficially in examples 1-3 above. As a consequence, we will then be able to document, e.g., that preprocessing method PMn was used in obtaining the benchmark results of institution X on test set K.
We are eager to hear about the experiences of people using the example data and UNIPEN members in the processing of UNIPEN data.
(L.S.)
Below is a table, compiled by Stan Janet of NIST, with the number of characters per donator, and as distributed over the benchmark tasks. The idea is that a given recognizer obtains a performance profile for a number of categories (where some categories may be Not Applicable, of course).
Below are the character counts after partitioning (these are the data of 2733 writers in total!). Non-members of UNIPEN: be patient, this data will become available in due time. This is just to give you an idea of the amounts involved:
In all By benchmark task
ID samples 1a 1b 1c 1d 2 3 4 5 6 7 8
apa 61k 3k 10k 38k 9k 61k 61k 0 0 0 0 61k apb 109k 15k 28k 60k 7k 109k 109k 0 0 0 0 109k apd 112k 0 0 0 0 0 0 0 0 92k 112k 112k ape 56k 0 0 0 0 0 0 0 0 41k 56k 56k app 115k 7k 20k 70k 18k 115k 115k 0 0 0 0 115k ata 376k 0 0 0 0 0 0 0 0 178k 187k 376k bbb 35k 0 0 0 0 0 0 0 0 0 0 35k bbc 34k 0 0 0 0 0 0 0 0 0 0 34k bbd 621k 0 0 0 0 0 0 0 0 0 0 621k ceb 50k <1k 2k 45k 2k 50k 50k 0 0 47k 50k 50k cec 162k 0 0 0 0 0 0 0 0 152k 162k 162k ced 49k 8k 17k 16k 7k 49k 49k 0 0 0 0 49k cee 205k 0 0 0 0 0 0 0 0 207k 207k 1k cef 57k 0 0 0 0 0 0 0 0 43k 57k 57k hpp 303k 0 0 0 0 0 0 0 0 278k 302k 258k pap 74k 0 0 0 0 0 0 0 0 72k 74k 0 rim 16k 0 0 0 0 0 0 0 0 16k 16k 0 scr 38k 0 0 0 0 0 0 0 0 0 0 38k sie 65k 0 0 4k 0 4k 0 0 0 60k 60k 0 sta 490k 0 0 0 0 0 0 0 0 484k 490k 25k tos 33k 4k 9k 9k 11k 33k 0 0 0 0 0 0 ugi 17k 0 0 0 0 0 0 0 0 17k 17k 0 uqb 26k 4k 12k 0 10k 26k 0 0 0 0 0 0 abm 19k 0 0 0 0 0 0 0 0 18k 18k <1k aga 133k 3k 7k 7k 1k 18k 0 0 0 0 0 115k anj 42k 0 0 0 0 0 0 0 0 42k 42k 0 apc 65k 0 0 0 0 0 0 0 0 63k 65k 65k art 19k 1k 5k 12k 1k 19k 19k 0 0 14k 19k 19k att 71k 0 0 0 0 0 0 0 0 25k 35k 71k atu 13k 0 0 0 0 0 0 0 0 0 0 13k bba 54k 0 0 0 0 0 0 0 0 0 0 54k cea 14k <1k 2k 11k <1k 14k 14k 0 0 13k 14k 14k dar 16k 0 0 0 0 0 0 0 0 16k 16k 16k gmd 22k 6k 0 13k 4k 22k 0 0 0 0 0 0 hpb 134k 0 0 0 0 0 0 0 0 42k 47k 134k huj 14k 0 0 0 0 0 0 0 0 14k 14k 0 ibm 107k 9k 24k 24k 11k 68k 0 0 0 38k 38k 0 imp 28k 1k 3k 3k 4k 12k 0 0 0 17k 17k 0 imt 14k 0 0 0 0 0 0 0 0 14k 14k 0 int 84k 0 0 0 0 0 0 0 0 84k 84k 0 kai 66k 0 10k 44k 10k 65k 47k 0 0 20k 48k 0 kar 115k 0 0 0 0 0 0 0 0 113k 115k 0 lav 13k 0 0 6k 0 6k 0 0 0 7k 7k 0 lex 217k 0 0 0 0 0 0 0 0 143k 200k 217k lou 55k <1k <1k <1k <1k <1k 0 0 0 52k 54k 0 mot 18k 0 0 18k 0 18k 0 0 0 0 0 0 nic 349k 0 0 0 0 0 0 0 0 349k 349k 0 not 42k 0 0 0 0 0 0 0 0 42k 42k 0 phi 122k 0 <1k 0 0 <1k 0 0 0 110k 110k 12k pri 26k <1k 1k 1k 1k 4k 0 0 0 6k 6k 16k syn 39k 24k 5k 5k 4k 39k 0 0 0 0 0 0 val 24k 4k 9k 9k 2k 24k 0 0 0 0 0 0
5036605 Characters in total!!!
Thanks to Stan Janet for providing this overview (The database is so huge that it takes 8 hours just to count the stuff. Of course the format is not optimized on speed).
bbn Bolt Beranek and Newman Inc. (MA)
John Makhoul,Han Shu
Email: makhoul@bbn.com,hshu@bbn.com
70 Fawcett Street, Cambridge, MA 02138
sta Stanford University (CA)
Dave Reynolds
Email: der@hplb.hpl.hp.com
Filton Road, Stoke Gifford, Bristol BS12 6Qz, UK
app Apple Computer Inc. (CA)
Richard F. Lyon,Rus Maxham
Email: lyon@apple.com,rus@apple.com
Apple Computer MS 301-3M, One Infinite Loop, Cupertino, CA 95014
att AT&T Bell Labs (CA)
Isabelle Guyon
Email: isabelle@research.att.com
50 Fremont St. 40th Floor, San Francisco, CA 94105
aga AT&T Global Information Systems (GA)
Wesley G. Hunter
Email: Wesley.Hunter@AtlantaGA.NCR.com
500 Tech Parkway, Atlanta, GA 30313
anj AT&T Bell Labs (NJ)
Jianying Hu
Email: jianhu@research.att.com
AT&T Bell Labs, Room 2D-404, Murray Hill, NJ 07974-2070
hpb Hewlett-Packard Laboratories, Bristol (UK)
Dave Reynolds
Email: der@hplb.hpl.hp.com
Filton Road, Stoke Gifford, Bristol BS12 6Qz, UK
abm AB&M GmbH (Germany)
Michael J. Boldt
Email: b@abm.de
Haid-und-Neu-Str. 7, 76131 Karlsruhe, Germany
art Advanced Recognition Technologies Ltd. (Israel)
Michael Tseitlin
Email: art@actcom.co.il
43 Brodezky St., P.O.B. 39918, Tel Aviv, 61398, Israel
atu Aachen Technical University (Germany)
Christiane Schmidt and Walter Ruetten
Email: schmidt@techinfo.rwth-aachen.de,walter@ghpc8.ihf.rwth-aachen.de
Lehrstuhl fur Technische Informatik, Ahornstr. 55, D-52074 Aachen
ced CEDAR, SUNY at Buffalo (NY)
Rohini K. Srihari
Email: rohini@cedar.buffalo.edu
UB Commons, SUNY at Buffalo, Buffalo, NY 14228-2567
dar TH-Darmstadt, Institut fur Datentechnik (Germany)
Jan Sendler
Email: jan@peel.dtro.e-technik.th-darmstadt.de
Merckstr. 25 D-64283 Darmstadt FRG
gmd German Natl. Research Center for Computer Science (GMD) (Germany)
Ashutosh Malaviya
Email: malaviya@gmd.de
Schloss Birlinghoven, 53757 St. Augustin, Germany
huj The Hebrew University, Inst. of Computer Science (Israel)
Yoram Singer
Email: singer@cs.huji.ac.il
Institute of Computer Science, The Hebrew University,
Givat Ram, Jerusalem 91904, Israel
ibm IBM T.J. Watson Research Center (NY)
Michael P. Frank and Andrew W. Senior
Email: mpf@watson.ibm.com,aws@watson.ibm.com
30 Saw Mill River Road, Hawthorne, NY 10532
imp Imperial College, Dept. of Elec. Eng. (UK)
Dominic J. Ostrowski
Email: d.ostrowski@ic.ac.uk
Exhibition Rd, London SW7 2B8, England
imt Impending Technologies (CA)
John Brookes
Email: jbrookes@ccnet.com
Suite 2020, 2140 Shattuck, Berkeley, CA 94704-1210
int Institut National des Telecommunications (France)
Bernadette Dorizzi
Email: Bernadette.Dorizzi@int-evry.fr
9 Rue Charles Fourier, 91011 Evry, France
kai Korea Advanced Inst. of Sci. and Tech.
Kwon Jae-Ook
Email: jokwon@gorai.kaist.ac.kr
AI Lab, Computer Science Dept., 373-1 Ku-Sung-Dong,
Yu-Sung-Gu, Taejon, Korea
kar University of Karlsruhe (Germany)
Stefan Manke
Email: manke@ira.uka.de
Computer Science Dept., 76128 Karlsruhe, Germany
lav Laval University, Dept. of Elec. Eng. (Canada)
Marc Parizeau
Email: parizeau@gel.ulaval.ca
Ste-Foy, Quebec, G1K 7P4, Canada
lex Lexicus Corp., A Motorola Company (CA)
Ronjon Nag and Liyang Zhou
Email: ronjonn@lexicus.com,liyang@lanthanum.lexicus.com
490 California Ave, Palo Alto, CA 94306
lou Universite Catholique de Louvain (Belgium)
Jean Luc Voz and Jean Didier Legat
Email: voz@dice.ucl.ac.be,jdl@dice.ucl.ac.be
Place du Levant 3, B 1348 Louvain la Neuve, Belgium
mot Motorola New Enterprises (IL)
Jim Errico
Email: cje003@email.corp.mot.com
1501 Woodfield Road, Suite 208 North, Schaumburg, IL 60173
nic Nijmegen Institute for Cognition and Information,
Nijmegen University (The Netherlands)
Dr. Lambert R.B. Schomaker
Email: schomaker@nici.kun.nl
P.O. Box 9104, 6500 HE Nijmegen, The Netherlands
not The Nottingham Trent University, Dept. of Computing (UK)
Paul Anderson
Email: pda@doc.ntu.ac.uk
Burton Street, Nottingham NG1 4BU, England
pap Papyrus Associates (France)
Brian Mottershead
Email: 100115.3211@compuserve.com
Place Sophie Lafitte, 06560 Sophia Antipolis, France
phi Philips Research Laboratories (The Netherlands)
J.G.A. Dolfing and Philippe Gentric
Email: dolfing@prl.philips.nl,gentric@trantor.lep-philips.fr
Building WY 2.19,Prof. Holstlaan 44, NL-5656 AA Eindhoven, The Netherlands
pri Princeton University (NJ)
Eric Sven Ristad
Email: ristad@princeton.edu
35 Olden St., Princeton, NJ 08544
rim Rimon Technologies (Israel)
Haim Weissman
Email: F67361@BARILAN.BITNET
12 Hefetz Mordchai St., Petach-Tikva, Israel 49493
scr Scribe-Tek (TX)
William Weideman
Email: weideman@connect.net
P.O. Box 13064, Arlington, TX 76094 (or 3503 Hastings Dr.,
Arlington, TX 76013)
sie Siemens AG, Corporate Research and Development (Germany)
Gerd Maderlechner,Volkmar Pflug,Brigitte Wirtz
Email: gm@zfe.siemens.de,Volkmar.Pflug@zfe.siemens.de,
Brigitte.Wirtz@zfe.siemens.de
Otto-Hahn-Ring 6, D-81739, Munchen, Germany
syn Synaptics Inc. (CA)
John Platt, Nada Matic
Email: platt@synaptics.com,nada@synaptics.com
2698 Orchard Parkway, San Jose, CA 95134
tos Toshiba, Multimedia Engineering Lab., AI & Human Interface Dept. (Japan)
Akinori Kawamura, Yojiro Tonouchi
Email: kawamura@cru.mmlab.toshiba.co.jp,tono@cru.mmlab.toshiba.co.jp
70, Yanagi-cho, Saiwai-ku, Kawasaki 210, Japan
ugi University of Genoa - DIST (Italy) and Pentech Associates
Luigi Barberis
Email: jjd@dist.dist.unige.it
Via Opera Pia 13, 16145 Genota, Italy
uqb Universite du Quebec a Trois-Rivieres and Ecole Polytechnique (Canada)
Fathallah Nouboud and Rejean Plamondon
Email: nouboud@uqtr.uquebec.ca,ha03@music.mus.polymtl.ca
Dept. Math-Info, C.P. 500 Trois-Rivieres, QC, Canada G9A5H7
val University of Valladolid, Dept. of Systems Eng. and Control (Spain)
Yannis Dimitriadis
Email: yannis@eis.uva.es
Dept. of Systems Engineering and Control, School of Industrial
Engineering, Paseo del Cauce S/N, 47011, Valladolid, Spain
Stan Janet
National Institute of Standards and Technology
Bldg. 225 Rm. A-216
Gaithersburg, MD 20899, USA
Phone: +1 (301) 975-2919 / Fax: +1 (301) 840-1357
Email: stan@magi.nist.gov
Lambert Schomaker / NICI, Nijmegen Institute for Cognition and Information #/######## University of Nijmegen, P.O.Box 9104 ##/######## 6500 HE Nijmegen, The Netherlands # / ####### Phone: +31 24 3616029 / Fax: +31 24 3616066 #/#### E-mail: schomaker@nici.kun.nl /