(OBSOLETE, original text) > - > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < - - > First UNIPEN Benchmark of On-line Handwriting Recognizers < - -> Organized by NIST < - - > - > - > - > - > - > - > - > - < - < - < - < - < - < - < - < - < - June 1994 Content: CALL FOR DATA Registration form 1 Historical Perspective 2 Purpose of the Benchmark 3 Who Should Participate and Why? 4 How to Format the Data? 5 Protocol of Data Donation Appendices: A FTP-site B UNIPEN 1.0 format C Data Skeletons D Tips for collecting data E IAPR F LDC G NIST * CALL FOR DATA (please post) At the initiative of Technical Committee 11 of the International Association for Pattern Recognition (IAPR), the UNIPEN project was started to stimulate research and development in on-line handwriting recognition (e. g. for pen computers and pen communicators). UNIPEN provides a platform of data exchange at the Linguistic Data Consortium (LDC) and is organizing this year a worldwide benchmark under the control of the US National Institute of Standards and Technologies (NIST). The benchmark is concerned with writer independent recognition of sentences, isolated words and isolated characters of any writing style (handprinted and/or cursive). Although UNIPEN will provide, in the future, data for various alphabets, this particular benchmark is limited to letters and symbols from an English computer keyboard . The data will be donated by the participants. * Conditions of participation Participation to the benchmark is open to any individual or institution who provides a sample of handwriting in the UNIPEN format which contains at least 12,000 characters. The data must be of acceptable quality and donated by October 1st, 1994 . The database of donated samples will be available for free to the data donators. Registration material can be obtained by sending email to stan@magi.ncsl.nist.gov or via ftp: ftp ftp.cis.upenn.edu Name: anonymous Password: [use your email address] ftp> cd pub/UNIPEN-pub/documents ftp> get call-for-data.ps ftp> quit * Organizing committee Isabelle Guyon, AT&T Bell Laboratories, USA Lambert Schomaker, Nijmegen Institute for Cognition and Information, The Netherlands Stan Janet, National Institute of Standards and Technologies, USA Mark Liberman, Linguistic Data Consortium, University of Pennsylvania, USA Rejean Plamondon, IAPR, TC11, Ecole Polytechnique de Montreal, Canada * Registration Form Mail or fax (no email) to: Stan Janet UNIPEN benchmark National Institute of Standards and Technology Building 225, room A-216 Gaithersburg, MD 20899 Phone: +1 (301) 975-2916 Fax: +1 (301) 840-1357 Email: stan@magi.ncsl.nist.gov Participants must register to be listed on the benchmark mailing list and to get access to the FTP site at LDC. -------------------------------- CUT HERE ---------------------------------- Participation in the benchmark implies the acceptance of the conditions of data donation and the benchmark protocol. The organizing committee reserves the right of making some changes to the benchmark protocol after receiving the data. The database which will be constituted both from the samples donated by the participants and the data collected by LDC will be made available for free to the data donators. However, NIST reserves the right of refusing data of too poor quality. Data donated and accepted may or may not be used for the benchmark. Data donated and accepted give the right to participate in the benchmark but no obligation. Participants can withdraw from the benchmark any time before they release to NIST their recognition results, during the final test. Data which has been donated will not be given back. Participation in the benchmark implies the donation of a database containing at least one homogeneous set of 12,000 characters (or the equivalent in words or sentences), segmented and labeled at the ``text'', ``word'' or ``character'' level, formatted in the UNIPEN format, and deposited at the FTP site at LDC before October 1st, 1994. Donations of several data sets or of larger data sets are encouraged. After the benchmark, the training and test data will be distributed by the LDC gratis to its members, and to others on payment of a fee. The benchmark test data will become public domain and will be distributed by anonymous ftp. The results of the benchmark will be published in the open literature. Name: Affiliation: Address: Phone: Fax: Email: Approximate quantity of data that will be donated (in number of characters): Date and signature: --------------------------------- CUT HERE ----------------------------------------- 1. Historical Perspective ---------------------- On-line handwriting recognition addresses the problem of recognizing handwriting from data collected with a sensitive pad which provides discretized pen trajectory information. Contrarily to other pattern recognition fields, such as speech recognition and optical character recognition, no significant progresses have been made, in the past few years, in on-line handwriting recognition to make large corpora of training and test data publicly available, and no open competitions have been organized. The first impulse to UNIPEN was given at the 11th IAPR-IEEE International Conference on Pattern Recognition, in September 1992, by a group of experts, the Technical Committee 11 of the IAPR, Professor Rejean Plamondon chairman. Information on the International Association for Pattern Recognition (IAPR) and the Technical Committee 11 can be found in appendix E. Two IAPR delegates (Isabelle Guyon and Lambert Schomaker) were designated to explore the possibility to create large databases for on-line handwriting recognition research and development. A small working group was constituted to get the project stated. In May 1993, a nucleus of experts in on-line handwriting recognition (Tetsu Fujisaki (IBM), Ronjon Nag (Lexicus), Sandy Benett (GO/EO), Dick Lyons (Apple), Yves Chauvin (NetID), Dave Reynolds and Dan Flickinger (HP), Isabelle Guyon (AT&T) and Lambert Schomaker (NICI)) layed the foundations of UNIPEN. It was proposed that a common data format would be designed to facilitate data exchange. It was decided that contacts would be made with the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technologies (NIST) to get the data distributed and arbitrate benchmarks. In summer 1993, the UNIPEN format was designed, incorporating features of the internal formats of several institutions, including IBM, Apple (Tap), Microsoft, Slate (Jot), HP, AT&T, NICI, GO and CIC. The format was then tested independently by the members of working group. A second iteration of test was organized in autumn 1993 to check the changes and additions to the format. In particular, the benchmark protocol was tested. The resulting format was internally used at AT&T and NICI (the home institutions of Isabelle Guyon and Lambert Schomaker) to collect data and benchmark recognizers. In parallel, at set of tools to parse the format and to browse the data was developed at AT&T and NICI. In January 1994, the negotiations with LDC and NIST concretized into the organization of the first UNIPEN benchmark. There is now an FTP site at LDC where data and programs can be exchanged and NIST is supervising the benchmark. For information about LDC and NIST, see appendices F and G. For information about the FTP site, see appendix A. In March 1994, UNIPEN advertized its existence on several electronic mailing lists, resulting in nearly 200 subscriptions to the UNIPEN newsletter. In June 1994, the instructions for participations to the first UNIPEN benchmark, limited to the Latin alphabet, are released. October 1st, 1994 is the deadline for submitting data. The benchmark will take place in 1995. The activities of UNIPEN will expand in the future according to the needs and desires of the participants. 2. Purpose of the Benchmark ------------------------ The three key features of the UNIPEN benchmark are that: - the test data will be donated by the participants, - the tests of the recognizers will be run by the participants, - the statistics on the results will be published at a workshop. The nature and structure of the benchmark were chosen to serve several goals: - Constitute a sizable, quality database which will be available for free to all the data donators and will be later (when the benchmark is over) distributed by LDC. - Evaluate the state of the art in on-line handwriting recognition by testing many recognizers in the same conditions on several tasks of various difficulty. - Bring together researchers and developers in on-line hand writing recognition from the Academia and the Industry to promote exchanges and collaborations. The benchmark was also designed to satisfy a number of constraints such as material constraints (rely on voluntary work, use a small budget, get started rapidly): Data. ---- We believe that data diversity is an indispensable feature which can be obtained only via the donation of data coming from a large variety of source. Thanks to data donation, we expect to obtain samples of various writing styles, handprint and cursive, isolated characters, isolated words, phrases, sentences, paragraphs, etc. from a large diversity of vocabularies. This will prevent our test set from being biased towards a specific kind of data. If you plan to collect data specifically for the UNIPEN benchmark, please read appendix D. Relying on data donations rather than collecting our own data is not justified by material constraints. We estimated that the cost of collecting a ``raw'' database of 100,000 words is roughly $24000, broken into: 5 data collection stations (PC + disk + pad) at $3000 plus 500 hours of data donators at $15 per hour and 100 hours of data collection supervisor at $15 per hour. This estimate does not include software development, manual segmentation and data cleaning costs. To ensure that a minimum amount of training data will be available to all participants, LDC is financing the collection of a training database of approximately 100,000 words. The downside of the data donation paradigm is that there is a risk of constituting an heterogeneous database, each part being too small to be usable by itself. To remedy this problem and ensure some data homogeneity, this first benchmark is limited to the Latin alphabet (plus numbers and some symbols). For further details, see section 4. We anticipate that other data collections and benchmarks along similar lines will be organized in the future for other more complicated tasks and other alphabets. Test. ---- The test will be broken into several sections with various levels of difficulty. This will give to everyone the opportunity to achieve excellence on at least part of the test. Note that achieving excellence on your own data sets will not be considered a primary achievement. Average scores will be computed separately on your own data sets and on those offered by others. Material constraints prevented us from opting for labor intensive solutions such as running recognizers tests ``on-site'' under the supervision of a benchmark officer, or porting all the recognizers to the same benchmark platform. The last solution would have been anyways impractical to benchmark commercial recognizers which are bound to particular hardware platforms and which source code is not available. Instead, we decided that the tests will be run by the designers of the recognizers themselves. NIST will provide them with unlabeled data. They will return to NIST the recognition results. NIST will compute the recognition performance. We rely on the integrity of the participants not to attempt labeling the data themselves in order to tune their recognizers by using the test data. To encourage fair competition, a development test set (with labels) will be available in advance. During the development period, participants will have access by email to the data scoring machine which will return statistics on results formatted in the UNIPEN format. Once the real benchmark test set is released, participants will be under time pressure to produce the results. The date of the final test has not been fixed yet. Participants will have about one year to work between the release of the training data and development test set and the final test. Workshop. -------- The results of the benchmark will be announced by NIST at the workshop organized at the end of the final test. The results will not be anonymous, to avoid technical and legal complications. Participants will have the choice to withdraw any time before the final test is run, in which case their results will not be published. The number of participants to the workshop will be limited. Would that become necessary, priority will be given to the donators of large data sets and to those who are willing to publish their results. Otherwise, the workshop will be open to all data donators. 3. Who Should Participate and Why? ------------------------------ You may find many reasons to enter the UNIPEN benchmark, that is, to donate data and eventually to participate in the test, for example: * You are curious to know what the state-of-the-art is on your own data set. * You need more training and test data and you would like to use this opportunity to exchange data. * You are looking for new challenging tasks. * You think that you will do well in the competition which may secure some funding to pursue your research or will be good advertisement to sell your products. * You want to use the competition to make contacts with other researchers and identify solutions to your problems. People frequently ask what kind of data they should give. Is it to the donator's advantage to: * give a data set that he is very familiar with, hoping to perform well at least on that set? - Not necessarily: the results on the donator's own data sets will be scored independently. * give data that is unreasonably hard? - Not necessarily since everybody will then perform poorly and it will not be possible to evaluate the relative strength of the recognizers. * give data that is extremely easy? - Chances are that everybody will do well on easy tasks. However, for practical purposes, it may be important to know what the state-of-the-art is, even on easy tasks (e.g. the recognition of isolated digits). * give data that his own recognizer cannot handle? - Possibly: that way he will know how far other groups have already gone on that particular task. * give nonsensical data? (e.g. characters that have all the same labels)? - This is not recommended at all. Compensation for large data donations: ------------------------------------- To encourage donations of large quantities of data, LDC will remunerate donations of labeled databases of more than 50,000 characters with a compensation. Donators of unsegmented data (i.e. segmentation at the TEXT level only) will receive $2 per 100 characters. Donators of data which was machine segmented on the basis of bounding boxes or combs (i.e. segmentation at the WORD or CHARACTER level) will receive $5 per 100 characters. Donators of connected handwriting (run-on or cursive) which was accurately ``manually'' segmented at the character level will receive $10 per 100 characters. Donators of data which was completely ``manually'' reviewed after data collection, sample by sample, by at least two people, to verify the labels, the quality of the samples and the segmentation, will get an additional $5 per 100 characters. This amount does not intend to cover all the data collection costs, but will help donators to pursue their data collection effort. LDC has allocated $25,000 to be distributed on a first come first serve basis. LDC reserves the right of refusing data of too poor quality. 4. How to Format the Data? ---------------------- All the data must be formatted in the UNIPEN format. This is a one time effort on your side. Afterwards, you will be able to access large quantities of data all written in the same format. The format is very easy to produce. It takes on the average 6 hours to a novice to read the documentation and write a data conversion program. The UNIPEN format was designed to be a universal format for encoding on-line handwriting data. It is versatile beyond the needs of the benchmark. To simplify your task and that of NIST, we designed some skeletons of data samples that we ask you to refer to (see appendix C). We provide also the full UNIPEN documentation for reference (see appendix B). Here is a list of the important restrictions of the UNIPEN format which apply to the benchmark data: Alphabet: -------- The data samples should only contain symbols from the following alphabet: .ALPHABET "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"" v" "w" "x" "y" "z" "~" "!" "@" "#" "$" "%" "^" "&" "*" "(" ")" "-" "+" "=" "|" "\\" "/" "{" "}" "?" "[" "]" "\"" ":" "<" ">" "," "." ";" "'" "`" "_" " " This alphabet contains all the ASCII symbols available from an English keyboard and space. Data donators may use a subset of all allowed symbols. We remind that \ is the escape character. Backslash should be written \\ and double-quote should be written \". The label "\@" (not to be confused with "@") has been adopted to encode ligatures. Bug: The AWK parser does not accept labels consisting of a single space " ", but it will accept "\ ". However, spaces in between characters do not need to be preceded by a backslash. Segments: -------- Only three kinds of segments are allowed: -> CHARACTER - any single character from the alphabet. Examples: "A", "b", "0", " ", "@". -> WORD - sequence of 1 or more non-space characters from the alphabet. Examples: "A", "b", "0", "The", "AC/DC", "99", "88.5%", "(C)", "{a\|b}". -> TEXT - zero or more characters from the alphabet, no leading or trailing blank characters, no occurrences of two or more adjacent blanks. Examples: "", "A", "that nothing more happened,", "Where is it?". Lexicons: -------- Lexicons may contain words, lines, phrases or sentenses. Only one lexicon per database is allowed. Some entries in the lexicon may be absent from the data labels and vice versa. If the lexicon is a superset of the labels, it means that your are defining a ``limited vocabulary'' task, but that only a few examples drawn from that vocabulary are in the test set that you are providing. If the lexicon is be a subset of the labels, it means that you want to test how well the recognizer will do on words that are not in the lexicon. The goal of providing a lexicon should be, in either case, to give some linguistic help to the recognizer, ** NOT to confuse it **. Examples: Completely unsegmented text: .LEXICON "Once upon a time, there was a young and pretty frog, that lived in a pound nearby a castle, etc." Text segmented into sentences: .LEXICON "Once upon a time," "there was a young" "and pretty frog," "that lived in a pound" "nearby a castle," "etc." Word lexicon: .LEXICON "a" "and" "frog" "etc." Files: ----- The directory tree shown in the example of appendix C should be respected. The name "org" will be replaced by a three letter identification which will be given to you upon registration. All included files should reside in the include directory. No path should be indicated in .INCLUDE statements. No file should be included in another include file. All file names should conform to the 8.3 DOS file limitation and respect the extension names .dat, .doc, .lex and be written in lowercase. All files should contain only characters from the ``alphabet'' define above plus: tabs (octal 013) and new lines (octal 012). Any other character will be deleted. After the deletion process, the last line of the files must be terminated by a new line. No single line should contain more than 1023 characters, including the terminating new line. Each data file should con tain only data from a single writer. Indicate at the top of each file .WRITER_ID, which gives a unique identifier to each writer. The data donators are responsible for maintaining proper book-keeping of writer identifiers. 5. Protocol of Data Donation ------------------------- These are the steps to be followed to donate data: 1- Registration (mandatory): Fill out the registration form and return it by mail or fax to Stan Janet at NIST. You will receive a personal three letter identification (e.g. ``org'') and a password. 2- Data sanity check: Retrieve data visualization tools (browsers) and parsers from the directory pub/UNIPEN/tools (see appendix A). Verify that the syntax of your data passes the tests. 3- Data submission: The preferred data submission protocol is to send by ftp a tared and compressed version of your directory ``org'': tar -cvf org.tar org (the file org.tar is created) compress org.tar (the file org.tar.Z is created) ftp ftp.cis.upenn.edu Name: anonymous Password: [use your email address] ftp> quote site group [use the group name provided by Stan Janet] ftp> quote site gpass [use the password provided by Stan Janet] ftp> cd pub/UNIPEN/repository/locked ftp> binary ftp> send org.tar.Z ftp> quit IMPORTANT: deposit data in pub/UNIPEN/repository/locked/ only. This directory is ``write-only'' which protects your data from being read by other participants. Only Stan Janet has the priviledge of reading from that directory. He will keep your data confidential until the benchmark takes place. Donators also are allowed to send in data by mail on 3.5 inch DOS-formatted floppy disks. Floppies should be mailed to Stan Janet. 4- Acknowledgment: An acknowledgement of receipt of your data, diagnostics about the format and statistics about the data will be emailed back to you. You are encouraged to submit early at least small subsets of your data if you want to obtain from us sufficient help to debug your format. *************************************************************************** ***************************** Appendices ********************************** *************************************************************************** A- FTP-site -------- The anonymous FTP site at LDC gives only access to general UNIPEN documentation: ftp ftp.cis.upenn.edu Name: anonymous Password: [use your email address] ftp> cd pub/UNIPEN-pub/documents ftp> get call-for-data.ps (this file) ftp> cd ../definition ftp> get unipen.def (format definition) ftp> quit The complete UNIPEN site is accessible only to registered benchmark participants. The password needed to extend your priviledges will be delivered to you upon registration. ftp ftp.cis.upenn.edu Name: anonymous Password: [use your email address] ftp> quote site group [use the group name provided by Stan Janet] ftp> quote site gpass [use the password provided by Stan Janet] ftp> cd pub/UNIPEN ftp> get README ftp> quit ------------------ pub/UNIPEN/ ------------------------------- | | | | | | README definition/ databases/ tools/ repository/ documents/ | | | | | training/ | test/ locked/ open/ examples/ UNIPEN/ definition/........Latest release of the UNIPEN format databases/.........Location of the real data (largely empty at the moment) training/.......Data to be used for training test/...........Data to be used for testing exclusively examples/.......Data to show how UNIPEN files may look tools/.............Programs and libraries for browsing through UNIPEN files awk-parser/.....A syntax checker written in AWK c-parser/.......The same syntax checker, in C c-browser/......A browser in C for various Unix/X platforms lisp-browser/...A browser in Lisp for SUN Sparc uplib/..........A library of C routines to build UNIPEN applications repository/........Newly contributed data: locked/.........to deposit your benchmark data, accessible only by NIST open/...........to deposit your tools, papers, and data which are not confidential documents/.........Publications and documentation text files B- UNIPEN 1.0 format ----------------- The UNIPEN format permits to encode, in ASCII, handwriting data samples obtained with a sensitive pad which provides the discretized pen trajectory information. It is designed for on-line handwriting recognition research and development applications. In contrast with binary formats, such as Jot, the UNIPEN format is not efficient for data storage or for real time data transmission and not designed to handle ink manipulation applications involving colors, image rotations, rescaling, etc. But, it has data annotation capabilities to encode informations about recording conditions, writers, segmentation, data layout, data quality, labeling and recognition results. Our design efforts focus on making the format: * human intelligible without documentation (keywords are explicit English words) * easily machine readable * compact (few keywords) * complete (enough keywords) * expandable Format syntax ------------- The format is a series of instructions, each consisting of a keyword followed by arguments: -> Keywords are reserved words starting with a dot in the first column of a line. -> Arguments are strings or numbers, separated indifferently by spaces, tabulations or new-lines. The arguments relative to a given keyword start after that keyword and end with the apparition of the next keyword or the end of file. All variables are globals: declared variables values hold until the next similar declaration. Hence, simultaneous multiple definitions of the same variable are not possible. In the format definition, almost everything is optional and the order of the instructions is flexible, so that simple data sets can be described in a simple way. However, for the present benchmark, we urge you to follow the data skeletons provided in appendix C so as to obtain a database which is well documented. Many instructions, otherwise optional, have been made mandatory for the benchmark. Databases written in the UNIPEN format may optionally be organized in different files and directories, but all the data can also be concatenated into a single file. For the benchmark, we ask you to respect the directory organization described in appendix C. Syntax checkers are provided at the FTP site of LDC in directories: pub/UNIPEN/tools/awk-parser and pub/UNIPEN/tools/c-parser Format definition ----------------- The format definition can be retrieved at the FTP site of LDC from: pub/UNIPEN/definition/unipen.def We exclude here keywords which have been defined to encode recognition results and are not relevant for the data donation phase of the benchmark. UNIPEN 1.0 format, Copyright (c) 1994, Isabelle Guyon, AT&T Bell Laboratories * Data types ---------- [N] Integer or decimal number represented by digits separated by a dot; may start with a sign; no commas allowed. [S] String: any combination of keyboard ASCII symbols, except space, new-line, tabulations and words starting by a dot in the first column. [F] Free text: a succession of strings separated by space, new-line and tab. [R] Reserved string: a string which has a special meaning for the UNIPEN format, as defined in the reserved string glossary. [L] Label: a string enclosed between double quotes which may contain spaces new-lines or tabulations, all counted as spaces; the escape character is backslash; inside a label, double quotes should be replaced by , backslash by , tabulations by and new-lines by . Bug: The AWK parser does not accept labels consisting of a single space `` '', but it will accept `` ''. Spaces in between characters do not need to be preceded by a backslash. In the format definition, we use the following conventions: [.] Repeat the last type until a new type is indicated. [+] Repeat all preceding types any number of times. * Comments and declarations ------------------------- .COMMENT [F] Comments for human reading, to be ignored by the machine parser. .INCLUDE [S] Name of a file to be included as header. * MANDATORY * at the top of every data file to indicate the documentation file and the lexicon file. ** String argument should not contain any path, the file is assumed to come from the "include" directory. ** No include file should contain another include file. .VERSION [N] Version number of the format (current version 1.0). * MANDATORY * at the top of every single file. * Data documentation ------------------ .DATA_SOURCE [S] Name of institution or person where the data came from. * MANDATORY * in the documentation file. .DATA_ID [S] Name of this database. * MANDATORY * in the documentation file. .DATA_CONTACT [F] Where to reach the person responsible to answer questions about the database. * MANDATORY * in the documentation file. .DATA_INFO [F] Nature and structure of the data. * MANDATORY * in the documentation file. .SETUP [F] Data collection recording conditions. * MANDATORY * in the documentation file. .PAD [F] Data collection device. * MANDATORY * in the documentation file. .COORD [R] [.] Declaration of the coordinates used in .PEN_DOWN and .PEN_UP components, a subset of: X, Y, T, P, Z, B, RHO, THETA, PHI, including at least X and Y. * MANDATORY * in the documentation file. .HIERARCHY [S] [.] Declaration of segmentation hierarchy used by .SEGMENT. Examples of arguments may be: TEXT, WORD, CHARACTER, STROKE, etc. The hierarchy for this benchmark is: .HIERARCHY TEXT WORD CHARACTER. * MANDATORY * in the documentation file. .ALPHABET [L] [.] Declaration of all characters used in data labels. In this version of the UNIPEN format, characters are restricted to English keyboard ASCII. * MANDATORY * in the documentation file. .ALPHABET_FREQ [N] [.] Natural frequencies of characters in the data (need not sum to one). * Lexicon ------- The following instructions are mandatory only if a lexicon is declared. One may choose not to declare a lexicon. The data must still be labeled, of course. .LEXICON_SOURCE [S] Name of institution or person where the lexicon came from. * MANDATORY * in the lexicon file. .LEXICON_ID [S] Name of the lexicon. * MANDATORY * in the lexicon file. .LEXICON_CONTACT [F] Where to reach the person responsible to answer questions about the lexicon. * MANDATORY * in the lexicon file. .LEXICON_INFO [F] Informations about the lexicon. * MANDATORY * in the lexicon file. .LEXICON [L] [.] Representative set of class labels found in the database, generally at the word or character segmentation level. * MANDATORY * in the lexicon file. .LEXICON_FREQ [N] [.] Frequencies of lexical entries defined by .LEXICON. Lexical frequencies characterize the distribution from which data samples were drawn at random. Therefore, the number of times a lexical entry appears in the database should be approximately proportional to the lexical frequencies. Normalizing such that all numbers sum to one is not necessary. * Data layout and unit system --------------------------- In the UNIPEN data files, the data layout is converted to a canonical form consisting of a simple bounding box optionally containing some guide lines. The bounding box may either be minimal or can be the entire pad. However, it is preferable to frame samples in a reasonable size bounding box to facilitate the task of the data visualization tools. ^ Y | X_DIM |<-----------------------------> |------------------------------+ | | ^ | | | | | | | Bounding box | | Y_DIM | | | | | | | | v O -----------------------------+-------> X Imposed coordinate system. ^ Y |------------------------------+ | | | | | | | Baseline | |- - - - - - - - - - - - - - - + | ^ | | | H_LINE | | v | O -----------------------------+--> X To indicate a baseline, use .H_LINE with one argument. ^ Y |----------------------------------+ | Multiple baselines | |- - - - - - - - - - - - - - - - - + | ^ | | | | | | | |- - - - - - - - | - - - - - - - - + | ^ | | | | | H_LINE's | | | | | |- - - - - -|- - | - - - - - - - - + | ^ | | | | | | | | | v v v | O ---------------------------------+--> X To indicate multiple baselines, use .H_LINE with several arguments (3 in this example). ^ Y Boxes or combs |------+------+------+------+------+ |<---> | | | | | |<----------->| V_LINES's | | |<------------------>| | | |<-------------------------------->| |- - - | - - -|- - - | - - -|- - - + ^ | | | | | | | H_LINE O -----+------+------+------+------+-v---------> X To indicate vertical guide lines, use .V_LINE. To make a comb, a combination of baseline and several vertical guide lines can be specified. In this example, use .H_LINE with one argument and .V_LINE with 4 arguments. Because the UNIPEN data layout description is rudimentary, we recommend that you provide an engineering drawing of the data collection form in PostScript(tm) format. .X_DIM [N] Width of the bounding box, in pixels (using the resolution of the input device, not that of the display). * MANDATORY * in the documentation file. .Y_DIM [N] Height of the bounding box, in pixels. * MANDATORY * in the documentation file. .H_LINE [N] [.] Distance in pixels between the bottom of the bounding box and horizontal guidelines, such as a baseline. * MANDATORY * in the documentation file if one or several baselines are used. .V_LINE [N] [.] Distance in pixels between the left edge of the bounding box and vertical guidelines. * MANDATORY * in the documentation file if combs or vertical guidelines are used. .X_POINTS_PER_INCH [N] x resolution of the data collection device (1 inch ~ 2.5 cm). This is the resolution of the sensor, NOT that of the display. * MANDATORY * in the documentation file (or use .X_POINTS_PER_MM). .Y_POINTS_PER_INCH [N] y resolution of the data collection device. * MANDATORY * in the documentation file (or use .Y_POINTS_PER_MM). .Z_POINTS_PER_INCH [N] z (altitude) resolution of the data collection device. * MANDATORY * in the documentation file (or use .Z_POINTS_PER_MM). .X_POINTS_PER_MM [N] x resolution of the data collection device (in SI units). * MANDATORY * in the documentation file (or use .X_POINTS_PER_INCH). .Y_POINTS_PER_MM [N] y resolution of the data collection device. * MANDATORY * in the documentation file (or use .Y_POINTS_PER_INCH). .Z_POINTS_PER_MM [N] z (altitude) resolution of the data collection device. * MANDATORY * in the documentation file (or use .Z_POINTS_PER_INCH). .POINTS_PER_GRAM [N] Pressure resolution of the data collection device. * MANDATORY * in the documentation file if pressure coordinates are used. .POINTS_PER_SECOND [N] Sampling rate, * MANDATORY * in the documentation file if time stamps are not provided. * Pen trajectory -------------- Pen trajectory information is encoded with the .PEN_DOWN or .PEN_UP instructions, with repeated sequences of the coordinates declared with .COORD. We avoid using the term "stroke" which may refer to a unit of handwriting movement, with respect to a specific generation model. We use instead the term "component" for any continuous trace of the pen recorded by the digitizer. A pen-down component is a trace recorded when the pen is in contact with the surface of the digitizer. A pen-up component is a trace recorded when the pen is at proximity of the digitizer without touching it. Components are not necessarily delimited by pen-lifts and may or may not coincide with strokes. Thus, the instruction .PEN_DOWN does not indicate necessarily that the pen was lifted before. A sure pen-lift is indicated by a .PEN_UP (an empty component, if no data is recorded when the pen is in the up position). Since, it is always preferable to provide data which is as raw as possible, if the data collection device does not provide penup/down information, do not try to infer that information, add pressure or time stamp coordinates. The succession of two .PEN_DOWN components not separated by a .PEN_UP permits indicating a finer segmentation, not based on pen-lifts (for cursive for instance). See also the notation [A[:M]]-[B[:N]],[C] used in .SEGMENT. The instruction .DT permits precising the elapsed time between two components when only XY coordinates are recorded and when the sampling rate is regular. .PEN_DOWN [N] [.] Pen down component: repeated sequences of coordinates as defined by .COORD, * MANDATORY *. .PEN_UP [N] [.] Pen up component: same as .PEN_DOWN, but with the pen not touching the pad surface. * MANDATORY * only if the pen-lift information is available from the pad sensors. .DT [N] Elapsed time measured when pen coordinates are elided because the pen was immobile or out of proximity of the pad sensors. ** Very highly recommended. * Data annotations ---------------- It is important to be able to distinguish between writer dependent and writer independent applications. Thus one must keep track of the identity of the writers across training set and test set. For reasons of confidentiality, it is not always possible to use the full name of the writer as identification. Use the keyword .WRITER_ID to provide a unique writer identification. We suggest to use first name and last name initials and a four digit extension phone number (e.g. IG_7523). The person who is working at the data source and identified by .DATA_SOURCE is responsible for the book keeping of the writer identifications. Note that an ID consisting of a machine name and a date is not adequate since the same writer may come back to donate data at a later point in time. It is very highly recommended to provide all possible informations about the writer, even if they are not mandatory. .DATE [N] [N] [N] Date stamp: month, day, year. .STYLE [R] PRINTED, CURSIVE or MIXED. * MANDATORY *. .WRITER_ID [S] Unique writer identification. * MANDATORY *. .COUNTRY [S] Country of origin. .HAND [R] Writer hand: L for left, R for right. .AGE [N] Writer age, in years. .SEX [R] Writer sex: M for male, F for female. .SKILL [R] Skill of writer, familiarity with input device: BAD, OK or GOOD. .WRITER_INFO [F] Misc. information about writer. * Data sets and implicit component numbering ------------------------------------------ The database is divided into one or several data sets. Within a set, components starting either with .PEN_DOWN or .PEN_UP are implicitly numbered, starting from zero. Empty components are NOT counted. For this benchmark, ** sets will coincide with data files ** , each one containing data from one writer. The instruction .START_SET is not needed. There will be one single component numbering for each file, starting from zero. .START_SET [S] Start a new set; the argument is the set name. The component counter is reset to zero. In the absence of .START_SET, the component counter is automatically reset to zero at the beginning of each file and the set name is the file name. .START_BOX [.] Instruction to the data data visualisation program (browser) to clear the data bounding box. * Segmentation and labeling ------------------------- Component numbers are used by the .SEGMENT instruction to delineate sentences, words, characters, etc. with a [N]-[N],[N] expression. To indicate a finer segmentation containing broken components, one can use the more complex notation [A[:M]]-[B[:N]],[C] (see reserved string glossary). In addition to the hierarchical segmentation declared in .HIERARCHY, undeclared segmentation information may be introduced to encode various kind of break points such as chapters, pages, blocks or lines (e.g. .SEGMENT PAGE 203-621). .SEGMENT [S] [R] [R] [L] Type of segment, its delineation, quality and label. .SEGMENT is * MANDATORY * for the benchmark, it is needed to provide the transcripts or labels. -> First argument: type of segment, such as the ones declared in .HIERARCHY (e. g. TEXT, WORD, CHARACTER, etc.). -> Second argument: segment delineation by a A[:M]]-[B[:N]],[C] expression (see reserved string glossary). Components are numbered, starting from zero, in order of apparition in the current set (which coincides for the benchmark with the entire data file of one writer). Empty components are NOT counted. -> Third argument: quality level, BAD for illegible, OK for regular, GOOD for superior, ? for unknown. -> Fourth argument: transcript or label (Sentence, word, character, etc.) * Reserved string glossary ------------------------ [N]-[N],[N] Compact representation of a list of numbers, used by .LEXICON_SET, .SEGMENT, .REC_TIME, .REC_LABELS, .REC_SCORES. The list: 2, 3, 4, 5, 15, 9, 50, 51, 52 ,53, 54, 55 would be represented as: 2-5,15,9,50-55. Commas allow non contiguous numbers (useful for segmentation of delayed strokes, i dots and t bars). NO SPACES are allowed in the notation. [A[:M]]-[B[:N]],[C] More flexible representation to delineate segments of data by breaking components. A,B and C are component numbers, M and N are point numbers in the component. Both components and points are 0-base. L defaults to zero and M to the last point in the component. Thus the [N]-[N],[N] expressions are special cases of a [A[:M]]-[B[:N]],[C] expression. The example 1:40-3,5,6:0-6:12 delineates component 1 from point 40 to the end, all of component 2, 3 and 5 and component 6 from the beginning to point 12. NO SPACES are allowed in the notation. X X position of the pen on the pad surface, in units of X given by .X_POINTS_PER_INCH Y Y position of the pen on the pad surface, in units of Y given by .Y_POINTS_PER_INCH T Time in MILLISECONDS. P Pressure in units of P given by .UNITS_PER_GRAM. Use preferably a linearized pressure in 1000 units per gram of force and calibrate the zero as the threshold of pen reaching the pad surface. Negative pressures can account for remaining non- linearities and hysteresis. Z Altitude above the pad surface, in units of Z given by .Z_POINTS_PER_INCH. BUTTON Barrel button states: 0, 1, ... RHO Rotational angle of the stylus, measured in degrees from some nominal orientation of the stylus (e. g. barrel button on top). The angle increases with clockwise rotation as seen from the rear end of the stylus. THETA XY angle of the stylus, measured in degrees, increasing from the X axis in the counter-clockwise direction. PHI Z angle of the stylus, increasing from the pad surface, in the positive Z direction. L Left handed. R Right handed. M Male. F Female. BAD Unskilled writer, illegible writing. OK Average quality writing, unambiguously legible. GOOD Superior quality writing, most recognizers should get it. ? Unknown. PRINTED Printed handwriting style. CURSIVE Cursive handwriting style. MIXED Mixed printed and cursive handwriting style. C- Data Skeletons -------------- The following data skeletons designed for the UNIPEN benchmark can be retrieved at the FTP site of LDC from directory: pub/UNIPEN/databases/examples/org/ In the example directory tree, substitute to ``org'' your three letter organization identification received upon registration. Should you donate several databases, create sub-directories orgX, where X is a number between 0 and 9. Each database should be homogeneous and contain at least 12,000 characters, or the equivalent in words or sentences. Within a sub-directory orgX, each file must contain data from only one writer. org/ | --------------------------------------------- | | | | | | README documents/ include/ org0/ org1/ org2/ documents: A directory containing relevant documents, including engineering drawings of the data collection forms, papers or technical reports published about the database or performance results using the database. All documents should be in PostScript(tm) format. include: A directory which should contain all include files. org0: Example of isolated character data set. org1: Example of isolated word data set. org2: Example of sentence data set. * File: org/include/org0.doc -------------------------- .VERSION 1.0 .COMMENT ###################################################################### ### SKELETON OF DOCUMENTATION FILE FOR ISOLATED CHARACTER DATABASE ### ###################################################################### organization_name is the full name of your organization. org0 is the identifier of this database, where org is the 3 letter identifier of your organization which was assigned to you upon registration. org0 is also the name of this file which is used in .INCLUDE. The version number is 1.0, please do not change it and conform to the UNIPEN 1.0 documentation. .DATA_SOURCE organization_name .DATA_ID org0 .COMMENT Use the .ALPHABET specified (the same for all the databases of the benchmark). It should be a superset of the character labels that you are using. Optionally use .ALPHABET_FREQ to specify character frequencies by replacing the 1's by appropriate values. .ALPHABET "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "~" "!" "@" "#" "$" "%" "^" "&" "*" "(" ")" "-" "+" "=" "|" "\\" "/" "{" "}" "?" "[" "]" "\"" ":" "<" ">" "," "." ";" "'" "`" "_" " " .ALPHABET_FREQ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 .COMMENT Use the .HIERARCHY CHARACTER for isolated characters as specified. This means that only the .SEGMENT type CHARACTER is allowed in data files which include this file. Character labels may be any label from .ALPHABET. .HIERARCHY CHARACTER .COMMENT ##################### ### DOCUMENTATION ### ##################### Please fill up the blanks and add any other relevant information. .DATA_CONTACT Name: Affiliation: Address: Phone: Fax: Email: .SETUP Site: Time: Writer population: Writer motivation: paid/volunteer/forced Writer physical position: Instructions given to the writer: Duration of one session: Recognizer feedback: yes/no Form layout: .DATA_INFO Set type: training or test ^^^^^^^^ Alphabet: English keyboard alphanumerics and symbols from the ASCII set Lexicon: Label distribution: Quantity: Quality: cleaning/no cleaning Number of Writer(s): Writing Style: printed/cursive/mixed Segmentation: .PAD Machine name: Brand: Type: Serial Nr.: Sensor: Pen: Driver: Sampling mode: Sampling rate: Resolution: Accuracy: Width: Height: Display: Inking: .COMMENT ##################### ### DATA LAYOUT ### ##################### Please replace numbers by adequate values. .X_DIM and .Y_DIM define the bounding box. .H_LINE is the baseline which may be optionally specified. All characters must fit in the same bounding box. The UNIPEN coordinate system must be strictly observed (0 at the lower left corner, X axis pointing to the right, Y axis pointing to the top). .X_DIM 75 .Y_DIM 160 .H_LINE 75 .COMMENT ##################### ### UNIT SYSTEM ### ##################### Please replace 000 by adequate values. .X_POINTS_PER_INCH and .Y_POINTS_PER_INCH may be replaced by .X_POINTS_PER_MM and .Y_POINTS_PER_MM. .COORD must have at least X and Y, but may include too P (pressure) and/or T (time), the altitude above the pad Z or even angle information RHO, THETA, PHI, if available (see UNIPEN documentation). If you use pressure information, conform to the UNIPEN documentation and use .POINTS_PER_GRAM to precise the resolution. .X_POINTS_PER_INCH 000 .Y_POINTS_PER_INCH 000 .POINTS_PER_SECOND 000 .COORD X Y * File: org/include/org1.doc -------------------------- This file is similar to org/include/org0.doc, except that org0 is replaced by org1 and $.HIERARCHY$ $CHARACTER$ by: .HIERARCHY WORD .COMMENT Use .HIERARCHY WORD for isolated words that are not segmented into characters and .HIERARCHY WORD CHARACTER if segments of both .SEGMENT types WORD and CHARACTER appear in the data file in which this file is included. Words may contain any characters from .ALPHABET except white space. * File: org/include/org2.doc -------------------------- This file is also similar to org/include/org0.doc, except that org0 is replaced by org2 and $.HIERARCHY CHARACTER$ by: .HIERARCHY TEXT WORD .COMMENT Use .HIERARCHY TEXT for unsegmented text. Use .HIERARCHY TEXT WORD if both .SEGMENT types TEXT and WORD appear in the data file in which this file is included. Use .HIERARCHY TEXT WORD CHARACTER if all three types are used. The text labels may contain any characters from .ALPHABET. If there are restrictions on the character set, please specify them with .ALPHABET_SET in the data file directly. Notice also that multiple baselines are defined in that file with: .H_LINE 450 1900 2300 * File: org/include/phrase.lex ---------------------------- .VERSION 1.0 .COMMENT ####################################### ### SKELETON OF UNSEGMENTED LEXICON ### ####################################### We leave the option to provide an unsegmeted lexicon, e.g. a list of sentences or phrases from which the data, was drawn or an entire text. It is also possible to provide a segmented lexicon (a list of words) for a text database. organization-name is the full name of your organization. phrase is here the name of that lexicon (also the name of the file) used to identify the lexicon in .INCLUDE. .LEXICON_SOURCE organization-name .LEXICON_ID phrase .COMMENT Please fill up the blanks and add any other relevant information .LEXICON_CONTACT Name: Affiliation: Address: Phone: Fax: Email: .LEXICON_INFO Source: Purpose: Language: Label distribution: .COMMENT Please provide all labels between "". Do not forget that the label quote = \" and the label backslash = \\. .LEXICON "Thomas N. Zinfindale, 6877 Greensburg" "Drive, Phoenix, MN 34132. Hanna Q." "Allen, 9887 County Drive, Phoenix, ME" "39153. Bruce C. Prescott, 8138 South" "Court, Columbus, UT 56029. Doris O." "Davies, 8152 Bluedale Drive," "Knoxville, SC 12277. Sally Y." "Hinricks, 6837 Lake Drive, Boise, RI" "81516. Hank I. Samuals, 2685 Lake" "Avenue, Madison, WA 03915. Frank O." "Allen, 6216 Bluedale Street, Topeka," "South Terrace, Boise, KY 29445." * File: org/include/word.lex -------------------------- .VERSION 1.0 .COMMENT ################################ ### SKELETON OF WORD LEXICON ### ################################ organization-name is the full name of your organization. word is here the name of that lexicon (also the name of the file) used to identify the lexicon in .INCLUDE. .LEXICON_SOURCE organization-name .LEXICON_ID word .COMMENT Please fill up the blanks and add any other relevant information .LEXICON_CONTACT Name: Affiliation: Address: Phone: Fax: Email: .LEXICON_INFO Source: Purpose: Language: Label distribution: .COMMENT Please provide all labels between "". Do not forget that the label quote = \" and the label backslash = \\. Optionally use .LEXICON_FREQ to provide word frequency information, as computed for instance from a corpus of text. .LEXICON "the" "of" "to" "and" "a" "in" "said" "that" "for" "is" "was" "on" "with" "by" "as" "he" "it" .LEXICON_FREQ 11000000 5824125 4958348 4683121 4167577 4053021 1924978 1889885 1796491 1557596 1496299 1362125 1137819 1060304 977732 965713 920947 * File: org/org0/zarbi.dat ------------------------ .VERSION 1.0 .COMMENT ############################################################# ### SKELETON OF DATA FILE FOR ISOLATED CHARACTER DATABASE ### ############################################################# Include the documentation file (mandatory). Included files reside in the include/ directory and shall come with no path. No file should be included in an included file. zarbi is the writer identifier, also the name of this file. There should be one file for each writer. .INCLUDE org0.doc .WRITER_ID zarbi .COMMENT Replace zeros or question marks by appropriate values (see UNIPEN documentation). All these entries are optional .DATE 00 00 00 .HAND ? .SKILL ? vv.COUNTRY ? .AGE 00 .SEX ? .COMMENT Complete .WRITER_INFO with any other information that has not yet been itemized. .WRITER_INFO First name: Last name: Native language: Other languages: Place of Birth: Where educated: Place of Residence: Computer experience: Occupation: Vision: .COMMENT These are examples of characters borrowed from samples of the IBM database. Only X and Y coordinates were used: no time coordinate (therefore a regular sampling rate is assumed), no .DT, no pressure coordinate and no pen up strokes, no angle information. Consult the UNIPEN documentation to encode more complex examples. .SEGMENT CHARACTER 0 OK "P" .PEN_DOWN 23 114 22 116 22 118 etc. 23 97 27 93 32 90 .PEN_UP .SEGMENT CHARACTER 1 OK "e" .PEN_DOWN 19 88 18 89 17 89 etc. 57 78 59 77 59 76 .PEN_UP .SEGMENT CHARACTER 2-3 OK "j" .PEN_DOWN 34 100 33 100 33 100 etc. 9 53 10 53 13 54 .PEN_UP .PEN_DOWN 27 116 27 116 27 116 etc. 32 108 34 106 34 105 .PEN_UP * File: org/org1/tobito.dat ------------------------- .VERSION 1.0 .COMMENT ############################################### ### SKELETON OF DATA FILE FOR WORD DATABASE ### ############################################### Include the documentation file (mandatory), and the lexicon (optional). Included files reside in the include/ directory and shall come with no path. No file should be included in an included file. tobito is the writer identifier, also the name of this file. There should be one file for each writer. Replace .STYLE PRINTED by the appropriate information (PRINTED, CURSIVE or MIXED). .INCLUDE org1.doc .INCLUDE word.lex .WRITER_ID tobito .STYLE PRINTED .COMMENT Replace zeros or question marks by appropriate values (see UNIPEN documentation). All these entries are optional .DATE 00 00 00 .HAND ? .SKILL ? .COUNTRY ? .AGE 00 .SEX ? .COMMENT Complete .WRITER_INFO with any other information that has not yet been itemized. .WRITER_INFO First name: Last name: Native language: Other languages: Place of Birth: Where educated: Place of Residence: Computer experience: Occupation: Vision: .COMMENT These are examples of characters borrowed from samples of the HP database. Only X and Y coordinates were used: no time coordinate (therefore a regular sampling rate is assumed), no pressure coordinate and no pen up strokes, no angle information. Consult the UNIPEN documentation to encode more complex examples. .SEGMENT WORD 0-11 ? "Tattersall" .PEN_DOWN 123 122 123 122 123 122 etc. 117 4 117 4 120 5 .PEN_UP .DT 100 .PEN_DOWN 297 151 297 151 296 151 etc. 31 79 17 75 10 72 .PEN_UP .DT 100 .PEN_DOWN 240 53 233 56 224 56 etc. 240 11 243 5 247 0 .PEN_UP .DT 100 .PEN_DOWN 307 109 308 100 311 81 etc. etc. * File: org/org2/camaru.dat ------------------------- .VERSION 1.0 .COMMENT ############################################### ### SKELETON OF DATA FILE FOR TEXT DATABASE ### ############################################### Include the documentation file (mandatory), and the lexicon (optional). Included files reside in the include/ directory and shall come with no path. No file should be included in an included file. camaru is the writer identifier, also the name of this file. There should be one file for each writer. Replace .STYLE MIXED by the appropriate information (PRINTED, CURSIVE or MIXED). .INCLUDE org2.doc .INCLUDE phrase.lex .WRITER_ID camaru .STYLE MIXED .COMMENT Replace zeros or question marks by appropriate values (see UNIPEN documentation). All these entries are optional .DATE 00 00 00 .HAND ? .SKILL ? .COUNTRY ? .AGE 00 .SEX ? .COMMENT Complete .WRITER_INFO with any other information that has not yet been itemized. .WRITER_INFO First name: Last name: Native language: Other languages: Place of Birth: Where educated: Place of Residence: Computer experience: Occupation: Vision: .COMMENT These are examples of characters borrowed from samples of the ATT database. Only X and Y coordinates were used: no time coordinate (therefore a regular sampling rate is assumed), no pressure coordinate and no pen up strokes, no angle information. Consult the UNIPEN documentation to encode more complex examples. In this example, .START_BOX is used, but this is optional. Also, .SEGMENT PAGE provides additional optional segmentation information. The only segments that will be taken into account for the benchmark are TEXT, WORD or CHARACTER (or a selection of those, as precised in .HIERARCHY). .START_BOX .SEGMENT PAGE 0-58 .SEGMENT TEXT 0-29 ? "that nothing more happened ," .SEGMENT WORD 0-6 ? "that" .PEN_DOWN 707 2417 707 2424 706 2429 etc. 590 2319 590 2319 590 2319 .PEN_UP .DT 151 .PEN_DOWN 588 2377 586 2377 586 2377 etc. etc. .PEN_UP .SEGMENT WORD 7-15 ? "nothing" .PEN_DOWN 1778 2366 1778 2366 1776 2367 etc. etc. .PEN_UP .SEGMENT WORD 16-19 ? "more" .PEN_DOWN 3048 2355 3046 2357 3045 2358 etc. etc. .PEN_UP .SEGMENT WORD 20-28 ? "happened" .PEN_DOWN 417 2019 417 2019 417 2021 etc. etc. .PEN_UP .SEGMENT TEXT 30-58 ? "that nothing more happened ," .PEN_DOWN 347 525 347 532 347 537 etc. etc. .PEN_UP .START_BOX .SEGMENT PAGE 59-102 .SEGMENT TEXT 59-80 ? "she decided on going" .SEGMENT WORD 59-61 ? "she" .PEN_DOWN 325 2328 325 2329 325 2330 etc. etc. D- Tips for collecting data ------------------------ If you are planning to collect data for the UNIPEN benchmark, we invite you to read these recommendations which might save you a lot of trouble and ensure a better quality of the data: * Golden rules ------------ -> Record all informations about recording conditions, hardware, software and writer background at the time of the data collection. Annotations made a posteriori have higher chances of being wrong or incomplete. -> Collect data rich in information: with the highest possible spacial resolution and sampling rate, with as little preprocessing or filtering as possible, with time stamps, pressure, ``a proximity'' strokes if available. -> Keep the instructions to data donators simple. Before starting the data collection, test your setup on ``naive'' subjects that have not been part of the design team. -> Before collecting large amounts of data, test your entire system by collecting a small sample. Run it through the UNIPEN parser and visualize some examples with a UNIPEN data visualization program (browser). Such tools are available to you at the FTP site of LDC. * Hardware -------- Use the best possible hardware you can afford. For instance, use pads which display the "electronic ink" rather than opaque tablets. The Wacom PL 100-V (integrated digitizer and LCD display) is among the best pads available: - display:VGA 640x480 16 grey-level backlit STN LCD - digitizer: .05 mm/point maximum resolution, 192x144 mm active area - maximum sample rate: 205 points per second - communications: unit comes with a PC-AT bus i/f card that handles video and serial i/o to the tablet. - approximate price: $1500.00 - delivery time: in stock generally - order from: WACOM CORP Vancouver, Washington, USA (1-800-922-6635) We also recommend the KVTS-5 VideoTablet: - display: VGA 640x480 64 grey-level backlit STN LCD - digitizer: .025 mm/point maximum resolution, 196x148 mm active area - maximum sample rate: 135 points per second - price: $1495.00 - delivery time: 2-3 weeks - order from: KURTA Phoenix, Arizona, USA (1-800-445-8782) Avoid Calcomp 2300, 2500 opaque tablets (random noise and high tilt sensitivity) and cheap (< $500) tablets intended for drawing rather than for handwriting input. In general, beware of large screen format capacitive/resistive "membrane" digitizer devices. Users tend to rest parts of the hand/fingers on the writing surface at times, which can lead to spurious information in the tablet data stream. Electro-magnetic digitizers (ie., Wacom/Kurta) are a better choice. Some digitizer units use pens that require batteries to power them. As the batteries discharge, pen performance can become erratic without warning... Moral: keep an eye on the batteries used by the pen device. * Driver ------ There is a choice of commercial drivers which come with the notepad computer or the operating system. A widely distributed software is Microsoft Windows for Pen Computing V1.0. We recommend to: - use the WFP SDK function GetPenHwData for data closest to the tablet: maximum sample rate, no unwanted preprocessing. - avoid using coordinates generated by WM MOUSE MOVE events. It is better to collect the original ``raw'' data and it is always possible to revert a posteriori to "mouse-type" spatial sampling. You may also want to write your own driver to read points coming from the pad. In this case, make sure that you are not dropping points or loosing track of the time stamps. It is indeed very well possible to produce drivers which are better than the ones provided commercially! Note that design considerations for some tablet drivers originate from the crosshair puck used in CAD, where the requirements are different from handwriting (speed is less important in CAD). General driver requirements: - Create an interrupt-based driver, do not use polling! - There are three types of basic functions: INIT SETTINGS, (send characters to the tablet to select dataformat, sampling mode, sampling rate, resolution etc.) START SAMPLING, (send command to start the sampling) STOP SAMPLING. (send command to stop the sampling) Allow for the checking and controlling of at least the sampling rate parameter. Make sure the storing of coordinates has a far higher priority than the graphical inking, but redesign your driver if there is a noticeable inking delay. Some writers get more upset than others about inking delay. The inking should be (almost) immediate. Try a faster computer (486 instead of 386, Quadra instead of MAC II) if it does not work. - Use the highest possible sampling rate (it is always possible to perform resampling at a lower rate in software), and not less than 100 points per second. - Reduce as much as possible or eliminate completely smoothing and low-pass filters: the details of the dynamics of the pen may be irremediably lost. It has been observed with a commercial driver that, for fast writers, a capital B and a capital D are undistinguishable. Also, prevent the use of dehooking or other preprocessing heuristics. This functionality is part of a recognition system, not of the driver. - Document the features of your driver in .PAD. In particular, indicate the host machine, the operating system, the estimated current load, the sampling mode (interrupt driven or polling), the number and size of the buffers on the signal path from the tablet to the disk. * Annotations ----------- The data should be annotated with informations about recording conditions, hardware, software, and writer background (see UNIPEN documentation and examples for details). Informations should be recorded at the time of the data collection. Annotations a posteriori have a higher chance of being wrong or incomplete. Design a computer survey form which will be filled at each data collection session. Record the date, the machine and pad serial numbers (useful for tracking hardware failures). Examples of good/bad annotations (thanks to all who provided examples): ---------------------------------------------------------------------------- .PAD Machine name: WACOM PL100-V Brand: Wacom Type: PL-100V Serial Nr.: 2JPJB0044 Sensor: Electromagnetic, wireless pen Pen: Untethered Pen, tip switch and side button. Tip switch used for PenUp/PenDown determination. Driver: Custom-made routine, calling GetPenHwData, under MS Windows for Pen Computing V1.0 (PENWIN.DLL), Sampling mode: Equidistant in time, during PenUp & PenDown (continuous sampling) Sample rate: 100 [Hz] Resolution: 0.02 [mm/unit] Accuracy: 0.1 [mm] Width: 192 [mm] Height: 143 [mm] Display: Wacom PL-100V, backlit LCD screen (VGA, 640x480) for the writer and a 15" CRT monitor for the supervisor. The writer only looked at the PL-100V Inking: Graphically emulated; black on white; 1 pixel thick. ---------------------------------------------------------------------------- .PAD sub fields and their values as in received files. Machine name: Wacom SD-510C Machine name: Scriptel Penwriter Machine name: Wacom HD-648A Machine name: IBM ThinkPad 700T Machine name: Calcomp 2500 Machine name: NCR 3125 pen computer Machine name: Philips Advanced Interactive Display (PAID) prototype "notebook" Machine name: CIC-MacHandwriter (CalComp digitizer c/to a Macintosh IICI) Note: a distinction between brand/type/serial number is necessary (see above). ---------------------------------------------------------------------------- Display: Apple monitor Matrix 640 x 480 points black ink on white background Display: backlit liquid crystal display (LCD) Matrix 640 x 480 points VGA standard, black on white Display: translucent type blue mode (STN) with backlight Matrix 640 x 480 points Display: LCD with backlight Matrix 640x480 points, VGA compatible Display: VAXstation 3100 15" monochrome screen for inspection Display: lcd display VGA resolution (640x480) black ink on white background Display: 640x480 (VGA) 16 gray levels Display: Macintosh's screen. Resolution: 72 pixels per inch. Ink: black on white; 1 pixel thick. Note: physical size, resolution and inking information are needed. That's why there is an Inking: subfield proposed (see above). ---------------------------------------------------------------------------- Pen: Untethered pen with tip switch Pen: Tethered pen with side button Pen: Untethered pen with side button Pen: Standard non-stroke with tip and barrel switches Pen: Ballpoint refill pen with home-made axial pen force transducer Note: also tell what determines your PenUp/PenDown (tip switch or side button etc) ---------------------------------------------------------------------------- Sensor: Electromagnetic give and receive action Sensor: Philips confidential Sensor: Electromagnetic resonance sensor (Sensor: Resistive Sensor: Capacitive) Note: also tell whether the pen is wireless. ---------------------------------------------------------------------------- Driver: Home made (this is too vague) Driver: MS-mouse-like (better, but not too informative) Driver: UCL-made MAC driver and interface. Time-equidistant sampling between .PEN_DOWN and .PEN_UP. (still better) Driver: subfield is very important and says even more about the data than the Machine name:, Brand: or Type: subfields. ---------------------------------------------------------------------------- * Data donators ------------- Try to balance the writer population left/right handed and female/male and collect samples from many different writers. Since many factors that are not itemized by special UNIPEN keywords may influence the handwriting quality, keep a detailed description of the recording conditions in .SETUP and .WRITER_INFO, including: - the writer's familiarity with the data collection device - whether the writer was paid or volunteered - the duration of the data collection session - the position of the writer (standing, sitting at a desk) - vision impairments - the writer's familiarity with the language In most data collections, you will want the "built-in" recognizer to be ``turned off''. Writers do adapt quite fast to the recognizer and keeping the recognizer active will bias the data. If this is your intention, please clearly indicate it in the .SETUP. The instructions to data donators should be short and simple: it should be "obvious" what to do. Before starting to use your data collection software, test it on naive potential donators who have not been part of the design team. Avoid imposing counter intuitive constraints on the writer which are likely to introduce errors in the data (e.g. avoid "Do not try to correct any mistakes", "Do not wrap text from one line to the next"). The software should be robust with respect to donators' mistakes by either recovering from them or warning and asking to rewrite (e.g. if the writer goes out of the bounding box). Let data donators get familiar with the device before starting to record. REMINDER: You are responsible for the book-keeping of writer identifications. Each writer should have a unique .WRITER_ID so that it is possible to track whether data from the same writer is both in the training set and in the test set. An ID consisting of a machine name and a date is not adequate since the same writer may come back to donate data at a later point in time. * Data layout to facilitate segmentation -------------------------------------- Every effort should be made to minimize or eliminate labeling errors at the source, in order to avoid the painful data cleaning process. The challenge is to make sure that the right ink is attributed to the right segment. If your pad is equipped with a display, make good use of it: - Keep the page simple with as few buttons, text, boxes as possible. Just make a help button which opens an explanation screen and one clear button for each writing area so that the writer can selectively erase and rewrite bad entries. - Give feedback to the writer. For instance, give warnings when the pen goes out of the bounding box (beep + flashing light). Then erase whatever was in that box to force the writer to rewrite by respecting the bounding box. General recommendations: - Separate clearly the writing boxes by wide dark margins. - Preferably always use a baseline in each writing field. - Make sure that there is enough space in each writing box to write the entire truth label string. - Make sure that the geometry of the guidelines or combs is comfortable for the writer. - Each data item corresponding to one truth label string should have its own writing box. The label should be associated without ambiguity to that box (e.g. in a corner of the box itself). - Ambiguous truth labels should be explained (e.g. ' = quote and , = comma) - Collect the ink from the entire screen. When the page is totally finished, proceed to segment it to attribute strokes to their legal writing area. This should take care of the problem of delayed corrections. - Keep a record in UNIPEN format of the layout geometry, including the position of baselines, boxes and combs. - Keep an engineering drawing of the layout geometry with dimensions expressed in the same units as the UNIPEN records. * Truth labels ------------ Report what your intention was in choosing the truth labels in .DATA_INFO. Here are general guidelines which may be violated depending on the target application: - Select truth labels that cover the entire character set and try to balance character frequencies. - Select a variety of sources which present different difficulties (e. g. literature texts, advertisements, scientific abstracts, computer programs, computer manual pages, daily news, addresses). - Do not always repeat the same set of words/sentences for each writer but rather draw fresh words/sentences to get more coverage. - If you collect isolated characters, do not present them always in the same order. * Data cleaning ------------- If, in spite of all your efforts, your database still contains errors, proceed to clean it with CAUTION: - It is harmful to loose the original data labeling. It is better to store several sets of .SEGMENT entries for the same data file. To do so, if ``zarbi.dat'' is your data file, create several files ``zarbi.sg0'', ``zarbi.sg1''', etc. containing the same header information as the data file, but no data, just the segmentation information. - Always keep a file with the original labels in a ``zarbi.sg0'' file. - Do not eliminate bad patterns, mark them with BAD in the quality field of .SEGMENT. - If you use several judges to clean the data, keep separate records of their decisions by creating separate``zarbi.sg'' files. - Document how you cleaned the data in .DATA_INFO. In particular, say if you used your own recognizer to help you clean the data by pointing out suspicious patterns. E- IAPR ---- IAPR is an international association of non-profit scientific or professional organizations (being national, multinational or international in scope) concerned with pattern recognition, computer vision and image processing in a broad sense. The major event in the IAPR calendar is the biennial International Conference which attracts scientists, researchers and practitioners in the field from all over the world; here they are able to listen, to learn, to educate and to exchange ideas with their colleagues. IAPR has established a number of Technical Committees (working groups with only a single level technical specialities and to further the objectives of the Association. Each committee has a distinct purpose of scope and is under the responsibility of a chairman appointed by the IAPR President. These committees organize regular or special meetings and workshops as well as collaborative activities with other organizations. The goal of Technical Committee 11, Professor Rejean Plamondon, Chairman, is to promote the exchange of information on research projects dealing with text processing using both static and dynamic methods for printed or handwritten document understanding. The major areas of research concerned by this committee are: on-line and off-line character recognition, document processing and understanding, cursive script recognition, automatic signature verification, modeling of handwriting generation and perception. Professor Rejean Plamondon Ecole Polytechnique de Montreal C.P. 6079, Succ, "A" Montreal, QC H3C3A7, Canada Phone: +1 (514) 340-4539 Fax: +1 (514) 340-4147 Email: ha03@music.mus.polymtl.ca F- LDC --- The Linguistic Data Consortium (LDC) was founded in 1992 to provide a new mechanism for large scale development and widespread sharing of resources for research in linguistic technologies. Based at the University of Pennsylvania, the LDC is a broadly-based consortium that now includes about 65 companies, universities, and government agencies. An initial grant of $5 million from ARPA amplifies the effect of contributions (both of money and data) from this broad membership base, so that there is guaranteed to be far more data than any member could afford to produce individually. In addition to distributing previously-created databases, and funding or co-funding the development of new ones, the LDC has helped researchers in several countries to publish and distribute databases that would not otherwise have been released. Dr. Mark Liberman (myl@sansom.ling.upenn.edu), the director of LDC, is Trustee Professor of Phonetics at the University of Pennsylvania. Dr. John Godfrey (jgodfrey@unagi.cis.upenn.edu), the Executive Director for the LDC, is on loan to the LDC from Texas Instrument where he was Project Manager and Principal Investigator for the Speech Corpus Collection Contract. The LDC now offers 20 databases, comprising 67 CD-ROMSs. These databases include well known speech corpora (DARPA Resource Management, TIMIT, SWITCHBOARD, Air Traffic Control) and text corpora (AP-news, Brown corpus, Wall Street Journal, Canadian Parliament Debates, various dictionaries). Another 40 CD-ROMs are in production, and more than 100 more are in development or planning. Linguistic Data Consortium Room 441, Williams Hall University of Pennsylvania Philadelphia, PA 19104-6305 Phone: +1 (215) 898-0464 Fax: +1 (215) 573-2175 Email: ldc@linc.cis.upenn.edu G- NIST ---- The US National Institute of Standards and Technologies (NIST) is an agency of the Department of Commerce established by Congress "to assist industry in the development of technology". The Institute's Computer Systems Laboratory (CSL) contributes toward that goal through cooperation with industry with a focus on data, test methods, measurements, and standards. With regard to recognition systems, NIST is interested in the evaluations of applications which (1) have broad user applicability in a specific domain of application, (2) move specific evaluation criteria which can be developed and are useful to both users and system developers, and (3) move a large number of system developers willing to participate in the evaluations at their own expense. Stan Janet of the CSL's Image Recognition Group will coordinate and perform UNIPEN scoring. This will be done using software written for recognition system evaluation and utilized internally by the IRS and in two OCR conferences for the Bureau of the Census. Stan Janet National Institute of Standards and Technology Bldg. 225 Rm. A-216 Gaithersburg, MD 20899 Phone: +1 (301) 975-2919 Fax: +1 (301) 840-1357 Email: stan@magi.ncsl.nist.gov * Acknowledgements We would like to thank all the people who participated in the earlier phases of the UNIPEN project and those who helped preparing this document, and in particular: Gerben Abbink (NICI), Tina Allen (AT T-GIS), Sandy Benett (Apple), Thomas Breuel (IDIAP), Hans Dolfing (Philips), Nick Flann (Utah State Univ.), Dan Flickinger (HP), Mark Foster (LDC), Tetsu Fujisaki (IBM), Tom Fuller (Washington Univ.), Jon Geist (NIST), Philippe Gentric (Philips), John Godfrey (LDC), Don Henderson (AT T Bell Labs), Elizabeth Hodas (LDC), Jonathan Hull (CEDAR), Wes Hunter (AT T-GIS), Kwon Jae-Ook (Korea AIST), Leonid Kitainik (Paragraph Intl.), Shinn Lee (Intel), Richard Lyon (Apple), John Makhoul (BBN), George Mills (Apple), Caroline Moran (Microsoft), Ronjon Nag (Lexicus), Craig Nohl (AT T Bell Labs), John Ostrem (CIC), Alexander Pashintsev (Paragraph Intl.), Robert Powalka (NTU), Dave Reynolds (HP), Sung Rhee (Microsoft), Eric Ristad (Princeton Univ.), Nasser Sherkat (NTU), Bill Stafford (Apple), Hans-Leo Teulings (Arizona State Univ.), Bob Vollmer (IBM), William Weideman (Scribe-Tek), Charles Wilson (NIST), Larry Yaeger (Apple).