NextFrom stan@magi.nist.GOV Thu Aug 10 03:43 MET 1995 Subject: UNIPEN: initial release of training data is ready To: UNIPEN participants
Organization: National Institute of Standards and Technology, Gaithersburg, MD Documentation for UNIPEN Training Data Release "train_r01_v01" 1. INTRODUCTION =============== The initial release of UNIPEN training data is ready for downloading From the anonymous FTP site, ftp.cis.upenn.edu. This release consists of one-third of the samples from one-fifth of the writers from 5 of the 50+ donated data sets (from almost 40 different sources). The rest of the training data will be coming as soon as I know exactly how much is needed for all of the partitioned test sets to be statistically significant, and after a few miscellaneous loose ends are tied up (e.g. six participants either owe missing information or are fixing at least some of their data). This release contains segments of each of the three types, CHARACTER, WORD, and TEXT, from at least two different sources. The data sets comprising this release are: atu cea not syn val Due to time and disk space constraints (it takes several minutes to simply use "grep" to search for pattern matches in the larger data sets), the other clean data sets are still in the queue for now. To avoid confusion when the data in this release is supplemented and possibly revised, you should be prepared to delete it. Therefore it would not be a good idea to copy this first release all over your organization. Feedback is welcomed and promptly reporting any mistakes is highly encouraged. Questions are welcomed too, as I'm sure there will be several items that need to be cleared up. 2. ORGANIZATION =============== The files are stored in a tar-format archive file. When restored from the archive, the top-level directory will be `train_r01_v01'. It contains directories `include' and `data'. As an example, these are the files derived from those contributed by CEDAR (from their first data set "cea") that had samples that have been randomly at this point for training set: include/cea include/cea/data include/cea/data/Asundi.dat include/cea/data/Kaeslingk.dat include/cea/data/Ramaswamy.dat include/cea/data/Toepfer.dat include/cea/ced0.doc include/cea/ced0.lex data/1a/cea data/1a/cea/Asundi.dat data/1a/cea/Ramaswamy.dat data/1b/cea data/1b/cea/Asundi.dat data/1b/cea/Kaeslingk.dat data/1b/cea/Ramaswamy.dat data/1b/cea/Toepfer.dat data/1c/cea data/1c/cea/Asundi.dat data/1c/cea/Kaeslingk.dat data/1c/cea/Ramaswamy.dat data/1c/cea/Toepfer.dat data/1d/cea data/1d/cea/Asundi.dat data/1d/cea/Kaeslingk.dat data/1d/cea/Ramaswamy.dat data/1d/cea/Toepfer.dat data/2/cea data/2/cea/Asundi.dat data/2/cea/Kaeslingk.dat data/2/cea/Ramaswamy.dat data/2/cea/Toepfer.dat data/3/cea data/3/cea/Asundi.dat data/3/cea/Kaeslingk.dat data/3/cea/Ramaswamy.dat data/3/cea/Toepfer.dat data/6/cea data/6/cea/Asundi.dat data/6/cea/Kaeslingk.dat data/6/cea/Ramaswamy.dat data/6/cea/Toepfer.dat data/7/cea data/7/cea/Asundi.dat data/7/cea/Kaeslingk.dat data/7/cea/Ramaswamy.dat data/7/cea/Toepfer.dat data/8/cea data/8/cea/Asundi.dat data/8/cea/Kaeslingk.dat data/8/cea/Ramaswamy.dat data/8/cea/Toepfer.dat 2.1 Include (".INCLUDE") Files The subdirectories in `include' are named with the three-character ID of the data set from which the files came. Any files in that directory are the original documentation files that were referenced by .INCLUDE lines. Also in that directory is a subdirectory "data". It's structure mirrors that of the original data files, and it holds the files that contain the pen data (everything except the .SEGMENT lines) for the handwriting samples that were partitioned into at least one training set task file. Components that are parts of samples that were not selected to be in this release of the training set have been replaced with a single coordinate pair "0 0" so that there was no need to revise the original segmentation and so that there is no unlabeled pen data in the .INCLUDE files. The vocabulary files for tasks 3-8 will be created after all the data has been partitioned. An example data file (include/cea/data/Asundi.dat) is: .WRITER_ID ra3198 .STYLE MIXED .SKILL OK [lines deleted] .H_LINE 137 .PEN_DOWN 0 0 .PEN_UP 0 0 [lines deleted] .PEN_UP .PEN_DOWN 85 379 83 381 83 382 82 382 ... 2.2 Data (".dat") Files The subdirectories under "data" are organized by task number: 1a-1d and then 2-8. The data files under each task subdirectory contain only four types of lines: 1. The .VERSION identifier 2. The .INCLUDE references to any documentation files that were originally present 3. One .INCLUDE reference to a file containing the original pen data (everything except .SEGMENT lines) 4. The .SEGMENT lines from the original data file that were part of a sample randomly picked to be in the training data for the given writer and that met the criteria for the task The .SEGMENT information was used to group components into samples. For the training set, Each selected sample's .SEGMENT's were put in files for all tasks where it met the criteria. For example, all uppercase characters in the context of words or text appear in tasks 1b, 2 and 3. All include files that are referenced in data files are relative to the directory "include" in the top-level directory. An example data file (data/8/cea/Asundi.dat) is: .VERSION 1.0 .INCLUDE cea/ced0.doc .INCLUDE cea/ced0.lex .INCLUDE cea/data/Asundi.dat .SEGMENT TEXT 39-85 ? "EZ-SOFT has a legitimate case for sueing ." .SEGMENT TEXT 183-213 ? "The intellectual property rules are very confusing ." .SEGMENT TEXT 278-318 ? "Disputing Clause 7 would probably be justified ." .SEGMENT TEXT 319-357 ? "Advertising without prior consent is illegal ." 2.3 Expected Differences from Test Sets There will be several differences between the training set and the test sets. First, the .SEGMENT lines will obviously not have the truth strings. In addition, the sources of the data will not be apparent -- either the directories named with the ID's will be mapped to something else, or the files in each task won't be organized into subdirectories based on their source data set at all. Also, not all samples in the test sets will appear in more than one task even when that is possible (i.e. not all digits will appear in tasks 1b and 2) -- the training set is organized that way to maximize the amount of data to train on for each task. Finally, the test sets will be organized further into writer-independent and writer-dependent tests if enough data exists. 3. DOWNLOADING ============== To download the tar-format archive file: $ ftp ftp.cis.upenn.edu Name: anonymous Password: ftp> quote site group ftp> quote site gpass ftp> cd pub/UNIPEN/repository/locked/Outgoing ftp> binary ftp> get train_r01_v01_ .tar ftp> quit where and were the strings given to your organization for submitting your UNIPEN data and allowing access to the `UNIPEN' subdirectory on the FTP site (in addition to the publicly open `UNIPEN-pub'. [Note: you will not be able to list the directory `locked', but downloading the file does not require read permissions for the directory.] Files for the following groups will be made avail- able once the problems that I have reported to them have been fixed: app par phi prp tos That's the only way I can think of that (1) is fair to everyone whose data was either correct all along or has been fixed by this point, and that (2) leaves an incentive to help tie up some of the loose ends that I'm having to dealing with. The tar archive is 8.7 Mb. There are also files with extensions ".tar.Z" (3.1 Mb) and ".tar.gz" (2.4 Mb) which are UNIX-compressed and GNU gzip- compressed tar archives respectively. Please download one of those instead of ".tar" if you have the decompression software, especially if your network path to the FTP site is long or has been subject to timeouts or lost connections. I know several sites had trouble during the data donation process. While the final amount of training data is not yet known, I expect it to be close to one gigabyte (uncompressed), so those with very limited disk space should start thinking now about how best to handle that much data. -- Stan Janet, NIST stan@magi.nist.GOV
Prev
NextFrom stan@magi.nist.GOV Mon Sep 11 05:37 MET 1995 From: Stan Janet
Subject: Next release of UNIPEN training data coming soon To: unipen-donators@magi.nist.GOV Organization: National Institute of Standards and Technology, Gaithersburg, MD You can expect directions for downloading the second release of UNIPEN training data either tomorrow or Tuesday. The data has been partitioned out and just needs to be checked for any obvious problems. This release will contain samples from the all donators. -- Stan
Prev
NextFrom stan@magi.nist.GOV Thu Sep 21 01:48 MET 1995 From: Stan Janet
Subject: UNIPEN: second release of training data is ready To: unipen-donators@magi.nist.GOV Organization: National Institute of Standards and Technology, Gaithersburg, MD Documentation for UNIPEN Training Data Release "train_r01_v02" 1. INTRODUCTION =============== The second release of UNIPEN training data is ready for downloading From the anonymous FTP site, ftp.cis.upenn.edu. This release supersedes the previous release train_r01_v01. (Please delete your copies of that to avoid confusion). It consists of approximately one-third of the samples from one-fifth of the writers from one data set from 29 different sources. This release is comprised of data from these 29 data sets: abm anj art atu cea gmd huj imp int kar lex mot not pri val aga apc att bba dar hpb ibm imt kai lav lou nic phi syn Partitioning has uncovered what appear to be minor problems with data sets from seven participants: pap rim scr sie tos ugi uqb The problems are described in Section 4 below. I need another day or so to investigate them further, so those participants do not need to do anything to their data yet (I may be able to fix the problems very quickly), although they are encouraged to investigate and let me know if my diagnosis appears to be correct. But they will not be able to download this release of training data yet -- files for those participants will be made available once the problems in their data sets have been resolved. That's the only way I can think of that (1) is fair to everyone whose data was either correct all along or has been fixed by this point, and that (2) leaves an incentive to help me tie up some of the loose ends that are left to be dealt with. More training data releases will be coming as soon as the data sets with problems are fixed, when statistics needed for the benchmark have been calculated, etc. Feedback and questions are welcomed, and promptly reporting any mistakes is highly encouraged. 2. ORGANIZATION =============== [Note: Section 2 below is identical to Section 2 in the documentation for the previous release.] The files are stored in a tar-format archive file. When restored from the archive, the top-level directory will be `train_r01_v02'. It contains directories `include' and `data'. As an example, these are the files derived from those contributed by CEDAR (from their first data set "cea") that had samples that have been randomly at this point for training set: include/cea include/cea/data include/cea/data/Asundi.dat include/cea/data/Kaeslingk.dat include/cea/data/Ramaswamy.dat include/cea/data/Toepfer.dat include/cea/ced0.doc include/cea/ced0.lex data/1a/cea data/1a/cea/Asundi.dat data/1a/cea/Ramaswamy.dat data/1b/cea data/1b/cea/Asundi.dat data/1b/cea/Kaeslingk.dat data/1b/cea/Ramaswamy.dat data/1b/cea/Toepfer.dat data/1c/cea data/1c/cea/Asundi.dat data/1c/cea/Kaeslingk.dat data/1c/cea/Ramaswamy.dat data/1c/cea/Toepfer.dat data/1d/cea data/1d/cea/Asundi.dat data/1d/cea/Kaeslingk.dat data/1d/cea/Ramaswamy.dat data/1d/cea/Toepfer.dat data/2/cea data/2/cea/Asundi.dat data/2/cea/Kaeslingk.dat data/2/cea/Ramaswamy.dat data/2/cea/Toepfer.dat data/3/cea data/3/cea/Asundi.dat data/3/cea/Kaeslingk.dat data/3/cea/Ramaswamy.dat data/3/cea/Toepfer.dat data/6/cea data/6/cea/Asundi.dat data/6/cea/Kaeslingk.dat data/6/cea/Ramaswamy.dat data/6/cea/Toepfer.dat data/7/cea data/7/cea/Asundi.dat data/7/cea/Kaeslingk.dat data/7/cea/Ramaswamy.dat data/7/cea/Toepfer.dat data/8/cea data/8/cea/Asundi.dat data/8/cea/Kaeslingk.dat data/8/cea/Ramaswamy.dat data/8/cea/Toepfer.dat 2.1 Include (".INCLUDE") Files The subdirectories in `include' are named with the three-character ID of the data set from which the files came. Any files in that directory are the original documentation files that were referenced by .INCLUDE lines. Also in that directory is a subdirectory "data". It's structure mirrors that of the original data files, and it holds the files that contain the pen data (everything except the .SEGMENT lines) for the handwriting samples that were partitioned into at least one training set task file. Components that are parts of samples that were not selected to be in this release of the training set have been replaced with a single coordinate pair "0 0" so that there was no need to revise the original segmentation and so that there is no unlabeled pen data in the .INCLUDE files. The vocabulary files for tasks 3-8 will be created after all the data has been partitioned. An example data file (include/cea/data/Asundi.dat) is: .WRITER_ID ra3198 .STYLE MIXED .SKILL OK [lines deleted] .H_LINE 137 .PEN_DOWN 0 0 .PEN_UP 0 0 [lines deleted] .PEN_UP .PEN_DOWN 85 379 83 381 83 382 82 382 ... 2.2 Data (".dat") Files The subdirectories under "data" are organized by task number: 1a-1d and then 2-8. The data files under each task subdirectory contain only four types of lines: 1. The .VERSION identifier 2. The .INCLUDE references to any documentation files that were originally present 3. One .INCLUDE reference to a file containing the original pen data (everything except .SEGMENT lines) 4. The .SEGMENT lines from the original data file that were part of a sample randomly picked to be in the training data for the given writer and that met the criteria for the task The .SEGMENT information was used to group components into samples. For the training set, Each selected sample's .SEGMENT's were put in files for all tasks where it met the criteria. For example, all uppercase characters in the context of words or text appear in tasks 1b, 2 and 3. All include files that are referenced in data files are relative to the directory "include" in the top-level directory. An example data file (data/8/cea/Asundi.dat) is: .VERSION 1.0 .INCLUDE cea/ced0.doc .INCLUDE cea/ced0.lex .INCLUDE cea/data/Asundi.dat .SEGMENT TEXT 39-85 ? "EZ-SOFT has a legitimate case for sueing ." .SEGMENT TEXT 183-213 ? "The intellectual property rules are very confusing ." .SEGMENT TEXT 278-318 ? "Disputing Clause 7 would probably be justified ." .SEGMENT TEXT 319-357 ? "Advertising without prior consent is illegal ." 2.3 Expected Differences from Test Sets There will be several differences between the training set and the test sets. First, the .SEGMENT lines will obviously not have the truth strings. In addition, the sources of the data will not be apparent -- either the directories named with the ID's will be mapped to something else, or the files in each task won't be organized into subdirectories based on their source data set at all. Also, not all samples in the test sets will appear in more than one task even when that is possible (i.e. not all digits will appear in tasks 1b and 2) -- the training set is organized that way to maximize the amount of data to train on for each task. Finally, the test sets will be organized further into writer-independent and writer-dependent tests if enough data exists. 3. DOWNLOADING ============== To download the tar-format archive file: $ ftp ftp.cis.upenn.edu Name: anonymous Password: ftp> quote site group ftp> quote site gpass ftp> cd pub/UNIPEN/repository/locked/Outgoing ftp> binary ftp> get train_r01_v02_ .tar ftp> quit where and were the strings given to your organization for submitting your UNIPEN data and allowing access to the `UNIPEN' subdirectory on the FTP site (in addition to the publicly open `UNIPEN-pub'. [Note: you will not be able to list the directory `locked', but downloading the file does not require read permissions for the directory.] The tar archive is now 73 Mb. There are also files with extensions ".tar.Z" (23 Mb) and ".tar.gz" (20 Mb) which are UNIX-compressed and GNU gzip- compressed tar archives respectively. Please download one of those instead of ".tar" if you have the decompression software, especially if your network path to the FTP site is long or has been subject to timeouts or lost connections. I expect some sites will not be able to FTP files this big. If that's the case for you, send me email and we can try to work something out. To check that FTP transfers successfully completed, file sizes should match below: 71833600 train_r01_v02_*.tar 23028481 train_r01_v02_*.tar.Z 19877276 train_r01_v02_*.tar.gz Also, checksums calculated using /bin/sum under SunOS 4.1.3 are: 42982 70150 train_r01_v02_*.tar 43940 22489 train_r01_v02_*.tar.Z 29437 19412 train_r01_v02_*.tar.gz Checksums calculated using /usr/5bin/sum under SunOS 4.1.3 (which is the same as /bin/sum under SGI IRIX 5.2) are: 45570 140300 train_r01_v02_*.tar 24892 44978 train_r01_v02_*.tar.Z 30983 38823 train_r01_v02_*.tar.gz While the final amount of training data is not yet known, I expect it to be close to one gigabyte (uncompressed), so those with very limited disk space should start thinking now about how best to handle that much data. 3. PROBLEMS =========== rim Component counter appears to have been reset for each sample in each file. As a result, segmentation is always 0-#. scr Three of the 25 files that ended up being partition out of this data set had labeling errors (st000009.dat, st000034.dat, and dann.dat). pap Component counter appears to have not been reset between files. sie Samples are upside down. Coordinate system will just need to be changed. tos Segmentation is wrong. It appears that components with the pen up were not counted. ugi Segmentation is wrong. It appears that components with the pen up were not counted. uqb Some components appear to have been deleted with segmentation not being revised, eg. segmentation in one file begins 0-1, then 2-3, then 8-9. -- Stan Janet, NIST stan@magi.nist.GOV
Prev
NextFrom stan@magi.nist.GOV Tue Nov 14 00:23 MET 1995 From: Stan Janet
Subject: Third release of UNIPEN training data To: UNIPEN participants Documentation for UNIPEN Training Data Release "train_r01_v03" 1. INTRODUCTION =============== The third release of UNIPEN training data is ready for downloading From the NIST anonymous FTP server, sequoyah.ncsl.nist.gov. I am now using a local server because that avoids limitations on disk space and because the files can be copied to it so much more quickly. This release supersedes the previous release, train_r01_v02. (Please delete your copies of that to avoid confusion.) It consists of approximately one-third of the samples from one-fifth of the writers from one data set from *all 36* different sources. These are the data sets represented in the release: abm anj art atu cea gmd huj imp int kar lex mot not pri val aga apc att bba dar hpb ibm imt kai lav lou nic phi syn pap rim scr sie tos ugi uqb This release contains the data from the train_r01_v02 release plus data from the 7 data sets that were withheld from it due to errors that I have since fixed: pap rim scr sie tos ugi uqb Participants that have downloaded train_r01_v02 can download a file containing only the added data. Everyone else should download the complete train_r01_v03 release. Directions for downloading are in Section 3. Several participants have reported errors that they have found in files from the previous release. This release does not include any fixes to those problems, which I am now investigating. Depending on the severity of the problems, the next training data releases may be coming fairly soon. Feedback and questions are welcomed, and promptly reporting any mistakes is highly encouraged. 2. FILE ORGANIZATION ==================== [Note: Section 2 below is identical to Section 2 in the documentation for the previous releases.] The files are stored in a tar-format archive file. When restored from the archive, the top-level directory will be `train_r01_v03'. It contains directories `include' and `data'. As an example, these are the files derived from those contributed by CEDAR (from their first data set "cea") that had samples that have been randomly at this point for training set: include/cea include/cea/data include/cea/data/Asundi.dat include/cea/data/Kaeslingk.dat include/cea/data/Ramaswamy.dat include/cea/data/Toepfer.dat include/cea/ced0.doc include/cea/ced0.lex data/1a/cea data/1a/cea/Asundi.dat data/1a/cea/Ramaswamy.dat data/1b/cea data/1b/cea/Asundi.dat data/1b/cea/Kaeslingk.dat data/1b/cea/Ramaswamy.dat data/1b/cea/Toepfer.dat data/1c/cea data/1c/cea/Asundi.dat data/1c/cea/Kaeslingk.dat data/1c/cea/Ramaswamy.dat data/1c/cea/Toepfer.dat data/1d/cea data/1d/cea/Asundi.dat data/1d/cea/Kaeslingk.dat data/1d/cea/Ramaswamy.dat data/1d/cea/Toepfer.dat data/2/cea data/2/cea/Asundi.dat data/2/cea/Kaeslingk.dat data/2/cea/Ramaswamy.dat data/2/cea/Toepfer.dat data/3/cea data/3/cea/Asundi.dat data/3/cea/Kaeslingk.dat data/3/cea/Ramaswamy.dat data/3/cea/Toepfer.dat data/6/cea data/6/cea/Asundi.dat data/6/cea/Kaeslingk.dat data/6/cea/Ramaswamy.dat data/6/cea/Toepfer.dat data/7/cea data/7/cea/Asundi.dat data/7/cea/Kaeslingk.dat data/7/cea/Ramaswamy.dat data/7/cea/Toepfer.dat data/8/cea data/8/cea/Asundi.dat data/8/cea/Kaeslingk.dat data/8/cea/Ramaswamy.dat data/8/cea/Toepfer.dat 2.1 Include (".INCLUDE") Files The subdirectories in `include' are named with the three-character ID of the data set from which the files came. Any files in that directory are the original documentation files that were referenced by .INCLUDE lines. Also in that directory is a subdirectory "data". It's structure mirrors that of the original data files, and it holds the files that contain the pen data (everything except the .SEGMENT lines) for the handwriting samples that were partitioned into at least one training set task file. Components that are parts of samples that were not selected to be in this release of the training set have been replaced with a single coordinate pair "0 0" so that there was no need to revise the original segmentation and so that there is no unlabeled pen data in the .INCLUDE files. The vocabulary files for tasks 3-8 will be created after all the data has been partitioned. An example data file (include/cea/data/Asundi.dat) is: .WRITER_ID ra3198 .STYLE MIXED .SKILL OK [lines deleted] .H_LINE 137 .PEN_DOWN 0 0 .PEN_UP 0 0 [lines deleted] .PEN_UP .PEN_DOWN 85 379 83 381 83 382 82 382 ... 2.2 Data (".dat") Files The subdirectories under "data" are organized by task number: 1a-1d and then 2-8. The data files under each task subdirectory contain only four types of lines: 1. The .VERSION identifier 2. The .INCLUDE references to any documentation files that were originally present 3. One .INCLUDE reference to a file containing the original pen data (everything except .SEGMENT lines) 4. The .SEGMENT lines from the original data file that were part of a sample randomly picked to be in the training data for the given writer and that met the criteria for the task The .SEGMENT information was used to group components into samples. For the training set, Each selected sample's .SEGMENT's were put in files for all tasks where it met the criteria. For example, all uppercase characters in the context of words or text appear in tasks 1b, 2 and 3. All include files that are referenced in data files are relative to the directory "include" in the top-level directory. An example data file (data/8/cea/Asundi.dat) is: .VERSION 1.0 .INCLUDE cea/ced0.doc .INCLUDE cea/ced0.lex .INCLUDE cea/data/Asundi.dat .SEGMENT TEXT 39-85 ? "EZ-SOFT has a legitimate case for sueing ." .SEGMENT TEXT 183-213 ? "The intellectual property rules are very confusing ." .SEGMENT TEXT 278-318 ? "Disputing Clause 7 would probably be justified ." .SEGMENT TEXT 319-357 ? "Advertising without prior consent is illegal ." 2.3 Expected Differences from Test Sets There will be several differences between the training set and the test sets. First, the .SEGMENT lines will obviously not have the truth strings. In addition, the sources of the data will not be apparent -- either the directories named with the ID's will be mapped to something else, or the files in each task won't be organized into subdirectories based on their source data set at all. Also, not all samples in the test sets will appear in more than one task even when that is possible (i.e. not all digits will appear in tasks 1b and 2) -- the training set is organized that way to maximize the amount of data to train on for each task. Finally, the test sets will be organized further into writer-independent and writer-dependent tests if enough data exists. 3. DOWNLOADING ============== The complete tar archive is now 87 Mb. There are also files with the extensions ".tar.Z" (28 Mb) and ".tar.gz" (23 Mb) which are UNIX- compressed and GNU gzip- compressed versions of the tar archive respectively. The corresponding sizes for the "diffs" archives (containing only the new data) are 15/4/3 Mb. If possible, please download the data in compressed form. If you experience timeouts or broken network connections, you can download the GNU gzip-compressed archives in 23 pieces (4 for the "diffs") of 1 Mb or less, then concatenate them back together locally. Only the GNU gzip-compressed archives are available in split form, so those without that software will have to download it from the GNU FTP server prep.ai.mit.edu (in pub/gnu/gzip-1.2.4.tar) and compile it. For MS-DOS, a binary pub/gnu/gzip-1.2.4.msdos.exe is there, and that probably also does decompression on that platform. To download a file from the FTP server: $ ftp sequoyah.ncsl.nist.gov Name: anonymous Password: ftp> binary ftp> cd outgoing/unipen/train_r01_v03 ftp> get ftp> quit [Note: you will not be able to list the directories `outgoing' or `unipen', but downloading the file does not require read permissions for the directory.] To download the split files, for example from split.complete, you should replace the "cd" and "get" commands above with: ftp> cd outgoing/unipen/train_r01_v03/split.complete ftp> prompt ftp> mget * A complete list of files available for downloading is below, followed by checksums generated by the "/bin/sum" command under SunOS 4.1.2, and checksums generated by the "/usr/5bin/sum" command under SunOS 4.1.2: Listing (outgoing/unipen/train_r01_v03): -rw-r--r-- 1 root 87214080 Nov 13 19:19 train_r01_v03.tar -rw-r--r-- 1 root 27509615 Nov 13 19:19 train_r01_v03.tar.Z -rw-r--r-- 1 root 22995935 Nov 13 19:35 train_r01_v03.tar.gz -rw-r--r-- 1 root 15380480 Nov 13 19:19 train_r01_v03_diffs.tar -rw-r--r-- 1 root 4477767 Nov 13 19:19 train_r01_v03_diffs.tar.Z -rw-r--r-- 1 root 3102560 Nov 13 20:33 train_r01_v03_diffs.tar.gz Listing (outgoing/unipen/train_r01_v03/split.complete): -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.00 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.01 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.02 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.03 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.04 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.05 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.06 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.07 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.08 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.09 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.10 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.11 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.12 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.13 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.14 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.15 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.16 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.17 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.18 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.19 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.20 -rw-r--r-- 1 root 1000000 Nov 13 18:37 train_r01_v03.tar.gz.21 -rw-r--r-- 1 root 995935 Nov 13 18:37 train_r01_v03.tar.gz.22 Listing (outgoing/unipen/train_r01_v03/split.diffs): -rw-r--r-- 1 root 1000000 Nov 13 18:27 train_r01_v03_diffs.tar.gz.00 -rw-r--r-- 1 root 1000000 Nov 13 18:27 train_r01_v03_diffs.tar.gz.01 -rw-r--r-- 1 root 1000000 Nov 13 18:27 train_r01_v03_diffs.tar.gz.02 -rw-r--r-- 1 root 102560 Nov 13 18:27 train_r01_v03_diffs.tar.gz.03 Checksums (/bin/sum): 42721 85170 train_r01_v03.tar 00176 26865 train_r01_v03.tar.Z 50100 22457 train_r01_v03.tar.gz 17909 15020 train_r01_v03_diffs.tar 25379 4373 train_r01_v03_diffs.tar.Z 38732 3030 train_r01_v03_diffs.tar.gz 43419 977 split.complete/train_r01_v03.tar.gz.00 57870 977 split.complete/train_r01_v03.tar.gz.01 38885 977 split.complete/train_r01_v03.tar.gz.02 06542 977 split.complete/train_r01_v03.tar.gz.03 60169 977 split.complete/train_r01_v03.tar.gz.04 33689 977 split.complete/train_r01_v03.tar.gz.05 15509 977 split.complete/train_r01_v03.tar.gz.06 48845 977 split.complete/train_r01_v03.tar.gz.07 52500 977 split.complete/train_r01_v03.tar.gz.08 16049 977 split.complete/train_r01_v03.tar.gz.09 52256 977 split.complete/train_r01_v03.tar.gz.10 52212 977 split.complete/train_r01_v03.tar.gz.11 20048 977 split.complete/train_r01_v03.tar.gz.12 39429 977 split.complete/train_r01_v03.tar.gz.13 25421 977 split.complete/train_r01_v03.tar.gz.14 55602 977 split.complete/train_r01_v03.tar.gz.15 05502 977 split.complete/train_r01_v03.tar.gz.16 51131 977 split.complete/train_r01_v03.tar.gz.17 25708 977 split.complete/train_r01_v03.tar.gz.18 24727 977 split.complete/train_r01_v03.tar.gz.19 52742 977 split.complete/train_r01_v03.tar.gz.20 07845 977 split.complete/train_r01_v03.tar.gz.21 45579 973 split.complete/train_r01_v03.tar.gz.22 32900 977 split.diffs/train_r01_v03_diffs.tar.gz.00 23655 977 split.diffs/train_r01_v03_diffs.tar.gz.01 47396 977 split.diffs/train_r01_v03_diffs.tar.gz.02 23287 101 split.diffs/train_r01_v03_diffs.tar.gz.03 Checksums (/usr/5bin/sum): 18099 170340 train_r01_v03.tar 28411 53730 train_r01_v03.tar.Z 1289 44914 train_r01_v03.tar.gz 13779 30040 train_r01_v03_diffs.tar 30562 8746 train_r01_v03_diffs.tar.Z 26865 6060 train_r01_v03_diffs.tar.gz 10021 1954 split.complete/train_r01_v03.tar.gz.00 35136 1954 split.complete/train_r01_v03.tar.gz.01 7503 1954 split.complete/train_r01_v03.tar.gz.02 55483 1954 split.complete/train_r01_v03.tar.gz.03 27695 1954 split.complete/train_r01_v03.tar.gz.04 34855 1954 split.complete/train_r01_v03.tar.gz.05 59493 1954 split.complete/train_r01_v03.tar.gz.06 61166 1954 split.complete/train_r01_v03.tar.gz.07 60993 1954 split.complete/train_r01_v03.tar.gz.08 7378 1954 split.complete/train_r01_v03.tar.gz.09 36997 1954 split.complete/train_r01_v03.tar.gz.10 28398 1954 split.complete/train_r01_v03.tar.gz.11 10705 1954 split.complete/train_r01_v03.tar.gz.12 54241 1954 split.complete/train_r01_v03.tar.gz.13 10171 1954 split.complete/train_r01_v03.tar.gz.14 58406 1954 split.complete/train_r01_v03.tar.gz.15 34412 1954 split.complete/train_r01_v03.tar.gz.16 46807 1954 split.complete/train_r01_v03.tar.gz.17 4243 1954 split.complete/train_r01_v03.tar.gz.18 33336 1954 split.complete/train_r01_v03.tar.gz.19 53538 1954 split.complete/train_r01_v03.tar.gz.20 6363 1954 split.complete/train_r01_v03.tar.gz.21 50369 1946 split.complete/train_r01_v03.tar.gz.22 7559 1954 split.diffs/train_r01_v03_diffs.tar.gz.00 63489 1954 split.diffs/train_r01_v03_diffs.tar.gz.01 51155 1954 split.diffs/train_r01_v03_diffs.tar.gz.02 35732 201 split.diffs/train_r01_v03_diffs.tar.gz.03 Please check that downloaded files' sizes and/or checksums match those above to verify that the FTP transfers completed without errors. While the final amount of training data is not yet known, I expect it to be close to one gigabyte (uncompressed), so those with very limited disk space should start thinking now about how best to handle that much data. I am hoping the furlough of US government employees will not occur, will not affect NIST, or will be short. However, if I am slow in responding to email in the next few days, that will be the likely reason. -- Stan Janet, NIST stan@magi.nist.GOV
Prev
NextFrom stan@magi.nist.GOV Fri Jan 26 09:36 MET 1996 From: Stan Janet
Subject: Fourth release of UNIPEN training data To: UNIPEN participants Organization: National Institute of Standards and Technology, Gaithersburg, MD Documentation for UNIPEN Training Data Release "train_r01_v04" 1. INTRODUCTION =============== The fourth release of UNIPEN training data is ready for downloading From the NIST anonymous FTP server, sequoyah.ncsl.nist.gov. I am now using a local server because that avoids limitations on disk space and because the files can be copied to it so much more quickly. This release supersedes the previous release, train_r01_v03. (Please delete your copies of that to avoid confusion.) It consists of approximately one-third of the samples from one-fifth of the writers from all 52 data sets (from 36 different sources). These data sets are present and unchanged from the previous release: abm anj art atu cea gmd huj imp int kar lex mot not pri val aga apc att bba dar hpb ibm imt kai lav lou nic phi syn pap rim scr sie tos ugi uqb New data sets in this release are: apa apb apd ape app ata bbb bbc bbd ceb cec ced cee cef hpp sta Directions for downloading are in Section 3. Feedback and questions are welcomed, and promptly reporting any mistakes is highly encouraged. Several participants have reported errors that they have found in files from the previous release, e.g. the BBN data is upside down (but included because the effort required to modify all my scripts to skip that processing is not worthwhile). This release does not include any fixes to those problems, which I will turn my attention to next. Depending on the severity of the problems, a release of repaired training data may be coming fairly soon. 2. FILE ORGANIZATION ==================== [Note: Section 2 below is identical to Section 2 in the documentation for the previous releases.] The files are stored in a tar-format archive file. When restored from the archive, the top-level directory will be `train_r01_v04'. It contains directories `include' and `data'. As an example, these are the files derived from those contributed by CEDAR (from their first data set "cea") that had samples that have been randomly at this point for training set: include/cea include/cea/data include/cea/data/Asundi.dat include/cea/data/Kaeslingk.dat include/cea/data/Ramaswamy.dat include/cea/data/Toepfer.dat include/cea/ced0.doc include/cea/ced0.lex data/1a/cea data/1a/cea/Asundi.dat data/1a/cea/Ramaswamy.dat data/1b/cea data/1b/cea/Asundi.dat data/1b/cea/Kaeslingk.dat data/1b/cea/Ramaswamy.dat data/1b/cea/Toepfer.dat data/1c/cea data/1c/cea/Asundi.dat data/1c/cea/Kaeslingk.dat data/1c/cea/Ramaswamy.dat data/1c/cea/Toepfer.dat data/1d/cea data/1d/cea/Asundi.dat data/1d/cea/Kaeslingk.dat data/1d/cea/Ramaswamy.dat data/1d/cea/Toepfer.dat data/2/cea data/2/cea/Asundi.dat data/2/cea/Kaeslingk.dat data/2/cea/Ramaswamy.dat data/2/cea/Toepfer.dat data/3/cea data/3/cea/Asundi.dat data/3/cea/Kaeslingk.dat data/3/cea/Ramaswamy.dat data/3/cea/Toepfer.dat data/6/cea data/6/cea/Asundi.dat data/6/cea/Kaeslingk.dat data/6/cea/Ramaswamy.dat data/6/cea/Toepfer.dat data/7/cea data/7/cea/Asundi.dat data/7/cea/Kaeslingk.dat data/7/cea/Ramaswamy.dat data/7/cea/Toepfer.dat data/8/cea data/8/cea/Asundi.dat data/8/cea/Kaeslingk.dat data/8/cea/Ramaswamy.dat data/8/cea/Toepfer.dat 2.1 Include (".INCLUDE") Files The subdirectories in `include' are named with the three-character ID of the data set from which the files came. Any files in that directory are the original documentation files that were referenced by .INCLUDE lines. Also in that directory is a subdirectory "data". It's structure mirrors that of the original data files, and it holds the files that contain the pen data (everything except the .SEGMENT lines) for the handwriting samples that were partitioned into at least one training set task file. Components that are parts of samples that were not selected to be in this release of the training set have been replaced with a single coordinate pair "0 0" so that there was no need to revise the original segmentation and so that there is no unlabeled pen data in the .INCLUDE files. The vocabulary files for tasks 3-8 will be created after all the data has been partitioned. An example data file (include/cea/data/Asundi.dat) is: .WRITER_ID ra3198 .STYLE MIXED .SKILL OK [lines deleted] .H_LINE 137 .PEN_DOWN 0 0 .PEN_UP 0 0 [lines deleted] .PEN_UP .PEN_DOWN 85 379 83 381 83 382 82 382 ... 2.2 Data (".dat") Files The subdirectories under "data" are organized by task number: 1a-1d and then 2-8. The data files under each task subdirectory contain only four types of lines: 1. The .VERSION identifier 2. The .INCLUDE references to any documentation files that were originally present 3. One .INCLUDE reference to a file containing the original pen data (everything except .SEGMENT lines) 4. The .SEGMENT lines from the original data file that were part of a sample randomly picked to be in the training data for the given writer and that met the criteria for the task The .SEGMENT information was used to group components into samples. For the training set, Each selected sample's .SEGMENT's were put in files for all tasks where it met the criteria. For example, all uppercase characters in the context of words or text appear in tasks 1b, 2 and 3. All include files that are referenced in data files are relative to the directory "include" in the top-level directory. An example data file (data/8/cea/Asundi.dat) is: .VERSION 1.0 .INCLUDE cea/ced0.doc .INCLUDE cea/ced0.lex .INCLUDE cea/data/Asundi.dat .SEGMENT TEXT 39-85 ? "EZ-SOFT has a legitimate case for sueing ." .SEGMENT TEXT 183-213 ? "The intellectual property rules are very confusing ." .SEGMENT TEXT 278-318 ? "Disputing Clause 7 would probably be justified ." .SEGMENT TEXT 319-357 ? "Advertising without prior consent is illegal ." 2.3 Expected Differences from Test Sets There will be several differences between the training set and the test sets. First, the .SEGMENT lines will obviously not have the truth strings. In addition, the sources of the data will not be apparent -- either the directories named with the ID's will be mapped to something else, or the files in each task won't be organized into subdirectories based on their source data set at all. Also, not all samples in the test sets will appear in more than one task even when that is possible (i.e. not all digits will appear in tasks 1b and 2) -- the training set is organized that way to maximize the amount of data to train on for each task. Finally, the test sets will be organized further into writer-independent and writer-dependent tests if enough data exists. 3. DOWNLOADING ============== The complete tar archive is now 178 Mb. Although just 16 more data sets were added in this release, they tended to be large and that accounts for the increase from 87 Mb. There are also files with the extensions ".tar.Z" (59 Mb) and ".tar.gz" (49 Mb) which are UNIX- compressed and GNU gzip- compressed versions of the tar archive respectively. If possible, please download the data in compressed form. If you experience timeouts or broken network connections, you can download the GNU gzip-compressed archives in pieces of 1 Mb or less, then concatenate them back together locally. Those pieces will not be available until Wednesday, Jan. 31, as I will have to do some data cleanup to free up enough disk space. Only the GNU gzip-compressed archives will be available split into 1 Mb pieces, so those without that software will have to download it from the GNU FTP server prep.ai.mit.edu (in pub/gnu/gzip-1.2.4.tar) and compile it. For MS-DOS, a binary pub/gnu/gzip-1.2.4.msdos.exe is there, and that probably also does decompression under DOS. To download a file from the FTP server: $ ftp sequoyah.ncsl.nist.gov Name: anonymous Password: ftp> binary ftp> cd outgoing/unipen/train_r01_v04 ftp> get ftp> quit [Note: you will not be able to list the directories `outgoing' or `unipen', but downloading the file does not require read permissions for the directory.] A complete list of files available for downloading is below, followed by checksums. Listing (outgoing/unipen/train_r01_v04): $ ls -l -rw-r--r-- 1 root 177602560 Jan 26 02:28 train_r01_v04.tar -rw-r--r-- 1 root 58877175 Jan 26 02:56 train_r01_v04.tar.Z -rw-r--r-- 1 root 49270444 Jan 26 02:59 train_r01_v04.tar.gz Checksums (as generated by /bin/sum on SGI's, /usr/5bin/sum under SunOS 4.1.x): $ /usr/5bin/sum * 24928 346880 train_r01_v04.tar 8421 114995 train_r01_v04.tar.Z 14095 96232 train_r01_v04.tar.gz Please check that downloaded files' sizes and/or checksums match those above to verify that the FTP transfers completed without errors. While the final amount of training data is not yet known, I expect it to be close to one gigabyte (uncompressed), so those with very limited disk space should start thinking now about how best to handle that much data. For those of you outside or in the US who might not be sure if NIST was affected by the three-week US government furloughs due to the budget impasse: as part of the Dept. of Commerce, we were. I had originally hoped to get this data out last month. Hopefully this work will not be affected again. -- Stan Janet, NIST stan@magi.nist.GOV
Prev
From stan@magi.nist.GOV Mon Jul 15 07:04:19 1996 From: Stan Janet
Subject: Fifth release of UNIPEN training data is available To: unipen-donators@magi.nist.GOV Organization: National Institute of Standards and Technology, Gaithersburg, MD Documentation for UNIPEN Training Data Release "train_r01_v05" 1. INTRODUCTION =============== The fifth release of UNIPEN training data is ready for downloading From the NIST anonymous FTP server, sequoyah.ncsl.nist.gov. I am now using a local server because of the extra disk space available there. This release supersedes the previous release, train_r01_v04. (Please delete your copies of that to avoid confusion.) It consists of approximately one-third of the samples from one-fifth of the writers From all 54 data sets (from 38 different sources). Directions for downloading are in Section 3. New data sets in this release are: par pcl These data sets have been corrected in various ways from the previous release: abm aga apc, apd, ape att bba, bbb, bbc, bbd lex phi pri scr A summary of the corrections to the above data sets is in the Appen- dix of this email. Thanks to the following participants for reporting many of the problems: Hans Dolfing, Kwon Jae-Ook, Stefan Manke, Han Shu, and anyone else who I may have forgotten. If anyone else is aware of problems still remaining, please forward a list of them to me. The two new sets are a result of data sets being fixed that I had previously been forced to take out because I was unable to reach the original contact points via email. If there are people in your organization not receiving this UNIPEN announcement that should be, please forward their name to me so that I have multiple contact points. Between 5 and 10% of my bulk mailings bounce back as unde- liverable for various reasons. Now that it appears that most data problems have been fixed, pre- liminary versions of the ultimate training data set and development test set will be coming along soon. I'm curious to know if anyone has used the training data released so far as training or even test data for internal development, and if so, how close they have come to be able to make use of foreign data seemlessly. Any insight you give me will help greatly in planning further release dates. 2. FILE ORGANIZATION ==================== [Note: Section 2 below is identical to Section 2 in the documentation for the previous releases.] The files are stored in a tar-format archive file. When restored from the archive, the top-level directory will be `train_r01_v05'. It contains directories `include' and `data'. As an example, these are the files derived from those contributed by CEDAR (from their first data set "cea") that had samples that have been randomly selected for training set: include/cea include/cea/data include/cea/data/Asundi.dat include/cea/data/Kaeslingk.dat include/cea/data/Ramaswamy.dat include/cea/data/Toepfer.dat include/cea/ced0.doc include/cea/ced0.lex data/1a/cea data/1a/cea/Asundi.dat data/1a/cea/Ramaswamy.dat data/1b/cea data/1b/cea/Asundi.dat data/1b/cea/Kaeslingk.dat data/1b/cea/Ramaswamy.dat data/1b/cea/Toepfer.dat data/1c/cea data/1c/cea/Asundi.dat data/1c/cea/Kaeslingk.dat data/1c/cea/Ramaswamy.dat data/1c/cea/Toepfer.dat data/1d/cea data/1d/cea/Asundi.dat data/1d/cea/Kaeslingk.dat data/1d/cea/Ramaswamy.dat data/1d/cea/Toepfer.dat data/2/cea data/2/cea/Asundi.dat data/2/cea/Kaeslingk.dat data/2/cea/Ramaswamy.dat data/2/cea/Toepfer.dat data/3/cea data/3/cea/Asundi.dat data/3/cea/Kaeslingk.dat data/3/cea/Ramaswamy.dat data/3/cea/Toepfer.dat data/6/cea data/6/cea/Asundi.dat data/6/cea/Kaeslingk.dat data/6/cea/Ramaswamy.dat data/6/cea/Toepfer.dat data/7/cea data/7/cea/Asundi.dat data/7/cea/Kaeslingk.dat data/7/cea/Ramaswamy.dat data/7/cea/Toepfer.dat data/8/cea data/8/cea/Asundi.dat data/8/cea/Kaeslingk.dat data/8/cea/Ramaswamy.dat data/8/cea/Toepfer.dat 2.1 Include (".INCLUDE") Files The subdirectories in `include' are named with the three-character ID of the data set from which the files came. Any files in that directory are the original documentation files that were referenced by .INCLUDE lines. Also in that directory is a subdirectory "data". It's structure mirrors that of the original data files, and it holds the files that contain the pen data (everything except the .SEGMENT lines) for the handwriting samples that were partitioned into at least one training set task file. Components that are parts of samples that were not selected to be in this release of the training set have been replaced with a single coordinate pair "0 0" so that there was no need to revise the original segmentation and so that there is no unlabeled pen data in the .INCLUDE files. The vocabulary files for tasks 3-8 will be created after all the data has been partitioned. An example data file (include/cea/data/Asundi.dat) is: .WRITER_ID ra3198 .STYLE MIXED .SKILL OK [lines deleted] .H_LINE 137 .PEN_DOWN 0 0 .PEN_UP 0 0 [lines deleted] .PEN_UP .PEN_DOWN 85 379 83 381 83 382 82 382 ... 2.2 Data (".dat") Files The subdirectories under "data" are organized by task number: 1a-1d and then 2-8. The data files under each task subdirectory contain only four types of lines: 1. The .VERSION identifier 2. The .INCLUDE references to any documentation files that were originally present 3. One .INCLUDE reference to a file containing the original pen data (everything except .SEGMENT lines) 4. The .SEGMENT lines from the original data file that were part of a sample randomly picked to be in the training data for the given writer and that met the criteria for the task The .SEGMENT information was used to group components into samples. For the training set, Each selected sample's .SEGMENT's were put in files for all tasks where it met the criteria. For example, all uppercase characters in the context of words or text appear in tasks 1b, 2 and 3. All include files that are referenced in data files are relative to the directory "include" in the top-level directory. An example data file (data/8/cea/Asundi.dat) is: .VERSION 1.0 .INCLUDE cea/ced0.doc .INCLUDE cea/ced0.lex .INCLUDE cea/data/Asundi.dat .SEGMENT TEXT 39-85 ? "EZ-SOFT has a legitimate case for sueing ." .SEGMENT TEXT 183-213 ? "The intellectual property rules are very confusing ." .SEGMENT TEXT 278-318 ? "Disputing Clause 7 would probably be justified ." .SEGMENT TEXT 319-357 ? "Advertising without prior consent is illegal ." 2.3 Expected Differences from Test Sets There will be several differences between the training set and the test sets. First, the .SEGMENT lines will obviously not have the truth strings. In addition, the sources of the data will not be apparent -- either the directories named with the ID's will be mapped to something else, or the files in each task won't be organized into subdirectories based on their source data set at all. Also, not all samples in the test sets will appear in more than one task even when that is possible (i.e. not all digits will appear in tasks 1b and 2) -- the training set is organized that way to maximize the amount of data to train on for each task. Finally, the test sets will be organized further into writer-independent and writer-dependent tests if enough data exists. 3. DOWNLOADING ============== The complete tar archive is ust over 180 Mb, and it is now only available in compressed form. The files train_r01_v05.tar.Z (60 Mb) and train_r01_v05.tar.gz (50 Mb) are the UNIX-compressed and GNU gzip-compressed versions of the tar archive respectively. If you experience timeouts or broken network connections during FTP sessions, you can download the GNU gzip-compressed archives in pieces of 1 Mb or less, then concatenate them back together locally. The files are in the "split" subdirectory. The UNIX- compressed tar archive is not available in 1 Mb pieces. Those without GNU gunzip can download it from the GNU FTP server prep.ai.mit.edu (in pub/gnu/gzip-1.2.4.tar) and compile it. For MS-DOS systems, a binary pub/gnu/gzip-1.2.4.msdos.exe is there, and I assume that does decompression under MS-DOS, although I have not checked that. To download the training data from our FTP server: $ ftp sequoyah.ncsl.nist.gov Name: anonymous Password: ftp> binary ftp> cd outgoing/unipen/train_r01_v05 and either then: ftp> get train_r01_v05.tar.gz ftp> quit or: ftp> get train_r01_v05.tar.Z ftp> quit or: ftp> cd split ftp> prompt ftp> mget * ftp> quit [Note: you will not be able to list the directories `outgoing' or `unipen', but downloading the file does not require read permissions for the directory.] A complete list of files available for downloading is below, followed by checksums. Listing (outgoing/unipen/train_r01_v05): % ls -lg train_* -rw-r--r-- 1 root 60330167 Jul 11 20:29 train_r01_v05.tar.Z -rw-r--r-- 1 root 50457943 Jul 11 19:58 train_r01_v05.tar.gz Checksums (as generated by /bin/sum on SunOS 4.1.1): % /bin/sum train_* 05462 58917 train_r01_v05.tar.Z 31471 49276 train_r01_v05.tar.gz Checksums (as generated by /bin/sum on SunOS 4.1.1): % /usr/5bin/sum train_* 5337 117833 train_r01_v05.tar.Z 10731 98551 train_r01_v05.tar.gz Please check that downloaded file's sizes and checksum match those above to verify that the FTP transfer completed without error. While the final amount of training data is not yet known, I expect it to be close to one gigabyte, uncompressed. -- Stan Janet, NIST stan@magi.nist.GOV ******************************************************************************* APPENDIX: DATA CORRECTIONS (since previosu release) ========================== Most of this is from my logs, but some of the details are from memory. If the donators need more information in order to correct their local databases, feel free to contact me. abm * labels were all uppercase, but samples were uppercase-first-character lowercase-rest with a few exceptions; I converted the labels to that format, then manually converted labels to the proper case * sm0.dat had a empty sample with a bad segment delineation range "554-553"; deleted * js0.dat had a sample that was too poor to use (just a squiggle really); commented out the .SEGMENT line aga * 40 of the 45 files cause upread to exit due to too few components; I deleted .SEGMENT's and their components that were not complete * 2 of the other 5 files had bad ends, which I truncated starting with the last .START_BOX/.SEGMENT apc, apd, ape * 59 of the "apc" files were missing some components, e.g. in file apc00/app0049.dat, there were had 20 components present, but the line: .SEGMENT TEXT 0-22 ? "Hose Hunters I Identify" indicates 23 should be there. The problem was present in 299 files from set "apd" and 178 files from set "ape". * There were also some labeling errors. For example apc/apc22/2273.dat has: .SEGMENT TEXT 0-40 ? "Send me a memo so I can look into some things" .SEGMENT WORD 0-3 ? "Send" .SEGMENT WORD 4-5 ? "me" .SEGMENT WORD 6 ? "a" .SEGMENT WORD 6-9 ? "memo" .SEGMENT WORD 10-11 ? "so" .SEGMENT WORD 12-14 ? "I" .SEGMENT WORD 12-14 ? "can" .SEGMENT WORD 15-19 ? "look" .SEGMENT WORD 20-24 ? "into" .SEGMENT WORD 25-28 ? "some" .SEGMENT WORD 29-36 ? "things" and there are no WORD's "a" or "I" in the handwriting sample. The object labeled "a" is actually the "m" in "memo", and the "I" is actually a copy of the word "can". This labeling overlap problem occurs in all the files with too few components, and only those files. The bad files were removed. att * These files had first .SEGMENT lines not starting at 0, so I added dummy components: data/att0/a0411314.dat .SEGMENT PAGE 13-22 data/att0/a0411454.dat .SEGMENT PAGE 59-72 data/att0/a0412414.dat .SEGMENT PAGE 44-57 data/att0/a0512534.dat .SEGMENT PAGE 106-172 data/att0/a0513084.dat .SEGMENT PAGE 113-150 data/att0/a0513134.dat .SEGMENT PAGE 182-237 data/att0/a1412404.dat .SEGMENT PAGE 114-169 * Files missing one or more .PEN_DOWN's after a .SEGMENT (which I added): a1412404.dat a0513134.dat a0512534.dat (2) * File a0513134.dat also had a bad label (" {" was not present, so I removed it from the label): .SEGMENT TEXT 182-214 ? "\" Ready ! \\ n \" ) ; for ( ; ) {" * Files with component gap (and # of missing components): a0512534.dat 150 a0412414.dat 29 a0411454.dat 14 * File data/a0412594.dat had unclosed final component, which I deleted because the image was missing the trailing comma in the label * File data/a0413194.dat had an extra component, and due to misnumbering of components, characters from one .SEGMENT were appearing in others. I fixed by manually renumbering where needed. Then I commented out the .DT lines because I no longer trusted them. Then, the writing at end of file was garbage so I commented it out. * File data/22412463.dat had two extra components, and trailing left-paren in the first "if ( argc ) if ( argv )" was in the next segment. This file also had an unclosed component which I commented out * These files had too few components (couldn't be viewed with the browser), so dummy components "0 0" were added into gaps components gaps file actual labels size labels ================================================================================ a0411384.dat 49 | 0 63 15 10-15 31-41 a0412134.dat 153 | 0 162 10 0-20 31-50 a0412304.dat 98 | 0 135 38 0-17 18-35 and 36-51 90-117 a0412464.dat 148 | 0 171 24 37-51 76-87 a0511154.dat 75 | 0 107 33 0-35 69-107 a0511284.dat 118 | 0 135 18 79-93 112-135 a1411174.dat 161 | 0 189 29 70-87 117-146 a1511484.dat 92 | 0 111 20 0-19 40-75 a1511574.dat 513 | 0 1068 556 38-71 111-160 and 379-419 448-482 710-766 969-994 1055-1068 a1512314.dat 61 | 0 84 24 16-31 57-84 [had repeated coordinates 530 1815] a1512444.dat 289 | 0 337 49 44-110 160-187 a1512534.dat 108 | 0 130 23 24-81 105-130 a1513104.dat 182 | 0 272 91 65-87 124-152 184-209 234-272 a1911264.dat 81 | 0 95 15 56-64 80-95 a1912024.dat 307 | 0 397 91 60-100 177-214 and 252-278 295-305 [had an unclosed component after .SEGMENT 196 ";" which became 196-197] a1912094.dat 208 | 0 333 126 44-69 149-173 and 250-266 314-333 a1912314.dat 139 | 0 173 35 103-123 159-173 bba, bbb, bbc, bbd * The coordinate system was wrong, resulting in samples being upside-down. * Two bbc files had been truncated, so I removed them: bbc/data/bcpr/bcpr138.dat bbc/data/mecr/mecr147.dat * A few bbd files were corrupt -- sequences of negative y-values in components that seemed to be extras, so I deleted those components: jesr/jesr170.dat (0-3 --> 0-2) rgbr/rgbr294.dat (0-68 --> 0-65) rgbr/rgbr486.dat (0-32 --> 0-31) slbr/slbr679.dat (0-57 --> 0-56) lex * Labeling errors: .PEN_UP components were not counted in .SEGMENT delineation, and empty components were counted _sometimes_ ; fixed by renumbering phi * For some files, the first sample was from another writer, so I commented that .SEGMENT out phi0: all files except the first, hh1.dat phi1: 28 of 44 files phi2: all files except the first, se1.dat * File data/phi2/se1.dat .SEGMENT #78 was truncated as "pinning" so I commented it out * File data/phi0/mm2.dat .SEGMENT #37 "decided" was truncated as "d" so I commented it out pri * All files ended with components, not a .PEN_UP. (The Call for Data was not clear on this, and it turns out that requiring a pen lift at the end of a sample makes sense and avoids some ambiguity when multiple samples are present in a single file.) For pri0 and pri1, I simply appended a .PEN_UP. For pri2 and pri3, I changed trailing the .START_BOX to .PEN_UP, except for 000.0.dat which had no trailing .START_BOX so I just appended a .PEN_UP * 6 files with TEXT .SEGMENT's had a trailing blank in the prompt string, which I deleted * File data/pri3/163.0.dat had illegal escape sequence in .SEGMENT: "... the 0\00B0 meridian ..." so I removed the file scr * One file had a corrupt line which was fixed based on its context: scr/data/d1/ww000052.dat:51.:?8 2964 [changed to "518 2964"] * Three files had garbage lines: scr/data/d1/ww000172.dat: ????{?h??|?U" scr/data/d2/ww000145.dat:S?V|?5?V|?s?V|eg?V}?{z?|?Z?|?2?}?d?{?e??|?r??}?{???|? ??}?f??{?g??|y??|?9??}?{?h??|?1??}#{?c??|Rl??|?C??|?I??|?1??|,??|?(??|`'??}{'??|?,??|?l" scr/data/d2/ww000146.dat:9??|?7??}?{"
From: Stan Janet
Subject: UNIPEN update
Resent-to: schomake@plinux2.psych.kun.nl, vuurpijl@kunhp1.psych.kun.nl
To: unipen-donators@magi.nist.GOV
Resent-message-id: <01I8VKQG3UBO9LWW8A@PSYCH.KUN.NL>
Message-id: <9608300409.AA02133@magi.nist.gov>
Organization: National Institute of Standards and Technology, Gaithersburg, MD
X-VMS-To: IN%"unipen-donators@magi.nist.GOV"
Content-transfer-encoding: 7BIT
X-Mozilla-Status: 0001
Content-Length: 1502
Dear UNIPEN contributor,
I thought it would be a good time for me to update you on the state of
the benchmark. I believe the data is now more than 99% correct, and
therefore now is clearly the time to partition out development and
benchmark test sets. I don't see that taking more than a few days to
code, and a day or two to execute, so I'm confident you can expect a
development test set next week. I hope that, by now, every participant
can process the data contributed by every other, or is close to being
able to do so.
In the interest of getting a scheduling for release of benchmark test
data, the publishing of UNIPEN data on CD-ROM, a conference to review
benchmark results, etc., I need your feedback. What month would you be
ready to process, for example, a test set consisting of characters
only? Assuming a few months are need to complete all benchmark testing
and a few more for results to be compiled, when (starting next Spring)
would you be able to attend a conference? It would be held at NIST, in
Gaithersburg, MD, and needs to be planned several months in advance.
Thanks to all the people who have reported a very few remaining errors
in the latest training set. Anyone who knows of others should let me as
soon as possible. Due to time constraints, errors found after September
will have to be dealt with by either removing the files or moving them
to a shadow hierarchy where they could potentially be fixed later by
the contributor.
- Stan Janet
stan@magi.nist.gov
schomaker@nici.kun.nl