3rd Korea-Japan Joint Workshop on Pattern Recognition (KJPR2008) , Seoul, Korea, pp.99-100 (2008. 11).
Character Pattern Retrieval to Support Reading Mokkans Akihito Kitadai†, Masaki Nakagawa†, Hajime Baba†† and Akihiro Watanabe†† †Tokyo University of Agriculture and Technology Naka-cho, Koganei, Tokyo, Japan E-mail:
[email protected] ††Nara National Research Institute for Cultural Properties Nijo-cho, Nara, Japan E-mail:
[email protected] Extended Abstract This paper presents a character pattern retrieval method to support reading damaged character patterns on mokkans. “Mokkan” is a generic name given to a kind of Japanese historical documents that have handwritten characters on wooden tablets. Over 180,000 mokkans have been found in and around the Heijo-Palace site that was the capital of Japan from A.D. 710 to 784. Many of the mokkans have stained or broken parts since almost all of the mokkans have been excavated from under the ground (Figure 1). For that reason, we frequently find damaged character patterns on the mokkans. Reading them is difficult even for archaeologists and historians. Therefore, we have proposed a basic method of character pattern retrieval to support reading mokkans (Figure 2) [1]. This paper extends the method for practical use. A key of the retrieval consists of black, white and gray pixels. The black and white pixels show a damaged character pattern image in which black pixels show the ink parts. The gray pixels show the stained or broken parts inferred by the users. Hereafter, we call the zone of the gray pixels “gray-zone.” The archaeologists and historians can use electric pen or mouse devices to add the gray-zones on the damaged character patterns.
Key of the retrieval consists of black, white and gray pixels gray-zone
Damaged Normalized pattern key
Figure 1. Mokkans excavated from Heijyo-palace site.
Feature extraction & Similarity evaluation
鯖 鯛 ・ ・ ・
武蔵国秩父郡大贄?一斗
山梨郡雑役胡桃子一古
尾張国中嶋郡石作郷
The right half of the mokkan has been lost
魚
Retrieval results
Figure 2. Process of character pattern retrieval.
In the non-linear normalization step, our method creates two images from the key. One changes the gray pixels to black pixels, the other changes to white. For each image, a transforming function of non-linear normalization is obtained. We employ the line density equalization method that shows significant results in handwritten Kanji recognition [2]. The transforming function for the key: Tkey is shown as the formula (1) when the transforming function for the black one is Tblack and the function for the white one is Twhite.
Tkey = w × Tblack + (1 − w ) × Twhite
(0 ≤ w ≤ 1)
(1)
In the feature extraction step, we extract 4-directional features of the black and white pixels in the normalized key [3]. After the extraction, the averaged feature in the pixels is assign to every gray pixel. Then, we create 256-dimensional feature vector (8x8 cells and 4 directions). Now, we show the detail of the experimental results. We employed 2,108 character patterns of the mokkans. They cover 309 categories frequently found on the mokkans. We also employed 10 mask patterns (Figure 4). When we use gray as the color of the mask patterns, we obtain quasi-damaged character patterns (q-DCPs) with gray-zones. In contrast, white colored mask patterns provide q-DCPs without gray-zones (Figure 5).
Figure 3. Character patterns of mokkans.
q-DCP with gray-zone Character pattern (non-masked)
Mask pattern q-DCP without gray-zone
Figure 4. Mask patterns.
Figure 5. Process to create q-DCP.
Table 1 shows the 3rd/5th/10th accumulative rates of character pattern retrieval. Since we employed leave-one-out method for every character patterns, the total number of the trial was 2,108 for non-masked character patterns. Also, the number of the trial became 21,080 (= 2,108 x 10) for q-DCPs. We made two experiments for q-DCPs with gray-zone. One fixed the value of w to 0.5. The other chose an optimal value of w among 0/0.25/0.5/0.75/1 in each trial. From these results, we consider that our method with gray-zone provides good support to read the mokkans. Table 1. 3rd/5th/10th accumulative rates of character pattern retrieval. for non-masked for q-DCPs for q-DCPs with gray-zone for q-DCPs with gray-zone (w = 0.5) (w: optimized) character patterns without gray-zone 3rd: 66.2% (1,396/2,108) 3rd: 30.9% (6,523/21,080) 3rd: 45.8% (9,648/21,080) 3rd: 57.9% (12,200/21,080) 5th: 71.1% (1,498/2,108) 5th: 35.9% (7,565/21,080) 5th: 51.1% (10,774/21,080) 5th: 63.0% (13,269/21,080) 10th: 76.0% (1,601/2,108) 10th: 43.4% (9,139/21,080) 10th: 58.6% (12,359/21,080) 10th: 69.6% (14,668/21,080)
Key words Character pattern retrieval, Mokkan, Archaeology.
Acknowledgement This work was supported by Grant-in-Aid for Scientific Research (S)-20222002 and Grant-in-Aid for Young Scientists (B)-19720202.
References [1]
[2]
[3]
M. Nakagawa, K. Saito, A. Kitadai, J. Tokuno, H. Baba and A. Watanabe “Damaged character pattern recognition on wooden tablets excavated from the Heijyo palace site,” Proc. 10th IWFHR, La Baule, France, vol. I, pp. 533-538, 2006. H. Yamada, K. Yamamoto and T. Saito “A Nonlinear Normalization Method for Handprinted Kanji Character Recognition ---Line Density Equalization---,” Proc. 9th ICPR, Roma, Italy, pp. 172-175, 1988. C.L. Liu, Y.J. Liu, R.W. Dai “Multiresolution statistical and structural feature extraction for handwritten numeral recognition,” Pre-Proc. 5th IWFHR, Colchester, England pp. 61-66, 1996.