OHSCR: Benchmarks Dataset for Offline Handwritten Sindhi Character Recognition

Authors

  • Jakhro Abdul Naveed 1Department of Information Technology Shaheed Benazir Bhutto University, Naushahro Feroze Campus, Sindh, Pakistan
  • Mudasar Ahmed Soomro Department of Information Technology Shaheed Benazir Bhutto University, Naushahro Feroze Campus, Sindh, Pakistan
  • Leezna Saleem College Education Department Karachi, Govt. of Sindh,
  • Muhammad Khalid Shaikh Department of Information Technology University of Sindh, Khan Bhadur Syed Allahndo Shah, Naushahro Feroze Campus, Sindh, Pakistan

Keywords:

Benchmark Dataset, Handwritten Character Recognition, Pattern Recognition, Machine Learning, Sindhi Language

Abstract

This research work presents a unique dataset for offline handwritten Sindhi character recognition. It has 7800 character images in total, divided into multiple categories by 150 writers of various ages, genders, and professional backgrounds. Each writer writes the 52 Sindhi characters in the designed form. With a high-quality scanner, all of the written samples were scanned. After that, all the handwritten Sindhi characters were cropped from the collected designed form, and the cropped images were saved in ‘.png’ format. For the benefit of the Sindhi research community, this work suggests an image dataset for character recognition in handwritten Sindhi. The dataset will be made
publically available. For the Sindhi language, this dataset can be used to create and test handwritten character recognition systems and provide helpful insights through writer identification. The dataset has been divided into the training set and the test set, with 80% for training and 20% for testing. The different preprocessing techniques used to remove noise from scanned images to create a clean dataset. The dataset created as a result of this research is the world's first openly accessible dataset for handwritten research, and it can be useful for writer identification systems and handwriting recognition systems.

References

Saqib, N., Haque, K. F., Yanambaka, V. P., & Abdelgawad, A. (2022). Convolutional-Neural-Network-Based Handwritten Character Recognition: An Approach With Massive Multisource

Data. Algorithms, 15(4), 129.

Hamdan, Y. B., & Sathesh, A. (2021). Construction of Statistical SVM-based Recognition Model For Handwritten Character Recognition. Journal of Information Technology and Digital World, 3(2), 92-107.

Ghosh, T., Abedin, M. H. Z., Al Banna, H., Mumenin, N., & Abu Yousuf, M. (2021). Performance analysis of state of the art convolutional neural network architectures in Bangla handwritten character recognition. Pattern Recognition and Image Analysis, 31, 60-71.

Ahlawat, S., Choudhary, A., Nayyar, A., Singh, S., & Yoon, B. (2020). Improved handwritten digit recognition using convolutional neural networks (CNN). Sensors, 20(12), 3344.

Naz, S., Umar, A. I., Shirazi, S. H., Ahmed, S. B., Razzak, M. I., & Siddiqi, I. (2016). Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey. Education and Information Technologies, 21, 1225-1241.

Husnain, M., Saad Missen, M. M., Mumtaz, S., Jhanidr, M. Z., Coustaty, M., Muzzamil Luqman, M., & Sang Choi, G. (2019).Recognition of Urdu handwritten characters using convolutional

neural network. Applied Sciences, 9(13), 2758.

Hakro, D. N., Ismaili, I. A., Talib, A. Z., Bhatti, Z., & Mojai, G. N. (2014). Issues and challenges in Sindhi OCR. Sindh University Research Journal (Science Series), 46(2), 143-152.

Bhatti, Z., Ismaili, I. A., Soomro, W. J., & Hakro, D. N. (2014). Word segmentation model for Sindhi text. American Journal of Computing Research Repository, 2(1), 1-7.

Liwicki, M., & Bunke, H. (2005, August). IAM-OnDB-an on-line English sentence database acquired from handwritten text on a whiteboard. In Eighth International Conference on Document Analysis and Recognition (ICDAR'05) (pp. 956-961). IEEE.

Wilkinson, R. A., Geist, J., Janet, S., Grother, P. J., Burges, C. J., Creecy, R. & Wilson, C. L. (1992). The first census optical character recognition system conference (Vol. 184). US Department of Commerce, National Institute of Standards and Technology.

Kavallieratou, E., Fakotakis, N., & Kokkinakis, G. (2002, August). Handwritten character recognition based on structural characteristics. In 2002 International Conference on Pattern

Recognition (Vol. 3, pp. 139-142). IEEE.

Srihari, S. N., Cha, S. H., Arora, H., & Lee, S. (2002). Individuality of Handwriting. Journal of Forensic Sciences, 47(4), 856-872.

Hull, J. J. (1994). A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 550-554.

Cheriet, M., Thibault, R., & Sabourin, R. (1994, November). A Multi-Resolution Based Approach for Handwriting Segmentation in Gray-Scale Images. In Proceedings of 1st International

Conference on Image Processing (Vol. 1, pp. 159-163). IEEE.

Viard-Gaudin, C., Lallican, P. M., Knerr, S., & Binter, P. (1999, September). The Ireste On/Off (Ironoff) Dual Handwriting Database. In Proceedings of the Fifth International Conference on

Document Analysis and Recognition. ICDAR'99 (Cat. No. PR00318) (pp. 455-458). IEEE.

Al Maadeed, S., Ayouby, W., Hassaine, A., & Aljaam, J. M. (2012, September). QUWI: An Arabic and English Handwriting Dataset for Offline Writer Identification. In 2012 International Conference on Frontiers in Handwriting Recognition (pp. 746-751). IEEE.

Zhang, H., Guo, J., Chen, G., & Li, C. (2009, July). HCL2000-A Large-Scale Handwritten Chinese Character Database for Handwritten Character Recognition. In 2009 10th International Conference on Document Analysis and Recognition (pp. 286-290). IEEE.

Kavallieratou, E., Liolios, N., Koutsogeorgos, E., Fakotakis, N., & Kokkinakis, G. (2001, September). The GRUHD Database of Greek Unconstrained Handwriting. In Proceedings of Sixth International

Conference on Document Analysis and Recognition (pp. 561-565). IEEE.

Elanwar, R. I., Rashwan, M. A., & Mashali, S. A. (2010). OHASD: The First On-Line Arabic Sentence Database Handwritten on Tablet PC. International Journal of Computer and Information Engineering, 4(12), 1907-1912.

Huda, A., Sadri, J., Suen, C. Y., & Nobile, N. (2008). A Novel Comprehensive Database for Arabic Off-Line Handwriting Recognition. In Proceedings of 11th International Conference on Frontiers in Handwriting Recognition, ICFHR (Vol. 8, pp. 664-669).

Hussain, R., Raza, A., Siddiqi, I., Khurshid, K., & Djeddi, C. (2015). A Comprehensive Survey of Handwritten Document Benchmarks: Structure, Usage and Evaluation. EURASIP Journal on Image and Video Processing, 2015(1), 1-24.

Mathworks. (n.d.). rgb2gray : Convert RGB Image or Colormap to Grayscale - MATLAB. Retrieved from: https://www.mathworks.com/help/matlab/ref/rgb2gray.html.

Northwestern University. (n.d.). im2bw : Image Processing Toolbox. Retrieved from: http://www.ece.northwestern.edu/localapps/matlabhelp/toolbox/images/im2bw.html

Izmiran.ru. (n.d). Imdilate : Image Processing Toolbox User’s Guide. Retrieved from:

http://matlab.izmiran.ru/help/toolbox/images/imdilate.htm

Downloads

Published

2024-05-02

How to Cite

Abdul Naveed, J., Soomro , M. A., Saleem, L., & Shaikh, M. K. (2024). OHSCR: Benchmarks Dataset for Offline Handwritten Sindhi Character Recognition . Sir Syed University Research Journal of Engineering & Technology, 14(1), 55–61. Retrieved from https://sirsyeduniversity.edu.pk/ssurj/rj/index.php/ssurj/article/view/618