Developing OCR for ancient scripts like Tamizhi (Tamil-Brahmi) and Kurdish historical texts is uniquely challenging due to character complexity, noise in source materials, and the lack of specialized datasets. Recent research using AI models such as LSTM, CNN, and fine-tuned Tesseract systems shows promising results, with Tamizhi OCR achieving over 91% accuracy. While no Kurdish-specific OCR exists yet, leveraging pre-trained Arabic models offers a practical pathway. These findings highlight the importance of tailored datasets, advanced machine learning techniques, and ongoing research in preserving and digitizing historical documents.Developing OCR for ancient scripts like Tamizhi (Tamil-Brahmi) and Kurdish historical texts is uniquely challenging due to character complexity, noise in source materials, and the lack of specialized datasets. Recent research using AI models such as LSTM, CNN, and fine-tuned Tesseract systems shows promising results, with Tamizhi OCR achieving over 91% accuracy. While no Kurdish-specific OCR exists yet, leveraging pre-trained Arabic models offers a practical pathway. These findings highlight the importance of tailored datasets, advanced machine learning techniques, and ongoing research in preserving and digitizing historical documents.

Building OCR Systems for Tamizhi and Kurdish Historical Documents

3 min read

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2.6 Tamizhi

Based on Munivel and Enigo (2022), digitizing documents from ancient history typically involves OCR. However, OCR for Tamizhi documents poses significant challenges due to the inherent similarities in shape and structure among many characters, along with their subtle variations. The Tamizhi script, also known as Tamil-Brahmi, serves as the precursor to numerous modern Indian scripts and is recognized as one of the oldest scripts in India. Developing an OCR system for Tamizhi script is exceptionally difficult due to the abundance of combined characters, where a character can consist of a single vowel, consonant, or a combination of both. In their research paper, the authors discuss their efforts in creating an OCR system specifically designed for printed Tamizhi documents. The system aims to perform effectively despite various factors, including the poor quality of the documents, the presence of noise, and the diverse formats of the input data. The authors report that their Tamizhi OCR achieves an accuracy rate of 91.12 percent for printed text, demonstrating promising results in recognizing Tamizhi characters.

\ To summarize, we can mention that up to the time we publish this research, the literature does not report on any efforts made to specifically develop OCR for historical Kurdish documents. Also currently no accessible dataset is available to train OCR systems that are specifically designed to extract text from historical Kurdish documents. That significantly restricts our options when it comes to selecting the most suitable approach for our study.

\ To develop an OCR system specifically tailored for historical documents, researchers employed different techniques and strategies such as SVM, LSTM, and CNN. The variability in the obtained results, which reached a maximum of 99.7% CLA, can be attributed to several contributing factors. These factors include the quality of the dataset used, the specific methodology employed during the development of the OCR system, and the intrinsic complexity of the documents being processed.

\ The studies that were reviewed in this chapter employed both proprietary datasets that were created by researchers themselves and publicly available datasets. These datasets include TWDB, HWDB, GT4HistOCR, Stockholm Archive, Dunhuang data, Tripitaka, TKH, MTH, and Kana-PRMU. According to the literature in this field, there are ongoing efforts to improve OCR techniques for different kinds of historical documents.

\ Based on our research, we identified that LSTM is a widely adopted approach for developing OCR systems with acceptable accuracy. As a result, we used the latest version of Tesseract, which integrates LSTM functionality, to ensure optimal performance in our project research. Additionally, we discovered the availability of pre-trained models that can be used for fine tuning on our dataset. Recognizing the similarities between the Kurdish and Arabic scripts, we made the decision to use an Arabic pre-trained model as our base model.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Market Opportunity
Wink Logo
Wink Price(LIKE)
$0.001949
$0.001949$0.001949
-4.32%
USD
Wink (LIKE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Tesla Stock Forecast: Will $1.25T SpaceX-xAI Merge Boost TSLA?

Tesla Stock Forecast: Will $1.25T SpaceX-xAI Merge Boost TSLA?

Tesla shares closed at $421.96 as of February 4, holding flat while broader markets slipped. The muted move came as investors digested reports that SpaceX and xAI
Share
Coinstats2026/02/04 19:10
Moku Pledges $1M to Launch Grand Arena Season One, a 24/7 AI-Athlete Fantasy Platform

Moku Pledges $1M to Launch Grand Arena Season One, a 24/7 AI-Athlete Fantasy Platform

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
Share
Blockchainreporter2025/09/22 22:20
Long-Awaited NikeSKIMS Launches To Reignite Nike’s Women’s Business

Long-Awaited NikeSKIMS Launches To Reignite Nike’s Women’s Business

The post Long-Awaited NikeSKIMS Launches To Reignite Nike’s Women’s Business appeared on BitcoinEthereumNews.com. Topline After delays due to product issues in its scheduled May release, the first NikeSKIMS activewear collections – the strategic partnership between the sportswear giant and Kim Kardashian’s $4 billion disruptive shapewear venture – will launch on both companies’ websites and in select Nike and SKIMS stores this Friday, September 26. Serena Williams for NikeSKIMS Courtesy of Nike Key Facts NikeSKIMS’ first outing will include three core activewear collections, along with four seasonal collections, all designed to support women with high-performance fabrication expected from Nike and the body-conscious styling SKIMS is known for. The introductory offering features 58 items in neutral colorways that can be combined into more than 10,000 different looks suited for an intense gym workout or a coffee run. An all-star cast of 50 elite female athletes star in the “Bodies at Work” release video, including Jordan Chiles, Romane Dicko, Beatriz Hatz, Chloe Kim, Nelly Korda, Sha’Carri Richardson, Madisen Skinner and Serena Williams, as well as Kardashian and members of UCLA and USC women’s teams. Prices will range from $38 for a bra to $128 for footed leggings, with the sweet spot for the collection in the $50 to $70 range, about even or slightly below the list price of premium activewear brands such as Lululemon and Alo Yoga. Crucial Quote “NikeSKIMS is more than a collaboration – It’s a new brand redefining activewear. With this launch, we are establishing a platform to grow NikeSKIMS, reach consumers worldwide and set a new benchmark for how activewear is experienced across retail, digital and cultural touch points,” said Jens Grede, SKIMS’ co-founder and CEO, in a statement. Key Background Nike has a lot riding on the success of the SKIMS-style meets Nike-function launch of NikeSKIMS. Nike brand revenues dropped 9% to $44.7 billion in fiscal year ended May 31…
Share
BitcoinEthereumNews2025/09/23 22:30