A growing, open, and community-driven corpus for the Mon language.
Built with care for language preservation, research, and future technologies.
Mon Corpus Collection is a curated collection of Mon-language text data in Unicode format, created to support:
- 📚 Linguistic research
- 🤖 Natural Language Processing (NLP)
- 🧠 Machine learning & AI experiments
- 🏛️ Digital humanities & cultural preservation
- 🌏 Open-access language resources
This repository exists to make Mon language data freely available for anyone who wants to learn, analyze, build, or experiment — no barriers, no gatekeeping.
- ✅ Clean Mon text in Unicode
- ✅ Ready-to-use for NLP pipelines
- ✅ Suitable for tokenization, training, and analysis
- ✅ Expandable and community-friendly structure
Whether you’re:
- building a tokenizer 🧱
- training a language model 🤓
- doing academic research 📖
- or preserving Mon language digitally 🌾
—you’re welcome here.
Language is living.
This project treats the Mon language not as static data, but as a living archive — something to be used, remixed, studied, and carried forward by future generations.
Open data.
Open culture.
Open futures.
🆓 Free to use
🆓 Free to modify
🆓 Free to redistribute
You may use this corpus for any purpose — academic, commercial, experimental, or personal.
(Attribution is appreciated, but not required 💛)
Contributions are very welcome!
You can help by:
- Adding new Mon text sources
- Cleaning or normalizing data
- Improving documentation
- Sharing this resource with others
If you care about Mon language, you belong here.
Janakh Pon
Htaw Mon
Low-resource languages deserve high-quality digital infrastructure.
By collecting and sharing Mon language data openly, this project helps ensure Mon is:
- represented in modern technology
- accessible to researchers worldwide
- preserved beyond physical archives
If you use this corpus in your work, research, or project —
we’d love to hear about it.
Let’s build the future of Mon language together 🌾✨