@_esilas , do you know of any existing Swahili dictionaries with parts of speech (noun, verb, etc) that we can convert to an NLP++ .dict file?
Yes, their is a Swahili dictionary know as Kamusi and it comes as a hard copy. The dictionary contains all possible Swahili words with a detailed explanation about them.
That is a hard copy. But to transfer that to digital is what we need to do.
Here are some comprehensive Swahili word lists available on GitHub. Here are a few options:
- All Swahili Words Dictionary: A repository containing a text file with a collection of Swahili words.
- Swahili Wordlist: A word list that includes Swahili words among other language resources.
- Kamusi Project: A repository offering JSON and CSV data for a Swahili dictionary with over 16,600 words, including meanings, synonyms, and conjugations.
Do any of these have parts of speech?
Part of speech according to the Swahili dictionary “Kamusi”
English | Noun |
---|---|
Noun (cat) | Nomino (paka) |
Pronouns (you) | Kiwakilishi cha nafsi (wewe) |
Verb (eat) | Kielezi (kula) |
Adjective (big) | Kivumishi (kubwa) |
Adverb (quickly) | Vielezi (haraka) |
Preposition (from) | Kihusishi (kutoka) |
Conjunction (but) | Kunganishi (ila) |
Interjection (wow!) | Kingizi (lo!) |
In reference to the GitHub repository, Kalebu/kamusi, I was able to identify the above 8 parts of speech, but some words still do not have an extensive definition. For instance, lo!
(“tamko la kuonyesha mshangao, furaha au hofu”, kefle!
) is a representation of an interjection but not highlighted to which part of speech it belongs to.
In reference to the GitHub repository, odolezal/wordlists, the Swahili words listed lack explanation as to what they mean and which part of speech a word belongs to.
I’m thinking we could use a LLM to query the part of speech using English for each word in the list.
I think that’s a brilliant proposal, and I would like to know when we could start working on it.
Thank you!
Let me know if you need any help or have questions…