Dictionary Update


Just some rambling about word-game dictionary choice.

The previous method of creating common-word dictionaries of various levels had a few flaws.  Using just frequencies without lemma data meant that "test" might be included as a common word while "tests", which should be just as recognizable, excluded as it doesn't show up as frequently.  The previous update sort-of fixed that by checking for various suffixes like "-s", "-ed", etc. in the dictionary and a potential update I was working on would use ECDICT's and NGSL's lemma data to be more accurate.  This still isn't perfect however.  English has some homonyms where one word is common (e.g. "guy"; noun, meaning a person) but another uncommon word of a different part of speech ("guy"; verb, meaning to steady with rope or chain).  So naive use of the lemma data would include inflections of the uncommon word ("guyed").

Additionally, the Wiktionary data quality was not good.  Proper names were included and with any internet-editable document it ended up with quite a few slurs hidden in it.

In Mere Anagram, the shortcomings are more apparent.  Getting a rare word in Compoundle is less of an issue, as long as it looks like an English word, the clues will point you to it.  But with anagrams you're required to find every possible word from scratch and end up guessing anything plausible if the dictionary is too large.

To solve this, I'm now taking advantage of SCOWL's configuration options.  The dictionary is designed with various levels, so that a limited (but still usable) dictionary can be created for applications that are memory-constrained.  Which has the side effect of removing obscure or rarely-used words from the dictionary.  The new solution appears to be working well in my testing, and retains about the same dictionary size as the frequency-list method.

Get Compoundle

Leave a comment

Log in with itch.io to leave a comment.