At the time when I started this project, the word-guessing game ScriptBlade, I did not have a full grasp of the big question staring me right in the face. I will have to decide, or agree with and justify the decisions already made by someone else, on what a word actually is. After all, if my app is to tell a user that their word is either correct or incorrect, then that means that the application has a clear definition of what a word actually is. Unfortunately, as I thought about this more and learned more about the subject I realized that this is one of those questions which does not have a clear-cut answer. At its core it is subjective and might even be seen by some as philosophical.
At first I thought that I could simply use an open source dictionary as the source of truth, but as we were testing this solution we noticed that a lot of words that the testers understood as words were not in the dictionary. This complaint came from various testers from different countries, some of them native English speakers and some not. As I looked into this issue further it became apparent that different dictionaries don't even agree with each other if a word is actually a word or not! To anyone familiar with the subject this is, of course, a very fundamental insight, but the extent to which it happens surprised me a lot, after all, we do speak the same language don't we? How can we disagree on the definition on one of its most basic units?
Well it turns out that even if we do speak the same language our vocabularies are vastly different and it all depends on our perspective. And that perspective is developed through multiple factors:
I'm sure that examples in each of these categories could be found, but to illustrate my point I would like to provide a couple of examples of words used in professions or sciences which are not part of most general purpose dictionaries and likely not fully understandable to the people outside of that domain.
And I would like to devote a special section to Chemistry, because it has a systematic nomenclature that can generate extremely long, highly technical compound names, which often seem incomprehensible to non-specialists. In simplified terms, molecules are represented by chemical formulae, and these formulae can be translated into standardized names following established naming conventions. Here are a couple of examples:
I tried to google the name of the above mentioned protein, but ran into a problem.
Well at this point we can say that this is absurd. Usually dictionary writers do disregard such names as "verbal formulae" rather than English words.[2] But if we look at general purpose dictionaries they do contain mentions of some "verbal formulae" or their parts, for example:
So it does seem inconsistent and the most satisfying solution to this problem would be to include or exclude words based on the category they belong to instead of their length or how commonly they are used, as long as they do get some use. But it is impossible to list all molecules as words in a dictionary, because most of them are not even known [3] and likely will never be known. I don't think they should be excluded either though, because some of those molecule names seep into everyday language, such as, monoxide, dioxide (as in carbon dioxide) and so on. It's one thing not to have extremely esoteric words in your dictionary, but if very common words are missing then that affects way more people.
In academia the debate on "what is a word" encompasses many more areas than what is necessary for written word-based game development. It's impossible to encompass everything in a paragraph and my understanding of this is pretty basic, but some point of contention are:
Thankfully for the purposes of developing a written word-based game some of these questions are irrelevant.
The remaining points are worth considering and are certainly interesting to think about.
Do boundaries of pronounced and written words qualify or disqualify them as words?
This seems like the most obvious and the most important part of the definition. A word should have either spaces and punctuation around it or should be separated by pauses in speech. If we have to separate speech and sentences into smaller units, then it is convenient to pick an objective place to do it. However, the counter argument to this line of reasoning is that some words in speech can be pronounced differently than the way they are written - without pauses. This would depend on your dialect, but there are examples of such phrases which in time have actually merged into a single word and are now almost always written and spoken together.
Words should hold meaning, after all, without meaning they would simply be strings of letters. However, different people construe meaning as different things. Meaning could be understood simply as purpose. A random string of letters does not hold any purpose in language, but even the smallest components of language, such as articles and interjections do have purpose. Although articles do not have any meaning if used by themselves, they can change the meaning of a sentence.
Even though the words that make up these components are separated by spaces, it could be argued that these expressions always go together and hold meaning/represent an idea much in the same way as a single word would. Replacing any of the components of this structure would lose its intended meaning completely and that signifies a sort of rigidity that only words have. A sentence does not usually act in this way, while a fixed expression always does.
While looking into various articles, video essays and papers about "What is a word?" I had an idea. If we can look at a language as a very complex system made up of various smaller components and then try to equate it to other such systems could we find similar problems?
"So, no, the entire history of mathematics has not ascribed a single logical meaning to the word "number," so that we can distinguish what is and isn't a number"[4]
"My own answer, perhaps somewhat appropriately, is that the meaning of variable does, in fact, vary according to its context (e.g., across different classes such as Velleman's example of proofs versus calculus, and within a single class such as Usiskin's approaches to algebra). Rather than trying to pin down a single definition, my own feeling is that we should explore these different meanings and challenge students to articulate the ramifications of adopting different conceptions at different times."[5]
Music has a lot of terminology that is ambiguous. It is excellent for the purposes which it is meant for, to communicate about musical concepts with others, to help us think about the musical system in a more efficient, convenient and clear way. But a lot of the terms in music are highly subjective.
Some musical questions, similar to the question "What is a word?", in that they cannot be answered objectively and straightforwardly, could be:
These subjective and context dependent terms are used to communicate all the time. They are fluid and changing, but that doesn't mean they are not useful.
Some would say that because of their malleability they are even more useful. You run into problems when you take a subjective term like that and try to think about it in a cold, robotic way. Or, in my case, you make a game around it and try to build a rigid and objective system on a fluid foundation.
As we can see, there isn't going to be a straightforward answer. If we can't find a concrete and objective definition, then it would be a good time to reflect why we asked the original question in the first place. I am making a writing based word-guessing game and I will be considering things relevant to my case, such as:
This is a core mechanic of the game I'm developing and it is important for me to get it right, but because of above mentioned factors and the subjective nature of the problem I have to make decisions that might seem arbitrary.
The best solution that I can come up with is to give users the ability to upload their own dictionaries. This way the responsibility of defining a subjective concept rests with the subject itself. It will also allow users to upload dictionaries of other languages than English, upload collections of city or country names, collections of medical terms, lists of names for people, Latin words used in law and so on. Not only is it a good solution for the initial problem, it opens the doors to many more modes of play.
For user convenience a simple dictionary will be provided, as all dictionaries it will have arbitrary exclusions, missing words, words that you would, from your perspective swear are not actually words. But of all that is a trade-off that I have to make. But if using such a dictionary to determine what is a word does not sound satisfying to you, then you will have the option of sourcing your own dictionary.
And in the end, that will free up the development resources to build a better experience overall. Adding more layers of strategy, making communication easier and more convenient, creating more art and ensuring a well-functioning, stable app.
ScriptBlade will provide a dictionary by default, the properties of this dictionary are as follows:
This excludes hyphenated compound adjectives, compound nouns, some numbers and maybe some other compounds. However, it brings the definition of a word more in line with what a layman would expect it to be. Words are made out of letters, not punctuation.
Another problem with compounds is that you can make up meaningful words on the fly. A dictionary might contain the definition for "light-colored", but fewer dictionaries will contain the definition for "dark-colored" and none for "blue-colored" or "red-colored". These compounds do have clearly understandable meaning, but not one a program can easily identify.
The source dictionary is a general purpose dictionary and that means that words from very specialized contexts will be excluded. The dictionary might contain some words from contexts like that, because they enter the common vocabulary through time. It is possible that it will seem arbitrary or outdated what words are included or excluded. But because sourcing, creating and maintaining such a dictionary is such an enormous task, it simply cannot be done right now.
Any proper nouns such as names of people, pets, countries, cities, municipalities, music, movies and so on are excluded. If they are not excluded as a whole, there are bound to be situations where a person enters their own name or the name of their origin city and the system tells them that their name is not valid. It is not possible to source a list of every single possible name for people. Because most countries allow naming your child whatever you want or even changing your name later, some countries allow symbols from foreign languages to be included in the name. Such an inclusion would require to expand the alphabet and there are likely more names in the world than words in the English language.
It is possible that you will encounter words that are proper nouns as well as nouns. Don't get confused, they are included because of their noun meaning, not because of their proper noun meaning. An example I can give you of such a case is "batman". You likely know this word as the name of a superhero, however, it is also the name of multiple different cities all over the world. But the reason it is included in most dictionaries is because it is "an orderly (noun) of a British military officer".