What is a word for the purposes of written word-based game development?

At the time when I started this project, the word-guessing game ScriptBlade, I did not have a full grasp of the big question staring me right in the face. I will have to decide, or agree with and justify the decisions already made by someone else, on what a word actually is. After all, if my app is to tell a user that their word is either correct or incorrect, then that means that the application has a clear definition of what a word actually is. Unfortunately, as I thought about this more and learned more about the subject I realized that this is one of those questions which does not have a clear-cut answer. At its core it is subjective and might even be seen by some as philosophical.

At first I thought that I could simply use an open source dictionary as the source of truth, but as we were testing this solution we noticed that a lot of words that the testers understood as words were not in the dictionary. This complaint came from various testers from different countries, some of them native English speakers and some not. As I looked into this issue further it became apparent that different dictionaries don't even agree with each other if a word is actually a word or not! To anyone familiar with the subject this is, of course, a very fundamental insight, but the extent to which it happens surprised me a lot, after all, we do speak the same language don't we? How can we disagree on the definition on one of its most basic units?

Well it turns out that even if we do speak the same language our vocabularies are vastly different and it all depends on our perspective. And that perspective is developed through multiple factors:

  • Lived experience
  • Age
  • Country of origin
  • Education
  • Profession
  • Religion
  • Dialects
  • And an innumerable quantity of others

I'm sure that examples in each of these categories could be found, but to illustrate my point I would like to provide a couple of examples of words used in professions or sciences which are not part of most general purpose dictionaries and likely not fully understandable to the people outside of that domain.

  • Computing
    • Transpilation
    • Minification ( holds a different meaning in programming than "minimization" which is a similar but is not used to refer to the exact concept)
    • Statelessness
    • Memoization
    • Containerization
  • Biology
    • Caenorhabditis
    • Cyclooxygenase
  • Physics
    • Electroluminescence
    • Quasicrystallography
  • Astrology
    • Synastry
    • Astrocartography

And I would like to devote a special section to Chemistry, because it has a systematic nomenclature that can generate extremely long, highly technical compound names, which often seem incomprehensible to non-specialists. In simplified terms, molecules are represented by chemical formulae, and these formulae can be translated into standardized names following established naming conventions. Here are a couple of examples:

  • Dodecamethylcyclohexasiloxane
  • Hexachlorocyclohexane
  • Octamethylcycl
  • And the world's longest "word", the name of the largest protein, which takes three hours to pronounce! [1]

I tried to google the name of the above mentioned protein, but ran into a problem.

Google 413 error

Well at this point we can say that this is absurd. Usually dictionary writers do disregard such names as "verbal formulae" rather than English words.[2] But if we look at general purpose dictionaries they do contain mentions of some "verbal formulae" or their parts, for example:

  • Monoxide
  • Tetrahydrofuran
  • Thiopental

So it does seem inconsistent and the most satisfying solution to this problem would be to include or exclude words based on the category they belong to instead of their length or how commonly they are used, as long as they do get some use. But it is impossible to list all molecules as words in a dictionary, because most of them are not even known [3] and likely will never be known. I don't think they should be excluded either though, because some of those molecule names seep into everyday language, such as, monoxide, dioxide (as in carbon dioxide) and so on. It's one thing not to have extremely esoteric words in your dictionary, but if very common words are missing then that affects way more people.

In academia

In academia the debate on "what is a word" encompasses many more areas than what is necessary for written word-based game development. It's impossible to encompass everything in a paragraph and my understanding of this is pretty basic, but some point of contention are:

  • Do boundaries of pronounced and written words qualify or disqualify them as words?
  • Do they have to convey meaning if used by themselves?
  • Whether morphological forms of words can be constituted as different words
  • Idiomatic/Fixed expressions
  • Should rules for "what is a word" be different for different languages?
  • Does the way we think about words differ from the way we write or say them?

Thankfully for the purposes of developing a written word-based game some of these questions are irrelevant.

  • Whether morphological forms of words can be constituted as different words is not important for this purpose, because it doesn't matter if words are different, the important part is that they are words.
  • "Does the way we think about words differ from the way we write or say them?" could be important, one could imagine a system that could be created based on this idea, that would allow for various expressions or dialects to be added to the source of truth for what a word is, but it is a goal that is simply too ambitious and the results of such an endeavor are uncertain, more on that later.

The remaining points are worth considering and are certainly interesting to think about.

Do boundaries of pronounced and written words qualify or disqualify them as words?

This seems like the most obvious and the most important part of the definition. A word should have either spaces and punctuation around it or should be separated by pauses in speech. If we have to separate speech and sentences into smaller units, then it is convenient to pick an objective place to do it. However, the counter argument to this line of reasoning is that some words in speech can be pronounced differently than the way they are written - without pauses. This would depend on your dialect, but there are examples of such phrases which in time have actually merged into a single word and are now almost always written and spoken together.

  • Insofar "In so far"
  • Albeit "All be it"
  • Breakfast "Break fast"
  • Today "To day"

Do they have to convey meaning if used by themselves?

Words should hold meaning, after all, without meaning they would simply be strings of letters. However, different people construe meaning as different things. Meaning could be understood simply as purpose. A random string of letters does not hold any purpose in language, but even the smallest components of language, such as articles and interjections do have purpose. Although articles do not have any meaning if used by themselves, they can change the meaning of a sentence.

Idiomatic/Fixed expressions

Even though the words that make up these components are separated by spaces, it could be argued that these expressions always go together and hold meaning/represent an idea much in the same way as a single word would. Replacing any of the components of this structure would lose its intended meaning completely and that signifies a sort of rigidity that only words have. A sentence does not usually act in this way, while a fixed expression always does.

Parallels

While looking into various articles, video essays and papers about "What is a word?" I had an idea. If we can look at a language as a very complex system made up of various smaller components and then try to equate it to other such systems could we find similar problems?

In math

"What exactly is a number?"

"So, no, the entire history of mathematics has not ascribed a single logical meaning to the word "number," so that we can distinguish what is and isn't a number"[4]

"What is a variable?"

"My own answer, perhaps somewhat appropriately, is that the meaning of variable does, in fact, vary according to its context (e.g., across different classes such as Velleman's example of proofs versus calculus, and within a single class such as Usiskin's approaches to algebra). Rather than trying to pin down a single definition, my own feeling is that we should explore these different meanings and challenge students to articulate the ramifications of adopting different conceptions at different times."[5]

In music

Music has a lot of terminology that is ambiguous. It is excellent for the purposes which it is meant for, to communicate about musical concepts with others, to help us think about the musical system in a more efficient, convenient and clear way. But a lot of the terms in music are highly subjective.

Some musical questions, similar to the question "What is a word?", in that they cannot be answered objectively and straightforwardly, could be:

  • What is a melody?
  • What is a song?
  • What is a motif?
  • What is timbre?

In the culinary discipline

  • What is a vegetable?
  • What is a spice?

In geography / political science

  • What is a country
  • What is a city

These subjective and context dependent terms are used to communicate all the time. They are fluid and changing, but that doesn't mean they are not useful.

Some would say that because of their malleability they are even more useful. You run into problems when you take a subjective term like that and try to think about it in a cold, robotic way. Or, in my case, you make a game around it and try to build a rigid and objective system on a fluid foundation.

Let's look at it pragmatically

As we can see, there isn't going to be a straightforward answer. If we can't find a concrete and objective definition, then it would be a good time to reflect why we asked the original question in the first place. I am making a writing based word-guessing game and I will be considering things relevant to my case, such as:

  • What definition feels good for most people?
  • What definition is the most accessible?
  • What implementation is possible?
  • Considering that resources devoted to this could be spent elsewhere, what solution is worth pursuing?

This is a core mechanic of the game I'm developing and it is important for me to get it right, but because of above mentioned factors and the subjective nature of the problem I have to make decisions that might seem arbitrary.

The best solution that I can come up with is to give users the ability to upload their own dictionaries. This way the responsibility of defining a subjective concept rests with the subject itself. It will also allow users to upload dictionaries of other languages than English, upload collections of city or country names, collections of medical terms, lists of names for people, Latin words used in law and so on. Not only is it a good solution for the initial problem, it opens the doors to many more modes of play.

For user convenience a simple dictionary will be provided, as all dictionaries it will have arbitrary exclusions, missing words, words that you would, from your perspective swear are not actually words. But of all that is a trade-off that I have to make. But if using such a dictionary to determine what is a word does not sound satisfying to you, then you will have the option of sourcing your own dictionary.

And in the end, that will free up the development resources to build a better experience overall. Adding more layers of strategy, making communication easier and more convenient, creating more art and ensuring a well-functioning, stable app.

The simple provided dictionary

ScriptBlade will provide a dictionary by default, the properties of this dictionary are as follows:

The words inside of the dictionary are composed solely out of the 26 letters in the English alphabet. No numbers, no hyphens, no spaces, no punctuation.

This excludes hyphenated compound adjectives, compound nouns, some numbers and maybe some other compounds. However, it brings the definition of a word more in line with what a layman would expect it to be. Words are made out of letters, not punctuation.

Another problem with compounds is that you can make up meaningful words on the fly. A dictionary might contain the definition for "light-colored", but fewer dictionaries will contain the definition for "dark-colored" and none for "blue-colored" or "red-colored". These compounds do have clearly understandable meaning, but not one a program can easily identify.

Very esoteric words will not be present in the source dictionary

The source dictionary is a general purpose dictionary and that means that words from very specialized contexts will be excluded. The dictionary might contain some words from contexts like that, because they enter the common vocabulary through time. It is possible that it will seem arbitrary or outdated what words are included or excluded. But because sourcing, creating and maintaining such a dictionary is such an enormous task, it simply cannot be done right now.

No proper nouns

Any proper nouns such as names of people, pets, countries, cities, municipalities, music, movies and so on are excluded. If they are not excluded as a whole, there are bound to be situations where a person enters their own name or the name of their origin city and the system tells them that their name is not valid. It is not possible to source a list of every single possible name for people. Because most countries allow naming your child whatever you want or even changing your name later, some countries allow symbols from foreign languages to be included in the name. Such an inclusion would require to expand the alphabet and there are likely more names in the world than words in the English language.

It is possible that you will encounter words that are proper nouns as well as nouns. Don't get confused, they are included because of their noun meaning, not because of their proper noun meaning. An example I can give you of such a case is "batman". You likely know this word as the name of a superhero, however, it is also the name of multiple different cities all over the world. But the reason it is included in most dictionaries is because it is "an orderly (noun) of a British military officer".


Refs: