When we started developing LanGeek, one of our main ideas was the “smart look-up” feature. The concept was simple: users could click on a word and instantly see its meaning. But building this seemingly straightforward feature turned out to be much more challenging than expected. It felt like taking on the Labors of Hercules.
Labor One: Building a Comprehensive Glossary
The first step was creating a library of words. While free resources like WordNet exist, they fell short of our needs. Missing information and numerous errors made them unsuitable.
Having prior experience in lexicography, we decided to build our own glossary from scratch. This has been a monumental task. Four years in, with tens of thousands of entries, we’re still refining and expanding it. This ongoing effort forms the foundation of our smart look-up feature.
Labor Two: Tokenization
Tokenization involves breaking sentences into meaningful units or tokens. While splitting text at spaces may seem like an easy solution, it’s inadequate for language learners. For example, “high school” would incorrectly become two tokens.
We addressed tokenization in two parts:
- Inseparable Compound Words: Words that always appear together, like “high school,” were grouped into single tokens.
- Flexible Compound Words: Structures like phrasal verbs, where words can separate, required more effort. For example:
- I put my jacket on.
- I put my jacket on the table.
In the first sentence, “put on” is a single token, but not in the second. We created rules to distinguish such cases accurately.
Labor Three: Recognizing Figurative Structures
Another challenge was identifying figurative structures, like idioms and proverbs, where the meaning isn’t literal. For instance:
- We often take electricity for granted.
Here, “take for granted” has a specific meaning unrelated to the individual words.
We developed a system to identify and mark such structures by analyzing their static and dynamic parts. For example, in “take for granted,” the dynamic part (a noun phrase) can vary. We applied this method to thousands of idioms, collocations, and proverbs, making the feature invaluable for learners.
Labor Four: Word Sense Disambiguation (WSD)
WSD is one of the most complex aspects of a smart look-up system. Consider the sentence:
- I used the iron to press my clothes and remove all the wrinkles.
If you click “iron,” most systems would show its most common meaning—a material used to make tools. While statistically likely, this meaning is wrong in context.
Some platforms solve this by listing all possible meanings and leaving the user to decide, but this approach shifts the burden onto learners.
To solve this, a system needs to understand the sentence’s context and determine the correct word sense. This is what WSD does, but implementing it is extremely challenging. Generative AI engines can handle WSD, but they are slow and costly for real-time use.
At LanGeek, we’re building our own WSD system to achieve both accuracy and speed. It’s still a work in progress, but it’s a vital step toward making smart look-up truly smart.
Beyond the Surface
What seems like a simple click-to-see feature hides a mountain of complexity. From crafting a robust lexicon to tackling tokenization, recognizing figurative language, and implementing WSD, each step has required innovation, persistence, and a deep understanding of language.
Despite the challenges, we remain committed to pushing the limits of language learning technology. Smart look-up is more than just a feature; it reflects LanGeek’s mission to empower learners with tools that genuinely make a difference.
Add comment