Working on B-amooz (our language learning platform) we always had the dream of teaching the language the ideal way. Of course, there is not only one viewpoint to what an ideal language learning platform is. Some look at it from the motivational perspective and go with a plan to increase motivation as much as possible (which is usually done by using gamification). We looked at it from a technical point of view, in which we tried to take apart a language into small particles of concept and syntax and teach language learners each one at the right time.
Challenges we faced
The first challenge we had was to separate pieces making up a sentence and identify them correctly. One might assume that it can be done by simply taking all the letters between two spaces or punctuation marks; It actually does the job but with a very low accuracy rate, which is almost unacceptable (many modern platforms still do that this way). After picking each piece (which is most likely a word) we should identify it. Again the simplest way is to look at the spelling and compare it with a database to understand what word it is. We used the simple method in B-amooz but the result was truly unacceptable. To give you an example:
The game was very challenging.
I was challenging him to a duel.
With the simple mechanism used as explained above, these two words are assumed to be the same. But one is an adjective and the other is a verb. The frequency of such errors especially in a language like English is too much.
At a more advanced level, it is essential for the learner to get familiar with phrasal verbs, collocations, idioms, proverbs, etc.; which is almost impossible to implement with a simple method (unless the compound structure has no dynamic part and is always used the same which is very unlikely).
The lexical solution
Using NLP could make identifying words a lot easier; by using Tokenization, Lemmatization, and Part-of-speech tagging, we could identify words with far higher accuracy. however, because most NLP libraries are made with “Speech recognition” in mind, they don’t do a great job at POS tagging (at least from an educational point of view). So we had to make a correctional engine to correct bad tagging. The current engine has more than %90 percent accuracy at identifying words, which is good but not ideal.
After tokenizing and identifying each token, it is time to glue everything together again to identify compound structures like phrasal verbs or idioms. Because there might be a dynamic structure between each part of the compound word, we need to specify the dynamic part and then identify the structure.
To give you an example in order to identify the idiom “to take something for granted“, first we need to identify the fully dynamic part (something) and specify its type. In this case, “something” can be a noun, noun phrase, or a pronoun. The other thing we should take into consideration is to specify which part can inflect, which in this case is “take”. All the other words are static and will not change.
We soon realized that it is impossible to identify grammatical structures by simple tokenization. This was not something we could ignore in the learning process. The ideal form of a learning platform should be able to correctly identify each grammatical structure and if needed lead you to a lesson teaching you how that structure works.
Even most sophisticated NLP libraries do not analyze and identify the grammatical structure, so we designed an in-house syntactic tagger to identify and tag the grammatical structures used in each clause. The result was more than satisfactory with more than %95 percent accuracy in identifying grammatical tenses and moods. When it comes to grammar, what seems easy like the subject or object of a sentence is usually the most complicated to identify. The good part was that we could use some of the data from our lexical analysis to help us in the syntactic process.
The best way to learn a language is to start with the most common and easiest words and structures and make your way up. It sounds easy enough but designing a platform based on that is not. There are ways to identify the most common words in each language (which are usually not that accurate), but there are no lists of most common idioms or even phrasal verbs. No worries there we are already making one.
The problem with learning a language only based on the most common words is some of these words are difficult to learn and/or are used in difficult structures; for example, the word “would” is one of the most common English words but in most standard English as L2 methods it is thought in A2 level because it is not an easy word to use. To identify the difficulty of a word there are different analyses that can be done one the word but an important factor is how easy or difficult learners see that word. That data can be collected from the way users learn words, how many times they look them up, or how many times they forget them.
Getting a combination of these two factors can result in a system in which each word or grammatical structure can be thought at the right time and the most efficient way and that is what we planned to do working on LanGeek.