[This entry has been edited for clarity. An example given discussing the similarity of words in French and English was incorrect. The following sentence has been removed: “The next question addressed by Bhattacharya was the ambiguity that may arise in languages with similar origins, for example in French ‘magazine’ actually means shop while in English, well it is a magazine.”]
Today is June 14th, so I am 14 days into summer school; 7 more days left, and we are all already feeling saddened by the idea of leaving Kharagpur soon. In India, an IIT is a dream for 90% of the 12th graders who join IIT coaching classes. The competition is high so not everyone gets in. I’m one of those who didn’t get in. So when I saw there was an ACM Summer School opportunity at the largest and oldest IIT in India, obviously I grabbed it. By sheer luck, I was selected to actually attend the school. Over the course of 21 days, we have been tasked to learn about machine learning and natural language processing.
From day one, literally everything in class has been a new word, and I feel like I am learning a new language altogether. It is as if I am a program in training.
The speaker on the first day was Pushpak Bhattacharya, currently the director of IIT Patna, Vijay and Sita Vashee Chair Professor at the Department of Computer Science and Engineering, Indian Institute of Technology Bombay Powai, Mumbai. His lecture introduced me to the eternal relevance of speech tagging. In layman terms this is a process that attaches a tag to each word, in order to give the word a meaning in its relative sentence. The machine maintains a tag set, and how the tags are allocated is decided by a set of rules or learned over time. How these are learned over time, well that’s the part that intrigues me the most.
A closer look at the variety of languages in this world shows us that each language has two main features: a family of languages that it comes from, for example, English and French have a Latin base. All the languages in India are broadly classified under “Indic Languages.” The second main feature of a language is whether it is a live language or a dead language. While English is a constantly growing language alive with changes, Sanskrit is classified as a dead language. The major problems arise in NLP mainly due to the constant changing nature in the grammar and meaning or trend of a language that is alive, Especially a globally accepted language like English and it’s increasing use in social media in association with other languages.
The next question addressed by Bhattacharya was the ambiguity that may arise in languages with similar origins, for example in French “Location” actually means rental as in holiday rentals,while in English, well it is a word used to talk about a particular place. Similarly, there is variation in the meaning of a word, or morphology of a word, when taking a noun form to a verb form and so on. For example, almost all the words in English have both a noun form and a verb form—read becomes to read, a scream becomes to scream and so on. Bhattacharya established the importance of context in determining the true meaning of a word in a given sentence and exposed us to some methods of establishing the same. POS Tagging cannot rely entirely on grammar rules, as these don’t are not intuitively understood by a machine as they are by human beings. So applying a Hidden Markov Model or a Conditional Random Field Model is bound to give more accuracy. Which one to use and the nature of your algorithm being rule based or entirely based on a neural network will be decided first and foremost by the problem statement and then by it’s readability and your data set. Generally, purely neural networks are hard to read and may lead to more errors, hence often the choice is an HMM approach, which is a middle line between the two other approaches.
We also discussed several language translation issues. One of the very basic and logical approaches that I personally really liked—you know keeping aside the probabilistic math, the HMM, and the CRF approach—is simply analyzing a set of parallel sentences to assist translation between two languages.
- (English): Four cats; cats from France.
- (French): Quatre chats; chats de France.
By just comparing the two sentences we can tell that “cat” is related to “chat” more than any other word in the two sentences. You know how it goes: Easy does it.
If we Google “Chats de France translate to English” it gives “Cats France” as a one word translation. When you type “Cats from France ” it will give you “Chats de France.” The varying nature of de in the French language is more than enough to confuse your translator.
“The girl of Britain” and “The girl from Britain” both will translate to “La fille de Grande Bretagne,” even though the first sentence doesn’t make a lot of grammatical sense. Such an example and many more are representative of the subtle and intriguing problems faced by scientists dealing with natural language processing today.