Find translations of a given word in the corpus e.g. by machine learning, word2vec, text mining-CodePudding

I am using this thread to get some ideas and find some possibilities.

I have about 1000 sermons and their translations into another language. The lengths of the sermons are different. These are religious sermon texts. Because of the domain (religious), there are a lot of words that can be used in different ways based on the context. The same word can become a different meaning.

Is there a way, where I can get "programmatically" the translations of a given word in the aim language?

x1 -> [y2,z2,a2,b2,c2]
where x is the word in the language 1
and the returned array contains translations in the language 2

This would be the best case. Maybe this could be possible by training a translation model by using domain data, but I don't have a lot of data.

Could it be possible by using word2vec? By creating a vector space of both texts (language 1 and language 2) and by using a transformation matrix would it be possible to put the semantical meanings together?

Do you know other ways or have other ideas? Is there maybe such works already and what is these kinds of research called? I was not able to find something like this. I hope you guys have some ideas on how this could be reached.

The general purpose is "to create a tool" for researchers in this specific domain, that can be used to analyse sermons translation quality. If you have another ideas how the quality of a translation (semantically) can be analysed, I would be very thankful.

CodePudding user response：

To get the translation for a specific word in a sentence, you can use what’s called word alignment.

To get the quality of the translation, you can use what’s called quality estimation.

machinetranslate.org/quality-estimation

CodePudding user response：

A solution based on word vectors (FastText vectors are typically better than Word2Vec) is certainly possible. The task that you are looking for is bilingual dictionary induction. The most frequently used tool for that is VecMap that can align two embeddings spaces from two languages. It either uses a small seed dictionary to align all the words or it even can work in a completely unsupervised fashion.

Another solution is doing word alignment, i.e., statistically aligning words in the translations. Then you can get a dictionary based on the frequencies of how often the words are mapped to each other (note there might be problems when the languages differ morphologically). In this case, you can easily show examples of how the translations are used in sentences. If the languages you are interested in are covered by the XLM-R model, I recommend using SimAlign (a neural solution). If not, you can use Eflomal (a statistical solution).