Navigate back to the homepage
ProjectsPricing

# How Machines Understand Words

Stefan Libiseller
August 14th, 2019 · 3 min read

Machine learning is - very simplified - transforming input numbers by learned numbers. This means, if words are to be the input for a machine learning model, they somehow have to be translated into numbers first.

One possibility is to encode each character and then feed that sequence to the ML model. While this is possible in theory, it is very hard for the ML model to pick up on the task you want it to perform. This is because, before it can focus on the task, it has to learn the complex mechanisms behind language and the meaning of words. It's the equivalent of taking a quiz in a language you don't speak.

Encoding words as a whole is a much better idea as it opens up the possibility of adding pre-trained knowledge of language and meaning to that encoding. This words to numbers translation is called a word embedding.

## Word Embedding

In a word embedding each word gets assigned a word vector which encodes its meaning relative to all other words. These vectors typically have between 50 and 300 dimensions (independent numbers). A common vocabulary size for word embeddings is about two million words.

To show you more about what word embeddings can do I am using fastText, a pre-trained word embedding by Facebook, which is publicly available. You can download the cc.en.300.vec word vector file here.

### Take a Look at a Word Vector

This Python code loads the fastText word vector for "dog".

.css-a94cux{position:absolute;right:22px;top:24px;padding:8px 12px 7px;border-radius:5px;color:#6f7177;-webkit-transition:background 0.3s ease;transition:background 0.3s ease;}.css-a94cux:hover{background:rgba(255,255,255,0.07);}.css-a94cux[data-a11y="true"]:focus::after{content:"";position:absolute;left:-2%;top:-2%;width:104%;height:104%;border:2px solid;border-radius:5px;background:rgba(255,255,255,0.01);}@media (max-width:45.9375em){.css-a94cux{display:none;}}1import torchtext23# load fastText vectors4fasttext = torchtext.vocab.Vectors('cc.en.300.vec')56def get_vector(word):7    assert word in fasttext.stoi, f'*{word}* is not in vocab!'8    return fasttext.vectors[fasttext.stoi[word]]910get_vector('dog')1112# output13# ---------------------------------------------------------------------14[ 0.1680, -0.0013,  0.0162,  0.2779, -0.1062,  0.0366,  0.2043,  0.0642,15 -0.0115,  0.0582, -0.2252, -0.2130, -0.0762, -0.0495,  0.0449,  0.2431,16  0.0446, -0.0288, -0.3035,  0.0158, -0.2097, -0.0215, -0.0963, -0.0459,17 -0.0209, -0.2514,  0.1053, -0.1832,  0.0060,  0.2452,  0.0032,  0.2207,18 -0.0141,  0.0683, -0.0712, -0.0064, -0.0015,  0.0694, -0.1512, -0.4159,19  0.0808,  0.0432, -0.1890,  0.0269, -0.2053,  0.0283,  0.0146,  0.0388,20 -0.2020,  0.2738, -0.2366, -0.1278, -0.0665, -0.1274, -0.2438,  0.1801,21 -0.0407, -0.0155, -0.1460,  0.1093,  0.0273,  0.0163,  0.1462,  0.0856,22 -0.1293, -0.0084,  0.1568,  0.2373, -0.1708,  0.1281, -0.0095,  0.0350,23 -0.0718, -0.1996, -0.0901, -0.0601,  0.1511, -0.0249,  0.0367,  0.0767,24  0.0178, -0.1069, -0.0110, -0.1920, -0.0224, -0.0404,  0.1455, -0.0236,25  0.1104, -0.0976,  0.0238, -0.2057,  0.0172,  0.0320,  0.0082,  0.0866,26  0.1850,  0.1840,  0.1067,  0.0374, -0.4075, -0.0402,  0.0846, -0.1112,27 -0.2529,  0.1772, -0.1850,  0.2514, -0.0127, -0.1483,  0.1600, -0.0588,28 -0.0614,  0.1117, -0.1457,  0.1475, -0.3153,  0.0108, -0.1519, -0.0436,29 -0.0635, -0.0888, -0.0578, -0.0983, -0.0251,  0.0774, -0.0807,  0.1271,30  0.1698, -0.1946, -0.1263, -0.0550, -0.0597, -0.1529, -0.0905,  0.0596,31  0.1855,  0.0218,  0.2297, -0.1333, -0.0720, -0.0312, -0.0077, -0.0386,32 -0.0635,  0.0168, -0.3063,  0.3933, -0.0754, -0.1283,  0.0095, -0.2939,33 -0.0505, -0.1281, -0.1555,  0.1101,  0.0319,  0.0221, -0.1495,  0.1655,34 -0.1755,  0.1453,  0.1828, -0.1498, -0.2188, -0.1255,  0.1867, -0.1273,35 -0.0232,  0.0352,  0.0901,  0.1168, -0.2179, -0.0116,  0.0472, -0.1177,36  0.1580,  0.0814, -0.1904, -0.1378,  0.0857, -0.0967, -0.0752, -0.2005,37  0.1006,  0.0772,  0.2077, -0.0425, -0.0078,  0.1553, -0.2352, -0.0190,38  0.0103,  0.0056,  0.1036,  0.0051, -0.0062,  0.1506, -0.0222,  0.1142,39  0.0601,  0.0364,  0.0585,  0.0437,  0.0291,  0.1614,  0.0338, -0.1743,40  0.0866,  0.1908,  0.0800, -0.1523, -0.0601, -0.1148, -0.1047, -0.3520,41 -0.0891, -0.0627, -0.0143, -0.0135,  0.1672,  0.0007,  0.0710, -0.0440,42  0.1362,  0.0377,  0.1690, -0.0459, -0.1022,  0.0346, -0.1959,  0.0451,43 -0.0774,  0.1307, -0.0142, -0.0253, -0.1935,  0.0333, -0.0448,  0.1531,44 -0.0086, -0.0767, -0.2097, -0.1825,  0.1158, -0.1706,  0.0685, -0.0045,45  0.0069,  0.0382,  0.0310, -0.0462,  0.0433, -0.1529, -0.4071,  0.1019,46 -0.0417, -0.1270,  0.0347,  0.1016, -0.0407, -0.1196, -0.0041, -0.0848,47 -0.1461,  0.0328, -0.1638,  0.0261, -0.0199, -0.1041,  0.0212,  0.1466,48 -0.1706,  0.0447, -0.2523,  0.0423, -0.0611,  0.1119, -0.0780, -0.1129,49  0.0558,  0.0450, -0.1698, -0.0260, -0.0644,  0.1335, -0.1240, -0.0888,50 -0.0556, -0.1717, -0.0921, -0.0390, -0.1410,  0.0748,  0.2265, -0.2045,51 -0.1585,  0.3027,  0.0942,  0.1540]

As you can see the word 'dog' translates to an array of 300 float values, which could now be processed by a machine learning model. Every word included in the embedding translates to such an array of length 300 with pre-trained numbers.

Often, the word embedding is the first layer in a deep learning model. This has the advantage that the word vectors themselves, just like the weights and biases of the neurons, can be further optimized during the training process.

### Related Words

Good word embeddings have properties that make it easier to process text. The most important one is that related words are close to each other.

The code below finds the closest n words by comparing all word vectors to the input word vector and calculating the vector distance between them. Similar to how you would find the distance between two points in a 2D plane, but with 300 dimensions. It then sorts them by distance and prints the n closest words.

1import torch23def pretty_print(word_distance_list):4    for word, distance in word_distance_list:5        print(f"{word:{25}} {distance:{10}.{7}}")67def closest_words(word, n=3):8    vector = get_vector(word)9    distances = [(w, torch.dist(vector, get_vector(w)).item())10                 for w in fasttext.itos]11    return sorted(distances, key = lambda w: w[1])[:n]1213pretty_print(closest_words('dog'))1415# output16# ---------------------------------------------------------------------17dog                              0.018dogs                        1.28991319puppy                       1.546394

As you can see dog, dogs and puppy are close together in the 300D vector space. This property of word embeddings lets the model understand synonyms and the gives it the power to process meaning.

### Analogy

Somewhat surprisingly word vectors are also able to solve analogy problems. This works by applying simple math operations to the vectors. For example: Subtract dog from puppy which leaves you with a fictional baby vector. Add cat to this vector and you should get kitty.

$puppy - dog = kitty - cat$$puppy - dog + cat = kitty$

Sounds crazy? It somewhat is. But it works!

Here is the implementation in Python code:

1def x_to_y_like_a_to(x, y, a, n=3):2    print(f"{x} is to {y} like {a} to...")3    b = get_vector(y) - get_vector(x) + get_vector(a)4    possible_words = closest_words(b, n=n+3)5    solution = np.squeeze([word for word in possible_words6                           if word[0] not in [x, y, a]][:n])7    pretty_print(solution)8    return910x_to_y_like_a_to('dog', 'puppy', 'cat')1112# output13# ---------------------------------------------------------------------14# dog is to puppy like cat is to...15kitty                     1.4033716kitten                    1.4087917kittens                   1.53741

As you can see, the vectors in the word embedding were chosen in such a way that these math operations are possible. There are many examples where such analogies work. Full disclosure: They don't work for every example.

Let's take a look at whether fastText has a sense of tenses:

1x_to_y_like_a_to('walk', 'walked', 'swim')23# output4# ---------------------------------------------------------------------5# walk is to walked like swim is to...6swam                      1.210567swimming                  1.311708swimmers                  1.32381

Yes, they do. And it even works for irregular verbs.

1x_to_y_like_a_to('France', 'Paris', 'Germany')23# output4# ---------------------------------------------------------------------5# France is to Paris like Germany is to...6Berlin                    0.681277Munich                    0.716888Frankfurt                 0.78771

Turns out you can learn a lot about the world if you read (or process) the internet.

### Out of Vocabulary Words

But what about words that are not in the vocabulary? No vector means no way to process that word. What now?

Often, unknown words are simply replaced by an <unk> token which translates to a vector of all zeros. As long as the number of unknown words is small and the task isn't too complex, the model will still be able to give the correct prediction. For more complicated tasks, such as translation, it is not possible to simply ignore a unknown word.

## Byte Pair Encoding

Byte pair encoding (BPE) solves this problem by working with frequent letter combinations instead. It dismantles words into predefined subword units and translates them into pre-trained vectors. For example, the word ending ing_ (the underscore represents the trailing space) would be translated into it's own 'participle vector'. Including the space in the encoding allows it to distinguish between ing, a subword in the middle, and ing_ as word ending.

The model learns to interpret sequences of word chunks rather than whole words. This technique gives it more flexibility to process unknown words while keeping the advantages of word based embeddings compared to character based processing.

The name byte pair encoding comes from the method, traditionally a compressing algorithm, that is used to determine what letters should be combined into a single token (word part).

## Conclusion

For machines to process words, they must first be converted to numbers. Word embeddings translate the meaning of words into a dense vector representation. In this representation, words of the same meaning are close to each other. A problem with word embeddings are unknown words. For this reason, recent publications often use byte pair encoding which vectorizes subwords instead of entire words.

Popular word embeddings:
GloVe by Stanford
SpaCy by Explosion AI

TorchText - text processing for PyTorch
pytorch-sentiment-analysis - Tutorials on getting started with PyTorch and TorchText

I create custom AI models for small to medium-sized companies who want to make their products stand out with deep learning. If you are interested in collaborating, send me an email or talk to me directly in a free video call!

Schedule free video call

Get notified about new blog posts and news about what I've been doing lately. You can opt-out at any time and I promise not to spam your inbox or share your email with anyone else.

### Machine Learning for Humans in a Hurry

A compact summary of what machine learning is and how it works.

July 9th, 2019 · 3 min read

### Deep Learning Workstation Build Guide

How I built my own Ryzen 7 3700X dual-GPU deep learning workstation maximised for value and upgradability.

February 20th, 2020 · 9 min read
Link to $https://twitter.com/libiseller_aiLink to$https://github.com/libisellerLink to \$https://instagram.com/standerwahre