• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

Master your language

October 26, 2023 The Editorial Board - Teamware Solutions

 

Shortest path with elevation in the mix
We could not emphasise this enough more than saying master your language. We remember sprinkling our
earlier articles with the same piece of advice. If we remember well it was positioned in the context of
emerging trends in language-based ways to quickly solve a problem. Let us usher the proverbial keyword
ChatGPT. The word ChatGPT is a specific reference to the project. Since, its release, it has many
adjectives associated with it Intelligent, Artificial, Generative, Job gobbler etc. There is enough said praised
already about its capabilities and enough said to scare about how it can make many titles obsolete. We are
not here to echo any of those. Please chat with ChatGPT to figure out how about it. We are here to ponder
the foundational pillars of ChatGPT. For the rest of this article, we will replace the brand with the technology
i.e., Generative AI.
The foundational pillars of Generative AI are –
1. Information representation
2. Searching through the piles of that representation
3. Constructing a response in natural language.
These topics individually have been areas of research for decades in Computer Science. What changed or
why is there so much attention now? The way we see it has parallels to Moore’s law. In the past decade,
we could do much more with a CPU than we could have done a decade ago. Add to it the ability of social
media, influencers and plush investments endowed to bleeding edge tech companies you have the present
time.
As developers, it is for us to stay grounded and try to understand and build on the fundamentals. The
application of technology can only be as good as we understand it. By application of technology, we mean
our solutions to customers. Many customers want to pilot or do something with Generative AI and often an
exciting start leads to a bit dull phase of stagnation if not steered well by architects and developers.
Information representation
Generative AI is not something manifested in thin air. It is built on the language that is written. Let us spin
you off-axis a bit. Did you encounter Generative AI solutions in languages like Tamil, Hindi, German,
Sanskrit, Arabic or Portuguese? If not why so? We will leave that as a question and hope you give enough
of your cognitive bandwidth to address it at some point. Words no matter which language they are written
are symbols of whose meaning is well understood when spoken between two humans. However, for
accomplishing something like that for computers we need to do the transformation for these symbols. Such
transformation of symbols is called encoding. For any form of intelligence, such an encoding is not enough.
It must also capture the context along with that symbol. For all practical purposes, the context is the
sequence of the symbols. Like this article! Which is a sequence of words (symbols) and together they
convey something to you. One can argue that context is more than sequence. Take a moment’s pause and
think about it. Words (symbols) mean something; that explains the topic of this article. The sequence of
words conveys the idea that we as authors want to communicate to you. This idea “for practical purposes”
constitutes the context. To capture the context, we will need to capture this sequence along with encoding.
The challenge with that is, there are infinitely many sequences and how do we store all that? That is where
we introduce Probability. For a language that has structure, the words have a higher degree of co-
appearance per the grammar of that language and the way people use a language too. In the field of
Computer Science, this act of capturing the possibility of co-occurrence is called word embedding.

The developer soaked in writing code using any kind of programming language, by now would have
assigned a data structure to that. Does array come to mind? If you ask well, array of what? Needless to
say, the CPU is conversant with numbers so how about the array of floats?
If you want to jump in right away and fire up your laptop to write that encoding and embedding. You might
soon hit a wall of challenges. Often frameworks help address such challenges. We will use the TensorFlow
framework in this article.
The one line
Often than not use of frameworks makes the code look rather straightforward to read. It takes away the
complexities involved in getting things done. Something as complex as word embedding by use of the
framework becomes this –
sentenceDenseVector = tf.keras.layers.Embedding(800,8, 20)
This line by itself is not sufficient. Come on! A framework can only do so much to help you. This is the
booting up of the embedding layer. But you need to few more things around this line to get a result. We
could take this for a spin right away. Let us supply a 1D vector to this layer –
sentencesDenseVector(tf.constants[3,6,9]).numpy()
The outcome of this will be a dense vector that does not resemble the input supplied.
array([[0.123456789, 0.987654321, 0.456789012, 0.789012345, 0.012345678, 0.345678901,
0.678901234, 0.901234567],[ -0.456789012, 0.789012345, -0.012345678, 0.345678901, -0.678901234,
0.901234567, -0.123456789, 0.987654321], [0.123456789, -0.987654321, -0.456789012, 0.789012345,
0.012345678, -0.345678901, 0.678901234, -0.901234567]], dtype=float32)
You will notice that this multidimensional array is not sparse. i.e., it does not have the same value over and
over which will happen if only encoding like one hot encoding was used. The array is dense and is of
dimension that is used while creating the embedding layer i.e., 8.
The first parameter is the input dimension where 800 represents the input dimension of the vocabulary
supplied to the layer. In other words, we expect that the input data that we will supply to train the word
embedding will have 800 different words. This is something that you can determine during exploratory
analysis. Remember do not use the exact number as things can grow. So, use a number larger than the
count of unique words in your vocabulary. There are finer decision points like do you want to include stop
words or want to skip in taking this count. It all depends on the application.
The last parameter is the length of the input supplied to the layer. We are technically incorrect in
configuring 20 and supplying only 3 values. You can avoid this conundrum by leaving out the parameter
itself. So that the layer adapts to the size of the input. When you are working with the sentences it makes
sense to tune this number to the average number of words in the sentence.
Applying this to text will involve some more work than declaring a constant. Text is prone to punctuations,
errors in spacing to make it fit for processing by application code etc. Since we are using TensorFlow we
can get that done quickly using a simple library call.
vectorizedText = TextVectorization(max_tokens = 800, output_mode=’int’, output_sequence_length=20)
In your neural network architecture, the vectorization will happen before the embedding. In the layer we just
built above you can additionally mention the standardised parameter that takes care of fixing the spacing
and punctuations and any other sanitization you deem relevant. The max_tokens parameter corresponds to
our guestimate on vocabulary size and the output_sequence_length corresponds to the context which we
introduced a while ago. In your experiments, it is important that you tune these hyperparameters. It
certainly will not be one size fits all.

These layers will in the neural network be configured as
model = Sequential([vectorizedText, sentencesDenseVector, GlobalAveragePooling1D(), Dense(8,
activation=’relu’), Dense(1)])
One key point that you must remember is whatever embedding model you use to capture your data use the
same model to transform your input after commissioning the model and use the same model to do a search
later. As you can see there is not much involved for you to use word embeddings. However, below the
surface of exposed API, there is a lot that goes on. Many a time word embedding is something that often
researchers or open-source enthusiasts share with each other. Word2Vec and GloVE come to mind when
we talk about sharing embeddings. True to the title of the article you must now understand that you need to
master the language you choose to write and communicate your thoughts. Incorrect sentences and poor
spellings will lead t wrong embeddings causing errors to creep into your models. Embedding is one way to
make the Generative AI work in languages other than English.

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img