• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

A book is more than what it is! – Part II

August 12, 2022 The Editorial Board- Teamware Solutions

– Exploratory analysis with python

 

Recap

Readers who have landed here directly from the search engines / forums; let us set you context really quick. We aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for writers (book authors). Our underlying goal by undergoing this exercise is also to explore the structure of experiments in data science. Organisation for real world projects that could be scaled.

Towards that journey we set up the initial folder structure and added the code to read in a page from the PDF file.

Setting up feedback mechanism

We next take up the printing this to the console. There are two things that crossed our mind. The simplest one is to add a print statement to the function itself. Another is to have separate script that performs such operation. In other languages like C# or java we will have Main that does it for us. For now let us do it inside the function itself.

 

   def read_first_page():

       print(“Reading first page =>”)

       file = pdf.open(path)

       first_page = file.pages[0]

       print(first_page.extract_text())

 

 

Now let us take this a step further and extract the entire page’s text.

 

   def read_entire_document():

       print(“Reading entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           text.append(page.extract_text())

       print(text)

 

 

You must have noticed the challenge. There are two function and we will need only one entry point for the application. That makes us sway towards another approach that we mentioned earlier. So, let us refactor code. Off the bat many calls this fixing the code. In data science and we urge in every manner this is refactor. Come to think about it we developers always refactor code through the entire day.

Refactoring here is simple we replace the last line of both the functions to like this –

 

   def read_first_page():

       …

       return text

 

   def read_entire_document():

       …

       return text

 

 

Small things like keeping the name of the value a function returns can speed up such exploratory analysis related coding lot easier.

Now we introduce the conductor for choosing the mode of exploration we want. Let us say, we call the conductor Experiemnts.py. Isn’t it apt?

 

   # Experiments.py

 

   import basic_file_read as document_reader

 

   print(“Host for the experiments that you want to be done here…”)

 

   content = document_reader.read_first_page()

 

   print(content)

 

 

This gives us a place to compile the path of experiments. We can call as many as we want. The next challenge is when we want to change it to something else; i.e., a different set of experiments. We then will miss the current experiment. Having all of them together in this file also clutters the file. We propose to use Git and tagged commits to manage such check-ins. Though, it appears to be kind of inspector gadget without the manual at present, it becomes easier with practice.

We could opt the best-in off the market IDE with workbench. It works as well. But starting from basics always helps to have root go deeper.

Coming back to the task at hand; we have a conductor for the experiment and have a file that reads the file as raw text.

Next, we take up the inferring the Part of Speech from this text. For that we will need some libraries – nltk. This is a basic library but will suffice our need for now. If you have worked with this earlier you will know that it gives PoS tags with abbreviations drawn straight from the Penn bank. Let us first have a summary of them so that we need not remember it. We made it a dictionary as it is finite.

 

   # english_pos_tags.py

 

       def tag_descriptions():

           return {

               ‘CC‘ : ‘Coordinating conjunction’,

               ‘CD‘ : ‘Cardinal number’,

               ‘DT‘ : ‘Determiner’,

               ‘EX‘ : ‘Existential there’,

               ‘FW‘ : ‘Foreign word’,

               ‘IN‘ : ‘Preposition or subordinating conjunction’,

               ‘JJ‘ : ‘Adjective’,

               ‘JJR‘ : ‘Adjective comparative’,

               ‘JJS‘ : ‘Adjective superlative’,

               ‘LS‘ : ‘List item marker’,

               ‘MD‘ : ‘Modal’,

               ‘NN‘ : ‘Noun singular or mass’,

               ‘NNS‘ : ‘Noun plural’,

               ‘NNP‘ : ‘Proper noun singular’,

               ‘NNPS‘ : ‘Proper noun plural’,

               ‘PDT‘ : ‘Predeterminer’,

               ‘POS‘ : ‘Possessive ending’,

               ‘PRP‘ : ‘Personal pronoun’,

               ‘PRP$‘ : ‘Possessive pronoun’,

               ‘RB‘ : ‘Adverb’,

               ‘RBR‘ : ‘Adverb comparative’,

               ‘RBS‘ : ‘ Adverb superlative’,

               ‘RP‘ : ‘Particle’,

               ‘SYM‘ : ‘Symbol’,

               ‘TO‘ : ‘to’,

               ‘UH‘ : ‘Interjection’,

               ‘VB‘ : ‘Verb base form’,

               ‘VBD‘ : ‘Verb past tense’,

               ‘VBG‘ : ‘Verb gerund or present participle’,

               ‘VBN‘ : ‘Verb past participle’,

               ‘VBP‘ : ‘Verb non-3rd person singular present’,

               ‘VBZ‘ : ‘Verb 3rd person singular present’,

               ‘WDT‘ : ‘With-determiner’,

               ‘WP‘ : ‘With-pronoun’,

               ‘WP$‘ : ‘Possessive with-pronoun’,

               ‘WRB‘ : ‘With -adverb’

           }

 

 

This function we will we consume after generating the PoS tag in our next dispatch.

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img