• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

A book is more than what it is! (Part III)

September 23, 2022 The Editorial Board- Teamware Solutions

 

Exploratory analysis with python

Recap

We have so far read a page from the book we are interested. We have also set up a utility function to map the POS tag to a suitable description. If you have landed here directly from the conduits of internet; we aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for literary works.

We move towards tokenizing the page. We talked about representing each word as a node, its PoS tag as label and relationship between the nodes be positional occurrence in a sentence.

Reading text

We have so far built a scaffolding of function that looks like

   def read_entire_document(path):

       print(“Reading entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           text.append(page.extract_text())

       print(text)

 

To tokenize document we can endeavour ourselves or use the NLP’s library to do that. We choose latter. For doing so we need to read the content of the document line by line and we do this by introducing the split –

   pageContent = page.extract_text()

   lines = pageContent.split(“\n”)

 

The whole function will then look like this –

   def read_entire_document(path):

       print(“Reading the entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           pageContent = page.extract_text()

           content = pageContent.split(“\n”)

           text += [line.strip() for line in content if line != “\n” and line.strip() != “”]

       return text

 

There are few variations of this that needs to be highlighted here. We are taking appearance of new line to indicate a line. This makes sense when you are looking at the document. It however does not make sense when you are trying to interpret from literary sense. To make sense from that point of view we will have to split the sentence using the period character which separates the words in English. It however is not going to be straight forward. We will have to handle the fringe cases like “i.e.” appearing in the text. Such fragments are not exactly sentences or lines.

Line vs Page and PoS

The other point worth considerations is should the PoS tagging be done for sentences or should it be done for the whole content across pages? Let us see if the answer to this question is going to impact performance of the data parsing or will it be affecting the theory of what we set to find. The quickest way is to conduct a quick experiment itself.

 

We initially do this in the command line interpreter. On a second thought this can be an experiment itself. We name this experiment based on the theory we validate, namely; PoS_Difference_LineVsFile

   

from nltk import pos_tag, word_tokenize

import pdfplumber as pdf

from basic_file_read import read_entire_document

 

print(“QUESTION – Does PoS differ when derived from a statement or the entire text”)

 

pdf_text = [ page.extract_text() for page in pdf.open(“Sample.pdf”).pages]

 

print(“We join the text of individual pages with a space and get the PoS tags in the same line”)

 

full_text_pos = pos_tag(word_tokenize(” “.join(pdf_text)))

 

print(“We now bring in the PoS for individual sentences”)

 

lines = read_entire_document(“Sample.pdf”)

 

tagged_lines = [pos_tag(word_tokenize(line)) for line in lines]

 

flattened_pos = [item for p1 in tagged_lines for item in p1]

 

print(“Let us now compare the two PoS sets.”)

 

print(set(full_text_pos) ^ set(flattened_pos))

 

If the above code prints anything on the console; then there are differences. For any of the literary work that you take we are fairly confident that you will notice something always printed after the above experiment.

That is because Part of Speech tagging assigns the tag in the context of usage of the word. When we supply just a line the context is different from when we supply the entire text in a single go.

Outcome of the experiment for is to use the entire text and we will copy the code which we already have for that purpose.

Re-organising code

We will move that to a script of itself. Let us call it nlp/tag_text.py

   from nltk import pos_tag, word_tokenize

 

   def tag_words(corpus):

       return pos_tag(word_tokenize(corpus))

Needless to say, the __init__.py will export the tag_words for use in main set of experiments.

Storing in graph storage

We have come this far without storing data anywhere. Like in our introductory article we intend to capture the data with PoS tags in graph storage. We will use CosmosDB as our graph storage. Before we create a new folder or script let us take a look at our folder system –

 

(-)– experiments

|——__init__.py

|——basic_file_read.py

|——PoS_Difference_LineVsFile

(-)— data

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

(-)— nlp

|——__init__.py

|——tag_text.py

|

|– Orchestrator.py

Now let us create a new folder that deals with storage and term it storage. Rather than experiment this one puts the data into a graph storage in a agreed upon structure. Thus, it will not be appropriate to call it experiment. Let us take a look again at our folder structure –

(-)– experiments

|——__init__.py

|——basic_file_read.py

|——PoS_Difference_LineVsFile

(-)— data

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

(-)— nlp

|——__init__.py

|——tag_text.py

(-)— storage

|——__init__.py

|——tagged_words_to_cosmosdb.py

|

|– Orchestrator.py

Here is an outline of what we must do next;

1. We must create a node for each word.
a. The plausible question will be what happens when we encounter repetition?
2. Each PoS tag must be stored as label in the graph storage
3. Nodes/Vertices must be connected using edges which have a number property; that indicates connectedness of two consecutive words. The number must reflect the position of the two words.

With these considerations in our mind we will focus our entire effort of storing data to Cosmos DB in the next dispatch. Till then happy analysing.

 

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img