A book is more than what it is! (Part III)
Exploratory analysis with python
Recap
We have so far read a page from the book we are interested. We have also set up a utility function to map the POS tag to a suitable description. If you have landed here directly from the conduits of internet; we aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for literary works.
We move towards tokenizing the page. We talked about representing each word as a node, its PoS tag as label and relationship between the nodes be positional occurrence in a sentence.
Reading text
We have so far built a scaffolding of function that looks like
def read_entire_document(path): |
print(“Reading entire document =>”) |
file = pdf.open(path) |
text = [] |
for page in file.pages: |
text.append(page.extract_text()) |
print(text) |
To tokenize document we can endeavour ourselves or use the NLP’s library to do that. We choose latter. For doing so we need to read the content of the document line by line and we do this by introducing the split –
pageContent = page.extract_text() |
lines = pageContent.split(“\n”) |
The whole function will then look like this –
def read_entire_document(path): |
print(“Reading the entire document =>”) |
file = pdf.open(path) |
text = [] |
for page in file.pages: |
pageContent = page.extract_text() |
content = pageContent.split(“\n”) |
text += [line.strip() for line in content if line != “\n” and line.strip() != “”] |
return text |
There are few variations of this that needs to be highlighted here. We are taking appearance of new line to indicate a line. This makes sense when you are looking at the document. It however does not make sense when you are trying to interpret from literary sense. To make sense from that point of view we will have to split the sentence using the period character which separates the words in English. It however is not going to be straight forward. We will have to handle the fringe cases like “i.e.” appearing in the text. Such fragments are not exactly sentences or lines.
Line vs Page and PoS
The other point worth considerations is should the PoS tagging be done for sentences or should it be done for the whole content across pages? Let us see if the answer to this question is going to impact performance of the data parsing or will it be affecting the theory of what we set to find. The quickest way is to conduct a quick experiment itself.
We initially do this in the command line interpreter. On a second thought this can be an experiment itself. We name this experiment based on the theory we validate, namely; PoS_Difference_LineVsFile
from nltk import pos_tag, word_tokenize |
import pdfplumber as pdf |
from basic_file_read import read_entire_document |
|
print(“QUESTION – Does PoS differ when derived from a statement or the entire text”) |
|
pdf_text = [ page.extract_text() for page in pdf.open(“Sample.pdf”).pages] |
|
print(“We join the text of individual pages with a space and get the PoS tags in the same line”) |
|
full_text_pos = pos_tag(word_tokenize(” “.join(pdf_text))) |
|
print(“We now bring in the PoS for individual sentences”) |
|
lines = read_entire_document(“Sample.pdf”) |
|
tagged_lines = [pos_tag(word_tokenize(line)) for line in lines] |
|
flattened_pos = [item for p1 in tagged_lines for item in p1] |
|
print(“Let us now compare the two PoS sets.”) |
|
print(set(full_text_pos) ^ set(flattened_pos)) |
If the above code prints anything on the console; then there are differences. For any of the literary work that you take we are fairly confident that you will notice something always printed after the above experiment.
That is because Part of Speech tagging assigns the tag in the context of usage of the word. When we supply just a line the context is different from when we supply the entire text in a single go.
Outcome of the experiment for is to use the entire text and we will copy the code which we already have for that purpose.
Re-organising code
We will move that to a script of itself. Let us call it nlp/tag_text.py
from nltk import pos_tag, word_tokenize |
|
def tag_words(corpus): |
return pos_tag(word_tokenize(corpus)) |
Needless to say, the __init__.py will export the tag_words for use in main set of experiments.
Storing in graph storage
We have come this far without storing data anywhere. Like in our introductory article we intend to capture the data with PoS tags in graph storage. We will use CosmosDB as our graph storage. Before we create a new folder or script let us take a look at our folder system –
(-)– experiments |
|——__init__.py |
|——basic_file_read.py |
|——PoS_Difference_LineVsFile |
(-)— data |
|——tale_of_two_cities.pdf |
|——count_of_monte_cristo.pdf |
(-)— nlp |
|——__init__.py |
|——tag_text.py |
| |
|– Orchestrator.py |
Now let us create a new folder that deals with storage and term it storage. Rather than experiment this one puts the data into a graph storage in a agreed upon structure. Thus, it will not be appropriate to call it experiment. Let us take a look again at our folder structure –
(-)– experiments |
|——__init__.py |
|——basic_file_read.py |
|——PoS_Difference_LineVsFile |
(-)— data |
|——tale_of_two_cities.pdf |
|——count_of_monte_cristo.pdf |
(-)— nlp |
|——__init__.py |
|——tag_text.py |
(-)— storage |
|——__init__.py |
|——tagged_words_to_cosmosdb.py |
| |
|– Orchestrator.py |
Here is an outline of what we must do next;
With these considerations in our mind we will focus our entire effort of storing data to Cosmos DB in the next dispatch. Till then happy analysing.
Recent post
Archives
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- October 2023
- June 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- January 2021
- December 2020
- October 2020
- August 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019