• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

A book is more than what it is!

July 20, 2022 The Editorial Board- Teamware Solutions

– Using python to analyze book

 

The big picture

An avid book reader reads cover to cover and soaks in the experience. By book we mean obviously fiction at present. Imagine how a data scientist will look at a book? Well bag of words isn’t it? You must have seen this coming from miles away. This is true though. We took off with this kind of imagination and started wondering on; Is there correlation between author and his style of writing? This was a fun exercise but gave us solid hard look on how do we chase an exploratory work in data science. We share our journey with you here. It is a joy ride with thoughts and some code and some flaws. One thing that we can assert is by the end para of this dispatch you; our reader will have gained experience of how exploratory work is pursued in the reality of uncertainty. You will also be exposed to point of views that are different to your own or might resonate with what you were pondering earlier.

Let us dive in.

Writing style

Think about it; what is a writing style? Limiting ourselves to the language that this article is written in -English; it is usage of words in particular manner. Let us dive in – What is the particular manner? We can drop the word particular from the preceding question and when we quiz ourselves what is the manner inwhich an author writes; isn’t it the choice of sequencing words in parts of speech – like Noun, Verb, Adjective etc. English language has a set of rule defined which defines the way these parts of speech must be arranged called Grammar. Still creative freedom gives authors arrange them in peculiar styles. We took flight with that definition of manner. We wanted to establish if there is a strong correlation between an author and such arrangement of wording across publications and books.

Facets

There are multiple facets to this exploration. We want to first collect publications by an author, then we need to prepare a model for the book’s content, then we need to find these patterns in part of speech.One thing at a time; we picked the data model first. Isn’t it evident that whatever we do in other facets willbecome as much easy or harder based on the data model. We also are aware that sometimes early on we might not have made the right decision but we adapt in-flight to whatever we laydown now must not be something that we are in love with and reluctant to accept as we discover new facets or challenges en-route.

Data model

The best data model we thought for this case will be a graph data structure. We also believed we are better off using graph database to convert the textual data to a analytical structure. In our model we will have words in vertices and the sentence will be edges between vertices. Sequence of occurrence for words in thesentences we decided is better placed as property in the edge.

Next we from the initial thought also needed part of speech to be presented. So we put that in the label. Because we anticipate that we might need query more on the label than the individual word itself. So we got ourselves something like this –

 

 

 

 

 

 

 

Populate the data model Data model is great but now the rubber needs to meet the road. We needed to load data in this model. Our data is a pdf file; the entire book. Before we get to the specifics of loading data, we realize we need a sustainable folder structure so that we do not bother ourselves with the specifics of why a file is located where it is every time and it becomes our second nature. So we devised this –

 

(-)– experiments

|

|——__init__.py

|——basic_file_read.py

|——[…] (-)— data

|

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

|——[…]

 

The folder structure is evident but this is not all of what we will. But the pattern is all experiments are persisted as individual script file. Data is collected in on place. These folders are treated as modules in python so we have init.py where we have code. That file resembles to

from .basic_file_read import read_document

from .basic_file_read import read_first_page

from .basic_file_read import read_file_per_line

from .basic_file_read import save_textas_csv

from .exploratory_analysis import statistical_summary

Fairly quickly we needed another folder called Analysis. Where we could store the content of the any intermediate data file that we intend to use introspect. We will doing that quite often isn’t it?

In case you are wondering why did not launch a Jupyternotebook server and run one notebook? It is very much possible. Similarly, this is also possible. i.e. you cook up individual script files and stitch them together via a main run time file. We happen to pick the latter approach. Before we wrap up let us flush in the PDF reading as it is more trivial –

def read_first_page():

print(“ReadFirstPage =>”)

file = pdf.open(path)

first_page = file.pages[0]

return first_page.extract_text()

 

 

We use the pdfplumber library for the above code to work. We will continue from here and into the next dispatch we will also take a look at reorganising the code structure a bit for convenience.

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img