• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

A book is more than what it is (Part IV)

September 30, 2022 The Editorial Board- Teamware Solutions

Exploratory analysis with python

Recap

We have so far read the text out of PDF files. We ran PoS tagging on the text read from the PDF files. In this process also experimented whether there is a difference in generating PoS tags with corpus or a section of the text. The outcome; we know there is difference. It is more relevant to generate PoS tagging using the entire corpus. In this dispatch we tread the journey to load this data in Cosmos DB.

Principle

We have so far read and stored data in the memory of the process. One of the common risks associated with such approach is; we will have to run the entire process if we want to inspect a small aspect of such processing. This can often grow to be painful if we are in early stage of experimentation. What we mean by that is; there will be multiple iterations and running the entire process every time after few iterations becomes counterproductive. Thus, it is prudent to create an exit ramp in experimental processes. The exit ramp is essentially saving the data to disk.

Hey but wait are we not doing that with the CosmosDB. Yes certainly. But you will agree the cost of writing code to save the data to CSV is far cheaper that persisting the data to Cosmos DB. The Cosmos DB is a great destination for analysing the data but is not a good low-cost destination for intermediate storage.

So, the principle we want to highlight here is create enough exit ramps in the operations that can aide in analysis and resumption of the entire experimentation process multiple times.

Creating the exit ramp

Let us use the simple and basic approach to write csv namely –

 

print(“column1 value, column2 value, column 3 value\n”)

 

Yes, it is that simple. But there are so many small things that will look like rocks if not managed well. E.g. the comma itself being present in the text. We will have to escape it with double quotes for being csv safe. But we can take an alternate route; we use a library that is known for such processing and is commonly used pandas. We do that with

   

pip install pandas

 

By using pandas, we will first store the data in data frame and in a safe manner call the pandas function to save the data to csv.

Next, we decide on where to place the exit ramp. It is prudent to place it after the PoS tags are generated. Based on what we need to store in the Cosmos DB we will need these in the csv –

Word as in text

POS Tag

data

data

For us to reach here we need to create the pandas data frames with these two columns. Let us say we save this script in the same location as client for storing data in CosmosDB. We call this script as storage\export_data_csv.py

import pandas as pd

def export_to_csv(tagged_data, csv_file_path):

   data_df = pd.DataFrame(tagged_data)

   data_df.to_csv(csv_file_path)

 

At this moment it is imperative that we peak into the Orchestrator.py. This is the file which is running the entire experimentation process. In that pipeline we invoke this script to mount an exit ramp. So, let us peek into the Orchestrator.py

from .basic_file_read import read_entire_document

from .nlp import tag_words

from .storage import export_to_csv

 

filepath = “data\tale_of_two_cities.pdf”

print(“We read the pdf file as text”)

text_content = read_entire_document(filepath)

print(“We next tag the words using nltk library”)

tagged_words = tag_words(” “.join(text_content))

print(“We wrangle the tagged words to a format based on which pandas data frame can be created”)

 

tagged_data = {“word”:[word for word, tag in tagged_words], “tag”: [tag for word, tag in tagged_words]}

 

print(“Now that we have wrangled the format we next save the outcome to disk as csv file)

 

export_to_csv(tagged_data, “.\output\tagged_words_data.csv”)    

That is it. Now, if we have run this process once we have the data and can import a step in console line interpreter and run that explicitly instead of the entire process every time. That is also simple; assume we have this run the process and have csv file. Our script to store the data in Cosmos DB starts with a CSV file then we can in console use a statement like this to run that one step instead of entire process.

>>> from .storage import save_to_cosmos as sc

>>> sc.persist(“.\output\tagged_words_data.csv”)

You might know about processing pipeline in sklearn. The approach we have taken through these multi-series article unfolds what the world was before such libraries. It is important that a budding data scientist deals with this raw plumbing. Until such exercise is done learning the libraries alone might leave you with being a master of syntactic sugar at worst and master of the library at best.

Connecting to Cosmos DB

Having that out of our way, we need to now connect to CosmosDB. There are few important settings that one must remember never to put in code. If you have done cloud-based development. You will know that secrets are not persisted in code. The approach adorned in such situation is to store such sensitive information in compute ecosystem’s environment variable. The compute ecosystem can range from developer workstation to VM in cloud or a container or simply a serverless runtime like Azure function or AWS lambda.

There is plenty of text in Azure on how to connect to CosmosDB there are few pre-requisites like a Gremlin driver that needs to be installed. Let us fast forward those straight forward operations and assume you have installed them. Then we start with something like this –    

from gremlin_python.driver import client, serializer

from gremlin_python.driver.protocol import GremlinServerError

Most of the times we find that articles generally skip highlighting the imports and leaves it as unnecessary detail or too trivial. We believe that those details are very important for anyone to follow. We need to know where some component is coming from isn’t it? Anyway resuming our journey to connect to Cosmos DB we will continue like this    

from gremlin_python.driver import client, serializer

from gremlin_python.driver.protocol import GremlinServerError

import os

import pandas as pd

 

_Endpoint = os.environ.get(“PYEXP_COSMOS_ENDPOINT”)

_Database = os.environ.get(“PYEXP_COSMOS_DATABASE”)

_Key = os.environ.get(“PYEXP_COSMOS_KEY”)

   

def test_connect():

       GremlinClient = client.Clinet(_Endpoint, “g”, username=_Database, password=_Key, message_serializer=serializer.GraphSONSerializerV2d0())

When you execute this function and it does not result in error, we must be good. We next have to load the data and push to Cosmos DB. Let us say we create a new function in the same script where we did the testing.

def save_data_to_cosmos(csv_file_path):

       tagged_data = pd.read_csv(csv_file_path)

       for row in tagged_data.itertuples():

           //TODO: Add gremlin code to insert nodes

           //TODO: Add gremlin code to insert edges

Now the gremlin code requires bit of attention and is vastly different from the python code. We will pause here and tackle the gremlin code in next dispatch.

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img