• image
  • image
  • image
logo logo
  • Home
  • View Jobs
  • Services
  • About Us
  • Blog
  • Contact Us
img img

Doing discrete predictions in Data Science

February 28, 2020 The Editorial Board - Teamware Solutions

Image Courtesy – pxhere.com
Algorithms – Using python language
Much of the professional work we do (create software for businesses) involve attributes / properties which gets a value from a universe of possible values. Say country it has a finite value, payment options that can be offered, subscription options offered by a website, status of an exam or test, number of road accidents in a locality, number of siblings an individual have etc. These are known as Categorical Variables.
In last dispatch we tete-a tete ‘ed with regression algorithm. Regression algorithm are good continuous variables which are abundantly observed in nature and sometimes in professional work we do e.g. earnings of a business. In this dispatch we will look at an algorithm which does the same with discrete variables.
Logistic Regression is about estimating parameters to a simple logistic model. A logistic model is used to model binary classes i.e. binary dependent variables. It can be visualized in the form of a graph as below –

Figure 1 – Visualization of Logistic (Regression) Model
The x-axis represents the independent variable and y-axis represents the dependent variable i.e. probability of occurrence of an interesting dependent variable. It isn’t always necessary to have x-axis have both positive and negative values.
Though logistic regression is about finding the probability of a dependent variable it does not classify as such by the algorithm. Implementation of this algorithm can though be used for classification to interpret the probability values emitted by the algorithm.
The binary classes are interpreted to the left and right of the y-axis in this graph. The shift over from 0.5 indicates the class B and for probability values lesser than 0.5 they represent the class A.
The interesting question that comes across now is what happens when we start applying this to use cases where there are more than 2 classes. From the examples above say for payment options there are could be VISA/Master card, Maestro card, Diners Card, AMEX card. So, if we want to determine which of the given options will customer opt for dependent on the total value of the cart; we still will use Logistic Regression to determine the preferred card but instead of Logistic Regression we will apply Multinomial Logistic Regression.
Multinomial Logistic Regression is an extension of Logistics Regression. The multinomial logistic regression can be realized either as linear model with weighted sum of individual predictions or as independent logistics regression for each class independently with values indicating applicability of class.
Before we progress further let us remove one prickly issue which bothers us when we put this in perspective with Linear Regression. Linear regression is about deriving numerical parameters involved in the equation of line which explains the observed relationship between independent and dependent variables.
Logistic regression though does not seem to describe any such equation between dependent and independent variables. The question – is it not fair to call it something else than a regression algorithm; stands its ground.
To the defence of it should still be called as regression algorithm – The logistic algorithm still determines the relationship between independent and dependent variable. The relationship instead of being visualized geometrically as line is rather visualized as probability by the formula –
l=〖log〗_b (p/(1-p))= β_0+β_1×x_1+β_2×x_2
The regression here is involved to translate this equation to the form below and estimate β_i for the relationship –
p= 1/(1+ b^(-(β_0+β_1×x_1+β_2×x_2 ) ) )
Let us now take a sample problem and apply this math (i.e. the logistic algorithm) to experience its capabilities.
Implementation in Python
Let us pick a scenario where we want to determine whether a management studies aspiring student candidate is more likely to get picked for interview in a prestigious overseas university. The independent variables we have with us for this modelling is – GMAT score, GPA score and Prior work experience of the student candidate.
The peek into the data we have looks like this –

Figure 2 – Sample data of top 6 records
This is just a peek and assume the entire dataset is made available as CSV file available in the disk. Proceed to next step, load relevant python packages for this modelling. We will require –
pandas – for modelling dataset as data frames
sklearn – for the actual logistic regression algorithm implementation
seaborn – for visualization of the accuracy of the modelling activity
Following import statements will be required for us to move forward –

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn

We load the data from the excel file using the following lines

labelledStudentInterview = pd.read_excel(r’C:\dataScience\project\studentInterviewInvitew\labelledDate.xlsx’)

We will have to append ‘r’ to the path to escape the special characters like ‘\’ in the text. We will have to transform this to data frame to be able to work with data from now on, so let us do that –

dfLabelledStudentInterview = pd.DataFrame(labelledStudentInterview, columns=[‘gmat’, ‘gpa’, ‘work_exp’, ‘invited’])

Let us segregate the independent and dependent variables

x = dfLabelledStudentInterview[[‘gmat’,’gpa’,’work_exp’]]
Y = dfLabelledStudentInterview[‘invited’]

Next, we will split the data into test and training set by doing this

x_train, x_test, Y_train, Y_test = train_test_spit(x, Y, test_size=0.25, random_state=0)

Here we have split the dataset in 75 and 25 percentage where 75% of data is used for training and 25% of data is used for testing.
Now the stage is ready for training the logistic regression on the training data set –

algorithm = LogisticRegression()
algorithm.fit(x_train, Y_train)
Y_pred = algorithm.predict(x_test)

Next, we evaluate the correctness of training algorithm using the code below –

confusion_matrix = pd.crosstab(Y_test, Y_pred, rownames=[‘Actual’], colnames=[‘Predicted’])
sn.heatmap(confusion_matrix, annot=True)

This will yield us something like this on the window –

Figure 3 – Confusion matrix for algorithm
Accuracy of the algorithm is computed as follows –
Accuracy= (True Positive+True Negative)/Total
This yields a value of 0.8 which is 80% accuracy based on the current training. This number can be derived as –

print(‘accuracy = ‘, metrics.accuracy_score(Y_test, Y_pred)

For a moment let us take the flight further little ahead and see how we could use the current training to predict propensity for the invite –

import pickle

modelFileName = ‘studentInvitePrediction.pck’

pickle.dump(algorithm, open(modelFileName, ‘wb’))

We will use this file sometime latter to predict the propensity of invite based on scores as follows –

import pickle
modelFileName = ‘studentInvitePrediction.pck’
model = pickle.load(open(modelFileName, ‘rb’))
studentInvitePossibility = model.score([[pd.DataFrame({‘gmat’:590, ‘gpa’:’2’, ‘work_exp’:3}, colums[‘gmat’, ‘gpa’,’work_exp’])]])
print(studentInvitePossibility)

The above code will print the possibility to be – 0.

Post navigation

Previous Article
Next Article

Recent post

  • The Swiss Army Knife for developer
  • Time Management Techniques
  • Habits of Successful Leaders
  • Error in probabilities
  • Another gem from the past

Archives

  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • October 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • Home
  • View Jobs
  • Services
  • About Us
  • Contact Us
img img