Dataframe transformation
Pandas in Python
Dataframe is the foundation stone of data wrangling. It is the humble two-dimensional representation of
data that interests you. Oftentimes the aesthetics that appeal to humans come on the way to data
representation. This causes the many lines of code that deal with transforming data from one way of
representation to another. Since such transformations deal with aesthetics often data engineers overlook
them. It often proves fatal for budding data engineers. Today, we will look at one such ask without the data
filled in the dataframe to keep it simple.
We were piloting an IoT project where the field gateway will collect data and send us a de-normalised
record. Let us remind ourselves that IoT gateway will always send data in a de-normalised form or in better
words the form that consumes the least energy available. Transformations or looking etc. are the activities
that low-energy devices which connect to the IoT gateway shy away from. They prefer to use the energy to
get the reading and shipping that over waves to the gateway. Without wearing you down on energy let us
just say that the IoT project was to monitor the water level in catchment areas. The data sent over by the
gateway would look like this –
Metric SampleNature Measure
Rainfall Simple1 X
Inflow Simple1 Y
SWT Simple1 Z
Evaporation Simple1 A
Rainfall StormWaterFlow B
Inflow StormWaterFlow C
Loss StormWaterFlow D
RelatedSample Secondary Simple1
Loss Secondary E
Inflow Secondary F
Outflow Secondary G
As you can see values are repeating themselves in the column called SampleNature. This data was
received in a comma-separated values format. Processing them to do the analytics was straightforward.
One of the frequent things we needed was to filter the records and transform the Metric to column headers.
It would look like this –
hydrologyData = pd.read_csv(‘raw_data_20221215.csv’)
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’].pivot(index = ‘SampleNature’, columns =
‘Metric’)[[‘Value’]]
hydrologyData
We first load the data from the CSV file and then filter the records that have SampleNature as Simple1 by
doing
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’]
We then pipe the outcome of this operation to do pivoting. It is the most common operation one would do in
Excel. With the pandas as the framework for data wrangling, we do that using the statement –
.pivot(index = ‘SampleNature’, columns = ‘Metric’)[[‘Value’]]
The first parameter to the call indicates what should be the row index and the second parameter which is
called columns indicate what should be the column headers as we want. The last square bracketed term
indicates the values that should be copied over from the source data frame. Thus, we would get a data
frame that looks like this –
Value
Metric Evaporation Inflow Rainfall SWT
SampleNature
Simple1 A Y X Z
This is great as far as the count is just 1 record. But in the real world, this would flow over and if we stick to
the above transformation, we would end up getting an error stating the index is duplicated. It is natural for
the framework to throw that error since it does not know how the records are constituted. i.e., how many
rows in the source translate to a row? If we skip stating the index we will get a rather unpleasant
representation –
Value
Metric Evaporation Inflow Rainfall SWT
0 A NaN NaN NaN
1 NaN Y NaN NaN
2 NaN NaN X NaN
3 NaN NaN NaN Z
We hid a fact and it is time we reveal it to keep things simple and not go into heuristics for determining the
row index. We had a timestamp column which will be the same only for a few metrics on the source
dataset. We used that column as the index –
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’].pivot(index = ‘timestamp’, columns =
‘Metric’)[[‘Value’]]
Now we did not get an error instead we have a beautiful-looking dataset that could be printed directly to a
report –
Value
Metric Evaporation Inflow Rainfall SWT
Timestamp
2022-12-
14T00:05:00
A Y X Z
2022-12-
14T00:10:00
A1 Y1 X1 Z1
Needless to say, we masked the values here but you got the drift, isn’t it? One of the hallmarks of a data
engineer is to be able to do such transformation as if it is second nature. Some other tools that could be
used in place of the pivot are the functions like mask, fillna, MultiIndex and many more. Data engineer must
always be creative and find the path that costs them the least and documents the intent without having to
write comments near the code. Much of the creativity that a data engineer emanates will be from the
context in which such transformations are needed. We leave you with a puzzle on how the aforesaid
transformation could be achieved by the combination of mask, fillna, MultiIndex.
Recent post
Archives
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- October 2023
- June 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- January 2021
- December 2020
- October 2020
- August 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019