Assortment of thoughts – Part 2

June 9, 2021 The Editorial Board- Teamware Solutions

On data processing – AWS Data Pipeline and .NET

Round 1 of reasoning by both the packs was interesting. It made us think about the choices deeply and introspect the offerings those choices are to offer. Moving swiftly to Round 2 we re-convened on a wet Thursday morning with our own Tea cups and Coffee mugs.

Let us have replay the goal here to keep our perspective intact – Offer a solution towards how will you handle the transformation processing like –

5. Looking up of a field in an in-memory dictionary

6. Looking up of a field over the internet

7. Concatenation of few strings to form a compounded column

8. Replacing a value with right sized bin-value (i.e., replace let us say a value 104.87 with 90-110 kind of value)

Round 2

We had our pack of experts with ETL background to demonstrate how they will be building the transformation activities using AWS Data Pipeline. For that they started voicing over a background on important entities of AWS Data Pipeline.

Pipeline Definition – JSON file which describes the activities that are involved in the pipeline. This was the concept which the team pitched heavily on in the Round 1. This enables IaC nice and easy for defining the sequence of activities nice and comfortably.

Activities – These are finite and allowed type of activities that can be included in a Data Pipeline. These include CopyActivity, EmrActivity, SqlActivity, ShellCommadActivity and many more.

Resources – Compute instances on which the pipeline is run. Thus, the pipeline definition is interpreted and translated to action on resources. This is limited to EC2 and EMR cluster.

Actions – Fire and forget kind of activity which allows the external observes know is something has gone wrong with the pipeline execution. TerminateAction and SnsAlaramAction allow information on events.

I am sure there are few more but the activities that we are supposed to perform this much of concepts will be good. There is definitely the concept of a Task Runner in a resource. But that is for more specialized scenarios where our resources are located in on-premise or we intend to customize the flow logic.

Let us begin with few assumptions,

1. Data ingestion is going to happen from an S3 bucket (private obviously). This is a fair enough assumption as S3 does not constrain the structure. As well any process to could drop the data to S3 storage. As-in it is generally practiced approach.

2. Data egress will also happen to S3 for same reasons.

3. Since the data structure is not much defined, we assume the data will be a JSON file. It will be idea if the JSON file was separate in nature. However, we have also experienced scenarios where data is shared as a single JSON file with each line indicating a record. We are yet to encounter a scenario where JSON array is used to transport large amount of data.

4. The important thing next is data. We agreed with the C# pack that we both will use data related to the drive locations synthetically generated for city of Chennai with a structure as follows –

{

“record”: “bbd948b5-a097-4415-9992-05849c76eac6”

“lat”: 12.9972222,

“long”: 80.2569444,

“duration_stop”: 300000,

“time_of_day”: 1611379865000,

“day”: 20210105,

“week_of_day”: “Tuesday”,

“day_time_print”: “05-January-2021 11:01:05 AM”

“temperature”: 23

“temp_unit”: “Celcius”

}

Though, temperature is displayed as part of record, it is filled by looking up a service with lat and long parameter. Similarly, the property “day_time_print” will be computed based on “day” and “time_of_day”.

5. The data structure layout is like this –

{“record”:…, “lat”: …, …}

6. We both agreed to work with 50,000 of such records.

These assumptions in place, the transformations will take the shape of –

1. Lookup of field value to dictionary for the attribute “week_of_day” to integer values starting from 0 for Sunday.

2. Looking up in the internet we fetch the temperature of the given location.

3. Concatenation of few strings to form a computed column is realized by creating a friendly date time property with “day” and “time_of_day” property to get a value like 05-January-2021 11:01:05 AM

Pack of ETL experts then started defining the pipeline. The following is the pipeline definition they created to start.

{

“objects”: […],

“parameters”: […],

“values”: {…}

}

Here, we did not fill the dots in between. That is how the team did it, and they immediately tested their client configuration for connecting to AWS via CLI as well by running the following command in CLI –

aws datapipeline create-pipeline –name “drive-stop-location-processing” –unique-id “dp-dslp-d86e9710”

This yielded a response like this –

{

“pipelineId”: “df-08602529RTPD0169MTB”

}

Here the pipeline container (not the docker kind of container, rather the usage is as English word) is created to hold the definition and the following command to push the pipeline definition to the datapipeline as follows –

aws datapipeline put-pipeline-definition –pipeline-id “df-08602529RTPD0169MTB” –pipeline-definition “file://C:/Users/Document/drive-stop-location-pipeline.json”

This one did not run that well after all the JSON is not well formed isn’t it? We noticed the error –

Expecting value: line 2 column 15 (char 16)

Next, we have to start defining the compute environment which we will use to perform the operations. Since, the record count is sizeable but not that large that we would need superior powers, we will use a medium sized instance of EC2 resource.

As we do that, we want to highlight that here is where we could specialize if there is a special kind of EC2 instance required. Since, we can also specify the AMI-ID for the instance. The AMI-ID could be one from the market place or the one that a business could have created with specialized tools installed. We will stick to the default and define the resource as follows –

{

“objects”: [

{

“resourceRole”: “DataPipelineDefaultResourceRole”,

“role”: “DataPipelineDefaultRole”,

“instanceType”: “t2.medium”,

“name”: “ComputeEc2”,

“id”: “ComputeEc2”,

“type”: “Ec2Resource”,

“terminateAfter”: “120 Minutes”

}

“parameters”: […],

“values”: {…}

}

There are weeny bits of details to be highlighted. Let us get that out of our way.

1. “id” and “name” both are required properties. The need not be same. But care should be exercised while defining the “id” column as updates to the definition file will be validated against that.

2. “terminatateAfter” is a property which helps control billing where the compute instance is allowed to run till that time after which we will end up losing the underlying compute resources. The challenge is the format of time. It has to be like the one mentioned there “120 Minutes” it cannot be converted to seconds or hours. The “M” in “Minutes” is better capitalized.

3. “type” has limited options for values i.e., EC2 and EMR. Thus, we need to use one or other for our purpose and the way it is specified it has to use the string as displayed. A quick look up in the documentation settles any doubt you might have on the format.

While the pack was narrating this, two things popped in our head –

1. It is too flaky for a new developer to get started. There is host of challenges that might occur. Depending on the nature of error messages the developer could quickly solve the problem or miss out and struggle. This is kind of eyebrow raising thing.

2. The flakiness provides us opportunity to define accelerator tools. Simple JSON-schema is great to start coupled with let us say the popular VSCode IDE. We ended up searching if there is any such thing unfortunately not.

Team steadily progressed highlight the next two important elements in the definition, namely the roles.

We will part now to talk about further how the definition shaped and the corresponding shell code and next curve ball that the ETL pack threw on their audience. Till then happy coding.

Picture courtesy- Px Here

Assortment of thoughts – Part 2

Recent post

Archives