Assortment of thoughts – Part 2
On data processing – AWS Data Pipeline and .NET
Round 1 of reasoning by both the packs was interesting. It made us think about the choices deeply and introspect the offerings those choices are to offer. Moving swiftly to Round 2 we re-convened on a wet Thursday morning with our own Tea cups and Coffee mugs.
Let us have replay the goal here to keep our perspective intact – Offer a solution towards how will you handle the transformation processing like –
Round 2
We had our pack of experts with ETL background to demonstrate how they will be building the transformation activities using AWS Data Pipeline. For that they started voicing over a background on important entities of AWS Data Pipeline.
Pipeline Definition – JSON file which describes the activities that are involved in the pipeline. This was the concept which the team pitched heavily on in the Round 1. This enables IaC nice and easy for defining the sequence of activities nice and comfortably.
Activities – These are finite and allowed type of activities that can be included in a Data Pipeline. These include CopyActivity, EmrActivity, SqlActivity, ShellCommadActivity and many more.
Resources – Compute instances on which the pipeline is run. Thus, the pipeline definition is interpreted and translated to action on resources. This is limited to EC2 and EMR cluster.
Actions – Fire and forget kind of activity which allows the external observes know is something has gone wrong with the pipeline execution. TerminateAction and SnsAlaramAction allow information on events.
I am sure there are few more but the activities that we are supposed to perform this much of concepts will be good. There is definitely the concept of a Task Runner in a resource. But that is for more specialized scenarios where our resources are located in on-premise or we intend to customize the flow logic.
Let us begin with few assumptions,
|
{ “record”: “bbd948b5-a097-4415-9992-05849c76eac6” “lat”: 12.9972222, “long”: 80.2569444, “duration_stop”: 300000, “time_of_day”: 1611379865000, “day”: 20210105, “week_of_day”: “Tuesday”, “day_time_print”: “05-January-2021 11:01:05 AM” “temperature”: 23 “temp_unit”: “Celcius” }
|
Though, temperature is displayed as part of record, it is filled by looking up a service with lat and long parameter. Similarly, the property “day_time_print” will be computed based on “day” and “time_of_day”.
|
{“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …}
|
These assumptions in place, the transformations will take the shape of –
Pack of ETL experts then started defining the pipeline. The following is the pipeline definition they created to start.
|
{ “objects”: […], “parameters”: […], “values”: {…} }
|
Here, we did not fill the dots in between. That is how the team did it, and they immediately tested their client configuration for connecting to AWS via CLI as well by running the following command in CLI –
|
aws datapipeline create-pipeline –name “drive-stop-location-processing” –unique-id “dp-dslp-d86e9710”
|
This yielded a response like this –
|
{ “pipelineId”: “df-08602529RTPD0169MTB” }
|
Here the pipeline container (not the docker kind of container, rather the usage is as English word) is created to hold the definition and the following command to push the pipeline definition to the datapipeline as follows –
|
aws datapipeline put-pipeline-definition –pipeline-id “df-08602529RTPD0169MTB” –pipeline-definition “file://C:/Users/Document/drive-stop-location-pipeline.json”
|
This one did not run that well after all the JSON is not well formed isn’t it? We noticed the error –
|
Expecting value: line 2 column 15 (char 16)
|
Next, we have to start defining the compute environment which we will use to perform the operations. Since, the record count is sizeable but not that large that we would need superior powers, we will use a medium sized instance of EC2 resource.
As we do that, we want to highlight that here is where we could specialize if there is a special kind of EC2 instance required. Since, we can also specify the AMI-ID for the instance. The AMI-ID could be one from the market place or the one that a business could have created with specialized tools installed. We will stick to the default and define the resource as follows –
|
{ “objects”: [ { “resourceRole”: “DataPipelineDefaultResourceRole”, “role”: “DataPipelineDefaultRole”, “instanceType”: “t2.medium”, “name”: “ComputeEc2”, “id”: “ComputeEc2”, “type”: “Ec2Resource”, “terminateAfter”: “120 Minutes” } ], “parameters”: […], “values”: {…} }
|
There are weeny bits of details to be highlighted. Let us get that out of our way.
While the pack was narrating this, two things popped in our head –
Team steadily progressed highlight the next two important elements in the definition, namely the roles.
We will part now to talk about further how the definition shaped and the corresponding shell code and next curve ball that the ETL pack threw on their audience. Till then happy coding.
Picture courtesy- Px Here
Recent post
Archives
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- October 2023
- June 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- January 2021
- December 2020
- October 2020
- August 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
June 9, 2021