Getting Started
ODIF Product Overview
Olive Data Ingestion Framework (ODIF) is an ingestion tool which can connect to configurable sources and sinks to accelerates data ingestion. It built with a cloud agnostic approach, with no pre-installation of cluster and can be deployed with minimal resource footprint. Enterprise do not have to worry about setting up hadoop cluster, with elastic compute it makes easy to setup the compute power based on source data-size. It provides a user-friendly web interface which helps user in, data source registration, job config, job runs and monitoring.
Key Feature
- One Homogenous code for all source and sink.
- No Pre-Installation of hadoop cluster.
- Minimal Resource Footprint.
- Cloud-agnostic ingestion engine.
- Compute configuration decided based on ingestion data size.
Launch ODIF
Launching on AWS
ODIF is an AMI based tool for AWS. This page will describe how to launch and setup application from AMI.
- Locate ODIF on marketplace.
- Browse Choose and instance type and choose from supported instance type. t2.xlarge is preferred. Minimum: 4 vCPU and 16 GiB.
- Configure instance details.
Points to consider:- VPC and Subnet:
- VPC and subnet can be selected as per requirement or can be left as default.
- Auto-assign Public IP:
- Mark it as Disable, and later a selected EIP can be allocated.
- IAM Role:
- Create a new IAM role for EC2 instance with access policies. Refer to create IAM for new role creation.
Rest other option can be left as default or can be selected as per user requirement.
- VPC and Subnet:
- Add Storage
Add EBS storage as per user requirement (minimum 32 GB). EBS Encryption is marked with default aws/ebs key.
- Add Tags
Tags can be added as per organization/individual policies.
- Security Group
Create new security group where inbound can be configured for ODIF. ODIF need to communicate between EMR and EC2, hence a minimum config for user IP and EMR SG needs to be added to EC2 SG.
Outbound can be left as default.
For creation of SG Security Group.
Note: We need EMR SG to add here, EMR SG can be created using another window or first we can launch ODIF EC2 instance and then create EMR SG and include it in ODIF EC2 SG as inbound rule. - EC2 Key Pair
Generate a new key pair for EC2 instance and download the same to access EC2 instance.
- Review and Launch Instance.
Additional Information
IAM Role
This IAM role is for managing credentials for application that going to run on EC2 instance. There will be 3 main IAM roles required, and below is the process for creation of the same:
- IAM for EC2
- From Create Role screen select EC2 and click on
to attach policy to new role
- Attach Policy
- AmazonElasticMapReduceFullAccess: To access EMR, S3 and EC2
- AssumeRole: To manage Dynamic launch of EMR cluster.
- Tags as per organisation/individual policy.
- Review and Create Role. This role will be used to launch EC2 instance from AMI
- From Create Role screen select EC2 and click on
- EMR Default Role
It allows EMR to call EC2 service on your behalf. Check if EMR_DefaultRole exists then use the same else create a new one.- From Create Role screen select EMR and click on
to attach policy to new role
- It will have default attached policy.
- Click on
to add tags as per organization/individual policy.
- Review and Create role, as EMR_DefaultRole.
- From Create Role screen select EMR and click on
- EMR_EC2_DefaultRole
It allows EMR to call EC2 service on your behalf. Check if EMR_DefaultRole exists then use the same else create a new one.- From Create Role screen select EMR and click on
to attach policy to new role.
- It will have default attached policy.
- Click on
to add tags as per organization/individual policy.
- Review and Create role, as EMR_EC2_DefaultRole
- From Create Role screen select EMR and click on
Security Group
Security Group control the incoming and outgoing traffic coming. With ODIF we need security group
- EC2 Instance Security Group
- Inbound
- All TCP for user/org-subnet IP address.
- All TCP for EMR Security Group.
- SSH for User IP
- Inbound
- EMR Security Group
- Inbound
- All TCP for EC2 security group
- All TCP for EMR master-slave security group.
- In case EMR have different security group for Master and Slave we need to add EC2 security group in both security group of Master and Slave as Inbound. Below example user single SG for master and slave.
Note: Good to have single Security Group for both Master and Slave.
- Inbound
EIP Allocation
Elastic IP address is a public IPv4 address, that can be associate with dynamic instance, which is reachable from internet. As best practice this EIP can be mapped in /etc/hosts for easy access to machine.
- Select Elastic IPs from EC2 feature, and click on
- Select Amazon’s pool of IPv4 addresses (default option) and click on
- Select the allocated IP and from Actions, select associate Elastic IP
- Select Resource and instance and choose launched instance to associate EIP.
- Click on
to associate EIP with instance.
Assume Role
AssumeRole policy is used to access AWS resources with a Security Token Service (STS). This policy needs to be attached with EC2 role, to launch dynamic EMR cluster. Below are steps to create and attach policy with Role:
- Click on
to create new policy.
- Select STS as service
- Select AssumeRole under Action
- Select Specific role and add ARN (Role created here)
- Click on
and provide name to policy and Create it.
- Attached this policy with the role created for EC2 Instance.
Setup ODIF
Login to ODIF
Once application setup launches it at http://<EIP>:8081. Application run at port 8081 and EIP refers to IP Address reserved for the instance.
Login with Credentials:
Login for Rabbit MQ http://<EIP>:15672
ODIF Screen Details
Compute Details
Odif user is admin user which will be used to setup basic configuration. After login as admin Compute Details will be the first screen. ODIF launch resources Dynamically based on input source size, hence need to provide some details to setup dynamic compute.
select compute type (aws default)
Field | Description |
Compute Type | Default AWS. |
EC2 Role | IAM Role for EC2 |
EMR Role | EMR Default Role |
EC2 Key Pair | Key Pair created at the time of Instance Launch. |
AWS Region | Region where compute resource will be launched. |
S3 Bucket Name | Existing S3 Bucket to hold ODIF assets and EMR logs. |
EC2 Instance Profile | Instance profile ARN for EMR_EC2_DefaultRole. Note: It needs to be Instance profile ARN not Role ARN. |
EMR Slave Security Group | EMR Slave Security Group. |
EMR Master Security Group | EMR Master Security Group. |
Connectors
-
- Connector screen is for creating and modification of connectors. Once connector is created it can be used as source or sink.
Once details filled using connectivity with the source/sink will be validated and after successful validation source connector can be submitted
Job Configuration
This screen helps in configure job where a link between source and sink can be setup. In case of mysql connector database, table information can be provided. It provides below feature
- Load multiple tables from one database in one job. Use
- Load tables from different database
- User specific query can also be provided in Query section.
Job Run
This screen will submit job to transfer data from source to sink. Use as default as
option is for other static on-premises hadoop cluster.