Summary
Overview
Work History
Education
Timeline
Generic

Kiran Sai T

Manassas

Summary

9+ years of professional IT experience in analyzing requirements, designing, building, and highly distributed mission-critical products and applications.

Highly dedicated and results-oriented Hadoop and Big Data professional with over 7 years of strong end-to-end experience in Hadoop development, with varying levels of expertise in different Big Data environment projects and technologies like MapReduce, YARN, HDFS, Apache Cassandra, HBase, Oozie, Hive, Sqoop, Pig, ZooKeeper, and Flume.

In-depth knowledge of HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming.

Experienced working extensively on the Master Data Management (MDM) and the application used for MDM.

Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning, and System Testing.

Expertise in converting MapReduce programs into Spark transformations using Spark RDDs.

Expertise in Spark architecture, including Spark Core, Spark SQL, DataFrames, Spark Streaming, and Spark MLlib.

Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDFS using Scala.

Experience in implementing real-time event processing and analytics using messaging systems like Spark Streaming.

Experience in using Kafka and Kafka brokers to initiate Spark context and process live streaming information with the help of RDDs.

Good knowledge of Amazon AWS concepts, like EMR and EC2 web services, which provide fast and efficient processing of Big Data.

Experience with all flavors of Hadoop distributions, including Cloudera, Hortonworks, MapR, and Apache.

Experience with all flavors of Hadoop distributions, including Cloudera, Hortonworks, MapR, and Apache.

Experience in installation, configuration, supporting, and managing Hadoop clusters using Apache, Cloudera (5.X) distributions, and on Amazon Web Services (AWS).

Expertise in implementing Spark Scala applications using higher-order functions for both batch and interactive analysis requirements.

Extensive experience working with Spark tools like RDD transformations, Spark MLlib, and Spark SQL.

Hands-on experience in writing Hadoop jobs for analyzing data using Hive QL (queries), Pig Latin (data flow language), and custom MapReduce programs in Java.

Experienced in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing, and internal/external tables.

Proficient in normalization and de-normalization techniques in relational and dimensional database environments, and have done normalizations up to 3NF.

Good understanding of Ralph Kimball (Dimensional) and Bill Inman (Relational) model methodologies.

Extensive experience in collecting and storing stream data, such as log data, in HDFS using Apache Hume.

Experienced in using Pig scripts to perform transformations, event joins, filters, and some pre-aggregations before storing the data onto HDFS.

Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and the usefulness of Python/Java into Pig Latin and HQL (HiveQL).

Skilled with Python, Bash/Shell, PowerShell, Ruby, Perl, YAML, and Groovy. Developed Shell and Python scripts used to automate day-to-day administrative tasks and automated the build and release process.

Good experience with NoSQL databases like HBase, MongoDB, and Cassandra.

Experience using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.

Hands-on experience in querying and analyzing data from Cassandra for quick searching, sorting, and grouping through CQL.

Experience working with MongoDB for distributed storage and processing.

Good knowledge and experience in extracting files from MongoDB through Sqoop, and placing them in HDFS and processing them.

I worked on importing data into HBase using the HBase Shell and the HBase Client API.

Experience in designing and developing tables in HBase, and storing aggregated data from Hive table.

Experience with the Oozie Workflow Engine in running workflow jobs with actions that run Java MapReduce and Pig jobs.

Great hands-on experience with PySpark for using Spark libraries by using Python scripting for data analysis.

Implemented data science algorithms, like shift detection in critical data points, using Spark, doubling the performance.

Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and the usefulness of Python/Java into Pig Latin and HQL (HiveQL).

Extensive experience in working with various distributions of Hadoop, such as enterprise versions of Cloudera (CDH4/CDH5), Hortonworks, and good knowledge of MAPR distribution, IBM Big Insights, and Amazon's EMR (Elastic MapReduce).

Experience in designing and developing the POC in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.

Developed automated processes for flattening the upstream data from Cassandra, which is in JSON format. Used Hive UDFs to flatten the JSON data.

Expertise in developing responsive front-end components with JavaScript, JSP, HTML, XHTML, Servlets, Ajax, and AngularJS.

Experience as a Java Developer in web/intranet, client/server technologies using Java, J2EE, Servlets, JSP, JSF, EJB, JDBC, and SQL.

Good knowledge of working with scheduling jobs in Hadoop using FIFO, Fair Scheduler, and Capacity Scheduler.

Experienced in designing both time-driven and data-driven automated workflows using Oozie and Zookeeper.

Experience in writing stored procedures and complex SQL queries using relational databases, like Oracle, SQL Server, and MySQL.

Experience in extraction, transformation, and loading (ETL) of data from multiple sources, such as flat files, XML files, and databases.

Supported various reporting teams and had experience with the data visualization tool, Tableau.

Implemented data quality in the ETL tool Talend and have good knowledge in data warehousing and ETL tools like IBM DataStage, Informatica, and Talend.

Experienced and in-depth knowledge of cloud integration with AWS using Elastic Map Reduce
(EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.

A detailed understanding of the Software Development Life Cycle (SDLC) and strong knowledge of project implementation methodologies, like Waterfall and Agile.

Overview

10
10
years of professional experience

Work History

Sr. Data Engineer

Citizens Bank
New York
01.2023 - Current
  • Responsible for building scalable, distributed data solutions using Hadoop.
  • Experience in creating a Kafka producer and a Kafka consumer for Spark streaming, which gets data from different learning systems of the patients.
  • Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDES using Scala.
  • Developed various Java objects (POJOs) as part of persistence classes for ORM mapping with databases.
  • Developed the HBase data model on top of HDFS data to perform real-time analytics using the Java API.
  • Implemented data injection systems by creating Kafka brokers, Java producers, consumers, and custom encoders.
  • Used Spark Streaming to divide streaming data into batches as an input to the Spark engine for batch processing.
  • Evaluated the performance of Apache Spark in analyzing genomic data.
  • Performed advanced procedures, like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Experienced in AWS Cloud Services such as IAM, EC2, S3, AMI, VPC, Auto-Scaling, Security Groups, Route 53, ELB, EBS, RDS, SNS, SQS, CloudWatch, CloudFormation, CloudFront, Snowball, and Glacier.
  • Used AWS S3 buckets to store the file, injected the files into Snowflake tables using Snowpipe, and ran deltas using data pipelines.
  • Worked on complex SNOW SQL and Python queries in Snowflake.
  • Worked on optimizing volumes with EC2 instances, created multiple VPC instances, and deployed those applications on AWS using Elastic Beanstalk, and implemented and set up Route 53 for AWS web instances.
  • configured AWS Identity and Access Management (IAM) groups and users for improved login authentication. Provided policies to groups using the policy generator and set different permissions based on the requirements, along with providing Amazon Resource Name (ARN).
  • Worked on all data management activities on the project, data sources, and data migration.
  • Worked with data compliance teams, data governance teams, to maintain data models, metadata, and data dictionaries, defining source fields and their definitions.
  • Deployed microservice onboarding tools leveraging Python and Jenkins, allowing for easy creation and maintenance of build jobs, as well as Kubernetes deployments and services.
  • Worked with AWS Cloud Formation Templates, Terraform, along with Ansible, to render templates, and Murano with Heat Orchestration templates in OpenStack environment.
  • Developed REST APIs using Python with the Flask framework and completed the integration of various data sources, including RDBMS, shell scripting, spreadsheets, and text files.
  • Generated Java APIs for retrieval and analysis on NoSQL databases, such as HBase and Cassandra.
  • Written HBASE Client program in Java and web services.
  • Implemented Agile methodology for building an internal application.
  • Developed A.I. machine learning algorithms, such as classification, regression, and deep learning, using Python.
  • Conducted statistical analysis on healthcare data using Python and various tools.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudFront, CloudWatch, SNS, SES, SQS, and other services of the AWS family.
  • Developed Python, ANTscripts, and UNIX shell scripts to automate the deployment process.
  • Experience in working with JFrog Artifactory to deploy artifacts, and used shell scripts: Bash, Python, and PowerShell for automating tasks in Linux and Windows environments.
  • Selecting appropriate AWS services to design and deploy an application based on given requirements.
  • Created concurrent access for Hive tables with shared and exclusive locking that can be enabled in Hive with the help of Zookeeper implementation in the cluster. Designed and implemented a test environment on AWS.
  • Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data into NFS.
  • Worked closely with EC2 infrastructure teams to troubleshoot complex issues.
  • Worked with AWS Cloud and created EMR clusters with Spark for analyzing raw data processing and accessing data from S3 buckets.
  • Involved in installing EMR clusters on AWS.
  • Used AWS Data Pipeline to schedule an Amazon
    EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
  • Designed the NIFI/HBASE pipeline to collect the processed customer data into HBase tables.
  • Apply transformation rules on top of data frames.
  • Worked with different file formats, like TEXTFILE, AVROFILE, ORC, and PARQUET, for HIVE querying and processing.
  • Developed Hive UDFs and UDAFs for rating aggregation.
  • Developed a Java client API for CRUD and analytical operations by building a RESTful server and exposing data from NoSQL databases like Cassandra via the REST protocol.
  • Created Hive tables and involved in data loading and writing Hive UDFs.
  • Experience in managing and reviewing Hadoop log files.
  • Involved in Core Java concepts like Collections, Multi-Threading and Serialization.
  • Worked extensively with Sqoop to move data from DB2, and Teradata to HDFS.
  • Collected the log data from web servers and integrated it into HDFS using Kafka.
  • Provided ad-hoc queries and data metrics to the Business Users using Hive, Impala.
  • Worked on various performance optimizations, such as using distributed cache for small datasets, partitioning, bucketing in Hive, map-side joins, etc.
  • Scheduled the Oozie workflow engine to run multiple Hive and Pig jobs, which run independently based on time and data availability.
  • Responsible for coding the business logic using Python, J2EE/Full Stack technologies, and Core Java concepts.
  • Utilized Spark Core, Spark Streaming, and Spark SQL API for faster processing of data instead of using MapReduce in Java.
  • Defined the reference architecture for Big Data Hadoop to maintain structured and unstructured data within the enterprise.
  • Lead the efforts to develop and deliver the data architecture plan and data models for the multiple data warehouses and data marts attached to the Data Lake Project.
  • Implemented AWS provides a variety of computing and networking services to meet the needs of applications.
  • Created Talend jobs to copy the files from one server to another and utilized Talend FTP components.
  • Created Talend jobs to load data into various Oracle tables. Utilized Oracle stored procedures and wrote few Java code to capture global map variables and use them in the job. Used ETL methodologies and best practices to create Talend ETL jobs. Followed and enhanced programming and naming standards.
  • Developed Talend jobs to populate the claims data into the data warehouse, using a star schema.

Data Engineer

Walgreens
04.2019 - 09.2022
  • Extensively migrated the existing architecture to Spark Streaming to process the live streaming data.
  • Responsible for Spark Core configuration based on the type of input source.
  • Executed Spark code using Scala for Spark Streaming/SQL for faster processing of data.
  • Performed SQL joins among Hive tables to get input for Spark batch process.
  • Developed PySpark code to mimic the transformations performed in the on-premise environment.
  • Developed PySpark code to mimic the transformations performed in the on-premise environment.
  • I analyzed the SQL scripts and designed solutions to implement using PySpark. Created custom new columns depending on the use case while ingesting the data into the Hadoop Lake using PySpark.
  • Worked with the Data Governance, Data Quality, and Metadata Management team to understand the project.
  • Implemented data governance, data quality framework, including data cataloging, data lineage, MDM, business glossary, data steward, and metadata operational.
  • Created S3 buckets and managed policies for S3 buckets, using them for storage, backup, and archiving in AWS, and worked on AWS Lambda, which runs the code in response to events.
  • Assisted application teams in creating complex IAM policies for administration within AWS, and maintained DNS records using Route 53. Used Amazon Route53 to manage DNS zones and give public DNS names to elastic load balancer Ips.
    Review source feeds, delivery mechanisms (messages/SFTP, etc.), frequency, full/partial, etc., and identification of customer PII information using fuzzy logic, and merge customer information from multiple sources into one MDM record, maintaining it with the help of a data steward, etc. Data management and data governance: lineage, data quality rules, thresholds, alerts, etc.
  • Evaluated tools using Gartner's Magic Quadrants, functionality metrics, and product scores—Informatica MDM, Ataccama, IBM, Semarchy, Contentserv, etc.
  • Developed a data pipeline using Flume, Sqoop, Pig, and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Automating AWS components like EC2 instances, security groups, ELB, RDS, and IAM through AWS Cloud information templates.
  • Created Snowflake warehouse, databases, tables, designed table; Data pipeline to snowflake stage and using snowsql to transform and load to snowflake warehouse.
  • Created a Redshift data warehouse for another client who preferred Redshift to Snowflake. Created tables, applied distribution key, sort key, vacuum strategy, etc.

Analyze the Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suits the current requirements.

  • Integrated Cassandra as a distributed, persistent metadata store to provide metadata resolution for network entities on the network.
  • Developed multiple MapReduce jobs in Java to clean datasets.

Implemented Spark using Scala, and used PySpark using Python for faster testing and processing of data.

  • Designed multiple Python packages that were used within a large ETL process to load 2 TB of data from an existing Oracle database into a new PostgreSQL cluster.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls, and loaded them into HDFS using Java.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.

  • Loading data from the Linux file system to HDFS, and vice versa.
  • Developed UDFs using both DataFrames/SQL and RDD in Spark for data aggregation queries and reverting back into OLTP through Sqoop.
  • Manage AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing, and Glacier for our QA and UAT environments, as well as infrastructure servers for GIT.
  • Migrated an existing on-premises application to AWS.
  • Knowledge of ETL methods for data extraction, transformation, and loading in corporate-wide ETL solutions and data warehouse tools for reporting and data analysis.
  • Implementing advanced procedures, like text analytics and processing, using the in-memory computing capabilities, such as Apache Spark written in Scala.
  • Extensively used Informatica client tools like Designer, Workflow Manager, Workflow Monitor, Repository Manager, and server tools: Informatica Server and Repository Server.
  • Installed and monitored Hadoop ecosystem tools on multiple operating systems, such as Ubuntu and CentOS.
  • Developed Scala scripts using both DataFrames, SQL, Datasets, and RDDs/MapReduce in Spark for data aggregation, queries, and writing data back into the OLTP system through Sqoop.
  • Extensively use Zookeeper as a job scheduler for Spark jobs.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Designed, wrote, and maintained systems in Python scripting for administering GIT, by using Jenkins as a full-cycle continuous delivery tool involving package creation, distribution, and deployment onto Tomcat application servers via shell scripts embedded into Jenkins jobs.
  • Involved in writing shell scripts to automate WebSphere admin tasks, application-specific syncs, backups, and other schedulers.
  • Experience in moving data in and out of Windows Azure SQL Databases and Blob Storage.
  • Experience in designing Kafka for a multi-data center cluster and monitoring it.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).
  • Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and ZooKeeper.
  • Experience in Kafka and Spark integration for real-time data processing.
  • Developed Kafka producer and consumer components for real-time data processing.
  • Hands-on experience for setting up Kafka MirrorMaker for data replication across the cluster.
  • Experience in configuring, designing, implementing, and monitoring Kafka clusters and connectors.
  • Involved in loading data from the UNIX tile system to HDFS using Shell Scripting.
  • Hands-on experience in Linux shell scripting.
  • Importing and exporting data into HDFS from the Oracle database using NiFi.
  • Started using Apache NiFi to copy the data from the local file system to HDFS.
  • Worked on the NiFi data pipeline to process a large set of data and configured lookups for data validation and integrity.
  • Worked with different file formats, like JSON, AVRO, and Parquet.
  • Experienced in using Apache Hue and Ambari to manage and monitor the Hadoop clusters.
  • Experienced in using version control systems like SVN, GIT, build tool Maven, and continuous integration tool Jenkins.
  • Worked on REST APIs in Java 7 to support internalization, and apps to help our buyer team visualize and set portfolio performance targets.
  • Used Mockito to develop test cases for Java bean components and test them through the testing framework.
  • Good experience in using relational databases: Oracle, SQL Server, and PostgreSQL.
  • Worked on developing a middle-tier environment using SSIS, Python, and Java in a J2EE/full stack environment.
  • Worked with agile, Scrum, and confidential software development frameworks for managing product development.
  • Using Ambarito to monitor the node's health and the status of the jobs in Hadoop clusters.
  • Implemented Kerberos for strong authentication to provide data security.
  • Involved in creating Hivetables, loading, and analyzing data using Hive queries.
  • Experience in creating dashboards and generating reports using Tableau by connecting to tables in Hive and HBase.
  • Created Sqoop jobs to populate data present in relational databases to Hive tables.
  • Experience in importing and exporting data using Sqoop from HDFS, Hive, HBase, to Relational Database Systems, and vice versa. Skilled in data migration ecosystem.
  • Oracle SQL tuning using the explain plan.
  • Manipulate, serialize, and model data in multiple forms, like JSON and XML.
  • Involved in setting up MapReduce 1 and MapReduce 2.
  • Prepared Avro schema files for generating Hive tables.
  • Used Impala connectivity from the User Interface (UI) and queried the results using ImpalaQL.
  • Worked on the physical transformations of the data model, which involved creating tables, indexes, joins, views, and partitions.
  • Involved in analysis, design, system architectural design, process interfaces design, design, and documentation.
    I used Jira for bug tracking and Bitbucket to check in and check out code changes.
    Involved in Cassandra data modeling to create keyspaces and tables in multi-data center DSE Cassandra DB.
  • Utilized Agile and Scrum methodology to help manage and organize a team of developers, with regular code review sessions.
  • Re-engineered n-tiered architecture involving technologies like EJB, XML, and Java into distributed applications.
  • Load and transform large sets of structured and semi-structured data using Hive and Impala with Elasticsearch. Worked closely with different business teams to gather requirements, prepare functional and technical documents, and UAT process for creating data quality rules in Cosmos and Cosmos Streams.

Data Engineer

Mindtree
04.2018 - 04.2019
  • Installed and configured Hadoop MapReduce, HDFS, and developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
  • Migrated existing SQL queries to HiveQL queries to move to a big data analytical platform.
  • Integrated the Cassandra file system with Hadoop using MapReduce to perform analytics on Cassandra data.
  • Installed and configured a Cassandra DSE multi-node, multi-data center cluster.
  • Created business logic using servlets, session beans, and deployed them on WebLogic Server.
  • Wrote complex SQL queries and stored procedures.
  • Developed the XML Schema and Amazon Web Services for data maintenance and structures.
  • Worked on analyzing and writing Hadoop MapReduce jobs using the Java API, Pig, and Hive.
  • Selecting the appropriate AWS service based on data, compute, and system requirements.
  • Implemented Shell, Perl and Python scripts for release and built automation then manipulated and automated scripts to suit the requirements.
  • Designed and implemented a 24-node Cassandra cluster for a single-point inventory application.
  • Analyzed the performance of the Cassandra cluster using node tool TP stats and CFstats for thread analysis and latency analysis.
  • Implemented real-time analytics on Cassandra data using the Thrift API.
  • Responsible for managing data coming from different sources.
  • Supported MapReduce programs that are running on the cluster.
  • Involved in loading data from the UNIX file system to HDFS.
  • Worked on installing the cluster, commissioning, and decommissioning of the data node, name node recovery, capacity planning, and slots configuration.
  • Load and transform large sets of data into HDFS using Hadoop fs commands.
  • Scheduled the Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
  • Implemented UDFs, UDAFs in Java and Python for Hive to process the data that can't be performed using Hive's inbuilt functions.
  • Did various performance optimizations, such as using distributed cache for small datasets, partitioning and bucketing in Hive, and performing map-side joins, etc.
  • Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and to generate reports.
  • Involved in writing optimized Pig Script, along with developing and testing Pig Latin scripts.
  • Supported in setting up, updating configurations for implementing scripts with Pig and Sqoop.
  • Designed the logical and physical data modeling, and wrote DML scripts for the Oracle 9i database.
  • Used the Hibernate ORM framework with the Spring framework for data persistence.
  • I wrote test cases in JUnit for the unit testing of classes.
  • Involved in templates and screens in HTML and JavaScript.

Java Developer

E CAPS computers India Pvt ltd
05.2015 - 04.2018
  • Experience in coding Servlets on the server side, which gets the requests from the client and processes them by interacting with the Oracle database.
  • Coded Java servlets to control and maintain the session state, and handle user requests.
  • GUI development using HTML forms and frames, and validating the data with JavaScript.
  • Used JDBC to connect to the backend database, and developed stored procedures.
  • Developed code to handle web requests involving Request Handlers, Business Objects, and Data Access Objects.
  • Creation of JSP pages, including the use of JSP custom tags, other methods of Java Bean presentation, and all HTML and graphically oriented aspects of the site's user interface.
  • Used XML for mapping the pages and classes, and to transfer data universally among different data sources.
  • I worked in unit testing and documentation.
  • Hands-on experience in the J2EE framework, Struts.
    Implemented Spring Model-View-Controller (MVC) architecture-based presentation using the JSF framework. Extensively used Core Java API, S API, in developing the business logic.
    Designed and developed agile applications, lightweight solutions, and integrated applications by using and integrating different frameworks, like Struts and Spring.
  • Involved in all the phases of the Software Development Life Cycle (SDLC), including analysis, designing, coding, testing, and deployment of the application.
  • Developed Class Diagrams, Sequence Diagrams, State Diagrams using Rational Rose.
  • Developed a user interface using JSP, JSP Tag libraries, JSTL, HTML, CSS, and JavaScript to simplify the complexities of the application.

Adapted various design patterns, like Business Delegate, Data Access Objects, and MVC. Used the Spring framework to implement MVC architecture.

  • Implemented layout management using Struts Tiles Framework.

Used the Struts validation framework in the presentation layer.

  • Used the Core Spring framework for dependency injection.
  • Developed JPA mapping to the database tables to access the data from the Oracle database.
  • Creating JUnit test case design logic and implementation throughout the application.
  • Extensively used ClearCase for version controlling.

Education

Master of Science -

SUNY College At New Paltz
New Paltz, NY

Timeline

Sr. Data Engineer

Citizens Bank
01.2023 - Current

Data Engineer

Walgreens
04.2019 - 09.2022

Data Engineer

Mindtree
04.2018 - 04.2019

Java Developer

E CAPS computers India Pvt ltd
05.2015 - 04.2018

Master of Science -

SUNY College At New Paltz
Kiran Sai T