9+ years of professional IT experience in analyzing requirements, designing, building, and highly distributed mission-critical products and applications.
Highly dedicated and results-oriented Hadoop and Big Data professional with over 7 years of strong end-to-end experience in Hadoop development, with varying levels of expertise in different Big Data environment projects and technologies like MapReduce, YARN, HDFS, Apache Cassandra, HBase, Oozie, Hive, Sqoop, Pig, ZooKeeper, and Flume.
In-depth knowledge of HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming.
Experienced working extensively on the Master Data Management (MDM) and the application used for MDM.
Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning, and System Testing.
Expertise in converting MapReduce programs into Spark transformations using Spark RDDs.
Expertise in Spark architecture, including Spark Core, Spark SQL, DataFrames, Spark Streaming, and Spark MLlib.
Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDFS using Scala.
Experience in implementing real-time event processing and analytics using messaging systems like Spark Streaming.
Experience in using Kafka and Kafka brokers to initiate Spark context and process live streaming information with the help of RDDs.
Good knowledge of Amazon AWS concepts, like EMR and EC2 web services, which provide fast and efficient processing of Big Data.
Experience with all flavors of Hadoop distributions, including Cloudera, Hortonworks, MapR, and Apache.
Experience with all flavors of Hadoop distributions, including Cloudera, Hortonworks, MapR, and Apache.
Experience in installation, configuration, supporting, and managing Hadoop clusters using Apache, Cloudera (5.X) distributions, and on Amazon Web Services (AWS).
Expertise in implementing Spark Scala applications using higher-order functions for both batch and interactive analysis requirements.
Extensive experience working with Spark tools like RDD transformations, Spark MLlib, and Spark SQL.
Hands-on experience in writing Hadoop jobs for analyzing data using Hive QL (queries), Pig Latin (data flow language), and custom MapReduce programs in Java.
Experienced in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing, and internal/external tables.
Proficient in normalization and de-normalization techniques in relational and dimensional database environments, and have done normalizations up to 3NF.
Good understanding of Ralph Kimball (Dimensional) and Bill Inman (Relational) model methodologies.
Extensive experience in collecting and storing stream data, such as log data, in HDFS using Apache Hume.
Experienced in using Pig scripts to perform transformations, event joins, filters, and some pre-aggregations before storing the data onto HDFS.
Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and the usefulness of Python/Java into Pig Latin and HQL (HiveQL).
Skilled with Python, Bash/Shell, PowerShell, Ruby, Perl, YAML, and Groovy. Developed Shell and Python scripts used to automate day-to-day administrative tasks and automated the build and release process.
Good experience with NoSQL databases like HBase, MongoDB, and Cassandra.
Experience using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
Hands-on experience in querying and analyzing data from Cassandra for quick searching, sorting, and grouping through CQL.
Experience working with MongoDB for distributed storage and processing.
Good knowledge and experience in extracting files from MongoDB through Sqoop, and placing them in HDFS and processing them.
I worked on importing data into HBase using the HBase Shell and the HBase Client API.
Experience in designing and developing tables in HBase, and storing aggregated data from Hive table.
Experience with the Oozie Workflow Engine in running workflow jobs with actions that run Java MapReduce and Pig jobs.
Great hands-on experience with PySpark for using Spark libraries by using Python scripting for data analysis.
Implemented data science algorithms, like shift detection in critical data points, using Spark, doubling the performance.
Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and the usefulness of Python/Java into Pig Latin and HQL (HiveQL).
Extensive experience in working with various distributions of Hadoop, such as enterprise versions of Cloudera (CDH4/CDH5), Hortonworks, and good knowledge of MAPR distribution, IBM Big Insights, and Amazon's EMR (Elastic MapReduce).
Experience in designing and developing the POC in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
Developed automated processes for flattening the upstream data from Cassandra, which is in JSON format. Used Hive UDFs to flatten the JSON data.
Expertise in developing responsive front-end components with JavaScript, JSP, HTML, XHTML, Servlets, Ajax, and AngularJS.
Experience as a Java Developer in web/intranet, client/server technologies using Java, J2EE, Servlets, JSP, JSF, EJB, JDBC, and SQL.
Good knowledge of working with scheduling jobs in Hadoop using FIFO, Fair Scheduler, and Capacity Scheduler.
Experienced in designing both time-driven and data-driven automated workflows using Oozie and Zookeeper.
Experience in writing stored procedures and complex SQL queries using relational databases, like Oracle, SQL Server, and MySQL.
Experience in extraction, transformation, and loading (ETL) of data from multiple sources, such as flat files, XML files, and databases.
Supported various reporting teams and had experience with the data visualization tool, Tableau.
Implemented data quality in the ETL tool Talend and have good knowledge in data warehousing and ETL tools like IBM DataStage, Informatica, and Talend.
Experienced and in-depth knowledge of cloud integration with AWS using Elastic Map Reduce
(EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.
A detailed understanding of the Software Development Life Cycle (SDLC) and strong knowledge of project implementation methodologies, like Waterfall and Agile.
Analyze the Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suits the current requirements.
Implemented Spark using Scala, and used PySpark using Python for faster testing and processing of data.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
Adapted various design patterns, like Business Delegate, Data Access Objects, and MVC. Used the Spring framework to implement MVC architecture.
Used the Struts validation framework in the presentation layer.