
Explore how Apache Hive enables SQL-like analytics on Hadoop without Java, using HiveQL to query data stored in HDFS; learn about the metastore, data types, and creating databases.
Master hive database basics by creating, using, showing databases, and dropping them, then create an employee table with row format delimited by tab and load data from local or HDFS.
Learn to load data into Hive using the load data command, with local and HDFS options, overwrite to avoid duplicates, and use alter table to rename, add, or drop columns.
Learn to alter Hive tables by replace columns, change data types, and add city, using show create table, show tables, and drop table, plus external versus managed and temporary tables.
Understand external tables in hive, when to use them to preserve data, how to create with the external keyword, and basics of embedded, local, and remote metastores.
Demonstrates inserting overwrite into a partitioned table named emp_prtn, loading only 2011 year-of-joining records, and explains static versus dynamic partitioning and loading data from the employee table.
Master Hive joins, including shuffle join, map join, bucket map join, sort merge bucket join, and skew join, with optimization strategies and Hive properties to boost performance on large data.
explain how skewed keys are handled in hive joins, routing skewed data to an in-memory hash table while joining data via a reducer, and using map-side joins for small data.
Explore how serde in Hive handles JSON data, serializer and deserializer roles, and how to serialize and deserialize data when storing to and retrieving from Hive tables.
Explore Hive user defined functions (udf and udaf) by building a udf in a Maven project, packaging as a jar, uploading to Hive, and invoking a temporary function in queries.
Learn how Hive UDFs compute maximum values using init, iterate, terminate partial, merge, and final methods, demonstrated with two files and the math.max approach.
Explore hive as a sql-like tool over HDFS that translates queries to MapReduce for batch processing. Store metadata, not data in Hive, and practice creating databases with if not exists.
Mastering Hive teaches conditional statements like if and case, and functions such as isnull, coalesce, and nvl, with examples and string tools like split, substring, and instring.
Master Hive explode and lateral view to transform array elements into rows, launch a MapReduce job, join with other columns, and extract map keys and values.
Explore how Hive joins combine tables using equality conditions, covering left, right, and full outer joins, with two- and three-table joins and memory management via streaming versus buffering.
Use map join to load small tables into memory for a mapper-only join, avoiding mapreduce. Configure bucket map join with the same bucketed column and same number of buckets.
Explore Hive alter commands to modify table schemas and structures. Rename tables, change column names and types, add or replace columns, and set properties.
Master table sampling, a bucketing feature that draws samples from across partitions, unlike limit, using bucket one out of four or percent, memory, or rows parameters.
Hive archiving transfers older data to less frequently used storage to reduce NameNode load in HDFS, with archive and unarchive commands and notes that space is not salvaged.
Explore the rank function, dense rank function, and row number function and how they rank data within partitions, handle ties, and produce top-n results using partition by and order by.
Learn how to create, alter, and drop compact and bitmap indexes on a table, compare performance, and determine when indexing helps or hinders queries in large datasets.
Learn how hive views function as tables from SQL results, selecting columns and filtering rows. See how changes to base data don't reflect in views, and how to create them.
Configure and use Hive variables in queries with Hive where and Hive conf, run scripts via source, and execute Unix and HDFS commands from the Hive shell.
Explore how Hive parallelism enables executing independent stages in parallel to speed up joins of partial data from two tables, noting the built-in parallel property hive.exec.parallel and potential deadlocks.
Master Hive table properties to control data loading, distinguishing active from passive. Use skip header line count and skip footer line count to omit header and footer lines during load.
Learn to use Hive's null format property to treat empty fields as null values by setting serialization.null.format during table creation and validating with queries.
Demonstrate setting table properties for RC and ORC formats, including compression codecs (zlib, snappy, bzip2, gzip) and options like stripe size, row index stride, and bloom filters.
Explore Hive purge commands: differentiate drop vs truncate, distinguish internal vs external tables, and understand purge behavior that permanently deletes data (no trash) since version 0.14.
Explore how to capture data changes in Hive using slowly changing dimensions, detailing type zero, type one, and type two approaches for preserving history.
Demonstrate a Hive-based slow changing dimension (SCD) example using a full outer join of two tables and CDC codes to label new, update, or no change records.
Create a hive table book_details with title, author, country, company, price, and year using an xml row format and xpath mappings. Load local books.xml and verify two rows via select.
Learn to prevent dropping and offline querying in Hive by enabling no drop and offline protections on tables and partitions, and then re-enabling them with disable commands.
Learn how to configure hive with the hive rc file, manage session versus global settings, and enable headers and map join, then explore cartesian products or cross joins.
Discover how Hive links metadata to a single data file to support multiple tables, including cases with fewer, equal, or more columns than the file, where extra columns become null.
Master hive file merging to reduce small files by configuring hive.merge settings, and learn the are like function for substring and regex matching.
Explore Hive configurations: speculative execution reduces slow tasks on multiple nodes, with map and reducer options; enable bucketing and Hive auto convert join for map joins on small tables.
Learn how Hive compression reduces storage and speeds data transfer in a Hadoop cluster by compressing input and map output, with codecs such as gzip, bzip2, lzma2, and snappy.
Explore the three Hive modes: embedded, local, and remote, and learn their use cases. Understand how Metastore, Derby database, and JDBC shape performance and parallel sessions.
Learn to enable compression in Hive to reduce storage and network traffic across input, map output, and reducer output, using gzip, bzip2, lzo, or snappy.
Explore hive's embedded, local, and remote modes, detailing metastore and Derby database setups, how to switch modes, and when to use each for small-scale to production environments.
Compare internal and external tables in hive, showing that internal tables are controlled by hive and lose data and metadata when dropped, while external tables retain data in hdfs.
Explore Hive managed tables by creating a sample database and employee table, loading comma-delimited data, and querying with select star from employee in a safe mode Hive environment.
Explore creating and managing external tables in Hive. Load data into employee details, view with show tables and show create table, and understand drop table behavior on the warehouse entry.
Explains how Hive partitions optimize queries by data segregation and reducing full-table scans, highlighting static and dynamic partitions and their use cases.
Learn dynamic partitions in Hive to avoid static partitions and prevent table scans. Implement by creating a temp table, loading data, setting hive.execute.dynamic.partition to non-strict, then inserting into partitioned table.
Explore common Hive file formats, including text, RC, and ORC, and learn practical steps to create tables, load csv data, and work with multiple formats.
Explore how Hive joins combine common data from the employee and department tables using inner join, left outer join, right outer join, and full outer join on the department number.
Learn to perform inner, left outer, right outer, and full outer joins in Hive, create and query join tables, and understand how join semantics affect results.
Mastering Hive shows how to create normal and complex views to read only selected columns from existing tables, avoiding storage waste and improving performance.
Explore Hive's data types, comparing primitive and complex types. Learn how arrays, maps, and structs store data, with a practical example of loading an array table from a csv.
Explore Hive complex data types, focusing on maps as key-value structures, and learn to create a map table, load map.csv, and query results.
Explore Hive data types with practical map datatype examples, including creating a map-based table for pay details, loading csv data, and querying by map keys like salary.
Explore Hive's struct data type and load CSV data into a table with a nested address field, then query address.street and other struct elements.
Learn to create and run Hive scripts by writing a sample ql, creating databases and tables, loading data, and executing scripts with Hive -f and parameterized configs.
Create a hive user defined function in java to compute bonuses, package as bonus.jar, and register it as a temporary function to add a bonus column to the employee table.
Learn the two types of tables in hive—external and managed—and how dropping or converting between them works, and how to create a hive tutorial database with a managed employee table.
Load data into Hive tables, convert between managed and external tables, and describe table formats. Explore why external tables protect data and remain accessible after deletions, with performance benefits.
Create an hbase employee table with cf1 for personal data and cf2 for working info. Insert rows with name, gender, age, job, and salary, then scan table using can command.
Learn to create a hash-based managed hive table and use HBase to manage Hive tables, addressing Hive and HBase disadvantages like no updates, null value space, and inserts not possible.
Explore Hive as a data warehouse on Hadoop, enabling OLAP-style data summarization and analysis through HiveQL, with a metastore, various data formats, and MapReduce execution on Hadoop.
Explore Hive data types from primitive boolean, int, bigint, float, double, and string to complex struct, map, and array, with implicit conversions. Understand partitions and buckets for faster queries.
Explore hive database and table commands, including create, show, describe, alter, and drop databases; manage tables with create, show, describe, and cascade drops, including managed and external tables.
Master hive table creation and database modification by adding properties, and define columns such as name (string), salary (float), subordinates (array<string>), deductions (map type), and address (struct).
Discover how to manage tables in Hive by showing tables, describing and describe extended, and altering tables, including rename, add or drop columns, drop tables, with managed and external tables.
Learn the difference between managed and external tables in Hive. Dropping a managed table deletes data and metadata, while dropping an external table preserves the data and removes only metadata.
Explore partitioning and bucketing in Hive to optimize queries. Learn to create and manage partitions across internal, managed, and external tables with partition commands in a telecom data case study.
Learn how to create and manage partitions with alter table, rename and drop partitions, and apply bucketing using hashing and clustered by id into 50 buckets for telecom data analysis.
Construct and modify hive tables with partitioning, defining columns like unique customer id, sequence number, status, activation and deactivation dates, and manage table lifecycle by dropping and recreating with partitions.
Create Hive tables for telecom data, focusing on the control services table with call_id, cs_sequence_number, change_of_service, status, and bill_date, using tab-delimited text storage.
Create the contract all and customer all tables with if not exists, define bigint, string, float, and int columns. Partition by ko user last mod and tm code.
Create and configure a Hadoop driver class with a main method, set the job, mapper, reducer, map output, and input/output paths, then build and run the jar.
Run a Hadoop MapReduce job, monitor map and reduce progress, and verify the output under hadoop fs, observe part-r output and sample product complaint counts.
Learn to compute location-wise counts of complaints and compare them with product-level totals using Hive, and prepare a grouped list of complaints by location.
Filter counts by user-specified location in Hive, taking a third-argument input, comparing with locations like Delhi or Mumbai, and writing location-specific results to the output folder.
Group complaints by location using MapReduce, count delayed responses, and flag not on time versus on time to produce a location-based delay report, and explore future Hive and Pig implementations.
Analyze bookmarking website data in the social media industry by moving data from rdbms to hdfs with sqoop. Process xml with mapreduce, pig, and hive to analyze reviews and location.
Process bookmarking site data from RDBMS to HDFS using Sqoop, convert flat files to XML, then analyze with MapReduce, Pig, and Hive, including custom data types and aggregation.
Show how to move data from RDBMS to HDFS with Sqoop, converting flat files to XML via an ETL tool, using a MySQL DB demonstration for MapReduce, Pig, and Hive.
Learn to import data from an RDBMS to HDFS using Sqoop. Specify a target directory, verify the schema, and run a MapReduce job on the resulting flat file.
Export the MapReduce jar, deploy to the Hadoop cluster, and run a mapper-driven xml processing job with the driver, noting there is no reducer, to generate a flat output file.
Transform the xml into a flat, star-delimited file and use a mapper to tally positive, average, and negative reviews for each book based on comma-separated feedback.
Analyze book reviews with a Hadoop MapReduce job that counts positive, negative, and average comments by converting input to lowercase and matching keywords such as awesome and bad.
Demonstrates a MapReduce reduce workflow to count bookmarks by location, aggregating values per key and producing a location-keyed output.
Analyze book data by author with a MapReduce workflow, counting how many books each author has written, from an XML to flat-file input and using author as the key.
Analyze book performance by loading an xml file with piggy bank's xml loader, extracting book name, author, review, and location, and applying a udf for positive, negative, and average reviews.
Register and use a piggy bank jar to process xml with pig and the xml loader. Employ tokenize and flatten to derive book id, category, and locations stored in SDFs.
Explore how to analyze a Pig output in Hive to handle location data, using Hive's complex data types like arrays, maps, and structs.
Explore processing pig output into hive by converting a pipe-delimited location array and loading it into a table in the book analysis database, with collection items terminated by pipe.
Load Hive data with append or overwrite, explode array-typed locations into separate rows, and count by location for simple analysis. Convert complex types to simple types for Hadoop analytics.
Explore the production workflow of big data processing with dfs, hdfs, mapreduce, and hive, showing how json data moves from local to hdfs, mapreduce to hive, and reports.
Explore how to transform demographic data into meaningful insights using Hadoop, data science, and data analytics. The session demonstrates use cases for policy decisions, tax analysis, and population trends.
Explore big data basics with IBM’s four v framework: volume, velocity, variety. Understand data growth toward 40 zettabytes and MapReduce's role in distributed, fault-tolerant processing of XML and JSON.
Explore big data processing of sensor data, addressing data veracity and uncertainty, with MapReduce architecture, Hadoop HDFS and YARN, and the mapper, reducer, and driver workflow for scalable analytics.
Convert a json input file with mapreduce in hadoop to a simple comma-delimited flat file, using a json library or a custom parser.
Convert json to plain text, then use mapreduce to count males and females in the Philippines by aggregating the gender field from a comma-delimited input.
Generate the count of females over 45 who are widowed or divorced using a MapReduce workflow, filtering by age, education, and marital status, with a mapper and reducer.
Examine a MapReduce workflow to compute total income and tax data, fix data types for decimals, and generate government-ready tax filer insights.
Run a MapReduce job to count people aged 18 and above with non-zero income, using a mapper and reducer to project next year’s income tax from HDFS data.
Develop a MapReduce workflow to count records where age is under 18 and income > 0, enabling government reporting on child labor using age and income data.
Compare MapReduce, Pig, and Hive to reveal abstraction differences: MapReduce is low-level and code-heavy, Pig uses Pig Latin for data flow, while Hive offers SQL-like declarative queries.
Compare Pig, MapReduce, and Hive for data processing and analysis, focusing on execution, performance, joins, and how each supports structured vs unstructured data.
Learn how to process sensor data in json format on hdfs with pig instead of mapreduce, using pig latin to analyze data and store results in hive for reporting.
Create and export a pig function jar, register it in the pig grunt shell, and apply a generic function to analyze gender, age, and marital status on HDFS.
Learn how Pig enables data flow processing for big data use cases, loading JSON data into HDFS with Pig storage and running grunt to compare Pig, MapReduce, and Hive.
Explore data flow with Pig by loading data, registering a jar with a user defined function, and extracting values for a given key using for each and generate.
Learn a Pig Latin use case to select females over 45 who are divorced or widowed, using data cleaning with trim, UDFs, and filters before grouping.
Advance Pig use cases by filtering S3 data with trim and equals to for female, divorced, or widow, then dump or store results to hdfs and prepare for udf integration.
Explore Hive as a data analytics tool built on Hadoop, using Hive query language similar to SQL to load data into HDFS, create tables, and generate reports through MapReduce-backed queries.
Explore hive features, hive metastore, and hiveql for analytics on hdfs; learn internal and external tables and how json data is processed with built-in functions.
Explore json use cases in hive by creating a demo database, a tbl_json table with a single json_data string column, and loading input text data with a star delimiter.
Explore loading JSON data into Hive tables, troubleshoot missing files, and use Unix and Hadoop commands to manage JSON data in a Hive workflow.
Use Hive's get_json_object to extract gender, age, and marital status from JSON data and count by gender. Filter females over 45 who are divorced or widowed using JSON extraction.
Explore end-to-end big data processing of JSON data, comparing MapReduce, Pig, and Hive, and deriving meaningful insights for government decision making from raw data.
Students will gain a comprehensive understanding of Hive, from the fundamentals to advanced topics. They will learn how to create and manage Hive databases, perform data loading and manipulation, execute complex queries, and use Hive's powerful features for data partitioning, bucketing, and indexing. Additionally, students will explore practical case studies and projects, applying their knowledge to real-world scenarios such as telecom industry analysis, customer complaint analysis, social media analysis, and sensor data analysis.
Section 1: Hive - Beginners
In this section, students will be introduced to Hive, an essential tool for managing and querying large datasets stored in Hadoop. They will learn the basics of Hive, including how to create databases, load data, and manipulate tables. Topics such as external tables, the Hive Metastore, and partitions will be covered, along with practical examples of creating partition tables, using dynamic partitions, and performing Hive joins. Students will also explore the concept of Hive UDFs (User Defined Functions) and how to implement them.
Section 2: Hive - Advanced
Building on the foundational knowledge, this section delves into advanced Hive concepts. Students will learn about internal and external tables, inserting data, and various Hive functions. The section covers advanced partitioning techniques, bucketing, table sampling, and indexing. Practical demonstrations include creating views, using Hive variables, and understanding Hive architecture. Students will also explore Hive's parallelism capabilities, table properties, and how to manage and compress files in Hive.
Section 3: Project 1 - HBase Managed Hive Tables
This section focuses on integrating Hive with HBase, a distributed database. Students will learn how to create and manage Hive tables, both managed and external, and understand the nuances of static and dynamic partitions. They will gain hands-on experience in creating joins, views, and indexes, and explore complex data types in Hive. The section culminates in practical implementation projects involving Hive and HBase, showcasing real-world applications and use cases.
Section 4: Project 2 - Case Study on Telecom Industry using Hive
Students will apply their Hive knowledge to a case study in the telecom industry. This project involves working with simple and complex data types, creating and managing tables, and using partitions and bucketing to organize data. Students will learn how to perform various data operations, understand table control services, and create contract tables. This hands-on project provides valuable insights into how Hive can be used for industry-specific data analysis.
Section 5: Project 3 - Customer Complaints Analysis using Hive - MapReduce
In this section, students will analyze customer complaints data using Hive and MapReduce. They will learn how to create driver files, process data from specific locations, and group complaints by location. This project highlights the power of Hive and MapReduce for handling large datasets and provides practical experience in data processing and analysis.
Section 6: Project 4 - Social Media Analysis using Hive/Pig/MapReduce/Sqoop
This section explores the integration of Hive with other big data tools like Pig, MapReduce, and Sqoop for social media analysis. Students will learn how to process and analyze social media data, perform data transfers from RDMS to HDFS, and execute MapReduce programs. The project includes practical exercises in processing XML files, analyzing book reviews and performance, and working with complex datasets using Hive and Pig.
Section 7: Project 5 - Sensor Data Analysis using Hive/Pig
The final section focuses on sensor data analysis using Hive and Pig. Students will learn the basics of big data and MapReduce, and how to convert JSON files into text format. They will perform various data analysis tasks, including calculating ratios, generating reports, and processing data using Pig functions. This project provides comprehensive hands-on experience in processing and analyzing sensor data, showcasing the practical applications of Hive and Pig in real-world scenarios.
Conclusion
This course provides a complete journey from understanding the basics of Hive to mastering advanced big data analysis techniques. Through a combination of theoretical knowledge and practical projects, students will gain the skills needed to manage, analyze, and derive insights from large datasets using Hive. Whether you're an aspiring data engineer, a data analyst, or a tech entrepreneur, this course will equip you with the tools and knowledge to excel in the world of big data.