
Create and manipulate Python lists, add items with extend and append, sort values, index from zero, slice lists, and build new lists from selected elements.
Learn to use for and while loops in Python, iterate lists and ranges, apply if conditions, print results, and define and call functions that return values or append to lists.
Explore your dataset in PySpark by viewing top rows with head, retrieving the first and last records with first and tail, and inspecting data types with dtypes. Then summarize data with describe and list columns with df.columns, setting up data cleaning in the next tutorial.
Clean and transform a PySpark data frame by replacing nulls with unknown, creating a clean country column, and dropping the original column; set proper data types and remove dollar signs.
Use the filter function to remove rows with null item type, compare data before and after, and plan to replace nulls or convert the string column to float.
Upload your CSV data file to the notebook before analysis, and run all to initialize the session; re-upload and run all after any disconnection to avoid errors.
Apache Spark is one of the most powerful tools used in big data analysis because:
It’s Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
· It can run real and semi-real time data analysis.
· It can handle large scale of data.
· It can be run using simple code in Python programming language.
You can use the easy commands in Python and SQL languages, to run data analysis on big data that cannot or difficult to import inside relational database engines. This combination of Spark, Python and SQL create a powerful work environment to analyze big data easier and faster.
In this course, you will learn: What is Spark, how does it run, and how data are stored in Spark work environment. You will learn how to configure Python programming environment to run Spark code. Also, you will learn performing data analysis using real big data. In addition, you will learn to import big data files inside Python. You will learn to clean and transform data for analysis purpose. You will learn conducting business analysis using several Spark functions. You will learn to create SQL queries inside PySpark to run data analysis. After that you will learn how to interpret the results from business perspective.