*this is meant as a study guide that was created from lecture videos and is used to help you gain an understanding of how does ai work.
Hadoop is open source and has been around for over 10 years. There are current innovations with Hadoop in the cloud sector. Hadoop is being implemented for IoT devices and Genomics data.
Hadoop is made up of three components: Storage (HDFS), Compute (MapReduce, Spark) and Management (YARN, Mesos).
Apache Spark is used for computing processes in memory of the worked nodes, which allows for quicker processing speeds of Hadoop jobs.
HDFS is used for batch, extract transformation and load jobs that was included in the Hadoop core distribution.
Data lakes are examples such as Amazon S3 or Google Cloud Storage and they are cloud-based file storages. These options are typically cheaper than using HDFS locally or on the cloud itself.
Cloudera and Databricks are the current leading commercial distributions of Hadoop.
Public cloud distributions are Google Cloud dataproc, AWS EMR and Microsoft Azure HDInsight.
As you add Hadoop libraries into your production environment, ensure you are matching the proper versions so your code runs.
Hadoop on GCP
Sign Up for GCP -> Login to GCP Console -> In left side menu, Big Data: Dataproc -> Create cluster to create a managed Hadoop cluster -> name the cluster and fill out additional information -> click Create -> you can then create a Job, types of Dataproc jobs are Hadoop, Spark, PySpark, Hive, SparkSql and Pig. -> Submit Job
Preemptible worker nodes are provided in Google Cloud dataproc production instances and are used to speed up batch jobs as these worker nodes are very cheaper.
Apache Spark is made from Databricks and they have a commercial distribution. Databricks with Apache Spark comes with Jupyter notebooks for data scientists.
Setup IDE in Databricks Apache Spark
Download your preferred IDE for Python, Scala or Java (which are the programming languages used to program against the Spark API) -> go into databricks and signup for Community Edition (Google Chrome works best for databricks)
Adding Hadoop Libraries to databricks (example library is Apache Avro)
add Hadoop libraries to databricks by going into databricks workspace -> Shared -> Create: Library -> Source: Maven Coordinate ->Coordinate: avro -> Click Search Spark Packages -> switch to Maven Central once in Search Packages -> type in avro_tools__1.8.1 -> Create Library
Setting up clusters in databricks
Databricks UI -> Clusters -> Create Cluster -> select Spark version and go with default settings -> Create Cluster
Loading data into tables in databricks
Databricks UI -> Tables -> Create Table -> select Data Source such as a csv or from a Cloud Bucket -> name table -> Create Table
Hadoop Batch Processing
Use MapReduce for Batch Processing.
YARN is used for scaling when moving from development to production. YARN is scalable and fault tolerant. It handles memory scheduling in a familiar type way of Hadoop schedulers.
Apache Mesos is a newer alternative to YARN and is designed for Hadoop workloads. Apache Mesos was designed as a cluster OS framework and handles both job scheduling and container group management.
YARN vs Spark Standalone
Hadoop Use Cases:
Stream Data Processing Use Cases
Streaming ETL is used to take data coming in, process it and respond to it
Data enrichment is taking data streaming in and combining it with other data sources
Trigger event detection is used for devices such as credit card services
Machine Learning Use Cases
Sentiment Analysis which can be explained easiest through the Twitter sentiment example.
Fog Computing Use Cases
IoT device messages is used to get information from devices and the edge layer to respond quickly
Type of Big Data Streaming
Open source cluster computing framework that uses in-memory objects for distributed computing. Apache Spark is faster than MapReduce over 100 times. The objects that Spark uses are called Resilient distributed datasets or RDD. Spark has many utility libraries.
Apache Spark Libraries
Learn more about Spark by visiting Apache.
Spark Data Interfaces
Resilient Distributed Datasets are low level interface to a sequence of data object in memory.
DataFrames are a collection of distributed row types and are similar to what you find with Python data frames.
Datasets are new for Spark 2.0 and are distributed collections that combine DataFrames and RDD functionality.
Spark Programming Languages
Python is easiest, Scala is functional and scalable, Java is for enterprise and R is geared towards machine learning.
In Spark 1.x, SparkSession Objects are called SQLContext or sc.
In Spark 2.0, SparkSession Objects are called spark.
Connect Spark with your Created Cluster in databricks- Databricks UI -> Create new cluster -> Create New Notebook -> in the notebook, type sc OR spark OR sqlContext and click execute
Visit Databricks to setup your account.
Workspace is the container for stored information.
Tables will provide the tables that have been created.
Clusters will show the clusters created.
Within Clusters: Spark UI, you can see the Jobs, Storage, Environment and Execution.
Executors will show the RDD blocks, which is where your data stores for Spark.
Working with Sparks notebook in databricks
Databricks UI -> Workspace -> Users -> select notebook
To add markdown language, use %md This is a markdown note example.
You can add notebooks into dashboards. You can see revision history and link to github.
Import and export Spark notebooks in databricks
File Types to Export- DBC Archive is databricks archive, Source File is the file exported in the language used (.py for Python), iPython notebook and HTML, which is a static web page. Exporting as a Source File and importing into a code editor is a good option.
To import a file, in the databricks UI select Workspace: Import and drag the file to import.
Example wordcount in databricks Spark using Scala
Databricks Spark Data Sources
Transformations and Actions
Spark Transformation operations do not show results without a Spark Action operation. Example of Spark Transformations are select, distinct, groupby, sum, orderby, filter and limit.
Examples of Spark Action operations are show, count, collect and save. Take is a Spark action that selects a certain number of elements from the beginning of a dataframe.
Pipelining in Spark is the separation of operations of Transformations and Actions.
A DAG will show you a visualization of the detailed process of what is occurring, the overhead, how long it is taking, how many executors there are and etc. View the specific Spark Job to see the DAG.
Using Spark SQL in databricks- Workspace -> Import Notebook -> read the csv data into Spark using spark.read
-> display data with display -> use createOrReplaceTempView to register the imported csv data into Spark SQL Tables -> you can then you SQL with %SQL ->
Easiest way to use SparkR is to make a local R dataframe and turn it into a SparkR DataFrame. Use the createDataFrame method to do this.
Spark ML with SparkR in databricks
Workspace -> Import Notebook -> Create a cluster -> load (csv) data with spark.read -> prepare data for model by cleaning it with appropriate Spark commands
Example of Pyspark Spark ML Model (Linear Regression) in databricks
Spark with GraphX
GraphX is a library that helps with graph databases and it is an API that can be on top of Spark. It is used for graph-parallel computation.
Spark with ADAM is used for genomics.
Streaming handles data coming in either through batch ingesting or stream ingesting. Amazon Kinesis and Google Cloud Pub/Sub are examples of proprietary solutions.
Streaming in real time when the data is being processed can be batch with MapReduce or streaming with Apache Spark or Apache Storm.
Using PySpark for Batch Processing in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code
Using PySpark for Stream Processing in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code
Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub and Microsoft Azure Service Bus.
Streaming with Google Cloud Pub/Sub
Go to Google Cloud to setup an account and get started.
Login to GCP Console -> Create New Project -> Launch Cloud Shell -> Create a Topic, which is a named resource which you send messages, in gcloud shell -> Create a subscription, which is how long messages will be retained
Apache Kafka is used for massive streaming ingest of data. It is a set of servers. The server cluster stores the record streams into topics. Each record has a key, value and a timestamp.
The four API’s used with Kafka are Producer API: used to publish to topics, Consumer API: used to subscribe to topics, Streams API: used to control the input and output of streams and Connector API: used to connect to your systems such as a database.
Kafka features include Pub/Sub, in-order messages, streaming and is fault-tolerant.
An alternative streaming processor to Apache Spark. Apache Storm is now being referenced as Apache Heron. It is a real time stream processor that is a record-at-a-time ingest pipeline.
Apache Storm core concepts are Topologies: real-time application logic, Streams: unbound sequence of tuples, Sputs: source of a stream in a topology, Bolts: performs processes on stream data and Stream grouping: section/partition of topology.
Apache Storm is not micro batches but true streaming.
Apache Kafka vs. Apache Storm
Apache Kafka is used to bring in the stream of data while Apache Storm is used for processing that stream.
You can use Apache Kafka instead of Google Cloud Pub/Sub if you needed messages in order.
You can use Apache Beam instead of Google Cloud Dataflow if you need highly scalable pipelines.