*this is a guide that was created from lecture videos and is used to help you gain an understanding of how do you become a data scientist.
Data scientists are one part of a larger data science team.
What is a data scientist? You’re a data scientist if you work with data in a scientific way. As the need for data scientists increase, there will be a standardized skill set created fo companies to know who they are hiring.
Ensure you focus on the science and not the data. Data scientists use an empirical approach.
The scientific method will make you a data scientist, not the knowledge of tools. Common data scientist tools are categorized into three categories: storing, scrubbing and analyzing.
Storing tools: spreadsheets and databases such as hadoop, cassandra and postgresql
Scrubbing tools: makes data easier to work with; text editor, scripting, programming
Analyzing: statistical packages to analyze the data, R, SPSS, python; also used for visualization
Big Data- datasets that are large and will not fit into a normal database management system.
Hadoop uses a distributed file system to store data on different servers. The group of servers is called a Hadoop cluster. The cluster splits tasks so you can run applications. Two of the most common processes used in Hadoop is mapreduce, which works in batches and apache spark, which works in real time.
Once you collect the data, then you scrub it to clean it. Use a script to do so. Most times are spent scrubbing data. Then you would use R or Python to analyze and present data.
Uncover insights and create knowledge
Data science does not focus on achievable objectives, it is exploratory and is about gaining business knowledge.
Example questions that a data scientist would ask:
What do we know about our customer?
How can we deliver a better product?
Why are we better than our competitors?
What would happen if we left the market?
Data scientists are skeptical and exploratory!
Questions are seen as a barrier from moving forward.
Visit kdnuggets.com for information on data science basics.
Modern databases started from IBM IMS, which was a computer file with columns and rows. It was difficult to use due to it resembling a large spreadsheet. SQL and relational databases, which divide datas into tables was then created. A map that shows the relation between tables is called a schema. The problem with relational databases is that you need to know the framework and design and it will take effort to change this once you start collecting data. A virtual table is called a view.
Get data into warehouses using ETL
An Enterprise Data Warehouse is a database focused on analyzing data. Traditional databases are used for OLTP, which is online transaction processing in real time, meanwhile a data warehouse uses OLAP, which is online analytical processing and uses historical data. ETL is used for operations that extract data from one data source, transform it and load it into another location. Hadoop is looking to replace data warehouses.
Relational databases rely on schemas, require planning and are not flexible. The harder your database works, the slower the website will be. NoSQL does not rely on the relational model, everything is in one transaction with no separate tables or relationships. A NoSQL database needs to be non relational, schema less, cluster friendly, open source. You will not have to normalize with a NoSQL database. An aggregate is where all the data is held in a NoSQL database and they are records saved in a transaction. They are easier to synchronize. Hadoop is built off HBase, which is an open source NoSQL database.
Big Data Problems
If you have a problem with your data having too much volume, variety, velocity, and veracity, then you have a big data problem. You need petabytes of data for big data. Autonomous cars, is an example of a big data problem.
3 types of data
Structured data is simple, follows specific format, cheap and flexible, each row has to follow the same structure which is called a data model. A data model describes the structure in the individual fields/rows.Relational databases are best used for structured data.
Unstructured data is data like pictures, video or audio as they have no defined structure. Using NoSQL.
Semistructured data has structure but it depends on the source. An example is email. To work with semi structured data, you would use XML or JSON to exchanged semi structured data over web services. Semi structured data is structured data with flexible field names.
Big Data Garbage
People argue to keep all data as it is hard to know what you may need later, it is cheap to store and it is time consuming to delete.
People also argue that it is difficult to find interesting results if the data is garbage. Which is known as data noise.
Applying Statistical Analysis
Statistics are tools to tell a story. Start out with descriptive statistics to describe what is going on with the data.
Understand probability, which is a key part in predictive analytics.
Use correlation which is the degree that two things are related. Typically, is is measured between 1 and 0. If it is 1, then it is strongly correlated and if it is 0 then there is no relationship. There is positive and negative correlation. The closer to 1 or -1, the stronger the relationship.
Correlation doesn’t imply causation
Do not create false relations, which is called spurious causation
Comb techniques for predictive analytics
Using insights from data analysis to predict the future. A type of data science where it makes the results actionable. Closely related to data science and big data.
The more data that you have, the more powerful and accurate your predictive analytics will be.
Invest more in your data analytics teams and not the hardware/software/tools. A good data science team will be messy. Ensure your team asks great questions.
Data Science is the process of making data useful. Ensuring you have a strong foundation of data engineering is essential when learning how do you become a data scientist.
Overview of Data Science System
Data providers- real time events which is streamed and data at rest, which is batched -> Data Hub- enterprise data repository, where all the companies data ends up -> the data then goes to either the data scientist, the data analyst of the applications that use it through an API.
The data hub is comprised of a back office and a front office. The back office is made up of ‘staging’, which is a copy of the data to use for later analysis -> then ‘cleansing’, which is where you clean the data -> the last phase is ‘conforming’ which involves lookups to get IUD fields and setting the data up for the analyst to use -> The data is then added to the ‘front office’ database which is where schemas are built.
Star Schema Components
A star schema can also known as a data mart. A star schema has a data mart design pattern that has fact and dimension tables, and these tables resemble a star.
Characteristics of Facts for Star Schemas- subject of analysis, which can be sales, website visitors or online orders. Typically, facts are numerical and can be orders, facts are stored with the lowest level of detail or granularity, which is known as atomic grain and an example such as the browser someone is using or at what time, multiple types such as event log, can be additive ad semi additive
Characteristics of Dimensions for Star Schemas- Context of Analysis, an example being ‘sales by country’, dimensions are most often textual, dimensions are slowly changing, the contain mostly attributes of the data, slice and dice which is using dimension value to filter the data.
Data Engineering Responsibilities
Data Ops- tasks such as the infrastructure and the availability and performance of the system.
Data Prep- Staging, Cleansing, Conforming, Delivering
Data Interfaces- API, Query Tool Compatibility and how data is being accessed.
Visit Medium to get a better understanding on big data engineering basics.
Stage- Data Profiles, Row Counts, Quality Checks -> Cleanse- Error logging, notifications, rollback -> Conform- Go/No Go Check, SMS notifications, Quality Checks -> Deliver- Automated Rollback. It is crucial you understand the process when learning the data pipeline in your journey towards how do you become a data scientist.
Strategically add them along the way and as needed based off their importance for the data engineering job.
Environment Setup Example
Install virtualbox VM manager software -> download sandbox image from Cloudera and import it -> go into virtual box settings, shared folders and add the downloaded files you have from Cloudera. Ensure it is set up as Auto-mount and make permanent. You should then be able to access the files in the virtual machine. -> you then need to give permission to access these files within Cloudera, within the terminal, sudo gedit /etc/group to add permissions. -> Launch Hue UI within Cloudera -> go into Hive query editor.
Loading and Profiling Data- create staging environment -> upload sales data -> do simple row count audit. Go into command line -> find folder for data with data directory and unzip -> once you find the csv files that you want to import and load data into hadoop, use the code ‘hadoop fs -put * /’ in terminal. -> to create the staging environment, create a database by using a SQL command within Hive -> create tables in Hive staging database -> create audit database for auditing piece, then audit log table (used for when you transfer data from staging environment to the final destination)
Data Quality Test
To be used once you have data inside a staging environment.
First Identify blank values -> find missing locations -> check for invalid dates. Use a SQL select statement to see in your audit table to do a data quality check.
Use once data is loaded into staging and you have performed data quality testing.
Adding Data Types- read in data from staging table for fact table -> convert numeric values to numeric types such as int or double -> convert date values into data types.
Create table in Hue Interface with HiveQL -> pull data from staging table and use SQL AS statement -> check the table to ensure data is properly set up -> do an audit check and a row check to ensure how many rows from the table has made it.
Handling Missing Values- Filter rows with missing key fields in your table -> replace invalid dates with 9999-12-31 -> replace missing (product) keys with #unknown. Filter rows with missing (orderID, for example) ID with a SQL statement that includes where orderID <> ‘ ’; -> replace the dates with a SQL statement including case when OrderDate = ‘ ‘ then ‘9999-12-31’ else OrderDate end as OrderDate -> to replace blank product keys, use a SQL case statement, case when ProductKey = ‘ ‘ then ‘#Unkown’ else ProductKey end as ProductKey
Verifying Addresses- create a zipcode lookup table, look for invalid city/state combos, replace missing city/state based on zip. Go into terminal and type code, hadoop fs -mkdir /’ to load the zipcode database. -> find the file where the csv file is within your files
-> add sql code to create zipcode table
Performing Master Data Lookups- create client staging table, create cleansed clients table, load cleansed clients table, update ClientID in Cleansed_Sales. Copy the csv into the hadoop environment -> create cleansed client table in terminal into Hadoop Hue interface.
Inferred Members- when new dimensions appear prior to existing in the lookup table. Use ‘inferred member support’ concept to address this issue.
Visit O’Reilly to get best practices when building a data pipeline.
Delivering Analytical Data Sets
Loading the star schema- insert/update new dimensions, insert new facts.
Loading Dimension Tables- Create client dimension table, load client table. First, create client dimension table in Hadoop Hue interface using SQL create table statement, then load data from your cleansed table and use a SQL statement insert overwrite table to do so.
Loading fact tables- create sales fact table, insert data from cleansed environment, perform row count. Create sales fact table in Hadoop Hue interface using SQL -> insert data from cleansed environment into production by using insert overwrite SQL statement. -> do a row count with an insert into and select SQL statement.
Creating Views- Create view with calculated measures, combine client dimension and sales fact to a single view, create aggregate table for monthly metrics such as OrderCount SalesTotal and AvgOrderSize. Create a view with calculated measures in Hive -> check view to ensure correct -> create flattened view to combine dimensions -> check view to ensure correct -> create new aggregated with monthly metrics by using a SQL statement with ‘create view’, ‘as’, and a subquery.
Keep learning through various outlets such as online videos, websites, books and courses to progress on how do you become a data scientist.