how does data mining work.
*this is a guide that was created from lecture videos and is used to help you gain an understanding of how does data mining work.
Data Mining Foundations
Data mining- the selection and analysis of data, accumulated during the normal course of doing business, to find and confirm previously unknown relationships that can produce positive and verifiable outcomes through the deployment of predictive models when applied to new data.
Data mining is finding patterns in historical data, which is used to create models. Then you use those models to score new data on a regular basis
Defining the Business Problem for Science Projects
Poor defining of a problem is the biggest reason why a data science project will fail.
- Decision: make a specific decision, meet a particular need, create a measurable goal. Every decision flows from that initial business question that you are trying to solve.
- Intervention: create an intervention strategy that is time sensitive, can be dealt at an individual level and has a plan of action
- Gain: determine what you are going to gain from a project, a gain does not occur unless deployment of the assignment, what are the tangible benefits, how is the gain measured for a business, at the start of the project you should estimate the potential benefit for completion of the project.
The idea of data mining is to take a carefully crafted snapshot (historical data), establish a set of best practices and then insert them into the flow of decision making for a business.
How big is the problem financially?
What is the total size of your problem?
How often does the event occur?
What does the event occurrence cost your business?
Data Mining Requirements
Data must be crafted and custom fit to solve the business problem and to let the algorithms perform the best. 70 to 90 percent of the work is getting the data
You need to create a customer footprint that entails who you are creating a project for and for how much historical data is necessary to create an accurate model.
Historical data will be put in a flat file and the algorithms that are used in predictive analytics are made to run on flat files.
Understand your Target- must have labeled/supervised data which has the outcome known and that the result for each in the data set is labeled
You need a target variable for training data, which is ‘loyal/churn’ for this example dataset.
Data Selection- What is the purpose of using the data? The historical data set that you build should mimic the data that you will be deploying the model on.
Data Integration- many sources of data as possible, 6 to 20 sources of data for a specific project is not unheard of. Combine many data sources to get surprising insights
Data Construction- this includes constructive data preparation operations: production of derived attributes, new records, transformed values for existing attributes. Most of the variables are created during a project and they typically represent the most important variables. Examples of this include ratios and formulas created or date arithmetic.
Derived attributes and feature engineering are the same thing.
Data Mining Algorithms
Data mining datasets DO NOT have to be huge or big (data). Data mining is used to solve post computing problems. Data modeling algorithms can take a few hours or overnight. Scoring needs to be quick, but finding patterns will take time.
It takes weeks or months, not days. 6 weeks to 20 weeks. Problem definition can take some time. Then it is data collection that takes time.
Working with Subject Matter Experts
Use subject matter experts to widen the search of which variables to include. Ask the question, “Did I leave anything out?” to the subject matter expert to fill the gaps. The computer can narrow the search of what to use.
Visit kdnuggets for more information on data mining algorithms.
Data Science Problems
Missing Data- you will have it, some modeling techniques will drop the whole row of data, is the data null or zero?
Organizational Resistance- data science is the business of organizational change, errors in historical data, sometimes major change can occur due to a model, anticipate concerns
Degrading models- monitor and recalibrate the weights/coefficients to ensure they are accurate over time, most software will allow you to calibrate the models. By recalibrating values over time, this will allow the algorithms to do their work accurately. to At one point you will want to adjust the variables, algorithms or data modelling technique used.
Data Mining Modeling
Nothing to prove- it is not testing hypotheses, data mining is data driven and exploratory, validation is not the same as hypothesis testing
Data mining is not the scientific method because the data is already provided.Data mining is about presenting the data to a a data mining algorithm in a way that you business question can be asked and answered by data.
Do not worry about statistical assumptions and rules from statistics as they may not apply to data mining. This is ultimately how statistics and data mining differs.
Leave variables mixed and combined, do not leave them out. Feature slection (selecting the variables in a data set) is important, but you must be cautious when removing variables.
Data mining algorithms are designed to handle a lot of variables. By taking variables out, you can decrease the accuracy of the model.
Data Modeling Process
Business understanding has the goal of taking the need from a business objective and translating it into the proper data modeling objective.
Data understanding is about fine tuning the modeling strategy.
Data preparation is about preparing the data for the data model algorithm.
-how many rows, how many variables, how much missing data there is..
Data modeling– testing and trying different algorithms on the data.
Data Mining Evaluation
No a priori hypotheses (which is a hypotheses derived from theory and not experience)
The same data which was used to uncover the data, cannot be used to prove the pattern applies to future data.
Data mining involves splitting up the data into the training and testing data sets.
The training data will find the pattern and the testing data will test that pattern (found from the training data)
Data Mining Deployment
Your project is not complete until deployment.
Propensity score- when there are only two possibilities
Metamodeling- technique where groups of models work together: ensembles, models in serial, models in parallel.
*Ensembles have more than one model all generating propensity scores, so that they will act together. An example of this is a model with 1000 decision trees is called a random forest. Models in serial building have data models that build off the previous a second model. Models in parallel is when you need two different models because the data relies on each other due to being incomplete.
Reproducibility- data mining must result in a reproducible series of steps that can be performed on new data. The model must work on new data. Does the data mining model access, integrate, prepare and score new data properly?
Documentation- Write a report at the end of each CRISP-DM process. Examples of what to document: which variables were used in the final model? Where is the data coming from? Where are predictive scores being sent? What kind of training will be necessary for end users?
Data Modeling and CRISP DM
Crisp DM is the standard process for data mining. There are 6 major phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
Data mining must have business understanding at each step in the process. The data must be in a form which can be analyzed. Data preparation is crucial to the data mining process.
The process of data mining is not: formulate the problem and algorithm finds the solution. The data miner must formulate the problem and find the solution, algorithms are only tools to assist you.
To find patterns, start from gaining knowledge and building a strong foundation of the business understanding.
All patterns are subject to change because they reflect a changing world and our changing self.
Get a better understanding of CRISP DM by visiting datasciencecentral.