Machine Learning in Data Science Process
“Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.”
—Arthur Samuel, 1959
When machine learning is seen as a process, the following definition is insightful: “Machine learning is the process by which a computer can work more accurately as it collects and learns from the data it is given.”
—Mike Roberts
Applications for machine learning in data science
Regression and classification are of primary importance to a data scientist. To achieve these goals, one of the main tools a data scientist uses is machine learning. The uses for regression and automatic classification are wide ranging, such as the following:
■ Finding oil fields, gold mines, or archeological sites based on existing sites (classification and regression)
■ Finding place names or persons in text (classification)
■ Identifying people based on pictures or voice recordings (classification)
■ Recognizing birds based on their whistle (classification)
■ Identifying profitable customers (regression and classification)
■ Proactively identifying car parts that are likely to fail (regression)
■ Identifying tumors and diseases (classification)
■ Predicting the amount of money a person will spend on product X (regression)
■ Predicting the number of eruptions of a volcano in a period (regression)
■ Predicting your company’s yearly revenue (regression)
■ Predicting which team will win the Champions League in soccer (classification) Occasionally data scientists build a model (an abstraction of reality) that provides insight to the underlying processes of a phenomenon. When the goal of a model isn’t prediction but interpretation, it’s called root cause analysis. Here are a few examples:
■ Understanding and optimizing a business process, such as determining which products add value to a product line
■ Discovering what causes diabetes
■ Determining the causes of traffic jams This list of machine learning applications can only be seen as an appetizer because it’s ubiquitous within data science. Regression and classification are two important techniques, but the repertoire and the applications don’t end, with clustering as one other example of a valuable technique. Machine learning techniques can be used throughout the data science process.
Where machine learning is used in the data science process
Although machine learning is mainly linked to the data-modeling step of the data science process, it can be used at almost every step.
The data modeling phase can’t start until you have qualitative raw data you can understand. But prior to that, the data preparation phase can benefit from the use of machine learning. An example would be cleansing a list of text strings; machine learning can group similar strings together so it becomes easier to correct spelling errors.
Machine learning is also useful when exploring data. Algorithms can root out underlying patterns in the data where they’d be difficult to find with only charts. Given that machine learning is useful throughout the data science process, it shouldn’t come as a surprise that a considerable number of Python libraries were developed to make your life a bit easier.
Python tools used in machine learning
Python has an overwhelming number of packages that can be used in a machine learning setting. The Python machine learning ecosystem can be divided into three main types of packages, as shown in figure.
PACKAGES FOR WORKING WITH DATA IN MEMORY
When prototyping, the following packages can get you started by providing advanced functionalities with a few lines of code:
■ SciPy is a library that integrates fundamental packages often used in scientific computing such as NumPy, matplotlib, Pandas, and SymPy.
■ NumPy gives you access to powerful array functions and linear algebra functions.
■ Matplotlib is a popular 2D plotting package with some 3D functionality.
■ Pandas is a high-performance, but easy-to-use, data-wrangling package. It introduces dataframes to Python, a type of in-memory data table. It’s a concept that should sound familiar to regular users of R.
■ SymPy is a package used for symbolic mathematics and computer algebra.
■ StatsModels is a package for statistical methods and algorithms.
■ Scikit-learn is a library filled with machine learning algorithms.
■ RPy2 allows you to call R functions from within Python. R is a popular open
source statistics program.
■ NLTK (Natural Language Toolkit) is a Python toolkit with a focus on text analytics. These libraries are good to get started with, but once you make the decision to run a certain Python program at frequent intervals, performance comes into play.
OPTIMIZING OPERATIONS
Once your application moves into production, the libraries listed here can help you deliver the speed you need. Sometimes this involves connecting to big data infrastructures such as Hadoop and Spark.
■ Numba and NumbaPro—These use just-in-time compilation to speed up applications written directly in Python and a few annotations. NumbaPro also allows you to use the power of your graphics processor unit (GPU).
■ PyCUDA—This allows you to write code that will be executed on the GPU instead of your CPU and is therefore ideal for calculation-heavy applications. It works best with problems that lend themselves to being parallelized and need little input compared to the number of required computing cycles. An example is studying the robustness of your predictions by calculating thousands of different outcomes based on a single start state.
■ Cython, or C for Python—This brings the C programming language to Python. C is a lower-level language, so the code is closer to what the computer eventually uses (bytecode). The closer code is to bits and bytes, the faster it executes. A computer is also faster when it knows the type of a variable (called static typing). Python wasn’t designed to do this, and Cython helps you to overcome this shortfall.
■ Blaze —Blaze gives you data structures that can be bigger than your computer’s main memory, enabling you to work with large data sets.
■ Dispy and IPCluster —These packages allow you to write code that can be distributed over a cluster of computers.
■ PP —Python is executed as a single process by default. With the help of PP you can parallelize computations on a single machine or over clusters.
■ Pydoop and Hadoopy—These connect Python to Hadoop, a common big data framework.
■ PySpark—This connects Python and Spark, an in-memory big data framework. Now that you’ve seen an overview of the available libraries, let’s look at the modeling process itself.
The modeling process
The modeling phase consists of four steps:
1 Feature engineering and model selection
2 Training the model
3 Model validation and selection
4 Applying the trained model to unseen data
Before you find a good model, you’ll probably iterate among the first three steps. The last step isn’t always present because sometimes the goal isn’t prediction but explanation (root cause analysis). For instance, you might want to find out the causes of species’ extinctions but not necessarily predict which one is next in line to leave our planet.
It’s possible to chain or combine multiple techniques. When you chain multiple models, the output of the first model becomes an input for the second model. When you combine multiple models, you train them independently and combine their results. This last technique is also known as ensemble learning.
A model consists of constructs of information called features or predictors and a target or response variable. Your model’s goal is to predict the target variable, for example, tomorrow’s high temperature. The variables that help you do this and are (usually) known to you are the features or predictor variables such as today’s temperature, cloud movements, current wind speed, and so on. The best models are those that accurately represent reality, preferably while staying concise and interpretable. To achieve this, feature engineering is the most important and arguably most interesting part of modeling.
Engineering features and selecting a model
With engineering features, you must come up with and create possible predictors for the model. This is one of the most important steps in the process because a model recombines these features to achieve its predictions.
Often you may need to consult an expert or the appropriate literature to come up with meaningful features. Certain features are the variables you get from a data set, as is the case with the provided data sets in our exercises and in most school exercises. In practice you’ll need to find the features yourself, which may be scattered among different data sets.
Often you’ll need to apply a transformation to an input before it becomes a good predictor or to combine multiple inputs. An example of combining multiple inputs would be interaction variables: the impact of either single variable is low, but if both are present their impact becomes immense. This is especially true in chemical and medical environments. For example, although vinegar and bleach are fairly harmless common household products by themselves, mixing them results in poisonous chlorine gas, a gas that killed thousands during World War I.
In medicine, clinical pharmacy is a discipline dedicated to researching the effect of the interaction of medicines. This is an important job, and it doesn’t even have to involve two medicines to produce potentially dangerous results. For example, mixing an antifungal medicine such as Sporanox with grapefruit has serious side effects.
Sometimes you have to use modeling techniques to derive features: the output of a model becomes part of another model. This isn’t uncommon, especially in text mining.Documents can first be annotated to classify the content into categories, or you can count the number of geographic places or persons in the text. This counting is often more difficult than it sounds; models are first applied to recognize certain words as a person or a place. All this new information is then poured into the model you want to build. One of the biggest mistakes in model construction is the availability bias: your features are only the ones that you could easily get your hands on and your model consequently represents this one-sided “truth.” Models suffering from availability bias often fail when they’re validated because it becomes clear that they’re not a valid representation of the truth.
When the initial features are created, a model can be trained to the data.
Training your model
With the right predictors in place and a modeling technique in mind, you can progress to model training. In this phase you present to your model data from which it can learn.
The most common modeling techniques have industry-ready implementations in almost every programming language, including Python. These enable you to train your models by executing a few lines of code. For more state-of-the art data science techniques, you’ll probably end up doing heavy mathematical calculations and implementing them with modern computer science techniques.
Once a model is trained, it’s time to test whether it can be extrapolated to reality: model validation.
Validating a model
Data science has many modeling techniques, and the question is which one is the right one to use. A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t seen. To achieve this you define an error measure (how wrong the model is) and a validation strategy.
Two common error measures in machine learning are the classification error rate for classification problems and the mean squared error for regression problems. The classification error rate is the percentage of observations in the test data set that your model mislabeled; lower is better. The mean squared error measures how big the average error of your prediction is. Squaring the average error has two consequences: you can’t cancel out a wrong prediction in one direction with a faulty prediction in the other direction. For example, overestimating future turnover for next month by 5,000 doesn’t cancel out underestimating it by 5,000 for the following month. As a second consequence of squaring, bigger errors get even more weight than they otherwise would. Small errors remain small or can even shrink (if <1), whereas big errors are enlarged and will definitely draw your attention.
Many validation strategies exist, including the following common ones:
Many validation strategies exist, including the following common ones:
■ Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)—This is the most common technique.
■ K-folds cross validation—This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set.This has the advantage that you use all the data available in the data set.
■ Leave-1 out—This approach is the same as k-folds but with k=1. You always leave one observation out and train on the rest of the data. This is used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big data analysts.
Another popular term in machine learning is regularization. When applying regularization, you incur a penalty for every extra variable used to construct the model. With L1 regularization you ask for a model with as few predictors as possible. This is important for the model’s robustness: simple solutions tend to hold true in more situations.
L2 regularization aims to keep the variance between the coefficients of the predictors as small as possible. Overlapping variance between predictors makes it hard to make out the actual impact of each predictor. Keeping their variance from overlapping will increase interpretability. To keep it simple: regularization is mainly used to stop a model from using too many features and thus prevent over-fitting.
Validation is extremely important because it determines whether your model works in real-life conditions. To put it bluntly, it’s whether your model is worth a dime.Test your models on data the constructed model has never seen and make sure this data is a true representation of what it would encounter when applied on fresh observations by other people.
For classification models, instruments like the confusion matrix are golden; embrace them. Once you’ve constructed a good model, you can (optionally) use it to predict the future.
Predicting new observations
If you’ve implemented the first three steps successfully, you now have a performant model that generalizes to unseen data. The process of applying your model to new data is called model scoring. In fact, model scoring is something you implicitly did during validation, only now you don’t know the correct outcome. By now you should trust your model enough to use it for real.
Model scoring involves two steps. First, you prepare a data set that has features exactly as defined by your model. This boils down to repeating the data preparation you did in step one of the modeling process but for a new data set. Then you apply the model on this new data set, and this results in a prediction.
Comments
Post a Comment