2020.01.18(pm): Feature Selection and Dimension reduction

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning(from : https://en.wikipedia.org/wiki/Feature_engineering )

For example, suppose you are watching a soccer game and you are predicting the outcome of the game. Can you predict a win or loss just by the start of the match? Or can you predict which team will win by the age of the football manager? Probably difficult. However, if you look at the number of goals scored by each team player, and the result of winning or losing, you will be able to predict more accurately.

As mentioned above, machine learning performance is highly dependent on the amount and quality of the data. So it’s very important to choose which features to choose.

Dimension Reduction

Dimensional reduction is also called feature extraction. If we have three independent variables describing one dependent variable, we do not need all three features. This is because some features may be represented by a combination of other features and some features may be completely incompatible with certain features.

In other words, the latent space that can explain the observations well can be smaller than the actual observation space, and the identification of the latent space based on the samples above the observation space is called the dimensionality reduction technique. There are two ways to reduce the dimension of the data: feature selection and feature extraction.

Feature Selection

The purpose of feature selection is to select a subset of all features to create a concise feature set. In the previous football situation, if we think that the manager’s age does not affect the outcome prediction, we can remove that feature from the full feature set. This feature selection removes unnecessary features (variables) from the original data.

This feature selection can be done by the analyst using a prior background, but can also use automatic feature selection methods. These methods remove some of the features and check their performance if they improve, which is the default behavior of most feature selection algorithms.

Feature Extraction

Feature extraction attempts to create a new feature with a combination of original features. For example, Principal Compnent Analysis is the most basic and popular dimension reduction technique. The principle of operation is to find an orthogonal major axis from the data and project all data onto that axis. In this case, the projection function that makes the original data projected data is equivalent to creating a new feature that consists of a linear combination of the original features.

Dimension reduction practical code using feature selection

Select numerical properties based on variance

# import library
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

# loading data
iris = datasets.load_iris()

# feature and target
features = iris.data
target = iris.target

# making threshold
thresholder = VarianceThreshold(threshold=.5)

# select feature has high variance than threshold
features_high_variance = thresholder.fit_transform(features)

# confirm feature_high variance
features_high_variance[0:3]
# confirm variance
thresholder.variances_
# import library
from sklearn.preprocessing import StandardScaler

# standarzation of feature array
scaler = StandardScaler()
features_std = scaler.fit_transform(features)

# calculate variance
selector = VarianceThreshold()
selector.fit(features_std).variances_

Select binary properties based on variance

# import library
from sklearn.feature_selection import VarianceThreshold

# making feature array
# feature 0: 80%가 class 0
# feature 1: 80%가 class 1
# feature 2: 60%가 class 0, 40%는 class 1
features = [[0, 1, 0],
            [0, 1, 1],
            [0, 1, 0],
            [0, 1, 1],
            [1, 0, 0]]

# selection depending on variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(features)

Dealing with Correlated Characteristics

# import library
import pandas as pd
import numpy as np

# making feature array with correlated characteristics
features = np.array([[1, 1, 1],
                     [2, 2, 0],
                     [3, 3, 1],
                     [4, 4, 0],
                     [5, 5, 1],
                     [6, 6, 0],
                     [7, 7, 1],
                     [8, 7, 0],
                     [9, 7, 1]])

# Convert array to DataFrame
dataframe = pd.DataFrame(features)

# making correlation matrix
corr_matrix = dataframe.corr().abs()

# Select upper triangle of corr_matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find Index of column bigger than 0.95, corr_coefficient
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Remove feature
dataframe.drop(dataframe.columns[to_drop], axis=1)
dataframe.corr()
upper

Deleting features not related to Classification

# import library
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

# load data
iris = load_iris()
features = iris.data
target = iris.target

# convert type as integer
features = features.astype(int)

# Select the two values ​​with the largest chi-square
chi2_selector = SelectKBest(chi2, k=2)
features_kbest = chi2_selector.fit_transform(features, target)

# Result
print("원본 특성 개수:", features.shape[1])
print("줄어든 특성 개수:", features_kbest.shape[1])
# Select the two attributes with the highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)

# Result
print("원본 특성 개수:", features.shape[1])
print("줄어든 특성 개수:", features_kbest.shape[1])
# import library
from sklearn.feature_selection import SelectPercentile

# Select the top 75% attribute of the largest F-value.
fvalue_selector = SelectPercentile(f_classif, percentile=75)
features_kbest = fvalue_selector.fit_transform(features, target)

# Result
print("원본 특성 개수:", features.shape[1])
print("줄어든 특성 개수:", features_kbest.shape[1])

https://github.com/SeongjooHong/machine-learning/blob/master/%ED%8A%B9%EC%84%B1%20%EC%84%A0%ED%83%9D%EC%9D%84%20%EC%82%AC%EC%9A%A9%ED%95%9C%20%EC%B0%A8%EC%9B%90%20%EC%B6%95%EC%86%8C.ipynb

Leave a Reply