慕课网机器学习笔记(4)

内容提要


使用网格搜索获取最佳超参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

digits = datasets.load_digits()
train_x, test_x, train_y, test_y = train_test_split(digits.data, digits.target)

# 需要验证的参数的数据
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)],
'p': [i for i in range(1, 6)]
}
]

# 需要使用的Classifier
knn_clf = KNeighborsClassifier()

# CV 交叉验证
grid_search = GridSearchCV(knn_clf, param_grid)
%%time
grid_search.fit(train_x, train_y)
# CPU times: user 1min 46s, sys: 212 ms, total: 1min 46s
# Wall time: 1min 46s
# GridSearchCV(cv=None, error_score='raise',
# estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
# metric_params=None, n_jobs=1, n_neighbors=5, p=2,
# weights='uniform'),
# fit_params=None, iid=True, n_jobs=1,
# param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
# pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
# scoring=None, verbose=0)


# 查看最佳结果
grid_search.best_estimator_
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
# metric_params=None, n_jobs=1, n_neighbors=4, p=3,
# weights='distance')

# 最佳准确度
grid_search.best_score_
# 0.9881217520415738

# 最佳参数
grid_search.best_params_
# {'n_neighbors': 4, 'p': 3, 'weights': 'distance'}
  • python参数命名:如果最后跟一个_,则代表该参数是根据用户的输入生成的

设置GridSearchCV来达到提速和日志输入的效果

1
2
3
4
5
6
# n_jobs 设置使用多少个核
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
%%time
grid_search.fit(train_x, train_y)
# CPU times: user 216 ms, sys: 54.2 ms, total: 270 ms
# Wall time: 22.8 s


KNN算法中更多关于距离的定义

各种距离的文档


数据归一化

  • 问题:样本间的距离被发现时间所主导,虽然发现时间只有两倍而肿瘤大小却有5倍

  • 解决:将所有的数据都映射到同一尺度

最值归一化 normalization

把所有的数据都映射到0-1之间,适用于有明显边界的情况:考试成绩,像素点

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
import matplotlib.pyplot as plt

arr_normal = np.random.randint(0, 100, [50, 2])
arr_normal = np.array(arr_normal, dtype=float)

arr_normal[:, 0] = (arr_normal[:, 0] - np.min(arr_normal[:, 0])) / (np.max(arr_normal[:, 0]) - np.min(arr_normal[:, 0]))
arr_normal[:, 1] = (arr_normal[:, 1] - np.min(arr_normal[:, 1])) / (np.max(arr_normal[:, 1]) - np.min(arr_normal[:, 1]))

plt.scatter(arr_normal[:, 0], arr_normal[:, 1])
plt.show()


均值方差归一化 standardization

把所有的数据归一到均值为0方差为1的分布中,适用于没有明显的分界,可能存在极端数据值的情况

1
2
3
4
5
6
7
8
9
10
import numpy as np
import matplotlib.pyplot as plt

arr_standard = np.random.randint(0, 100, [50, 2])
arr_standard = np.array(arr_standard, dtype=float)

arr_standard[:, 0] = (arr_standard[:, 0] - np.average(arr_standard[:, 0])) / np.std(arr_standard[:, 0])
arr_standard[:, 1] = (arr_standard[:, 1] - np.average(arr_standard[:, 1])) / np.std(arr_standard[:, 1])

plt.scatter(arr_standard[:, 0], arr_standard[:, 1])


对测试数据所取的均值和方差

测试数据在归一化的过程中使用训练数据的均值和方差

sklearn中封装了一个专门的类 Scalar 对数据进行归一化


使用sklearn提供的scaler进行均值方差归一化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

data = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)

# 初始化
standardScaler = StandardScaler()
# 载入数据
standardScaler.fit(x_train)

# 生成归一化之后的数据
x_train_scale = standardScaler.transform(x_train)
x_test_scale = standardScaler.transform(x_test)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train_scale, y_train)
knn_clf.score(x_test_scale, y_test)
# 1

仿照sklearn提供的Scaler自己封装一个类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np

class StandardScaler:

def __int__(self):
self.mean_ = None
self.scale_ = None

def fit(self, X: np.ndarray) -> 'StandardScaler':
""" 根据训练数据集X获得数据的均值和方差 """
assert X.ndim == 2, "The dimension of X must be 2"

self.mean_ = np.array([np.mean(X[:, i]) for i in range(X.shape[1])])
self.scale_ = np.array([np.std(X[:, i]) for i in range(X.shape[1])])

return self

def transform(self, X: np.ndarray) -> np.ndarray:
""" 将X根据这个StandardScaler进行方差均值归一化处理 """
assert X.ndim == 2, "The dimension of X must be 2"
assert self.scale_ is not None and self.mean_ is not None, \
"must fit before transform"
assert X.shape[1] == len(self.mean_), \
"the feature number of X must be equal to mean_ and scale_"

resX = np.empty(shape=X.shape, dtype=float)
for col in range(X.shape[1]):
resX[:, col] = (X[:, col] - self.mean_[col]) / self.scale_[col]

return resX

关于机器学习算法,KNN总结

主要流程如下

KNN的缺点:

  • 效率低下,O(M*n)

  • 对边界的值比较敏感

  • 预测解决不具有可解释性,只能说明预测数据和哪几个数据最接近

  • 维数灾难,随着维度的增加,看似两个相近的点之间的距离越来越大,10000维的(0,0,..,0)到(1,1,…,1)距离为100