我成功地仅对Iris数据集使用numpy成功地实现了多线性回归。我想为 boston houses data set,但我的模型无法学习,我也不知道为什么。
import pandas as pd
# read data and split into test and training sets
data = pd.read_csv('train.csv')
data = (data - data.mean()) / data.std() # normalize data
split_data = np.random.rand(len(data)) < 0.8
train_data = data[split_data].round(5)
test_data = data[~split_data]
# create matrices
input_features_train = train_data.drop(['ID','medv'],1).values
output_feature_train = train_data.medv.values.reshape(-1,1)
ones = np.ones([input_features_train.shape[0],1])
input_features_train = np.concatenate((ones,input_features_train),1)
weight = np.zeros([1,14])
def computeCost(X,y,theta):
summed = np.power(((X @ theta.T) - y),2)
return np.sum(summed) / (2 * len(X))
def gradientDescent(X,theta,iters,alpha):
costs = np.zeros(iters)
for i in range(iters):
theta = theta - (alpha / len(X)) * np.sum(X * (X @ theta.T - y),0)
costs[i] = computeCost(X,theta)
return theta,costs
learning_rate = 0.01
iterations = 100000
weights,cost = gradientDescent(input_features_train,output_feature_train,weight,iterations,learning_rate)
print("Weights: ",weights)
finalCost = computeCost(input_features_train,weights)
# test
input_features_test = test_data.drop(['ID',1).values
output_feature_test = test_data.medv.values.reshape(-1,1)
ones = np.ones([input_features_test.shape[0],1])
input_features_test = np.concatenate((ones,input_features_test),1)
def test_data(input_features,output_feature,weights):
predictions = np.round(np.dot(input_features,weights.T))
for i in range(len(output_feature)):
predicted = predictions[i]
success = predictions[i] == output_feature[i]
print('For features: ',input_features[i],' housing price should be ',output_feature[i])
print("Predicted: %f" % predicted)
print("Is success? ",success)
print()
test_data(input_features_test,output_feature_test,weights)
predictions = np.round(np.dot(input_features_test,weights.T))
accuracy = (sum(predictions == output_feature_test) / float(len(output_feature_test)) * 100)[0]
print("accuracy of the model is ",accuracy,"% after ","iterations")
示例输出如下
Weights: [[ 0.01465871 -0.11583742 0.17729105 0.01249782 0.09822299 -0.31249182
0.25208063 -0.00937766 -0.48751822 0.46772537 -0.27637035 -0.1590125
0.12926108 -0.48910136]]
For features: [ 1. -0.44852959 -0.47141352 0.09095532 -0.25240023 0.13793157
0.46506236 0.03105118 -0.62153314 -0.98758424 -0.79769195 1.18594974
0.37563165 -0.40259248] housing price should be [-0.04019949]
Predicted: 0.000000
Is success? [False]
我什至尝试了10000000次迭代,但仍然无法通过所有测试,并且精度为0%。在虹膜数据集上,我设法用此模型获得了100%的收益,所以我不明白为什么它不起作用。
我怀疑这可能与数据规范化有关,因为没有它,我会收到RuntimeWarning: overflow encountered in power
summed = np.power(((X @ theta.T) - y),2)
错误,但我也不知道为什么会这样。
您能指出我正确的方向吗?谢谢!