Giter Site home page Giter Site logo

Comments (14)

oraoto avatar oraoto commented on May 4, 2024 2

拆分数据是要在特征缩放前的,不需要保证训练集和测试集一致,只能假设测试数据集(或者模型部署后的生产数据)是近似的,然后做和训练数据同样的处理(例如减去训练集的均值)。

如果在拆分数据前就进行了特征缩放,测试集的特征就会被参杂到训练数据里了。

from 100-days-of-ml-code.

wengJJ avatar wengJJ commented on May 4, 2024

之前回答过英文版的问题,还是保持原来的观点-先拆分数据再特征缩放。
常用的特征缩放:归一化(Normalization)与标准化(Standardization)
归一化Normalization=(x-min)/(max-min)涉及样本的极大极小值
标准化Standardization=(x-μ)/(σ)涉及样本的均值μ标准差σ
先进行总体特征缩放,例如归一化的极大值max存在于测试集,这就会让测试集的数据影响到训练集的数据,而我们划分训练集测试集的原则就是彼此不受干扰,保证得出模型的效果准确性。
所以,个人观点还是得先拆分数据再特征缩放

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

我的理解,如果数据量足够大,先后都无所谓,但当数据不够多时,影响会变大。

from 100-days-of-ml-code.

Huang-Jack avatar Huang-Jack commented on May 4, 2024

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

请教一个问题,为什么特征缩放时,对训练集用的是fit_transform,而对测试集用的是transform.
我尝试对测试集也用fit_transform结果发现数据会错误

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

哪个文件?

from 100-days-of-ml-code.

Huang-Jack avatar Huang-Jack commented on May 4, 2024

day6 @zhyongquan

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

@Huang-Jack 原作者repo有人提过这个问题,都改成fit_transform,刚我试了下,没什么问题,你报什么错误?

from 100-days-of-ml-code.

Huang-Jack avatar Huang-Jack commented on May 4, 2024

我试了都改成 fit_transform不会报错了,但是我不知道他们的区别是什么

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

摘自此文章的评论https://blog.csdn.net/appleyuchi/article/details/73503282
我觉得讲的比正文更清楚
每一个transform都需要先fit,比如把数据转为(0,1)分布,需要均值和标准差,fit_transform和transform的区别就是前者是先计算均值和标准差再转换,而直接transform则是用之前数据计算的参数转换。所以如果之前没有fit,是不能直接transform的。
在此例子中,前文已经用过一个fit_transform了,后面用fit_transform或transform都可以,因为已经fit过了。作者的用意是不是要用训练集合的fit参数,去transform测试集和?以此达到两者相同的变化。

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

@wengJJ
本issue讨论的顺序问题,作者在transform时考虑了?

from 100-days-of-ml-code.

Huang-Jack avatar Huang-Jack commented on May 4, 2024

尝试了对测试集如果用fit_transform会造成F1 score低于对测试集用transform的情况,可能是对测试集做了新的fit就无法反映整体数据集的特征了 @zhyongquan
X_test = sc.transform(X_test)
precision recall f1-score support

      0       0.89      0.96      0.92        68
      1       0.89      0.75      0.81        32

avg / total 0.89 0.89 0.89 100

X_test = sc.fit_transform(X_test)
precision recall f1-score support

      0       0.90      0.93      0.91        68
      1       0.83      0.78      0.81        32

avg / total 0.88 0.88 0.88 100

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

所以train和test的transform,要使用同样的fit参数

from 100-days-of-ml-code.

wengJJ avatar wengJJ commented on May 4, 2024

进行特征缩放时我们的顺序是
1先fit获得相应的参数值(可以理解为获得特征缩放规则) 2再用transform进行转换
fit_transform方法就是先执行fit()方法再执行transform()方法,所以每执行一次就会采用新的特征缩放规则

@Huang-Jack 你提出的说测试集使用fit_transform效果变差是肯定的,因为你使用的是测试集的新的规则来进行转换。如果你使用的是transform,则会使用训练集的规则进行转换,效果会更好。
@zhyongquan 顺序问题还是不会冲突 作者是用训练集的特征缩放规则来应用到测试集的,所以还是先得先拆分数据集 再进行特征缩放

from 100-days-of-ml-code.

zhyongquan avatar zhyongquan commented on May 4, 2024

是的,这个issue算讨论明白了。

from 100-days-of-ml-code.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.