youchounobb / 2018-tencent-ad-competition-baseline Goto Github PK
View Code? Open in Web Editor NEW2018腾讯广告算法大赛baseline 线上0.73
2018腾讯广告算法大赛baseline 线上0.73
LGBMClassifier训练完之后如何保存,下一次使用的时候直接加载.
还有预处理完的矩阵是不是也可以保存,每次都得预处理一遍,太慢了
默认的token_pattern为'(?u)\b\w\w+\b',这样的话似乎会忽略长度为1的字符,如'1'、'2',从而导致特征缺失。若nan填充为'-1'的话,可考虑设置token_pattern为'(?u)\b(?<!-)\d+\b'
大神,请教个问题,clf.predict的时候报错:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use array.size > 0
to check that an array is not empty.
会是什么原因呢,刚接触不是很懂,能否指点下
Traceback (most recent call last):
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 76, in batch_predict
data[feature] = LabelEncoder().fit_transform(data[feature].apply(int))
File "E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 2551, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src/inference.pyx", line 1521, in pandas._libs.lib.map_infer
ValueError: invalid literal for int() with base 10: '1 2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 115, in
result.append(batch_predict(slice,i))
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 78, in batch_predict
data[feature] = LabelEncoder().fit_transform(data[feature])
File "E:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 112, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "E:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 210, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "E:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 274, in _unique1d
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: '<' not supported between instances of 'int' and 'str'
Process finished with exit code 1
提交失败, 文件内容格式错误
执行这句话时,出错了,请问该怎么办啊?谢谢
--可以先使用 user_feature_tocsv.py 将用户特征转换成csv文件,以便后面直接pd.read_csv读入Traceback
错误信息:
... ...
11300000
11400000
(most recent call last):
File "user_feature_tocsv.py", line 28, in
user_feature=pd.concat([pd.DataFrame('../data/userFeature_' + str(i) + '.csv') for i in range(cnt+1)])
File "user_feature_tocsv.py", line 28, in
user_feature=pd.concat([pd.DataFrame('../data/userFeature_' + str(i) + '.csv') for i in range(cnt+1)])
File "/home/wc/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 404, in init
raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!
(tfpy35) wc@ubuntu:~/Desktop/Tencent/code$ ^C
8g内存 ,博客留言上说有分割data的方式,但是新人不清楚,是有什么函数吗?
del 删除变量引用,但内存还是99%,
重新跑过,发现前面内存还好的,忽上忽下的,就是到了11400000的时候,内存很高,请问是什么原因呢?
新人第一次玩,不吝赐教。
我想下载你分享的数据链接,但是连接失效无法下载,你能再次分享一下吗 谢谢
bryan_baseline_v3.py#98行
train_x = sparse.hstack((train_x, train_a)) 中
为什么要用sparse.hstack而不是df.concat?
2.#66-#68
one_hot_feature=['LBS','age','carrier','cons......
vector_feature =['appIdAction','a','....
one_hot_feature是需要进行标准化处理的(x-u/标准差)
vector_feature是需要进行文本特征提取的。因而对两部分特征分别遍历,然后一列一列加上。
不知道这么理解对不对
问题是出在
slice = train[start:end]
这一行,显示的是 can not do slice indexing on with these indexers [0.0] <type 'float'>
我打印了一下start 和 end都是float类型,是因为我的pandas版本太低没法操作吗?
您好,我现在给电脑新装了8G的内存条,用的ubuntu16.04系统,打开了ulimit -c unlimited,但是跑v3版本还是会在读取4000000行左右的user_feature.data后进程被killed,请问可能有什么解决办法吗?非常感谢!
为什么既要把数据分开写入csv,又把数据单独写入csv呀;
为什么在循环体内已经del userFeature_data了,并且把它设为空了,它还是能写入文件呢?
这两点不太懂,请多多指教
大佬标记的8g可用的,8gram+4g虚拟内存,读取到770wuserdata会memoryerror;选择部分数据集,250w吃满,500wmemoryerror
标记为16g可用的,32g内存,实测在读取userdata的时候,满占用;最后迭代的时候,稳定占用内存28g左右。
以上数据分别是我在自己电脑上和实验室电脑上跑所得数据,仅供参考,顺便感谢大佬
(本渣也入坑datascience了)
忘记报名了,可以给一下数据集吗?
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'aid'
During handling of the above exception, another exception occurred:
你好,请问代码52,53行这两句话中 'creativeSize' 是什么意思,从哪来的, 这两句话的作用是什么?
train_x=train[['creativeSize']]
test_x=test[['creativeSize']]
大佬,能不能重新分享一下这个数据呀,谢谢大佬!
这个地方有点不理解,将数据全部稀疏化后,只是列数增多了,行数没变化。为什么不能使用切片呢?是不是因为数据中有多变量特征,如果全部是单变量特征,是不是稀疏化后就可以进行切片。
为什么我直接跑这个baseline,没有改动,会报这个错呢?非常感谢你的回答!
cv.fit(data[feature]),获取词频向量时候报错的。
能不能详细说下
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.