youchounobb / 2018-tencent-ad-competition-baseline Goto Github PK

2018腾讯广告算法大赛baseline 线上0.73

Python 100.00%

2018-tencent-ad-competition-baseline's Issues

LGBMClassifier模型怎么保存?

LGBMClassifier训练完之后如何保存,下一次使用的时候直接加载.
还有预处理完的矩阵是不是也可以保存,每次都得预处理一遍,太慢了

CountVectorizer()需要设定token_pattern参数

默认的token_pattern为'(?u)\b\w\w+\b'，这样的话似乎会忽略长度为1的字符，如'1'、'2'，从而导致特征缺失。若nan填充为'-1'的话，可考虑设置token_pattern为'(?u)\b(?<!-)\d+\b'

大神报这个错：DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

大神，请教个问题，clf.predict的时候报错：
DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use array.size > 0 to check that an array is not empty.

会是什么原因呢，刚接触不是很懂，能否指点下

萌新求教，跑了两个小时后出现这些错误，是哪里不对啊？

Traceback (most recent call last):
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 76, in batch_predict
data[feature] = LabelEncoder().fit_transform(data[feature].apply(int))
File "E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 2551, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src/inference.pyx", line 1521, in pandas._libs.lib.map_infer
ValueError: invalid literal for int() with base 10: '1 2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 115, in
result.append(batch_predict(slice,i))
File "E:/2018-tencent-ad-competition-baseline-master/bryan_baseline_v2.py", line 78, in batch_predict
data[feature] = LabelEncoder().fit_transform(data[feature])
File "E:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 112, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "E:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 210, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "E:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 274, in _unique1d
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: '<' not supported between instances of 'int' and 'str'

Process finished with exit code 1

baseline_v3训练结束后提交提示错误：提交失败, 文件内容格式错误

提交失败, 文件内容格式错误

使用 user_feature_tocsv.py 将用户特征转换成csv文件时出错

执行这句话时，出错了，请问该怎么办啊？谢谢
--可以先使用 user_feature_tocsv.py 将用户特征转换成csv文件，以便后面直接pd.read_csv读入Traceback

错误信息：
... ...
11300000
11400000
(most recent call last):
File "user_feature_tocsv.py", line 28, in
user_feature=pd.concat([pd.DataFrame('../data/userFeature_' + str(i) + '.csv') for i in range(cnt+1)])
File "user_feature_tocsv.py", line 28, in
user_feature=pd.concat([pd.DataFrame('../data/userFeature_' + str(i) + '.csv') for i in range(cnt+1)])
File "/home/wc/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 404, in init
raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!
(tfpy35) wc@ubuntu:~/Desktop/Tencent/code$ ^C

请教一个问题：在跑tocsv的时候，跑到11400000，就一直卡着不动了。希望指点

8g内存，博客留言上说有分割data的方式，但是新人不清楚，是有什么函数吗？
del 删除变量引用，但内存还是99%，
重新跑过，发现前面内存还好的，忽上忽下的，就是到了11400000的时候，内存很高，请问是什么原因呢？
新人第一次玩，不吝赐教。

数据下载连接失效

我想下载你分享的数据链接，但是连接失效无法下载，你能再次分享一下吗谢谢

bryan_baseline_v3.py#98行 train_x = sparse.hstack((train_x, train_a)) 中为什么要用sparse.hstack而不是df.concat?

bryan_baseline_v3.py#98行
train_x = sparse.hstack((train_x, train_a)) 中
为什么要用sparse.hstack而不是df.concat?

2.#66-#68
one_hot_feature=['LBS','age','carrier','cons......
vector_feature =['appIdAction','a','....
one_hot_feature是需要进行标准化处理的(x-u/标准差)
vector_feature是需要进行文本特征提取的。因而对两部分特征分别遍历，然后一列一列加上。
不知道这么理解对不对

我再对baseline3跑的时候遇到了问题

问题是出在
slice = train[start:end]
这一行，显示的是 can not do slice indexing on with these indexers [0.0] <type 'float'>
我打印了一下start 和 end都是float类型，是因为我的pandas版本太低没法操作吗？

v3 内存仍然不够读取4000000后killed

您好，我现在给电脑新装了8G的内存条，用的ubuntu16.04系统，打开了ulimit -c unlimited，但是跑v3版本还是会在读取4000000行左右的user_feature.data后进程被killed，请问可能有什么解决办法吗？非常感谢！

你好，tocsv的程序有点不理解

为什么既要把数据分开写入csv，又把数据单独写入csv呀；
为什么在循环体内已经del userFeature_data了，并且把它设为空了，它还是能写入文件呢？
这两点不太懂，请多多指教

v3版本的大约要跑多久训练完？

内存实测

大佬标记的8g可用的，8gram+4g虚拟内存，读取到770wuserdata会memoryerror；选择部分数据集，250w吃满，500wmemoryerror
标记为16g可用的，32g内存，实测在读取userdata的时候，满占用；最后迭代的时候，稳定占用内存28g左右。
以上数据分别是我在自己电脑上和实验室电脑上跑所得数据，仅供参考，顺便感谢大佬
（本渣也入坑datascience了）

请问下baseline中的训练集和测试集为什么设置为一样？

数据集

忘记报名了，可以给一下数据集吗？

萌新求教出现keyerror aid

Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas_libs\index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'aid'

During handling of the above exception, another exception occurred:

youchounobb / 2018-tencent-ad-competition-baseline Goto Github PK

2018-tencent-ad-competition-baseline's Issues

Recommend Projects

Recommend Topics

Recommend Org