Giter Site home page Giter Site logo

themains / pydomains Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 5.0 128.62 MB

Get the kind of content hosted by a domain based on the domain name

Home Page: http://pydomains.readthedocs.io/en/latest/

License: MIT License

Python 1.20% Jupyter Notebook 98.80%
domain-classifier lstm machine-learning dmoz phishtank domains

pydomains's People

Contributors

dependabot[bot] avatar soodoku avatar suriyan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pydomains's Issues

Error while running pred functions

Hi.
I get the following error while running the pred functions: "AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'"
I'm using the following packages:
pydomains (0.2.0)
Keras (2.3.1)
tensorflow (2.3.0)

Code:

import pandas as pd
from pydomains import *

df = pd.read_csv('D:/Google Drive/Datasets/Comscore 2018/domains_2018.zip')
df = df[1:100]

df_shalla   = pred_shalla(df, domain_names = 'domain_name')
df_toulouse = pred_toulouse(df, domain_names = 'domain_name')

Console Output:


df_toulouse = pred_toulouse(df, domain_names = 'domain_name')
Using cached Toulouse model data from local (E:\WPy64-3771\settings\.pydomains\toulouse_cat_lstm_others_2017.h5)...
Using cached Toulouse vocab data from local (E:\WPy64-3771\settings\.pydomains\toulouse_cat_vocab_others_2017.csv)...
Using cached Toulouse names data from local (E:\WPy64-3771\settings\.pydomains\toulouse_cat_names_others_2017.csv)...
Loading Toulouse model, vocab and names data file...
Traceback (most recent call last):

  File "<ipython-input-11-f19b92001d44>", line 1, in <module>
    df_toulouse = pred_toulouse(df, domain_names = 'domain_name')

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\pydomains\pred_toulouse.py", line 89, in pred_toulouse
    model, vocab, cats = load_model_data(year, latest)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\pydomains\pred_toulouse.py", line 58, in load_model_data
    model = load_model(model_path)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\saving.py", line 584, in load_model
    model = _deserialize_model(h5dict, custom_objects, compile)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\saving.py", line 274, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\saving.py", line 627, in model_from_config
    return deserialize(config, custom_objects=custom_objects)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\layers\__init__.py", line 168, in deserialize
    printable_module_name='layer')

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\utils\generic_utils.py", line 147, in deserialize_keras_object
    list(custom_objects.items())))

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\sequential.py", line 302, in from_config
    model.add(layer)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\sequential.py", line 166, in add
    layer(x)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\base_layer.py", line 446, in __call__
    self.assert_input_compatibility(inputs)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\engine\base_layer.py", line 310, in assert_input_compatibility
    K.is_keras_tensor(x)

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\backend\tensorflow_backend.py", line 695, in is_keras_tensor
    if not is_tensor(x):

  File "E:\WPy64-3771\python-3.7.7.amd64\lib\site-packages\keras\backend\tensorflow_backend.py", line 703, in is_tensor
    return isinstance(x, tf_ops._TensorLike) or tf_ops.is_dense_tensor_like(x)

AttributeError: module 'tensorflow.python.framework.ops' has no attribute '_TensorLike'

TypeError: expected string or bytes-like object

I have a data frame consisting of URLs as below in column 'resolved_url'. The package works well overall (and successfully works when using DMOZ. However, I keep having type error when using PhishTank, Shalllist, Toulouse.

Here are my URLs that I have 'resolved_url' column in my data frame. This column does not have any NaN value.
df['resolved_url']
Out[28]:
0 http://giveaway.amazon.com\__CONNECTIONPOOL_ER...
1 https://twitter.com/tonythehuff/status/9271864...
2 http://giveaway.amazon.com\__CONNECTIONPOOL_ER...
3 http://pcktpro.com_CONNECTIONPOOL_ERROR_
Name: resolved_url, Length: 15486150, dtype: object

Here are the error output below.
`
df_toulouse = pred_toulouse(df, domain_names = 'resolved_domain')
Downloading Toulouse model data from the server (toulouse_cat_lstm_others_2017.h5)...
98%|█████████▊| 1600.0/1631.5 [00:00<00:00, 12499.23KB/s]
Downloading Toulouse vocab data from the server (toulouse_cat_vocab_others_2017.csv)...
96%|█████████▌| 64.0/66.625 [00:00<00:00, 8090.28KB/s]
Downloading Toulouse names data from the server (toulouse_cat_names_others_2017.csv)...
100%|█████████▉| 64.0/64.080078125 [00:00<?, ?KB/s]
Loading Toulouse model, vocab and names data file...
Traceback (most recent call last):

File "", line 1, in
df_toulouse = pred_toulouse(df, domain_names = 'resolved_domain')

File "C:\Users\Simon\anaconda3\lib\site-packages\pydomains\pred_toulouse.py", line 100, in pred_toulouse
df[col_domain] = df[domain_names].apply(lambda c: url2domain(c, exclude_subdomains=['www']))

File "C:\Users\Simon\anaconda3\lib\site-packages\pandas\core\series.py", line 4356, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File "C:\Users\Simon\anaconda3\lib\site-packages\pandas\core\apply.py", line 1036, in apply
return self.apply_standard()

File "C:\Users\Simon\anaconda3\lib\site-packages\pandas\core\apply.py", line 1092, in apply_standard
mapped = lib.map_infer(

File "pandas_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer

File "C:\Users\Simon\anaconda3\lib\site-packages\pydomains\pred_toulouse.py", line 100, in
df[col_domain] = df[domain_names].apply(lambda c: url2domain(c, exclude_subdomains=['www']))

File "C:\Users\Simon\anaconda3\lib\site-packages\pydomains\utils.py", line 66, in url2domain
tld = tldextract.extract(url)

File "C:\Users\Simon\anaconda3\lib\site-packages\tldextract\tldextract.py", line 296, in extract
return TLD_EXTRACTOR(url, include_psl_private_domains=include_psl_private_domains)

File "C:\Users\Simon\anaconda3\lib\site-packages\tldextract\tldextract.py", line 216, in call
SCHEME_RE.sub("", url)

TypeError: expected string or bytes-like object
`

I suspect this issue occurs because some of processed URL after your algorithm with Toulouse is NaN, although I do not have any NaN value on original urls. I also look thorough the function between dmoz_cat and pred_toulouse and the code for dealing url strings seems similar so I don't know why this issue happens only for PhishTank, Shalllist, Toulouse, not Dmoz.

This issue occurs when I use other PC installing the current version of your pydomains package.

pydomains as a service

Hi,

Hope you are all well !

I was giving a try to your interesting repository and was wondering if it is possible to make it as an http service to predict the category of domain ?

Thanks in advance for any insights or inputs on that question.

Cheers,
X

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.