yoannsculo / jobcatcher Goto Github PK

Python 100.00%

jobcatcher's Introduction

JobCatcher

JobCatcher is a daemon that retrieves job offers from multiple job boards feeds and generates custom RSS feeds and HTML reports for you. This is a decentralized software meant to run of your own server.

JobCatcher comes with a filter feature, so you can filter company names with black or whitelists.

Think of it as a RSS feed reader with filter feature.

I would like this software to be under GPLv2 License. But I need to check if this is compatible with dependencies I've choosen.

Work in Progress

The project is fully in development and many features need to be implemented. It is developed in Python. This is my first time I use Python on a non-basic project. So I guess my code is not so pythonic ... yet. Feel free to help me or show me mistakes I could have made or improvements I could do.

Dependencies

python-html2text, python-requests, python-beautifulsoup

Usage (mainly development options for now)

--all              sync the blacklist, fetch the offers and generates reports.
--feeds            download the all feeds in the config
--feed=JOBBOARD    download only the feed from JOBBOARD in the config
--pages            download the all pages in the config
--page=JOBBOARD    download only the pages from JOBBOARD in the config
--inserts          inserts all pages to offers
--insert=JOBBOARD  insert JOBBOARD pages to offers
--moves            move datas to offer
--move=JOBBOARD    move JOBBOARD datas to offer
--clean=JOBBOARD   clean offers from JOBBOARD source
--report           generate a full report
--version          output version information and exit

Reports are generated into the local "www" directory.

I start jobcatcher.py -s manually with crontab for now. But this should change soon.

List of supported Job Boards

Apec.fr (France)
Cadreonline (France)
Eures (Europe)
PoleEmploi (France)
Progressive Recruitment (France)
RegionsJob
CentreJob (France)
NordJob (France)
PacaJob (France)
RhoneJob (France)
EstJob (France)
OuestJob (France)
SudOuestJob (France)
ParisJob (France)

TODO

Lolix.org (France)
Linux.com (Int.)
L'eXpress-Board (France)
Remixjobs.com (France)

Installation

Debian, Ubuntu

# Install a packages
apt-get update
apt-get install sqlite3 python-pip git 
pip install virtualenv virtualenvwrapper


# Configure virtualenvwrapper
cat << EOF >> ~/.bashrc
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh
EOF
source ~/.bashrc

# Prepare jobcatcher environment
mkvirtualenv --no-site-packages -p /usr/bin/python2.7 jobcatcher
add2virtualenv /opt/JobCatcher

# Install jobcatcher project
cd opt
git clone -b unstable https://github.com/badele/JobCatcher.git
cd JobCatcher
pip install -r requirements.txt

Utilisation

Modify the config.py and execute

workon jobcatcher
python jobcatcher.py --all

Help us to add new job boards to JobCatcher ! :)

Contributors

Yoann Sculo - www.yoannsculo.fr
Bruno Adelé - bruno.adele.im
Yankel Scialom - github

jobcatcher's People

Contributors

Stargazers

Watchers

Forkers

ein5t3in badele grevaillot estei-master cassou guillaumerose piroux n941 victariox adri1mart1 ihebja mrth0m

jobcatcher's Issues

Bug in -u function

./jobcatcher.py -u 'http://ec.europa.eu/eures/eures-searchengine/servlet/ShowJvServlet?lg=FR&pesId=62&uniqueJvId=009XWYB&nnImport=false'
Traceback (most recent call last):
File "./jobcatcher.py", line 647, in
moduleClass = getattr(module, 'Apec')
AttributeError: 'module' object has no attribute 'Apec'

Dynamic report: empty pages

It seams that when the last of dynamic pages contains exactly offers_per_page offers, an empty page is added after it.

Add 'report' keyword to test.py for configs

$ python test.py
ERROR: test_jobcatcher (main.TestPackages)

Execute the jobcatcher functions

Traceback (most recent call last):
File "test.py", line 157, in test_jobcatcher
jobcatcher.executeall(configs)
File "/home/yscialom/Projets/JobCatcher/jobcatcher.py", line 530, in executeall
generatereport(conf)
File "/home/yscialom/Projets/JobCatcher/jobcatcher.py", line 535, in generatereport
r.generate()
File "/home/yscialom/Projets/JobCatcher/jobcatcher.py", line 221, in generate
self.generateReport(True)
File "/home/yscialom/Projets/JobCatcher/jobcatcher.py", line 323, in generateReport
self.header(report)
File "/home/yscialom/Projets/JobCatcher/jobcatcher.py", line 239, in header
if self.configs['report']['dynamic']:
KeyError: 'report'

--clean option doesn't work anymore

(jobcatcher)yoann@thinkpad:~/dev/apec$ python jobcatcher.py --clean=Apec
Traceback (most recent call last):
  File "jobcatcher.py", line 1241, in <module>
    clean(configs, options.clean)
  File "jobcatcher.py", line 1095, in clean
    utilities.db_checkandcreate(conf)
  File "/home/yoann/dev/apec/utilities.py", line 183, in db_checkandcreate
    if not db_istableexists(configs, 'offers'):
  File "/home/yoann/dev/apec/utilities.py", line 189, in db_istableexists
    conn = lite.connect(configs['database'])
TypeError: 'Config' object has no attribute '__getitem__'

Improve location filter

JobCatcher doesn't work with Python 2.6 and 3.1

Bug seen on Debian

Add a new type of contract: Internship

Finalise the P2P implementation

Store original raw salary information and filtered one in two separate table fields

[Report] Add hour:minute information into pubdate column

It seems we lose information from the original publication date:

offer.date_pub.strftime('%Y-%m-%d %H:%M')

For instance, here we get only '2013-11-21 00:00' for all the offers fetched on the 11/21

[Report] Add a small div popup with the original data when we hover over 'NA' fields

It applies on the following fields:

salary
company

Requests connection errors (Connection reset by peer) crash JobCatcher

(jobcatcher)yoann@yoann:~/dev/perso/JobCatcher$ python jobcatcher.py --all
Download http://www.centrejob.com/fr/rss/flux.aspx?&fonction=10
Traceback (most recent call last):
  File "jobcatcher.py", line 879, in <module>
    executeall(configs)
  File "jobcatcher.py", line 650, in executeall
    downloadfeeds(conf)
  File "jobcatcher.py", line 679, in downloadfeeds
    downloadfeed(conf, jobboardname)
  File "jobcatcher.py", line 672, in downloadfeed
    plugin.downloadFeeds(feeds)
  File "/home/yoann/dev/perso/JobCatcher/jobcatcher.py", line 120, in downloadFeeds
    self.downloadFeed(feed, interval, forcedownload)
  File "/home/yoann/dev/perso/JobCatcher/jobcatcher.py", line 113, in downloadFeed
    utilities.downloadFile(feed['url'], datas, saveto, True, interval)
  File "/home/yoann/dev/perso/JobCatcher/utilities.py", line 151, in downloadFile
    r = download(url, datas)
  File "/home/yoann/dev/perso/JobCatcher/utilities.py", line 125, in download
    r = requests.get(url)
  File "/home/yoann/.virtualenvs/jobcatcher/local/lib/python2.7/site-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/home/yoann/.virtualenvs/jobcatcher/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/yoann/.virtualenvs/jobcatcher/local/lib/python2.7/site-packages/requests/sessions.py", line 361, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/yoann/.virtualenvs/jobcatcher/local/lib/python2.7/site-packages/requests/sessions.py", line 464, in send
    r = adapter.send(request, **kwargs)
  File "/home/yoann/.virtualenvs/jobcatcher/local/lib/python2.7/site-packages/requests/adapters.py", line 356, in send
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.centrejob.com', port=80): Max retries exceeded with url: /fr/rss/flux.aspx?&fonction=10 (Caused by <class 'socket.error'>: [Errno 104] Connection reset by peer)

Store the distance range as a configuration parameter

Profile and optimize the javascript code

Add Lolix.org Jobboard

Add a dynamic pagination system

− yoannsculo: "I have collected 6481 offers on my database. Usually it takes a while (some seconds) to load the entire page. With the javascript firefox tells me the jquery scripts are not responding. I don't know how we could solve this. Either we use a paging system or we find a other way to lighten the page. If you want I can send you my database".

yscialom : "As of the final aim of JobCatcher is to display fresh job offers to the user, a database as big as 6481 entries seems a bit overkill. Don't we want to automatically delete old offers (say, older that two monthes)? In addition to that, i'm willing to add a dynamic pagination system. I'm happy to know thhough that my code isn't working properly for big reports; i'll work on that."

--flush option doesn't work anymore

(jobcatcher)yoann@thinkpad:~/dev/apec$ python jobcatcher.py --flush
Error column company is not unique:

Filtered report displays blacklisted companies

Maybe linked to #35

I can see AUSY, AKKA TECHNOLOGIES offers whereas there are in the blacklist table.

jobcatcher -s crashes when a job provider website is not (x)html complient?

See attached log:
$ ./jobcatcher.py -s
Fetching CENTREJOB
main date Sat, 12 Oct 2013 00:00:00 +0200
Downloading http://www.centrejob.com/clients/offres_chartees/offre_chartee_modele.aspx?numoffre=77148&de=consultation#xtor=RSS-105217407
Processing 77148
Downloading http://www.centrejob.com/clients/offres_chartees/offre_chartee_modele.aspx?numoffre=77143&de=consultation#xtor=RSS-105217407
Processing 77143
Fetching PROGRESSIVE
Traceback (most recent call last):
File "./jobcatcher.py", line 220, in
bot.run()
File "./jobcatcher.py", line 135, in run
item.fetch()
File "/home/yscialom/Etudes/recherche-emploi/JobCatcher/jobboards/Progressive.py", line 103, in fetch
self.fetch_url(url)
File "/home/yscialom/Etudes/recherche-emploi/JobCatcher/jobboards/Progressive.py", line 41, in fetch_url
xmldoc = minidom.parseString( content )
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1931, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: mismatched tag: line 58, column 6

Save read offers, add filter to only unread offers

The filter by publication date (aka pubdate) will allow to show only unread offers. An offer is considerated read when it is displayed more than x seconds.

Add a French company filtering (like salary and location)

Pagination buttons are floating on the page when we scroll down

Dynamic report does not work on google chrome

Google Chrome (tested version 30) does not accept the folowing syntax:

attribute: function(arg = default_value) {
   ...
}

It may also refuse any sort of default value for a function argument.

--p2pinit error: undefined getjobboardlist

(jobcatcher)yoann@thinkpad:~/dev/apec$ python jobcatcher.py --p2pinit
Traceback (most recent call last):
  File "jobcatcher.py", line 1307, in <module>
    p.initcache()
  File "jobcatcher.py", line 421, in initcache
    plugins = getjobboardlist(self.configs)
NameError: global name 'getjobboardlist' is not defined

[Report] Display the number of dynamic filtered results

It would be usefull to return the global number of results when we use filters in the interface.

We could maybe put it on the left of the pagination buttons.

Report with javascript filtering: salary filter not working properly

The function SalaryFilter.priv_range_from_string is boggus.

Add UI element to page navigation to intimate a third page

Suggestion:

Remove .page files when the content shows an unavailable offer

See f013124

Full report shows less offers than the filtered one

The full report page behaves very strangely.

Put a CSS dress&makeup onto all that

− yscialom: "The whole css is to be done. I'd like to merge the refresh/reset buttons into the text bars (to get something like this http://www.textfixer.com/tutorials/html-search-box.php).
I'm not sure to get what you mean about the calendar though :/."

Bug when the column selected for sorting is changed

It only sorts the current page, with no consideration of the rest of the table. A clic on a column head should reset the table page.

Report with javascript filtering: filter NA salaries

Add a chackbox to leave or remove the N/A salaries.

[Report] 'NA' Salary filter seems to filter offers without 'NA' strings

For example, I can see an offer with the 'SMIC' string when I don't enable NA filtering. If I do, I can't find the SMIC offer anymore. Same thing for other strings such as 'Negociable', 'A determiner', 'selon CCN51', ... that should not disappear untill I put them in the filter :P

Use random user agent

For prevent blocking by jobboard site, use the random user agent for downloading page

Jobcatcher is blocked when there is an error in a Jobboard class

Traceback (most recent call last):
  File "jobcatcher.py", line 1261, in <module>
    executeall(configs, selecteduser)
  File "jobcatcher.py", line 1000, in executeall
    downloadfeeds(conf, selecteduser)
  File "jobcatcher.py", line 1032, in downloadfeeds
    plugin = utilities.loadJobBoard(jobboardname, conf)
  File "/home/yoann/dev/perso/JobCatcher/utilities.py", line 175, in loadJobBoard
    'jobboards.%s' % jobboardname
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named cadresonline

It should not be blocking.

Refactoring the class, move class to jc directory

License: choose the JobCatcher license and edit any file to be complient with it

Why not MIT?
What is required to add in the files to be MIT complient?

Report with javascript filtering: missing new line afetr </tr>

Eures - strftime error in analyzePage

Error in unstable (57e3d23) with jobcatcher -a

==================================
Eures
==================================
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=004GKRV&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=004GPYD&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=007PFFX&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=007WGKK&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=008RSZH&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=008TBSP&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=009XWRX&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=009XWRZ&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=009XWSG&nnImport=false 
Download http://ec.europa.eu/eures/eures-searchengine/servlet/./ShowJvServlet?lg=FR&pesId=62&uniqueJvId=009XWYB&nnImport=false 
Traceback (most recent call last):
  File "./jobcatcher.py", line 638, in <module>
    executeall(configs)
  File "./jobcatcher.py", line 544, in executeall
    pagesinsert(conf)
  File "./jobcatcher.py", line 574, in pagesinsert
    fd.analyzesPages()
  File "./jobcatcher.py", line 79, in analyzesPages
    plugin.analyzePage(content.url, content.page)
  File "/home/yoann/dev/apec/jobboards/Eures.py", line 81, in analyzePage
    "%d/%m/%Y").strftime('%s')
TypeError: must be string, not None

a small openstreetmap square map with a location information
the whole offer information and content
a link to the jobboard

    All 575 offers
    538 filtered offers (93.57%)
    37 blacklisted offers (6.43%)

I have 30 offers / page

All offers -> 17 pages with 17 offers left on the 18th (total of 527 offers)
filtered offers -> 16 pages with 11 offers left on the 17th (total of 491 offers)

$ pwd
.../JobCatcher/www
$ ../jobcatcher --all
Traceback (most recent call last):
  File "../jobcatcher.py", line 905, in <module>
    executeall(configs)
  File "../jobcatcher.py", line 675, in executeall
    initblacklist(conf)
  File "../jobcatcher.py", line 691, in initblacklist
    utilities.blocklist_load(conf)
  File "/home/yscialom/Projets/JobCatcher/utilities.py", line 279, in blocklist_load
    fp = open('blacklist_company.txt', 'r')
IOError: [Errno 2] No such file or directory: 'blacklist_company.txt'

yoannsculo / jobcatcher Goto Github PK

jobcatcher's Introduction

JobCatcher

Work in Progress

Dependencies

Usage (mainly development options for now)

List of supported Job Boards

TODO

Installation

Debian, Ubuntu

Utilisation

Contributors

jobcatcher's People

Contributors

Stargazers

Watchers

Forkers

jobcatcher's Issues

Execute the jobcatcher functions

Recommend Projects

Recommend Topics

Recommend Org