son-of-gridengine / sge Goto Github PK

License: Other

Emacs Lisp 0.01% Roff 0.10% Makefile 1.63% HTML 1.23% C 69.35% C++ 3.51% Shell 1.09% Java 21.49% M4 0.01% Objective-C 0.11% Lex 0.06% Yacc 0.03% Perl 0.60% TeX 0.58% Prolog 0.04% XSLT 0.02% Ruby 0.08% DTrace 0.03% Tcl 0.05%

sge's Introduction

Son of Grid Engine

This GitHub repository was intended as the new home for the Son of GridEngine project, as we believed that the principal maintainer, Dave Love, was out of contact and had stopped maintaining it. As Dave Love has surfaced again, this repository doesn't currently serve much purpose.

This README is very much a TODO.

See source/README.BUILD for information on building and installation; the GUI installer may or may not be available, depending on the platform and packaging. See also other source/README.* files.

sge's People

Contributors

Stargazers

Watchers

Forkers

danpovey mightybigcar kunzol dariofranceschini soarfgoal

sge's Issues

Debian package structure and pull-upstream-changes process

https://salsa.debian.org/hpc-team/gridengine
@0xaf1f, I'd like to understand how that repo relates to this repo and how you pulled in changes.
In that repo, you seem to have the stuff from this repo, but also a debian/ subdirectory that contains a copy of sources/. Is that just a copy of the top-level sources/?

Importing the original repo

Making a separate issue for this.

@Kunzoi told me by email:

as written on Github, here is the link to the Git repo I created from the DARCS repo of SGE.

https://gitlab.bfabric.org/schmidt/sge

As far as I can see this is equal to the DARCS repo (diff).

I tried importing his repo and it's the same as this repo where expected.

 README                                                               |  7 -------
 README.md                                                            | 15 +++++++++++++++
 debian/rules                                                         |  0
 ... all other diffs are empty.

Unfortunately that's a problem because I previously determined that this repo is out of sync with the release tarballs that the Debian people were using-- and it has been out of sync for some time into the past, at least since 8.1.3. Basically, the release tags in this repo don't line up with the release tarballs. It seems Dave had two repos. This repo may correspond to his "master" version, but there was another "release" version I believe.. or something like that. Do you think you could try to do your same process for the release version? It think it's more efficient than my process.

Once you have that, I can test whether it lines up with what the Debian people have, and try figure out whether his master repo had important differences from the release repo that we need to keep.

sge_qmaster segfaulting

Discussed in issue #3, but making a separate issue as it's a separate problem.
The problem @entn-at had. I will try to look at @Kunzol's patch over the weekend.

But note: this may actually be linked in a different way to issue #3 because @entn-at was not able to build from source in the first place! Anyway we should fix the bug.

Importing https://gitlab.com/loveshack/sge.git

Per list discussions here
http://gridengine.org/pipermail/users/2018-May/thread.html
(search for 'Son of GridEngine succession), I am creating this organization and repo, and attempting to import https://gitlab.com/loveshack/sge.git. Unfortunately, due to GitLab bugs, that repo has errors ('git fsck' fails due to an issue mentioned here https://stackoverflow.com/questions/21971941/invalid-author-committer-line-missing-space-before-email).
I am running the fix suggested on that page. It may end up changing the commit hashes.

Document the Debian package build process (+other build processes)

@0xaf1f, I am hoping you can help with this. We probably need to do this before we do anything else, because the compilation instructions in this repo are pretty hard.

What I am thinking of is-- it would be good to have some instructions to document how to build the Debian package, starting from what the base machine is (let's assume people can use cloud services to spin up a particular base machine), and what things need to be installed first. If you get compilation errors I likely already have patches for those, as I did manage to compile.

More generally, going forward I think we need clear documentation of the build process on different platforms, preferably with scripts that check dependencies and automate that process. The existing tools make it very unclear. I also want to identify the build processes that "matter", and work on those first. I imagine those are:

build Debian package
build RedHat package

but after that: we should find out if it's possible to build this stuff on Mac or on BSD Windows (Those are probably lower priority).

Revive this?

@njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project?

hwloc/autogen/config.h error

Dear developers,
I am trying to install SGE on my linux 18.04.
When I launch ./aimk -no-java -no-jni -no-secure -spool-classic I encounter the following error:

cc -DSGE_ARCH_STRING=\"lx-amd64\" -O2 -Wstrict-prototypes -DLINUXAMD64 -DLINUXAMD64 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8  -DTARGET_64BIT  -DSGE_PQS_API -DSPOOLING_classic  -DHAVE_HWLOC=1 -DNO_JNI -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__  -D__SGE_NO_USERMAPPING__ -I../common -I../libs -I../libs/uti -I../libs/juti -I../libs/gdi -I../libs/japi -I../libs/sgeobj -I../libs/cull -I../libs/comm -I../libs/comm/lists -I../libs/sched -I../libs/evc -I../libs/evm -I../libs/mir -I../daemons/common -I../daemons/qmaster -I../daemons/execd -I../clients/common -I. -fPIC -c ../libs/sgeobj/sge_binding.c
In file included from ../libs/uti/sge_binding_hlp.h:45:0,
                from ../libs/sgeobj/sge_binding.h:43,
                from ../libs/sgeobj/sge_binding.c:39:
../libs/uti/hwloc.h:49:10: fatal error: hwloc/autogen/config.h: No file or folder of this kind
#include <hwloc/autogen/config.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
../libs/sgeobj/Makefile:184: recipe for target 'sge_binding.o' failed
make: *** [sge_binding.o] Error 1
not done

Could you help me resolve this problem ?

Marie

Deleting exec-host with jobs in 'dr' state is allowed

See email chain pasted below.
The basic issue, I believe, is that you can do
qconf -de some_host
when there are jobs in state 'dr' on that host. That crashes the gridengine master, and restarting it is not possible: message in /var/spool/gridengine/qmaster/messages is:
11/10/2018 16:23:27| main|deb8qmaster|C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

I'm not sure which part of the code deals with this; it should probably be fixed.

I was able to fix it, although I suspect that my fix may have been disruptive to the jobs.

Firstly, I  believe the problem was that gridengine does not handle a deleted job that is on a host that has been deleted, and it dies when it sees it.   Presumably the bug is in allowing it to be deleted in the first place.

Anyway, my fix (after backing up the directory /var/spool/gridengine) was to move the file /var/spool/gridengine/spooldb/sge_job to a temporary location, restart the qmaster, add the host back with qconf -ah, stop the qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job, and restart the qmaster.

Before doing that whole procedure, to stop the hosts getting confused I stopped all the gridengine-exec services.  That probably wasn't optimal because clients like qsub and qstat would still have been able to access the queue in the interim, and it definitely would have confused them and killed some processes.  Unfortunately I had to do this on short notice and wasn't sure how to use iptables to close off those ports from outside the qmaster while I did the maintenance-- that would have been a better solution. 

Also I encountered a hiccup that `systemctl stop gridengine-qmaster` didn't actually work the second time, the process was still running, with the old database, so I had to manually kill it and retry.

Anyway this whole episode is making me think more seriously about moving to Univa GridEngine.  I've known for a long time that the free version has a lot of bugs, and I just don't have time to deal with this type of thing.



On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <[email protected]> wrote:
Hi,

I've never seen this but I would start with:
1) strace qmaster during restart to try to see at which point it is dying (e.g.,
loading a config file)
2) look for any reference to the name of the host you deleted in the spool
area and do some cleanup
3) clean out the jobs spool area

HTH,
John

On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
Has anyone found this error, and managed to fix it?
I am in a very difficult situation.
I deleted a host (qconf -de hostname) thinking that the machine no longer existed, but it did exist, and there was a job in 'dr' state there.
After I attempted to force-delete that job (qdel -f job-id), the queue master died with out-of-memory, and now I can't restart qmaster.

So now I don't know hw to fix it.  Am I just completely lost now?

Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

son-of-gridengine / sge Goto Github PK

sge's Introduction

sge's People

Contributors

Stargazers

Watchers

Forkers

sge's Issues

Debian package structure and pull-upstream-changes process

Importing the original repo

sge_qmaster segfaulting

Importing https://gitlab.com/loveshack/sge.git

Document the Debian package build process (+other build processes)

Revive this?

hwloc/autogen/config.h error

Deleting exec-host with jobs in 'dr' state is allowed

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent