Giter Site home page Giter Site logo

Comments (17)

yaoweibin avatar yaoweibin commented on August 21, 2024

Hi, Kallen,

Thanks for your report.

Can you show me your debug logs? I can't reproduce your problem in my box.

On 2011-10-21 1:48, kallen wrote:

hello yaoweibin,

we're using your upstream module and recently ran into this issue posted to the nginx forum. do you have any insight? as i said in the forum post, i have a pile of strace, lsof, debug logs, etc. but no core file yet.

http://forum.nginx.org/read.php?2,216933

thanks,
kallen

Weibin Yao

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

hello. i put all the various debugging data here:

http://groknaut.net/nginx/

in particular, here's the debug error log:
http://groknaut.net/nginx/nginx.error.log.gz

thanks,
kallen

On Thu, Oct 20, 2011 at 8:03 PM, Weibin Yao(姚伟斌)
[email protected]
wrote:

Hi, Kallen,

Thanks for your report.

Can you show me your debug logs? I can't reproduce your problem in my box.

On 2011-10-21 1:48, kallen wrote:

hello yaoweibin,

we're using your upstream module and recently ran into this issue posted to the nginx forum. do you have any insight? as i said in the forum post, i have a pile of strace, lsof, debug logs, etc. but no core file yet.

http://forum.nginx.org/read.php?2,216933

thanks,
kallen

Weibin Yao

Reply to this email directly or view it on GitHub:
#10 (comment)

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

Hi, Kallen,

This may be the problem:

509827 2011/10/19 18:57:30 [emerg] 30403#0: host not found in upstream
"app4:1802" in /etc/nginx/upstream.conf:3

If there is any problem in the configure file, Nginx will not reload.

What does this command show before you reload?

nginx -t

Thanks.

On 2011-10-21 12:32, kallen wrote:

hello. i put all the various debugging data here:

http://groknaut.net/nginx/

in particular, here's the debug error log:
http://groknaut.net/nginx/nginx.error.log.gz

thanks,
kallen

On Thu, Oct 20, 2011 at 8:03 PM, Weibin Yao(姚伟斌)
[email protected]
wrote:

Hi, Kallen,

Thanks for your report.

Can you show me your debug logs? I can't reproduce your problem in my box.

On 2011-10-21 1:48, kallen wrote:

hello yaoweibin,

we're using your upstream module and recently ran into this issue posted to the nginx forum. do you have any insight? as i said in the forum post, i have a pile of strace, lsof, debug logs, etc. but no core file yet.

http://forum.nginx.org/read.php?2,216933

thanks,

kallen

Weibin Yao

Reply to this email directly or view it on GitHub:
#10 (comment)

Weibin Yao

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

when i encountered the reload problem, i didn't run nginx -t. i have since restarted all the affected nginx proxies and they are currently not having a reload problem. the problem seems to occur after they've been running for a while. i'll incorporate an "nginx -t" into our reload process.

but .. app4's DNS record hasn't changed recently. i don't know if all the affected proxy hosts in that moment had a dns resolution problem. but i'll watch for it.

at this moment, this is what it shows:

10/21 17:18[root@proxy2 ~]# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

10/21 17:18[root@proxy2 ~]# head -4 /etc/nginx/upstream.conf

Tomcat via HTTP

upstream tomcats_http {
server app2:1802;
server app4:1802;

10/21 17:19[root@proxy2 ~]# lynx -dump http://localhost/upstream-status/

                   Nginx http upstream check status

Check upstream server number: 2, shm_name: ngx_http_upstream_check#7

Index Upstream Server Status Rise counts Fall counts Check type
0 tomcats_http app2(10.112.249.82:1802) up 9903 0 http
1 tomcats_http app4(10.112.231.71:1802) up 12719 0 http

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

i now realize you may not have been referring to this as a DNS issue with regard to "509827 2011/10/19 18:57:30 [emerg] 30403#0: host not found in upstream "app4:1802" in /etc/nginx/upstream.conf:3".

what circumstance might that error refer to?

when we do these reloads of nginx, what we're doing is taking nodes in and out of the pool. we edit (using a script) upstream.conf taking app4 in and out of it, reloading on each change.

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

On 2011-10-22 1:52, kallen wrote:

i now realize you may not have been referring to this as a DNS issue with regard to "509827 2011/10/19 18:57:30 [emerg] 30403#0: host not found in upstream "app4:1802" in /etc/nginx/upstream.conf:3".

what circumstance might that error refer to?

when we do these reloads of nginx, what we're doing is taking nodes in and out of the pool. we edit (using a script) upstream.conf taking app4 in and out of it, reloading on each change.
This may be a DNS issue. I don't know either. Are you using your own DNS
server?

Can you just specify the IP in the upstream?

Thanks.

Weibin Yao

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

we are using our own DNS server. i do not notice there being DNS resolution problems with these upstream nodes. i will have to add some more error checking to our script which runs the nginx configtest and reload.

do you have any feedback on why when i attempted a restart in this situation (as seen in the various debugging i supplied), the parent died and the worker continued processing requests. to illustrate the sequence of events:

  1. we altered upstream.conf node membership
  2. we ran nginx reload
  3. we checked http://proxy/upstream-status and the node membership had not changed to reflect the config we set in step 1
  4. we ran nginx restart .. and the parent died leaving an orphan worker behind which still served requests.

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

OK, I have checked through your debug log. I found you sent the SIGHUP
twice to reload Nginx, but failed. And you sent the SIGHUP in the third
time. It seems the master process disappear after checking the host name
at that time in the log. Then you killed the worker process, and started
a new Nginx instance. After that time, you reloaded Nginx again. I found
you succeeded this time.

Then I checked your nginx.strace.parent file.

I found the DNS problem is like this:
Nginx master process try to get ip from three DNS servers("10.64.37.51",
"10.245.197.162", "10.252.71.163"), But none of these servers reply
until timeout(30 seconds). The socket connect action is successful.

The most weird thing is that the master process work is killed by
SIGKILL when trying to resolve the server name. This signal can't be
captured. OS may send this signal if the process encounter some serious
situation.

I'll check the DNS code of Nginx. But I'm not very clear with that part
of code.

On 2011-10-25 5:18, kallen wrote:

we are using our own DNS server. i do not notice there being DNS resolution problems with these upstream nodes. i will have to add some more error checking to our script which runs the nginx configtest and reload.

do you have any feedback on why when i attempted a restart in this situation (as seen in the various debugging i supplied), the parent died and the worker continued processing requests. to illustrate the sequence of events:

  1. we altered upstream.conf node membership
  2. we ran nginx reload
  3. we checked http://proxy/upstream-status and the node membership had not changed to reflect the config we set in step 1
  4. we ran nginx restart .. and the parent died leaving an orphan worker behind which still served requests.

Weibin Yao

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

I find Nginx just use gethostbyname when starting. I don't know why that happened. Someone had similar problem: http://www.ruby-forum.com/topic/197833 .

Maybe it's a OS bug.

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

the behavior described in the above forum post sounds like the behavior i saw. i have placed more error checking into our nginx reload script and will watch for name resolution errors. and maybe consider using IPs instead of node names, but not quite yet.

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

hello. we hit this issue again tonite. i have more details to provide. in short, reloading nginx is not respecting our update to upstream.conf. prior to reloading nginx, i run the config test, and the config test says "syntax is ok".

can you help? thanks very much in advance.

here we are taking "app1" out of the config file..

11/04 05:39[root@proxy1 ~]# cat /etc/nginx/upstream.conf         
## Tomcat via HTTP
upstream tomcats_http {
    server app3:1802;
    check interval=3000 rise=3 fall=2 timeout=1000 type=http default_down=false;
   check_http_send "GET /monitor/sStatus HTTP/1.0\r\n\r\n";
}

11/04 05:43[root@proxy1 ~]# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

11/04 05:43[root@proxy1 ~]# /etc/init.d/nginx reload
Reloading nginx:                                           [  OK  ]
11/04 05:44[root@proxy1 ~]# lynx --dump http://localhost/upstream-status

                   Nginx http upstream check status

Check upstream server number: 2, shm_name: ngx_http_upstream_check#65

 Index Upstream Server Status Rise counts Fall counts Check type
 0 tomcats_http app1(10.214.49.140:1802) up 29857 0 http
 1 tomcats_http app3(10.215.41.157:1802) up 7906 0 http

11/04 05:44[root@proxy1 ~]# host app1
app1.prod.xyabc.com has address 10.214.49.140

11/04 05:44[root@proxy1 ~]# host app3
app3.prod.xyabc.com has address 10.215.41.157

debugging, including debug error log, can be seen here: http://groknaut.net/nginx/

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

hmm. for some reason i didn't previously read carefully enough your reply in #10 (comment)

i just looked at the parent strace tonite and i do see it querying our nameservers and timing out on each one.

but i do wonder why nginx -t returns ok in spite of that problem. but perhaps it's not your problem :> .. feel free to close and/or impart any wisdom you have.

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

Yes, I know. Your DNS server or network may be OK when you test the
Nginx configure file. But it became broken when you reload.

I'm curious about two things:

  1. Why your DNS resolver become unavailable temporarily?
  2. Why the DNS problem caused the program killed by signal 9? This seems
    not just the Nginx problem. Maybe it's a OS bug.

Thanks for your detail trace information.

On 2011-11-4 14:17, kallen wrote:

hmm. for some reason i didn't previously read carefully enough your reply in #10 (comment)

i just looked at the parent strace tonite and i do see it querying our nameservers and timing out on each one.

but i do wonder why nginx -t returns ok in spite of that problem. but perhaps it's not your problem :> .. feel free to close and/or impart any wisdom you have.


Reply to this email directly or view it on GitHub:
#10 (comment)

Weibin Yao

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

for 1, i don't know why resolver becomes unavailable. we're running in ec2, running our own bind servers therein. still a mystery.

for 2, do you see program killed by signal 9 in the details provided tonite? or from last week?

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

From your last week nginx.strace.parent:

30403 1319050725.497926 poll([{fd=12, events=POLLIN}], 1, 3000) = 0
(Timeout) <2.998432>
30403 1319050728.496429 gettimeofday({1319050728, 496441}, NULL) = 0
<0.000010>
30403 1319050728.496475 poll([{fd=13, events=POLLOUT}], 1, 0) = 1
([{fd=13, revents=POLLOUT}]) <0.000010>
30403 1319050728.496525 send(13,
"\342\326\1\0\0\1\0\0\0\0\0\0\4app4\5alloy\7saasure\3com\0\0\1\0\1", 40,
MSG_NOSIGNAL) = 40 <0.000024>
30403 1319050728.496593 poll([{fd=13, events=POLLIN}], 1, 6000
<unfinished ...>
30403 1319050732.094480 +++ killed by SIGKILL +++

These are the last lines in that file.

On 2011-11-4 15:20, kallen wrote:

for 1, i don't know why resolver becomes unavailable. we're running in ec2, running our own bind servers therein. still a mystery.

for 2, do you see program killed by signal 9 in the details provided tonite? or from last week?


Reply to this email directly or view it on GitHub:
#10 (comment)

Weibin Yao

from nginx_upstream_check_module.

yaoweibin avatar yaoweibin commented on August 21, 2024

This time strace file is fine. Your master process didn't not disappear
suddenly.

On 2011-11-4 15:20, kallen wrote:

for 1, i don't know why resolver becomes unavailable. we're running in ec2, running our own bind servers therein. still a mystery.

for 2, do you see program killed by signal 9 in the details provided tonite? or from last week?


Reply to this email directly or view it on GitHub:
#10 (comment)

Weibin Yao

from nginx_upstream_check_module.

groknaut avatar groknaut commented on August 21, 2024

i'm not sure what happened last time, but we're watching closely. thanks again for your help.

from nginx_upstream_check_module.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.