Current Position:Home > Child process dies, nfs locks not released, webserver hangs...

Child process dies, nfs locks not released, webserver hangs...

Update:12-10Source: network consolidation
Advertisement
Hi,
I have Sun One 6.1 sp 11 on a solaris 10 ldom.
The server is configured to write logs access and error to /logs which is an NFS mount to a separate solaris 10 box. The logging to an NFS mount is a business requirement.
Sun JWS is configured to have two httpd processes and the watchdog to restart them if one should fail.
Every now and then, about once a day (it varies), one of the child processes will die with messages like this in the error log: (1949 is the wdog pid)
[09/Dec/2009:14:19:06] failure ( 1949): CORE3107: Child process closed admin channel
[09/Dec/2009:14:19:06] fine ( 1949): CORE3061: signal_handler_thread: received signal 18
[09/Dec/2009:14:19:06] fine ( 1949): CORE3049: Primordial process detected child 1950 died: status 37
[09/Dec/2009:14:19:06] fine ( 1949): CORE3050: Is our child, will spawn replacement
[09/Dec/2009:14:19:06] fine ( 1949): CORE3062: Unlinking of /tmp/https-wv2-819e4c2d/.cgistub_1950 returned -1
[09/Dec/2009:14:19:06] fine ( 1949): CORE3047: Server spawned worker process 2011
[09/Dec/2009:14:19:06] fine ( 2011): HTTP5169: User authentication cache entries expire in 120 seconds.
[09/Dec/2009:14:19:06] fine ( 2011): HTTP5170: User authentication cache holds 200 users
[09/Dec/2009:14:19:06] fine ( 2011): HTTP5171: Up to 4 groups are cached for each cached user.
[09/Dec/2009:14:19:06] fine ( 2011): HTTP4207: file cache module initialized (API versions 2 through 2)
[09/Dec/2009:14:19:06] fine ( 2011): HTTP4302: file cache has been initialized
[09/Dec/2009:14:19:06] fine ( 2011): HTTP3066: MaxKeepAliveConnections set to 256
[09/Dec/2009:14:19:06] fine ( 2011): Installed configuration 1
[09/Dec/2009:14:19:06] fine ( 2011): HTTP4193: flex-rotate-init: rotate start time is 0h, 0m
At this point the webserver will not respond. The processes (2*httpd, 1*wdog) are running but do not respond. The access log shows a weird lock with output from pfiles:
21: S_IFREG mode:0777 dev:340,10 ino:34988 uid:111 gid:102 size:0
O_RDWR|O_APPEND|O_CREAT|O_LARGEFILE FD_CLOEXEC
advisory write lock set by system 0x2 process 280
which I think means the new http process is waiting for the lock to be released, but the lock is never freed.
But what I'm really curious about is why the process is dying in the first place. Anyone seen "status 37" before, or know where I can look it up? I couln't google up any reference on what it might mean...
any help appreciated
cheers
Kristin.

The Best Answer

Advertisement
I found the following in http://docs.sun.com/app/docs/doc/816-4555/rfsrefer-134?l=ja&a=view :
In this situation, the SIGLOST signal is posted to the process. The default action for the SIGLOST signal is to terminate the process.
For you to recover from this state, you must restart any applications that had files open at the time of the failure. Note that the following can occur.
- Some processes that did not reopen the file could receive I/O errors.
- Other processes that did reopen the file, or performed the open operation after the recovery failure, are able to access the file without any problems.
Thus, some processes can access a particular file while other processes cannot.
Edited by: Arvind_Srinivasan on Dec 10, 2009 12:33 AM