[Casper] Timing logins and applicaion startup

Clinton Blackmore clinton.blackmore at westwind.ab.ca
Fri Feb 13 09:14:59 PST 2009


Thanks for the help!  Replies in teal.

On 13-Feb-09, at 7:58 AM, Thomas Larkin wrote:
> Replies in bold...
>
> Replies inline:
>
> I'm now using the 10.5.6 server admin tools -- does that cause  
> problems with previous versions of the OS (are are problems more  
> likely when using outdated tools?)
>
> Yes, using the wrong server tools versions can cause issues,  
> especially with Work Group Manager, it can cause BSD database  
> corruption, which will hose your LDAP.  Ever see a new user generate  
> a negative UID number?

Gee!  I would not have expected that.  Is it possible to tell if the  
BSD database is corrupt or not?

If it is corrupt, is there a way to recover?  (Googling shows me http://sdb.open-xchange.com/node/29 
  , and I imagine something similar might work.  Oh, hey, http://www.barbariangroup.com/posts/1668-fixing_b0rked_open_directory_ldap_databases 
  has steps similar to what we took when the first master failed.)  If  
the master's LDAP DB is corrupt, would I expect all the replicas to  
have the same corruption?  Would fixing the master cause the fix to  
replicate?

>>
> Huh.  Well, this is the first year we've used replicas -- previously  
> ever site had its own master and was a universe unto itself.
>>
>> We had a small problem.  In the master image the client is bound to  
>> the ODM.  I then have casper policies that change bindings via a  
>> shell script.  Well, for some reason they weren't running and all  
>> 6,000 clients were bound to the ODM and they proceeded to bend over  
>> my Xserve and throw it to it's knees.  Everything ran slow.  That  
>> is remedied now, the casper policy is running and all client  
>> machines get bound to the ODR in their building.

Ouch!

>
> The CPU graph on the server at a school called CJHS -- where, in  
> particular, I was having problems -- is at a constant 75% -- which  
> is about 10 times what I would expect.  [I wish I had proper  
> monitoring in place and could go back further than seven days.  It  
> was hovering at a constant 60% almost a week ago, and then jumped up  
> to 75% and remained there.]  Running top on the machine, I see that  
> AFP has gone off the deep end -- using 599.9% of the available CPU  
> time.  Time to reboot that box.  [Only one of our other servers was  
> misbehaving in the same way.]  I had turned on all the AFP logging  
> features on that machine, and now, when they could be useful, the  
> access log starts at Jan 29 and ends on Feb 5th.  It was too  
> verbose, so I have turned off many of the logging features.
>
> how many connections are you seeing on AFP?  I assume that all home  
> folders are on AFP?  Do you do portable home directories?

Looking at the graphs, there are peaks and plateaus.  The last plateau  
(before I rebooted) was at ~70 connections.  The last peak was double  
that.  70 connections is largely accounted for by our two desktop  
labs, which use network home folders.  Our two laptop labs are using  
portable home directories, and may explain the peak.

> ...
> [The log file has] lots and lots of lines like:
>
> Feb 11 13:53:38 CJHS-iMacLab-22 /System/Library/CoreServices/ 
> SystemUIServer.app/Contents/MacOS/SystemUIServer[4176]:  
> FolderManager: Failed looking up user domain root; url='file://localhost/Network/Servers/cjhs.wwsd.net/Volumes/DataHD/CJHSstudents/CJHS_Grade_07/ 
> [full name redacted]/' path=/Network/Servers/cjhs.wwsd.net/Volumes/ 
> DataHD/CJHSstudents/CJHS_Grade_07/[full name redacted]/ err=-120  
> uid=7100 euid=7100
>
> Thanks for your time.  I will see if I am able to get a proper trace  
> of what is going on, especially if I can attribute it to something  
> other than AFP.
>
> Cheers,
> Clinton Blackmore
>
>
> That last line where it can't look up the home folder path, kind of  
> makes me think, DNS issue.  Is your DNS fully resolved both forwards  
> and backwards?  In OS X Server the changeip command is actually what  
> is used to check this, and of course set this.  I have had my share  
> of small DNS issues and they will always come back to bite your leg  
> off.  So, make sure you get your DNS in order.  So, you can ssh into  
> your server and run this command
>
> xs106-a:~ root# changeip -checkhostname
>
> Primary address     = 10.160.3.30
>
> Current HostName    = xs106-a.kckps.org
> DNS HostName        = xs106-a.kckps.org
>
> The names match. There is nothing to change.
>

The results came back as expected ("the names match.  There is nothing  
to change") on our master, former master, and all but two of the  
replicas.  Those two came back with "The DNS hostname is not  
available, please repair DNS and re-run this tool." I'll look into  
that, but problems have been occurring at sites where this is not an  
issue.

Just trolling through the logs.  On the CJHS school server, the  
Password Service Error Log shows this line this  quite frequently:

Feb 13 2009 07:40:22    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:00:44    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:10:17    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:20:55    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.
Feb 13 2009 08:30:28    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.
Feb 13 2009 09:00:30    DoSyncWithServerChangeList: "Parent" has a  
transaction ID beyond the current value, resetting to 0.

On our ODM, I see some lines like this in the Directory Services Error  
Log:

2009-02-06 14:20:44 MST - T[0xB05A6000] - dsDoReleaseContinueData -  
PID 0 error -14071 while checking if reference <16777292> is a node
2009-02-11 06:01:37 MST - T[0xB0699000] - dsDoReleaseContinueData -  
PID 0 error -14071 while checking if reference <16777276> is a node

The Kerberos Administration Log shows lots of entries like:

Feb 13 09:56:03 odm.wwsd.net kadmin.local[6683](info): No dictionary  
file specified, continuing without one.
Feb 13 09:56:03 odm.wwsd.net kadmin.local[6683](info): No dictionary  
file specified, continuing without one.

Well, I'm going to continue to look at the logs and see if I see  
anything more.

Cheers,
Clinton


This email has been scanned by Barracuda Network's Anti-Virus and Spam Firewall.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://list.jamfsoftware.com/pipermail/casper/attachments/20090213/71f3982d/attachment.htm 


More information about the Casper mailing list