Minutes of the UKI monthly meeting, 19th July 2006
==================================================

Present: Steve Traylen (chair), Stephen Burke, Greig Cowan (minutes),
Giuseppe Mazza, Olivier van der Aa, Peter Gronbech, Christopher Brew,
Duncan Rand, Mark Nelson, Graeme Stewart, Alessandra Forti, Mona
Agarawal, Gianfranco Sciacca, Alice Fage, William Hay, Peter Love,
Brian Davies, Matt Doidge, Yves Coppens.


Progress and issues with gLite 3.0.x 
====================================

gLite upgrade
-------------

Each of the T2 coordinators reported on site progress and experiences
with the gLite 3.0 upgrade. 

* London

IC-HEP required a rebulid of the CE. IC-LeSC have upgraded the LFC and
CE.

There is a general LFC for the London Tier-2.

* SouthGrid

All sites have been upgraded to gLite for a while.

Problems experienced with Java being updated by SL. Referred to in
site-info.def and causing a problem due to the change in location.

ST has a solution for using alternative scripts in order to ensure
that the correct location is pointed to. /usr/bin/java. This can be
done without reference to any LCG middleware. ST will send information
to list.

CB mentioned that there have been problems enabling the ops VO within
dCache. RAL-PP still failing SAM tests.


ST suggested mapping yourself locally to the ops VO to try and figure
out what was happening.

(this problem was resolved after the meeting).

* NorthGrid

All sites upgraded CE, MON, WN, no complaints. Left dCache as it was
since it did not change in gLite 3.0.

Alessandra does not want to add in VOs by hand because she thinks it
is stupid. Sys admin should not be required to fiddle with
configuration files by hand to add a VO. There should be a script (
that does not try to reconfigure the whole node and restart all the
services) to do it on every node that requires it. I've been making
this point for months now. Someone will listen sooner or later.

Instead of using a complete YAIM reconfiguration I asked to group the
commands in a config_vo function that is called by upper level
functions or as part of the node configuration BUT can be also called
standalone when needed like config_gip. I think it is a more than
reasonable request.


* ScotGrid

All sites upgraded. Completely reinstalled site at Glasgow and learned
a lot.


Rolling upgrades
----------------

3.0.1 happened and nobody really noticed. Will discuss further at WLCG
operations meeting next week.


Security Incident
-----------------

ST: Lots of investigation showing up lots of unconnected compromises.
RAL lost a couple of accounts. The SNO experiment had to change
passwords and private keys. Investigations are ongoing.

PG: How did they get root from an ordinary user? Using a solaris box?

ST: Not much can be done until report comes out. In contact with
Romain.  Might be best just to always assume that there are
compromised accounts everywhere.

AF: What action should be taken now that an incident has occurred? 

ST: Report on incident will make some recommendations. Switching to
using ldap.

AF: People would like general recommendations, like switching off
services on your cluster that are not needed.

ST: use of crypto cards?


UK participation in SC4
=======================

ST: Has any Atlas data been transferred to Lancaster or Manchester.

GS: No contact from Atlas so far.

BD: Believes that communication issue within Atlas UK has
hindered/prevented any transfers ocurring. Plus also the LFC problems
at RAL, then the security incident. Still plan to have Atlas transfers
from RAL to UK T2s.

ST: Is situation the same in the other regions?

BD: Atlas data monitoring page shows that French T2 are getting data
during the challenge.

ST: Is this due to French sites having people closer to center of
Atlas?

AF: Lack of coordination in this case. No one understood who was
supposed to do what. 

GS: Atlas had no one to drive transfers in the UK. We had the SRM
endpoints set up.


ST: LHCb just getting going, nothing much happened so far.


Job abort rates
===============

OvdA working on accounting. Have student working with them. Want data
in APEL. Nothing yet on the abort rates.


Summary from the GridPP dteam meeting 30th June
===============================================

ST: What were the main issues?

SB: Site availability. But we never decided on what it means.

ST: I have been talking to Judit Novak. SAM availablity measured per
VO.

SB: Another issue was the joint support at sites within a Tier-2. 

ST: Suggests that T2 coordinators collect information from thier sites
to see if they have problems with other people getting accesss to
their machines.

GS: FS have had some access to Grid machines. 

ST: Support will be a best effort kind of area since site admins will
be busy enough dealing with problems at their own sites.

PG: Yves Coppens is single hardware support post, doing a lot of work
with the Tier-2 anyway.

ST: So distributed support is already happening, maybe just need to
make it more formal. Need to keep discussing this.

MN: Someone might not quite understand site nuances. Need good
communication between both parties about what has to be done and what
solution was used to resolve a problem.


AOB
===

ST: At CIC on duty meeting. Nothing much to report.

CB: Michel Jouvin talked about running on 64bit on LCG-ROLLOUT.

PG: What about running the BDII on a separate machine from the CE?

ST: This has been done at RAL and was very easy to do. Recommends that
sites should look at Gstat plots. If CPU count is erratic then you
need to move the BDII to a separate host.

OvdA:  Do we understand what is going on? 

ST: Information system is harmed with the load.

ST: Further discussion about site and top level BDII's.


There was some discussion in the VRVS chat window about dCache and the
kpwd file in regards to the ops VO problem that CB was experiencing.