Minutes of the UKI monthly meeting, 19th July 2006 ================================================== Present: Steve Traylen (chair), Stephen Burke, Greig Cowan (minutes), Giuseppe Mazza, Olivier van der Aa, Peter Gronbech, Christopher Brew, Duncan Rand, Mark Nelson, Graeme Stewart, Alessandra Forti, Mona Agarawal, Gianfranco Sciacca, Alice Fage, William Hay, Peter Love, Brian Davies, Matt Doidge, Yves Coppens. Progress and issues with gLite 3.0.x ==================================== gLite upgrade ------------- Each of the T2 coordinators reported on site progress and experiences with the gLite 3.0 upgrade. * London IC-HEP required a rebulid of the CE. IC-LeSC have upgraded the LFC and CE. There is a general LFC for the London Tier-2. * SouthGrid All sites have been upgraded to gLite for a while. Problems experienced with Java being updated by SL. Referred to in site-info.def and causing a problem due to the change in location. ST has a solution for using alternative scripts in order to ensure that the correct location is pointed to. /usr/bin/java. This can be done without reference to any LCG middleware. ST will send information to list. CB mentioned that there have been problems enabling the ops VO within dCache. RAL-PP still failing SAM tests. ST suggested mapping yourself locally to the ops VO to try and figure out what was happening. (this problem was resolved after the meeting). * NorthGrid All sites upgraded CE, MON, WN, no complaints. Left dCache as it was since it did not change in gLite 3.0. Alessandra does not want to add in VOs by hand because she thinks it is stupid. Sys admin should not be required to fiddle with configuration files by hand to add a VO. There should be a script ( that does not try to reconfigure the whole node and restart all the services) to do it on every node that requires it. I've been making this point for months now. Someone will listen sooner or later. Instead of using a complete YAIM reconfiguration I asked to group the commands in a config_vo function that is called by upper level functions or as part of the node configuration BUT can be also called standalone when needed like config_gip. I think it is a more than reasonable request. * ScotGrid All sites upgraded. Completely reinstalled site at Glasgow and learned a lot. Rolling upgrades ---------------- 3.0.1 happened and nobody really noticed. Will discuss further at WLCG operations meeting next week. Security Incident ----------------- ST: Lots of investigation showing up lots of unconnected compromises. RAL lost a couple of accounts. The SNO experiment had to change passwords and private keys. Investigations are ongoing. PG: How did they get root from an ordinary user? Using a solaris box? ST: Not much can be done until report comes out. In contact with Romain. Might be best just to always assume that there are compromised accounts everywhere. AF: What action should be taken now that an incident has occurred? ST: Report on incident will make some recommendations. Switching to using ldap. AF: People would like general recommendations, like switching off services on your cluster that are not needed. ST: use of crypto cards? UK participation in SC4 ======================= ST: Has any Atlas data been transferred to Lancaster or Manchester. GS: No contact from Atlas so far. BD: Believes that communication issue within Atlas UK has hindered/prevented any transfers ocurring. Plus also the LFC problems at RAL, then the security incident. Still plan to have Atlas transfers from RAL to UK T2s. ST: Is situation the same in the other regions? BD: Atlas data monitoring page shows that French T2 are getting data during the challenge. ST: Is this due to French sites having people closer to center of Atlas? AF: Lack of coordination in this case. No one understood who was supposed to do what. GS: Atlas had no one to drive transfers in the UK. We had the SRM endpoints set up. ST: LHCb just getting going, nothing much happened so far. Job abort rates =============== OvdA working on accounting. Have student working with them. Want data in APEL. Nothing yet on the abort rates. Summary from the GridPP dteam meeting 30th June =============================================== ST: What were the main issues? SB: Site availability. But we never decided on what it means. ST: I have been talking to Judit Novak. SAM availablity measured per VO. SB: Another issue was the joint support at sites within a Tier-2. ST: Suggests that T2 coordinators collect information from thier sites to see if they have problems with other people getting accesss to their machines. GS: FS have had some access to Grid machines. ST: Support will be a best effort kind of area since site admins will be busy enough dealing with problems at their own sites. PG: Yves Coppens is single hardware support post, doing a lot of work with the Tier-2 anyway. ST: So distributed support is already happening, maybe just need to make it more formal. Need to keep discussing this. MN: Someone might not quite understand site nuances. Need good communication between both parties about what has to be done and what solution was used to resolve a problem. AOB === ST: At CIC on duty meeting. Nothing much to report. CB: Michel Jouvin talked about running on 64bit on LCG-ROLLOUT. PG: What about running the BDII on a separate machine from the CE? ST: This has been done at RAL and was very easy to do. Recommends that sites should look at Gstat plots. If CPU count is erratic then you need to move the BDII to a separate host. OvdA: Do we understand what is going on? ST: Information system is harmed with the load. ST: Further discussion about site and top level BDII's. There was some discussion in the VRVS chat window about dCache and the kpwd file in regards to the ops VO problem that CB was experiencing.