Notes from the UKI Operations meeting – 12th April 2006 ********************************************* Attendees: Jeremy Coles (chair & scribe) Pete Gronbech Mona Aggarwal Fraser Spiers Stephen Burke Winnie Lacesso Peter Love Brian Davies Grieg Cowan Tim Barrass Steve Traylen Mona Aggarwal Mark Nelson Matt Doidge Philippa Strange Jens Jensen Simon George David Colling Rosario Esposito Summary for last month ****************** See "more information" attached to agenda. Main things were: Details about new OPS VO and background to why and how it became introduced. Job loads set to ramp up. LCG MoU now signed by PPARC. gLite 3.0 RC2 released for pre-release testing so expect release in May. Review of Rome GDB meeting discussions - SL4, Quattor, GFAL, security policies and accounting. On the OPS VO Steve Traylen mentioned that another reason it was introduced was so the monitoring VO was more acceptable to other grids like NGS in the UK and OSG in the US. Grieg Cowan is working on version a method to present storage information at the VO level for GridPP. Transfer tests ********** Some sites still have not tested their transfer capability. GridPP milestones for the end of March required all sites to have tested but we learnt a lot from the tests that did happen. Jeremy and Graeme will rework the milestones (action) and add new ones in areas like 2-way multiple and sustained transfers. Grieg is coordinating the timetable and tests this week. See the wiki for details: http://wiki.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Tests. Brian Davies expressed some concern about the FTS setup and mentioned that several areas need work - there are still some simple bugs. For example the TCP window size (a parameter set in FTS) is a factor of 50 too small for transfers over UKLIGHT. There are workarounds but that is not an ideal route forward. SC4 throughput tests stopped further testing of changes. ST: New FTS is up and running. Will declare operational at some point. See the wiki page for more details: http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_File_Transfer_Service Feedback from HEPiX ****************** Martin was unable to attend the meeting so Jens gave a summary from his perspective. Some specifics: Jens mentioned that some sites are starting to run SRMv2. Fermilab developers will star to give our storage group access to the newer version to test it. Note that the current FTS is not v2 compatible. Grieg added that the future of V2 is still up for debate -input from experiments required to define storage types which is in need of clarification. Some sites will try pre-staging in of files. V2.2 may come out of the discussions. Scientific Linux – xfs not supported directly in SL4. Use of 64-bit machines on the increase – 32-bit stuff all works and performance improves. LeSC have large number of 64-bit machines. QMUL too. but not RHUL (some thought it did). Other issues – du on dCache for getting metric information, and logging and monitoring within dCache were discussed with one of the dCache developers who are currently looking at the scalability of dCache - there seems to be a limit of 50 requests per second. Running java. Additional stuff will be worked on such as VOMs integration. Security was touched upon and led to a brief discussion on ROC challenges in this area. Storage challenges would be a useful addition - UKI (in particular the storage group will work on this). Examples: See if can write to pools (circumventing access controls) and sites then try to see who wrote or accessed these and normal files. IHEPCC member was present and asked about non-HEP use of HEP resources. Jens directed him to GridPP support of biomed as an example. More notes on HEPiX from Jens here: http://storage.esc.rl.ac.uk/gridpp-storage-hepix.pdf Jeremy mentioned the CERN desire to still move to SL4 in the autumn - perhaps October. A limiting factor appears to be the readiness of the middleware (other pressures on developers in functionality area). SL4 has now been validated at CERN by the LHC experiments. Latest on VO Boxes **************** Steve talked about the two day VO box meeting last week which was run to try to remove the need for VO Boxes. The group consisted of about 6 people including experiment and site representatives and middleware developers. The highlights can be seen in the minutes of the meeting but these things were mentioned: The group looked closer at services run within the VOBox. These can be split – class I and class II. The latter require backdoor access at site – if only connects to BDII, GK etc then it is a Class I servive. Class-II access pnfs file space etc and require local access. LHC VOs run some services as Class II – ALICE packman requires write access to nfs mounted software. CMS PheDex must have local access but this may not be required when full FTS integration is provided. CMS working to reduce local access needs. VO Boxes generally required for things that can not be done with standard middleware. Work being done to get rid of Class-II services. LHCb and ATLAS only have Class-I services. So only need VO-Boxes at Tier-1 and friendly sites. Few areas on policy to be addressed: e.g. Server certificates and handling of proxies. Most significant for UK is CMS. Commands on back under SRM. Things not possible in v1. LHCb. running only two VO boxes in the world. ATLAS have about 10 and using 1 at RAL. LFCs not really needed either anymore so this is all good news. Jeremy asked about any increased requirements from the GPBox (deals with site policies) and sits behind CE. Stephen Burke replied that we haven’t seen it even in testing yet so it must be over 1 year away. We do not really know the deployment implications or requirements. AOB **** Mark Nelson of Durham asked about the org code for Durham University. Jeremy said this was raised with Robin Middleton and Dave Kelsey. There should be a code but in the meantime a catch all can be used PU0 (last character is a zero) - will follow up with an email. Stephen Burke asked about the LCG operations workshop and whether it was definitely the following week. Jeremy confirmed as this was the only date available around this period. Those going can start booking! 19th-20th June - see http://agenda.cern.ch/fullAgenda.php?ida=a062031 Mark mentioned a series of problems observed with new tickets where the same issue was being addressed under an older but not yet closed ticket. Procedures should stop this raising of new tickets. The 24hr reminder is useful but can be annoying. A particular problem has caused a ticket to go to the developers (before the last OPS meeting concerning rm failures) but there has been no response. Philippa said there were problems like this when a problem is fixed and then returns....also if tickets are raised in the local helpdesk but also raised by COD people. Mona commented that they see several tickets directly from VOs even though the problem is wider than just their site (e.g. person not in gridmap file). This is a tricky one as users will not try every site to see how wide the problem is before submitting the tickets, so we have to accept that individual tickets will be created in some cases and if we notice other sites have the same problem propogate the solution/issue ourselves. If it is not just a site problem then the site can reassign the ticket back to the VO or ROC - need to check sites know how to do this. Some VOs still come direct to the site - for example geant4. Sites should ask for a ticket to be created but can deal with the problem from when they hear about it or when they receive a proper notification. Stephen said the model should be that the TPM assigns to ROC and then ROC assigns to site. The next meeting is Wednesday 10th May at 10:30 in the Fog VRVS room: http://agenda.cern.ch/fullAgenda.php?ida=a06683. Meeting closed at 11:40. Actions ****** O-060412-1 Jeremy and Graeme to rework the SC milestones O-060412-2 Storage group to develop fuller proposal for SE security tests O-060412-3 Jeremy to confirm operations workshop 19th-20th June - see http://agenda.cern.ch/fullAgenda.php?ida=a062031 O-060412-4 Jeremy/Philippa - follow up on issue of mulitple tickets being assigned for same problem