********************************************************************** Minutes of UKI Monthly Operations Meeting - Tuesday 15th November 2005 ********************************************************************** Attendees: Jeremy (scribe+chair) Morag Brunel - Henry Jens Chris Brew Giuseppe Steve Brian Pete Yves Simon Olivier Matt Alessandra Mona Stephen Steve Mark Ben + UCL? ***************** Preparing for SC4 ***************** - SRM: Teleconference tomorrow for storage - FTS: SRM interface may require FTS (DPM client may work with dCache server) - Pete - who will need to be contacted? - SThorn: Can LFC coexist. - JC: Yes probably for sites with <100WNs - tests are being done in Glasgow. Graeme will write experiences into a document. - CB: What hardware is required - JC: Enough to cope with the 1TB transfers and some background production work. At what rate can you produce data to be transferred? - MN: How much is required? - JC: This is calculable from the number of worker nodes and supported VOs. - PG: What is the formula for storage required? (Action O-051115-1 JC to forward sizing formula to list) - SG: Milestones explicitly that VO Boxes will be at sites. Action O-051115-4 Change At -> for in Milestone document. VO Box may be in Tier-1 or central within the Tier-2. = 2.7.0 - R-GMA have updates they would like out - Best that can be done now is December - There may be a release by 2nd week December but we do not know - Rolling release - changes to client libraries can be picked up automatically, but for configuration changes a new release is needed. - MN: Durham 2.4->2.5 etc. Once upgraded the site is "stable". What if there are problems with patches (for instance recent glibc patches if provided via auto-update would have caused a problem for the site without warning). At least the test procedure meant that there was some confidence. - SB: RPM updates are happening automatically now anyway - ST: Nothing very exciting in pipeline other than glite components. Not obvious that a new release is needed - JC: Conclusion is that we do not know. The earliest now is mid-December. = Feedback - BD: Had major problems FTS 1.11 to 1.3 client upgrade. The problems originate from the many dependencies that are difficult to track. (Action: O-051115-2 Feedback FTS upgrade issue to CERN team). - BD: Storage list - with throughput tests a way will be needed to clear old files - Action O-051115-3 Dteam to investigate methods to remove old files - Scheduling tests - some concerns because of connectivity for RHUL and Brunel. Difficult to get basic responses. Site network contacts need to be warned. O-051115-5 Sites to warn site network contacts of data transfer test schedules - PG: Pete Clarke has made the point about saturating the network before people will consider upgrades! = ATLAS (Alessandra) - Each experiment has an operations team - coordination team - PL: ATLAS software installation - sites put in request for installation - install and tag site. The problem is that admins do not know versions of software required. Local users sometimes update local versions. Action O-051115-8 JC to follow up on how sites are to know which experiment software version to have installed - Groups and roles not yet implemented in VOMS. Competition on resources (and files?). For jobs they will have access control - they want to be able to reorder jobs in queue. - SB: No numbers - number of jobs, how long jobs run for ... etc. - Action: O-051115-6 AF follow up on ATLAS numbers (number of jobs, how long jobs run etc.) = ALICE (Pete) - Difficult to summarise talk - use AliRoot and ROOT - JC: ALICE changed their oringal spec. for VO Box, how does it compare with ATLAS? - ST: Tier-1 had more concern with ATLAS and could not accept in original form - YC: ALICE members at Birmingham only concerned with trigger work, not available resources = CMS (Olivier) - Tier-2 used for MC production, analysis and calibration. T2 must be able to connect to calibration database. - Provision of software, servers and local databases required for the operations of the CMS workload and data management services. - Tier-2s are aggregate - Not clear whether need PROOF server at each MC site? Run on AOD and produce summary ROOT files which user scans to do event selection. What will they use for the file server in the other experiments. - SB: They will probably use ROOT in some form. - OvdA do they need ROOT servers at each site? Action: O-051115-7 Follow up on CMS plans for file server requirements in light of computing model = LHCb (Fraser) - DIRAC operates above LCG middleware. GANGA provides a simplified interface to grid for end-user analysts - CPU expressed in number of 2.4GHz CPUs - Does not say anything about number of Tier-2s *************************** Regioal security challenges *************************** - Jobs are being run now - If problems replying to ticket - mail response to Alessandra or Jeremy - Overall the responses must come via the ticketing system so they can be tracked - SG: Few months ago LHCB jobs failed all at once at RHUL. They followed up. Discovered that (using PBS instead of LSF) records did not match! COuld only trace back to LHCB based on timeframe. - CB: Unless you have RB access - job ID is not enough - SB: Job record is R-GMA published. - AF: How many people think they can trace a job if given a time window and the group of the job run? - CB: That should be fine if not too many jobs in the timeframe. - Others: Need to try. *********************** Quick discussion topics *********************** Networking document: - Some things should not be published. - SB: Only thing that might be sensitive is the firewall information but they are public anyway Training: - JC: What courses do people need? What about updates? - Courses: Stephen - they are a trade off between getting up to date and having out of date material. - SB: But where is the knowledge? - JJ: We (dteam) should provide some of the expertise so members ought to go on courses and disseminate knowledge gained. Supported VOs: - CIC portal - good to have more information. But which VOs are active. - Only 6 sites supporting PhenoGrid. - OvdA - Imperial enabled Geant4. - How much to support VO - if require additional sotfware (what should be published by sites so VOs can find them?). - ST: Generally say if what is required is in SL3 then should support. Not always clear what they need. - H: X11 needed by ATLAS - AF: Question about contact points. ID cards do not contain the right information. - AF: Need configuration parameters - VO server. VOMS endpoint. Need SGM account etc. - JJ: UK VOs are empty - SB: Some are EGEE approved VOs, then there are wide-spread like DO and then regional Site stability: -JC: Pointed to the sustained difference between available CPU and potentially available in metrics report. Why do we not get better? - HB:List of named sites. So many places to look so see if up and running. - HB: Brunel fails due to rm failurs - problem at CERN end. No information about which part timed out. - JC: EGEE weekly reports will soon require site/Tier-2 coordinator validation - https://cic.in2p3.fr/index.php?id=roc&js_status=2&roc_page=4&subid=roc_report *** AOB *** - HB introduced Duncan the new hardware support person for Brunel and RHUL. *********** New Actions *********** O-051115-1 JC to forward sizing formula to list O-051115-2 JC to feedback FTS upgrade issue to CERN team or BD to raise on SC list O-051115-3 Dteam to investigate methods to remove old transfer files O-051115-4 JC Change At -> for in Milestone document O-051115-5 Coordinators - Sites to warn site network contacts of data transfer test schedules O-051115-6 AF follow up on ATLAS numbers (number of jobs, how long jobs run etc.) O-051115-7 Dteam Follow up on CMS plans for file server requirements in light of computing model O-051115-8 JC to follow up on how sites are to know which experiment software version to have installed O-051115-9 JC to follow up on Vo published information. Sites would like information on which VOs are active and more data on such things as VO server, VOMS endpoints etc.) ********** New Issues **********