WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Mountain room))

28-R-15

VRVS (Mountain room)

Nick Thackray
Description
VRVS "Mountain" room will be available 15:30 until 18:00 CET
actionlist
minutes
    • 14:00 17:25
      28-R-15

      28-R-15

      • 16:00
        Feedback on last meeting's minutes 5m
        more information
      • 16:05
        Grid-Operator-on-Duty handover 5m
      • From CERN (SE) to Italy (France):


      • * Number of tickets opened for sites failing CA test (most sites fixed it)

        * A few ticket opened for failing PPS sites

        * One problematic ticket for USCMS-FNAL-WC1, might be an OSG interoperability problem

        * Weekend RM failures caused by expired LFC host certificate (fixed now)
  • 16:10
    Review of action items 15m
    actionlist
  • 16:25
    Issues to discuss 20m

    Reports were not received from [Northern Europe, CERN]

  • Item 1 (The problem was caused by SAME and the CERN firewall, see the 'SFTs jobs' thread in LCG_ROLLOUT started on May 24. Several sites have been affected by this including at least Birmingham, Oxford, Bristol and the eSc center at Imperial in the UK. In Birmingham, Bristol and Oxford, our short queues had a maximum CPU time of 10 mins and walltime of 15 mins. Increasing the wall time and cpu time immediately alleviated the failures. I believe the shorter the short queue was at sites, the more jobs failed. Our queues are properly advertised by our information system, and it is the responsability of the SFT to select the appropriate queue in the jdl! - we also have a dteam queue for longer dteam jobs)
  • Item 2 (Radom SFT-RM failures with 'Not a PNFS File, can't get a PNFS ID' error have occured previously as detailed in previous reports but we much more frequent this week. It appears that increasing the number of supported VOs and so PNFS databases to 24 has tripped the balance from occasional failures to almost continuous failures. Further investigation indicated that copying a file to a the SE and imediately trying to access or replicate it (as the SFT does) causes this error, waiting 15-20secs after the file creation seems to allow the database updates to occur fully and avoids the problem. Runinng the gridftpdor on the admin node also solves the problem at the expense of significantly slower transfers. We're investigating database tuning and/or the possibility of running PNFS on a separate host hoping that'll fix the problem. )
  • Item 3 (I've also marked the sft-lcg-rm as non relevant since the failures were due to the CERN SE being full, see No space left on device in the output of the tests. This has also been reported in the 'SFT again thread' on ROLLOUT where Judit Novak wrote that 'The misleading entries will be removed from the SAME database to avoid false errors appearing in any site metrics.' but this has not been done )
  • Item 4 (Replica management tests continue to show intermittent failures, apparently as a result of invalid or missing client credentials)
  • Item 5 (Site put in CT or JS because of failures not attributable to site is much counter-productive. Some experiments rely on this info. Besides, it triggers unnecessary troubleshooting procedure at sites. This is a significant overhead in terms of site-administration.)
  • Item 6 (When a ticket is solved, I keep receiving ticket escalation messages when the ticket should actually be closed instead. From a ticket (GGUS-Ticket-ID: #8282) I submitted about a ticket on May 2, I understood that tickets must first be closed by someone in the UKI ROC, before the closed status propagates to ggus. It seems to take a very long time before tickets are closed once solved. Who is responsible to close the tickets? Is it the site admin responsibility, though I can close tickets for our site, it would seem natural to me that the decision to close a ticket is performed by a third party and not me )
  • Item 7 (: It is difficult to try to correlate SFT failures and investigate them properly when the timestamps reported in the CIC report do not correspond with the timestamps on the CERN lcg-sft site listings. For example, the entry in the CIC page JS 24-05-2006 05:31 has as its closest match in https://lcg-sft.cern.ch:9443/sft/sitehistory.cgi?site=lcgce01.triumf.ca JL 24-05-2006 05:06:02 There should be an exact entry in the SFT pages for each entry mentioned in the report. Where did CIC get its timestamp?? )
  • Item 8 (The CE's overload is strange because we limited the MaxTotalJob per VO but nothing change. Other strange thing, may be link to the CE overload: In http://goc.grid.sinica.edu.tw/gstat/CGG-LCG2/ , the number free CPU are good. but for each VO, the number of running job are always equal to zero ???? )
  • Item 9 (How to publish/get available gLite WMS? LCG RBs publish their info in BDII and all users can get them with $ lcg-infosites --vo dteam rb What's about WMS? )
  • Item 10 (We see that most of the LHC VOs submit lots of jobs to the CE under the "sgm" account. We understood this account was for software installation, so the scheduler puts higher priority to these jobs, but allows to run few of them in parallel. If the jobs submitted through "sgm" accounts are normal production jobs, there is a problem.)
  • 17:05
    gLite 3.0 deployment status 5m
    more information
  • 17:10
    Upcoming WLCG Pilot Service / SC4 Activities 10m
    See the agenda of today's WLCG Resource Scheduling Meeting

    As a reminder, the WLCG Pilot service starts Thursday 1st June 2006!

    Speaker: Jamie Shiers + LHC Experiments
    more information
  • 17:20
    AOB 5m
    • Plan to put the OPS VO in production 5m
      Speaker: Piotr Nyczyk
    • New times for RC and ROC reports 15m
      the new reporting shift would be: - From Saturday till Friday, including weekend. - RC reports will stay open every day except Mondays. The reporting week is closed on Friday. - ROC reports will stay open on Mondays from 8.00 am till 14.00 (CERN time) This will be implemented by 10th June
      Speaker: Maite Barroso
  • 17:05 17:10
    gLite 3.0 deployment status 28-R-15

    28-R-15

    VRVS (Mountain room)