WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Snow room))

28-R-15

VRVS (Snow room)

Maite Barroso
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • VRVS "Snow" room will be available 15:30 until 18:00 CET

    actionlist
    minutes
      • 16:00 17:25
        WLCG-OSG-EGEE Operations Meeting 28-R-15

        28-R-15

        • 16:00
          Feedback on last meeting's minutes 5m
          Minutes
        • 16:05
          EGEE Items 20m
          • <big> Grid-Operator-on-Duty handover </big> 5m
            From CentralEurope ROC (backup: Taiwan) to SouthEast Europe ROC (backup: France ROC)
            Tickets:
            OPENED 33
            1ST MAIL 15
            2nd MAIL 18
            QUARANTINE 7
            SITE OK 50
            UNSOLVABLE

            Notes:
          • <big> Change to ops VO </big> 5m
            First consequences of the change:
            - CIC dashboard only shows CE failures (based on SFT)
            - RC reports show all services failures (based on SAM)
            Speaker: Maite Barroso
          • <big> Change of RC/ROC weekly report shifts </big> 5m
            Proposal to always report for a "civil" week (mon-sun). This would lead to:
          • week window from Monday morning till late Sunday
          • sites can edit each day (except when ROCs can), and can finish their part on Monday & Tuesday morning
          • ROCs have from Tuesday noon to Wednesday noon to report
          • You have Wednesday afternoon and Thursday morning to extract the ROC issues and prepare the meeting


          • This would have some advantages, like:
          • Sites start their working week with reporting on the past week, which seems more logical than doing it in the middle of the week. If the Monday happens to be bank holiday, they still have Tuesday morning to finish.
          • ROCs have time to fill the global report for their region, and the OCC has time to prepare the meeting
          • weekly ROC GGUS metrics will then run from Monday to Sunday, and be (at last) synchronized with COD GGUS metrics which already run for this time window for ages.
          • We'll be able to aggregate reports following calendar/project weeks, which will certainly simplify the way to produce any kind of metrics, as well as the usefulness of these metrics
          Speaker: Maite Barroso
      • <big> Update on WMS tests </big> 10m
        Speaker: Andrea Sciabà
        transparencies
      • <big> EGEE issues coming from ROC reports </big> 15m
        Reports were not received from these ROCs: All reports received.

        1. BDII timeouts encountered, most of the time on lcg-cr. Maybe there's a problem with this lcg tool ? Other timeouts also encountered on lcg-* commands : is the information system too loaded ? Are all the available information usefull ? Is there a way to decrease the load on information system requests ? (after a download on lcg-bdii, I get a 17MB ldiff file...). (France ROC)


        2. The cycle of test job submission in SFT/SAM framework is apparently 2h for OPS VO (compared to 3h for dteam as before). Has this officially been announced/broadcasted (e.g. to the sites) yet? (DECH ROC)


        3. Uncertified sites: we have put a couple of RCs in "Uncertified status" on GOCDB because we have some problem with them (very long down, old release, site mostly unattended). A MoU for INFN Production Grid should be approved in the next few days and signed by all italian RCs. Our idea is to use the "uncertified status" as "light" temporary suspension of sites that don't respect this MoU. Any objections and comments are welcome. (Italy ROC)


        4. Sometimes SAM monitoring shows one isolated error (1hour) and the CIC report does not show it. Which is the status of this? Should all the SAM errors appear in the CIC report, or single errors will not appear?. (SWE ROC)


        5. A couple of GGUS tickets have been open with requests for the PPS report in the CIC-portal: +To add a "week" button on friday to be able to edit the weekly report (ggus-12850) +To send an e-mail reminder, as it is done for the Production report (ggus-12851) (SWE ROC)


        6. LFC Catalog does not work with the Spanish CA (REDIRIS) User certificates because they include a dot "." in the user name field (ggus-12389). This is a major problem for us, since many users have certificates from this CA, so we would appreciate to have an estimation of the timescale this might be solved. . (SWE ROC)


        7. In the last gLite version, the /opt/glite/yaim/libexec/gLite.def file contains a hardcoded value for "cemon.cetype=blah". The value of this parameter should be allowed to take the value "condor". Hence, it would be useful to have it as a "site-info.def" parameter. (SWE ROC)
        8. >

        9. The move to OPS is welcome, but the default tests shown on the SAM website need to be updated to reflect this. It's only now that I realise we've been failing some significant tests for quite some time.
          As this is the first time we have been presented with these results I am marking failures as non-relevant for this week. Looking at the old SFTs we had a near perfect week. But now the OPS/SAM tests have flagged us as failing all week. Clearly this is nonsense as we've been full all week processing CMS and LHCb jobs! (UKI ROC)
  • 16:25
    OSG Items 5m
  • working with SFT's
  • 16:30
    WLCG Items 35m
    • <big> WLCG SC report and upcoming activities </big> 15m
      Speaker: Harry Renshall
      document
    • As outcome of the informal discussion that Atlas,CMS and LHCbhad yesterday (September 19th) about Sofwtare Installation procedure and SFTinfrastructure used for that purpose, there is a strong requirement from all VOsto have *everywhere* the sgm account's priority higher than other accounts. Itwill be up to the sites to decide how to implement that. Furthermore, all VOs(Alice included) agreed that installing software via SFT would be the best wayof proceeding.
    • LHCB required (and they are getting quick answers from several sites) to havedeployed everywhere their own VOMS group based (w/o further extra-Rolesspecified) mapping. They plan to start with this simple mapping (that shouldbe easily implemented in YAIM):
      "/VO=lhcb/GROUP=/lhcb/sgm"
      lhcbsgm:
      "/VO=lhcb/GROUP=/lhcb/lcgprod"
      lhcbprd:
      "/VO=lhcb/GROUP=/lhcb/Role=NULL/Capability=NULL" .lhcb
      lhcbsgm isthe local account for installing software. It could have granted just a singleconcurrent job on the site but it should be allowed to have the highestpriority.
      lhcbprod account should have guaranteed a specified share (at LRMSlevel) and the Storage area is group writable. Normal members of the communitycan just access data on this area.
      This is the desirable configuration theywould like to have deployed as first step. Further groups have been also definedwithin LHCb but they aren't easily deployable with current YAIM tools.
      Pleasenote that LHCb do not plan to use special Roles inside each group apart from thedefault one "user". This last is just a trick to prevent unwanted privilegiesbased on groups without extra Roles specified. For a VOMS proxy, LCMAPS willalways try to acquire as many secondary GIDs as possible. This means that auser who appears in the /lhcb/sgm group will always have "lhsgm" present inthe list of GIDs, either as primary or as secondary GID.
    more information
  • 17:05
    Review of action items 15m
    pendingactions
  • 17:20
    AOB 5m