WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Twister room))

28-R-15

VRVS (Twister room)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • VRVS "Twister" room will be available 15:30 until 18:00 CET

    minutes
      • 16:00 17:25
        WLCG-OSG-EGEE Operations Meeting 28-R-15

        28-R-15

        • 16:00
          Feedback on last meeting's minutes 5m
        • 16:05
          EGEE Items 20m
          • <big> Grid-Operator-on-Duty handover </big> 5m
            From GermanySwitzerland ROC (backup: CERN) to SouthEast Europe ROC (backup: France ROC)
            Tickets:
            OPENED 24
            1ST MAIL 23
            2nd MAIL 16
            QUARANTINE 13
            SITE OK 44
            UNSOLVABLE 1

            Notes:
            1. There are still quite some sites failing replica tests with timeouts. (see also last handover report). One possible reason seems to be configured top-BDII too slow and/or network problems.
            2. SFTs no longer updated on lcg-sft.cern.ch since last Friday (new results can now be found at lcg-sam.cern.ch!)
            3. Sites specifics
              - NSC-BLUESMOKE "couldn't get the glite-ce to work" (Job list match fails)
              No further information about status. lcg-ce ok, perhaps take glite-ce out of monitoring until fixed.
              - USCMS-FNAL-WC1 asked for a (quite) long delay of tickets (rm problems and ops-vo)
          • <big> EGEE issues coming from ROC reports </big> 15m
            Reports were not received from these ROCs: All reports received.

            1. What are the plans to migrate the gLite software to SLC 4? We'd be happy to take advantage of new kernel version performance and support for modern hardware. (CentralEurope ROC)


            2. A major problem with downtimes: GRIF is a site split accross different subsites. When adding a downtime on 1 CE/node in GRIF, the whole site is affected. This means that no job is going on other GRIF CEs, which is not what we could infer when adding the downtime on this only one node. The only way for GRIF to update nodes without affecting all other subsites seems to not declare the downtime at all... wich is not desirable either. Could the tools (GOC DB, SAM-CE and others) be updated, so that a partial site downtime is not seen as a global downtime? (France ROC)


            3. SFT/SAM: SFTs not running regularly any more on https://lcg-sft.cern.ch , additionally some site specific SFT framework problems (checked for region DECH, see GGUS Ticket #12454). SAM framework is not substituting old SFT framework sufficiently yet in terms of completely representing the production environment. One reason for this might be failures at the sites, but there are several obstacles due to the migration. The transition phase of both test frameworks should last long enough for sites to get used to the new SAM framework. (DCG ROC)


            4. Maintaining persistent MW services: There is no recipe yet for shutting down and bringing up persistent middleware services like RB and SE. We suggest that the deployment group comes up with a concept of how these services should be maintained in a controlled way with as less as possible affect on the users jobs. Such a maintenance procedure is found to be an essential part of any middleware component for a production environment. (DECH ROC)


            5. Last week the INFN-GRID Release Team discovered a change in the URLs of the LCG-gLite metapackage lists. This was before http://glite.web.cern.ch/glite/packages/R3.0/$VERSION/doc/rpm_list/ and this the current one http://glite.web.cern.ch/glite/packages/R3.0/deployment/${ge}/${rpm_version}/${ge}-${rpm_version}.rpm.list.txt We ask if it's possible to be informed of such changes, involving more the release internals and developement than users. Other regions need this? (Italy ROC)


            6. 8443 DPM/R-GMA port conflict. INFN-GRID Release Team has locally implemented the modifications to yaim to change the DPM port. If needed they can send it to the "Mother-Release" for inclusion. (Italy ROC)


            7. Can the RC report expiry time be extended to include at least some of the weekend if not all? (UK/I ROC)
        • 16:25
          OSG Items 5m
        • FNAL would like to investigate adding open source databases to FTS. In order for us to properly evaluate this, we want to check out the FTS client and server code from the CVS repository. In which CVS repository is this code stored?
        • Piotr updated us with SFT testing status. In the next week we hope to run SFT's on our Development site.
  • 16:30
    WLCG Items 35m
    • <big> WLCG SC report and upcoming activities </big> 15m
      Speaker: Harry Renshall
      document
    • Check the status of SARA, in downtime for almost one week.
    • Get requesttimeouts in CCIN2P3
    • FZK and RAL to be checked by ALICE
    • Central server of FTS at CERN not answering
    • Several problems observed this weekend with the SE service of ALICE at CERN (central FTD service affected)
  • <big> Changing the day of this meeting </big> 5m
  • 17:05
    Review of action items 15m
  • 17:20
    AOB 5m