Saturday 11-02-06 ==>"How-to" achieve service level at tiers. Thanks to notes from Maite Barroso, Piotr Nyczyk and Harry Renshall ========================================================================================= Summary of the discussion ------------------------------- Stable components with proper monitoring have to be defined and released. Taiwan is going to hire four people devoted to high availability issues. American and canadian tier1s seem to have 24/7 availabilities already; european tier1s state that on duty person is to be called at, only on exceptional circumstances. FNAL reports their use of a call center for operations. Triumph, FNAL and BNL consider joining regular opearations run by european tiers. Operations on a T0-T1 level is to be worked upon for VOs to know how an alert assessment from a VO on-call person is adressed at in the middle of the night (e.g). VOs are concerned by the response time from Tier1s, namely, in order to cope with MoU levels and advise service redundancy to be carefully looked at by tier1s. Even though some emergency by-pass direct contacts from VOs to sites could be established from daily usecases,for the sake of flexibility, it should not be the rule. We will consider procedures and tools similar to the ones presented by Maite Barroso and Piotr Nyczyk to start with. Once the processes and procedures between T0, V0s, and Tiers1 specific set-ups are investigated, the existing procedures will be adapted from SC4 feedback onwards. Finally, some usecases for operations from Tiers1 emerged, new ones can be sent to helene.cordier@in2p3.fr : * Tier1-Tier2 not completely overlapping ROC-site interaction - cf ROC-D/CH vs TIER1-FZK usecase; * Some specific service run at Tier1 on behalf on Tier2. Round of the T1s : RAL, CNAF, SARA, IN2P3, BNL, FNAL, FZK, TRIUMF, ASGC, NDGF, PIC and T0. =========================================================================================== IN2P3 - we have a good automatic callout already, high availabiliby sertvices needed as no operator on site foreseen. RAL - we are looking to improve our callout systems for exceptional cases so automation is needed. CNAF - we fixed serious problems within 24 hours at Xmas hoidays. general grid expert on call. FNAL - we have people for 24/7. Operator with modus operandi at extra-site call-center ~ 500$/month. FZK - we want to clarify which services really need callout. A 24 hour operator would be a big financial cost. NDGF - we will be aiming for redundancy and expect most problems to be during the first year only. PIC - we will rely on calling to experts and high availability services. Triumf - being a national lab they have 24 by 7 operators. BNL - We have 2 operators for grid, and if not we can hire a company. Sara - we have 24 hour coverage of network operations. ASGC - we are going to hire four technicians. CERN - full cover with escalation layers from operator to sysadmin to expert. =========================================================================================