WLCG-OSG-EGEE Operations meeting

Monday 7 Aug 2006, 14:00 → 17:30 Europe/Zurich

28-R-15 (VRVS (Saturn room))

28-R-15

VRVS (Saturn room)

Maite Barroso

Description

VRVS "Saturn" room will be available 15:30 until 18:00 CET

- 14:00 → 17:25
  28-R-15
  
  28-R-15
  - 16:00
    
    Feedback on last meeting's minutes 5m
    
    Minutes
  - 16:05
    
    Grid-Operator-on-Duty handover 5m
  - From SouthEasternEurope (backup: Italy) to CERN (backup: Russia)

16:10

SC4 weekly report and upcoming activities 10m

See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans

Speaker: Harry Renshall

16:20

EGEE broadcast: review of VO targets 5m

We need to extend the CIC portal broadcast with other classes of email contacts. For example, production managers mailing lists and data managers mailing lists. We'd rather avoid to go down the route of local experiment contacts because they might not be the right people.

Speaker: Jeremy Coles

16:30

Issues to discuss from reports 25m

Reports were not received from [put ROCs, sites, VOs here]

1. AP: We are working on accounting with a site that has condor 6.7.17. do other Rocs have experience experience with this? We will try contacting other condor sites.

2. CE: Information System instabilities (large sites). It looks like the fix given at: http://goc.grid.sinica.edu.tw/gocwiki/LCG_Release_Fixes section: "Information System Instabilities" doesn't improve the situation much for some sites. The fix tells to move site GIIS to another machine. That's avoid "site GIIS down" problem, however the information providers (MDS'es sitting on port 2135) are still on overloaded machine with PBS server. That results in "CPU count erratic" and "missing services" error in Gstat. We'd rather prefer to move out PBS server out of CE than move GIIS. Does this fix work for the others?

3. CE: Connected with the above: We reported the problem to LHCb (the VO causing excessive load) and the VO said the excessive load is caused due to many jobs *fails* at the site (GGUS ticket id: 10716). As the only remedy to save the site they closed LHCb queue. Is that possible for LHCb failing jobs to not cause such an overload on CE? We don't notice such a problem with other VOs. (The site can't fix the problem as when they tried to find the origin of the problem the site disappeared from the web page given by the VO: https://webafs3.cern.ch/santinel/cgi-bin/logging_info There were many successfully done jobs so digging through local logs and finding those failing is said to be rather hard).

4. CE: There's SLC-Bug causing 'id' command from coreutils package to segfault in glibc, systems fails bdii init script start. Site admins in CE suggested to use 'id' from the old rpm version.

5. SEE: As mentioned in the OPERATIONS WORKSHOP, 19th-20th June, CERN, sites upgrading should avoid draining queues and scheduling downtimes. Also the latest announcement of upgrade of the PPS was along the same lines. Would it be possible to have an advice in each announcement of upgrades for the production service? e.g. A statement saying that no draining and downtime is needed in order to upgrade to this new release, or in case there is a need for a downtime then advise accordingly.

6. SWE: Lots of queued jobs on the CE cause a high load on this node (and this has bad consequences since processes like GRIS, etc get slower). Since this causes operational problems, we wanted to ask wether it is acceptable that, until the CE m/w is capable of managing lots of queued jobs without causing such high load on the node, we configure a hard limit on the number of queued jobs per VO or per user at the level of the Batch system. This might cause problems on the RB, but we need this to keep the CE under control.

7. SWE: The high load on the CE caused the GRIS (slapd:2135) to respond very slowly. We tried to move the local info provider on the CE from the GRIS on 2135 to a BDII on 2170 that just triggers the GIP, and publishes the information under mds-vo-name=resource,o=grid. We would like to hear from IS experts wether this is an acceptable solution.

8. SWE: The atlas SRM-disk space at PIC is completelly full. Probably this is SC4 data that can be deleted. We would like to get clarification on wether we should periodically delete this data or the application does that. In case we need to do it, we would like to have a simple algorithm like: delete any file older than 48h under the directory "/sc4tier0/", no need to delete the LFC entries. Or something like this...

9. SWE: For the T1s there is a web page from which the SAM Availability value per day can be obtained (http://lcg-sam.cern.ch:8080/sqldb/site_avail.xsql). This overall availability is computed as an average of the "1 or 0" hourly availabilities during the day. Each of those is computed from the "per site availability" using some algorithm described in the GOC wiki. It would be very useful for T1 sites if we could access somewhere in a web page to the "availability bit per hour". In this way, we could interface the local alarm system to it.

10. UKI: Still no answer for lcg-CE Condor dependency issue. No reply to GGUS ticket (#10162) yet. One question to Operations Meeting people: Can anyone please suggest me where to go next with this issue? Reporting here didn’t bring any luck, so far. Is anyone really looking after this issue? Looks like not.

VO REPORTS

ALICE: Continuing the T0-T1 transfers
CNAF: Not working, problem with one of the diskservers.
CCIN2P3 and FZK: Working fine
SARA: Problems with the localcatalogue
RAL: Pending the transfers.

16:35

Review of action items 15m

17:20

AOB 5m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

VRVS (Saturn room)

28-R-15

Share this page

Direct link

Social networks

Calendaring