Tier 1 reports

 

ASCG:

Castor2 testbed

  * now upgraded to 2.1.0.4

  * encountered error with castor NS returning "?" instead of the file location.  Castor team is assiting with this issue.

   * We used gdb and valgrind try to figure out the reason and found request handle serveice had  a lot of error in memory when it called oracle library (10.2.0.1).

   * Castor2 pkg. compiled with Oracle 10.2.0.2 lib.  We have patched oracle to 10.2.0.2 to make sure that this is not a source of errors.

 

T1 Operations

 * Jul 22, 2006

  * 3 CT error found at quanta hpc cluster (since 2006-07-21 13:00:33)

   * prob arise from one of the WNs have old CA pkgs, fixed at 2006-07-21 19:00:31

 * Jul 21, 2006

  * updating CMSSW to 0_8_1 at ASGC

  * install CMSSW 0_8_0 and 0_8_1 at TW-FTT

 * Jul 20, 2006

  * starting bulk minbias production (1000 events) for CMS heavy ion group at ASGC

 * Jul 19, 2006

  * rebuild CMS UI, and migrate into new dual core server

 

SC

 * Disk pool crash that occured 2 week ago may have been related to the DAEMONV3_WRMT parameter in shift.conf for castor.

  * We thought that's the reason cause a lot of dead process (both of rfio & gridftp) occpuied most of system resource.

  * the disk pool nodes are stable so far.  We will continuing monitoring to verify if this helps improves stability.

 * CMS

  * Added missing CMSPROD pool account in production castor to enable CMS file transfer to continue.

 

CERN:

CERN T0/T1 Report

=================

CASTOR:

- With the agreement of the experiment, we dismantled stagelhcb on Monday.

- 36 BATCH boxes were moved from the ATLAS ITDC cluster to CMS ITDC cluster and LSF was configured to let CMS access

  them. CMS has started tests on the boxes.

- LSF long configuration time: On Sunday evening a specially instrumented mbatchd provided by Platform was deployed to

  enable them to understand our problems with the long reconfiguration time. A reconfiguration of LSF was done on

  Monday morning (announced previously), while developers were watching things as they were happening, on lxmaster01.

  This helped them to finally understand the root of the problem: when mbatchd reconfigures it has to count the number

  CPUs which takes a long time for each node. Obviously, the size of this problem scales with the number of nodes.

  Vendor provided us with a hot fix mbatchd yesterday, which we deployed.  We needed a reconfiguration that became

  urgent and was postponed until we got the new mbatchd. So the reconfiguration was triggered yesterday (going ahead

  with several configuration changes) which seemed to work in the first place, but then at around 12:15pm the system

  fell over and went into a strange state, with some commands working ok (like bjobs), but not job scheduling and

  commands like bqueues or bugroups failing. Before attempting to restart mbatchd the mbatchd debugging was switched

  on to allow vendor following up the incident later. The failure of some b-commands caused a back lock of requests in

  particular from the CEs, flooding the LSF master in addition. We decided to roll back to the previous version of

  mbatchd, and reverse parts of the configuration changes. After several attempts to restart LSF the system was

  brought back into production. It was fully back at around 14pm.

  The whole incident was reported to vendor immediately, and they are investigating. Due to ongoing tests by Atlas no

  further steps can be taken until the end of the week. The vendor is planning to provide us with yet another mbatchd

  until the end of the week. We are planning an intervention similar to the one we made last week-end: deploy the new

  mbatchd to be provided by Platform on Sunday evening, and do a reconfiguration on early Monday morning with vendor

  developers monitoring what is going on. This intervention will be announced later today.

 

IN2P3:

* T1:

  Dcache SE currently stopped for maintenance purposes. Atlas, Cms and Lhcb queues were temporary removed from our CE to prevent access attempts to this SE. FTS were stopped too.

  CE Disk problem on saturday. Atlas jobs were certainly lost, and CE was unavailable.

 

GridKA:

Report for Tier1 GridKa (FZK):

16/7 - 21/7

 

18/7 GidKa experienced a complete power cut after cooling system failure. Systems were down from

15/7 16:00 to 18/7 12:00. Job submission resumed on 18/7 18:00.

 

* Storage

FTS channels to all 8 Russian (RDIG) Alice Tier2 sites set up and tested.

Last activity moved from FTS 1.4 to FTS 1.5.

The FTS 1.4 (f01-015-111-e.gridka.de) endpoint will go offline 28/7@18:00

 

* MW

vobox cms pending, vobox lhcb on-line

PPS pending

all lcg/glite central nodes on virtual machines to improve recovery time

update of some information provider functionalities still in progress (GIIS to batch system)

 

* Network

17/7 opn to cern production postponed to 31/7

 

INFN:

SC4 weekly report

Low activity because of cooling problem at CNAF Computing Center during last week. Air conditioning equipment upgraded on Friday, problem should be solved.

 

[ATLAS and CMS]

CNAF in scheduled down for Castor-2 migration for ATLAS and CMS. Most ATLAS and CMS effort at INFN were concentrated in testing the new infrastructure. CMS provided a report with all possible combinations in the use of both disk and tape access points in the new Castor-2 set-up, in order to provide feed-back on real experiment's use-cases to the storage group. CMS LoadTest activity restarted on friday Jul 21st, at a low rate though, so not to overload the system. ATLAS is restarting as well.

 

[LHCb]

testing T1_disk-T1_disk channels, R.Santinelli asked all T1s to set-up PIC_DISK-<your_T1> since PIC put a new disk-only access point (srm-disk.pic.es) into production. Some discussion arose about if it is actually needed since a site may be using the service discovery and point to the bdII and not to a static services.xml file. No action done yet on CNAF FTS server.

 

[ALICE]

 is testing the Castor-2 installation at CNAF (migrated some weeks ago already).

 

NIKHEF/SARA:

Report Nikhef 17-24 July

 

* To prevent a (compromised) local user account from exploiting a kernel

bug (/proc environment race condition) and gaining root access, we have

applied a workaround on all hosts in the cluster.

 

* There have been some problems with the MySQL database on the resource

broker. The file size of the database had become almost 4 GB, leaving no

free space for new records. This problem is due to the fact that there

are no regular clean ups of old records from the database, even though a

script seemed to be available since a long time. However, it is not

configured in the standard installation.

 

* We worked on supporting LHCb production users via VOMS.

 

* A second hard disk was installed in the Alice VO box. After powering

on the machine again, the first hard disk suffered from hardware

problems and had to be replaced.

 

* The grid facility has run from Friday morning until Saturday on power

from a local generator. A problem with the regular power supply left the

entire institute without power, but this has not affected the grid

factility.

 

------------------------

Report SARA tier1 service

 

Due to failing cooling equipment on July 17 and July 21 we had to shutdown some services, amomg which the tier1 services

 

PIC:

pic (Tier-1 report)

====

- We had some problems with ce04's gatekeeper. Problem solved.

- We have plenty of rgma errors which seem to be false positives

- some problems with pbs server. Jobs got stucked. Problem solved

- Operative problems with the tape subsystem. It seems tapes somewhat don't get loaded or unloaded correctly, and only manual operation can fix this problem. We expect some interruption of tape service due to hardware replacement soon.

Disk overloads. Seems our storage boxes tend to be weak to a lot of files being written in one particular machine. Some SFT tests failed due to high constant loads on some disks, that appeared as "timeouts".

Savannah bug 17738 is still appearing on some of the test, though the frequency has clearly diminished.

Gridftp traffic (mainly from atlas) is fluctuating from 20 to 40 Mb/s. Only on Tuesday the rate decreased and it was due to tier-0 not providing enough bandwith for everyone.

 

RAL:

We took premtive action because of expected record temperatures. We had the hottest July day since records begun
in 1820.

Batch work was draining in the morning and killed of around 14:00. 270 jobs were lost.

Storage facilities remained operational.