ATLAS experience with LCG-1 and EDG2 software and testbeds.
-----------------------------------------------------------

The LCG production testbed will be used by ATLAS during the upcoming
Data Challenges as a computing Grid prototype for distributed
production and analysis. The Grid middleware deployed by the LCG-1
prototype is largely based on the EDG2 releases, such that the early tests
could be ran interchangeably at the LCG-1 production testbed and EDG2
applications testbed. Although there are implemenational differences,
they are largely transparent for end-users. EDG2 applications testbed
tends to be ahead in introducing new services, thus it is important
for ATLAS users to test such new features at the EDG2 testbed, and
provide feedback, before they get propagated to the LCG-1 facilities.

In October 2003, ATLAS users started to get access to the LCG and EDG
resources, installed with the latest stable versions of the respective
middleware. In addition to the LCG-1 Production Testbed and EDG2
Applications Testbed, the LCG-1 Certification Testbed was made
available for the tests, to validate the middleware and its
interaction with the ATLAS software. Different groups conducted
different sets of tests, as briefly described below.


1) The tests of the middleware started at the LCG-1 certification
testbed on October 13: ATLAS was given timeslots from 14:00 to 18:00
every day from October 13 to 17. During those tests:

   - 10 input data files (30-50 MB each) were copied and registered
   directly from CASTOR to all the 5 avalable SEs, using entirely the
   LCG-1 Replica Management tool. The only encountered problem is the
   CASTOR staging time-out, to which the tool is not accomodated. The
   feature request was already submitted to EDG developers.

   - ATLAS software (v6.0.4) was installed by the privileged ATLAS user
   at a dedicated shared area, cross-mounted at all the Worker
   Nodes. The installation used for final tests was from RPMs produced
   for the NorduGrid ans US Grid, i.e., having the "install area"
   already enabled. Although none of the testers was working with this
   particular dataset before, and this particular configuration of the
   ATLAS s/w never been deployed before, with the help ATLAS s/w and DC
   experts we were able to adapt the scripts very fast.

   - while the JDL now allows to specify environment variables (like
   $ATLAS_ROOT) from outside the job, the input/output data stageing
   still has to be performed by the job. For this, the "typical" job
   scripts of DC1 were wrapped into Replica Management utilities:
   "getBestFile" for stage in and "copyAndRegisterFile" for stage out
   operations. Although it involves a lot of string editing by users,
   it's not very complex and worked nicely.

   - a test of brokering was attempted, by submitting 40 jobs using the
   pre-staged files as input data. All the jobs were submitted via the
   Resorce Broker, and the workload distribution correctly followed the
   data distribution pattern.

   - however, even with the smallest possible input data size, the jobs
   took too long to process a partition in a given time slot, thus
   there was no any single job that ran to the end, unless it was
   forced to process 5 events or so.

2) Menawhile, tests at the production LCG-1 testbed were attempted, using
the legacy ATLAS s/w: version 3.2.1 from last year's tests was never
removed from the EDG automatized installation, and is inherited by the
LCG.

   - a single input file for full simulation (ca 2GB) was attemted to
   get replicated across all the SEs. Replication failed due to
   credential problems in Moscow, Prague, Krakow and Tokyo. Three of
   those got their certificates renewed later, except for Krakow.

   - a full simulaton of 25 events successfully proceeded on the rest
   of the testbed, failing only at FZK and Taipei due to NFS file
   access failure.

   - these tests will be a part of the "standard" LCG test suite, and are
   planned to be extended to the EDG Applications testbed

3) From October 2, access to the EDG Applications Testbed have been
    opened, and an extention of the tests above was prepared

   - several input files (same as with the LCG-1 Cert. TB test) have
   been copied from CASTOR and replicated over available SEs, using the
   Replica Management tool only. Unfortunately, the system was not
   particularly stable, and it was impossible to test all the SEs.

   - simple tests of data-driven job submission

   - ATLAS s/w installation could not be done system-wide even by a
   privileged user, and the RPM sets used in the previous LCG-1 Cert.
   Testbed tests did not satisfy EDG requirements (e.g., no relocation
   can be done by the EDG/LCG automated installation procedure). Also,
   the latest ATLAS s/w which users would like to evaluate, is not
   officialy available as RPM (RPM is not supported/endorsed by
   ATLAS). Hence, as a temporary solution, EDG and LCG-1 testbeds were
   asked to deploy the 6.0.4 version of ATLAS RPMs, prepared for
   earlier EDG  tests on a dedicated setup.

   - On November 3, EDG asked to suspend the testing, as the testbed
   developed multiple problems, most notably, with the R-GMA
   information system.

4) By November 10, the 6.0.4 version of the ATLAS software was
    installed at the LCG-1 production testbed (14 sites), and users
    were asked to test it. The same reconstruction test was in scope

    - As the testbed advertises 17 Storage Elements, first 17
    partitions of the input dataset were attempted to be copied from
    CASTOR and registered in the data management system, using the
    Replica Management tool only. 11 files were copied
    successfully. Copy to FNAL and Budapest failed because the sites do
    not authorise ATLAS users. Copy to Krakow failed due to the still
    unresolved site credentials issue. Copies to FZK, Moscow and Prague
    failed for still unknown reasons - with the latter probably due to
    the firewall configuration in Prague.

    - Meanwhile, a full simulation test using release 6.0.4 was made,
    by processing a single input file stored on a SE in Valencia. The
    task proceeded successfully, being local to Valencia (as all the
    used components were local and no input files were stored
    elsewhere).

    - On November 12 LCG asked to suspend the tests because of the
    unspecified problems developed by the testbed.


The experience gained during the past month can be summarized thus as
follows:

- LCG-1 and EDG testbeds present a wealth of resources, most of which
   can be used by ATLAS in Data Challenges and other tasks.

- It is unclear whether all the resources will eventualy authorize
   ATLAS users, and if not, whether or how soon the VO management tools
   will be ready to prevent ATLAS jobs or files to land on a non-ATLAS
   resource.

- In general, despite the automated installation, any site can get
   misconfigured, and none of the testbeds appear to have a mechanism
   of avoiding such sites.

- The middleware is still far from being ready for a production-level
   task or a distributed anaysis.

- The user interfaces are non-intuitive, and a lot of scripting has to
   be done by users in order to port their tasks to the LCG/EDG
   middleware.

- The experiment software installation mechanism that does not use
   RPMs was developed with participation of ATLAS, and is proposed to
   the LCG very recently. It is unclear though when it will be
   implemented, also, many implementation details are not decided yet.

- As ATLAS heavily relies on CASTOR, a solution or a workaround for
   staging timeouts has to be found.

- With the new data management concept, physical file names on the
   Grid are irrelevant and contain no metadata. This means that ATLAS
   file names must be transferred into Logical File Names (LFN), and
   any task that makes use of the ATLAS naming scheme should be
   preceded by linking or copying files from the Storage Element with
   the "new old" name derived from the LFN.

- While the LCG organised an Experiment Inegration Support team and
   the mailing list, there is no known analogy to the EDG's Integration
   Team list, from which users and sysadmins alike can ask for help or
   advise, and learn about the latest situation with the testbed.