ATLAS experience with LCG-1 and EDG2 software and testbeds. ----------------------------------------------------------- The LCG production testbed will be used by ATLAS during the upcoming Data Challenges as a computing Grid prototype for distributed production and analysis. The Grid middleware deployed by the LCG-1 prototype is largely based on the EDG2 releases, such that the early tests could be ran interchangeably at the LCG-1 production testbed and EDG2 applications testbed. Although there are implemenational differences, they are largely transparent for end-users. EDG2 applications testbed tends to be ahead in introducing new services, thus it is important for ATLAS users to test such new features at the EDG2 testbed, and provide feedback, before they get propagated to the LCG-1 facilities. In October 2003, ATLAS users started to get access to the LCG and EDG resources, installed with the latest stable versions of the respective middleware. In addition to the LCG-1 Production Testbed and EDG2 Applications Testbed, the LCG-1 Certification Testbed was made available for the tests, to validate the middleware and its interaction with the ATLAS software. Different groups conducted different sets of tests, as briefly described below. 1) The tests of the middleware started at the LCG-1 certification testbed on October 13: ATLAS was given timeslots from 14:00 to 18:00 every day from October 13 to 17. During those tests: - 10 input data files (30-50 MB each) were copied and registered directly from CASTOR to all the 5 avalable SEs, using entirely the LCG-1 Replica Management tool. The only encountered problem is the CASTOR staging time-out, to which the tool is not accomodated. The feature request was already submitted to EDG developers. - ATLAS software (v6.0.4) was installed by the privileged ATLAS user at a dedicated shared area, cross-mounted at all the Worker Nodes. The installation used for final tests was from RPMs produced for the NorduGrid ans US Grid, i.e., having the "install area" already enabled. Although none of the testers was working with this particular dataset before, and this particular configuration of the ATLAS s/w never been deployed before, with the help ATLAS s/w and DC experts we were able to adapt the scripts very fast. - while the JDL now allows to specify environment variables (like $ATLAS_ROOT) from outside the job, the input/output data stageing still has to be performed by the job. For this, the "typical" job scripts of DC1 were wrapped into Replica Management utilities: "getBestFile" for stage in and "copyAndRegisterFile" for stage out operations. Although it involves a lot of string editing by users, it's not very complex and worked nicely. - a test of brokering was attempted, by submitting 40 jobs using the pre-staged files as input data. All the jobs were submitted via the Resorce Broker, and the workload distribution correctly followed the data distribution pattern. - however, even with the smallest possible input data size, the jobs took too long to process a partition in a given time slot, thus there was no any single job that ran to the end, unless it was forced to process 5 events or so. 2) Menawhile, tests at the production LCG-1 testbed were attempted, using the legacy ATLAS s/w: version 3.2.1 from last year's tests was never removed from the EDG automatized installation, and is inherited by the LCG. - a single input file for full simulation (ca 2GB) was attemted to get replicated across all the SEs. Replication failed due to credential problems in Moscow, Prague, Krakow and Tokyo. Three of those got their certificates renewed later, except for Krakow. - a full simulaton of 25 events successfully proceeded on the rest of the testbed, failing only at FZK and Taipei due to NFS file access failure. - these tests will be a part of the "standard" LCG test suite, and are planned to be extended to the EDG Applications testbed 3) From October 2, access to the EDG Applications Testbed have been opened, and an extention of the tests above was prepared - several input files (same as with the LCG-1 Cert. TB test) have been copied from CASTOR and replicated over available SEs, using the Replica Management tool only. Unfortunately, the system was not particularly stable, and it was impossible to test all the SEs. - simple tests of data-driven job submission - ATLAS s/w installation could not be done system-wide even by a privileged user, and the RPM sets used in the previous LCG-1 Cert. Testbed tests did not satisfy EDG requirements (e.g., no relocation can be done by the EDG/LCG automated installation procedure). Also, the latest ATLAS s/w which users would like to evaluate, is not officialy available as RPM (RPM is not supported/endorsed by ATLAS). Hence, as a temporary solution, EDG and LCG-1 testbeds were asked to deploy the 6.0.4 version of ATLAS RPMs, prepared for earlier EDG tests on a dedicated setup. - On November 3, EDG asked to suspend the testing, as the testbed developed multiple problems, most notably, with the R-GMA information system. 4) By November 10, the 6.0.4 version of the ATLAS software was installed at the LCG-1 production testbed (14 sites), and users were asked to test it. The same reconstruction test was in scope - As the testbed advertises 17 Storage Elements, first 17 partitions of the input dataset were attempted to be copied from CASTOR and registered in the data management system, using the Replica Management tool only. 11 files were copied successfully. Copy to FNAL and Budapest failed because the sites do not authorise ATLAS users. Copy to Krakow failed due to the still unresolved site credentials issue. Copies to FZK, Moscow and Prague failed for still unknown reasons - with the latter probably due to the firewall configuration in Prague. - Meanwhile, a full simulation test using release 6.0.4 was made, by processing a single input file stored on a SE in Valencia. The task proceeded successfully, being local to Valencia (as all the used components were local and no input files were stored elsewhere). - On November 12 LCG asked to suspend the tests because of the unspecified problems developed by the testbed. The experience gained during the past month can be summarized thus as follows: - LCG-1 and EDG testbeds present a wealth of resources, most of which can be used by ATLAS in Data Challenges and other tasks. - It is unclear whether all the resources will eventualy authorize ATLAS users, and if not, whether or how soon the VO management tools will be ready to prevent ATLAS jobs or files to land on a non-ATLAS resource. - In general, despite the automated installation, any site can get misconfigured, and none of the testbeds appear to have a mechanism of avoiding such sites. - The middleware is still far from being ready for a production-level task or a distributed anaysis. - The user interfaces are non-intuitive, and a lot of scripting has to be done by users in order to port their tasks to the LCG/EDG middleware. - The experiment software installation mechanism that does not use RPMs was developed with participation of ATLAS, and is proposed to the LCG very recently. It is unclear though when it will be implemented, also, many implementation details are not decided yet. - As ATLAS heavily relies on CASTOR, a solution or a workaround for staging timeouts has to be found. - With the new data management concept, physical file names on the Grid are irrelevant and contain no metadata. This means that ATLAS file names must be transferred into Logical File Names (LFN), and any task that makes use of the ATLAS naming scheme should be preceded by linking or copying files from the Storage Element with the "new old" name derived from the LFN. - While the LCG organised an Experiment Inegration Support team and the mailing list, there is no known analogy to the EDG's Integration Team list, from which users and sysadmins alike can ask for help or advise, and learn about the latest situation with the testbed.