archive:appds:wp2:main
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| archive:appds:wp2:main [05/07/2024 19:03] – removed - external edit (Unknown date) 127.0.0.1 | archive:appds:wp2:main [05/07/2024 19:03] (current) – ↷ Page moved from grants:archive:appds:wp2:main to archive:appds:wp2:main admin | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== WP2: Big Data Science Software ====== | ||
| + | The goal of this work package is first, to improve the basic software of the KCDC data centre. Then | ||
| + | we will adapt this to move the data centre to the Big Data environment provided by KIT-SCC and | ||
| + | MSU-SINP. Both experiences will be used to provide finally a concept and a standard software | ||
| + | package for scientific data centres. Therefore, the project will serve as a general software solution | ||
| + | for providing open access to (astroparticle) data independently of the exact kind of data. The | ||
| + | software needs to be built as a modular, flexible framework with a good scalability (e.g. to large | ||
| + | computing centres). The configuration is hold to be simple and doable also via a web interface. | ||
| + | The entire software will be based solely on Open Source Software (Python, Django, mongoDB, | ||
| + | HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated | ||
| + | Data Life Cycle Lab for the astroparticle physics community, where this project is aiming for. | ||
| + | Dedicated access, storage, and interface software has to be developed. | ||
| + | |||
| + | A brief summary of the requirements, | ||
| + | improved is compiled in the following: When publishing data it is not enough to put some ASCII | ||
| + | files on a plain webpage. To ensure that the user can actually use the data, an extensive | ||
| + | documentation on how the data has been obtained is needed. Depending on the kind of data, this | ||
| + | is at least a description of the detector and the reconstruction procedures employed. Since this | ||
| + | information, | ||
| + | static pages is a viable option. Another important aspect is the user and access management. For | ||
| + | KCDC it was enough to ensure that only registered users get access to the data and to certain | ||
| + | parts of the portal, such as the data shop. Therefore, it was sufficient to check if an account is | ||
| + | active and the user is authenticated. While there is already a basic implementation of a permission | ||
| + | based access limitation, a useful categorization of the users into - possibly hierarchical - groups is | ||
| + | needed (no administrator should manually manage privileges of single users) to effectively use it. | ||
| + | The heart of the data centre is the data shop. The design goals have to ensure an easy access to | ||
| + | a user defined subset of the whole dataset, a natural way to configure additional detector | ||
| + | components or observables without the need to change the codebase and a clear overview of | ||
| + | previous requests with the possibility to use these selections as a template for future requests. For | ||
| + | KCDC, the data is stored in a MongoDB. Its scheme-less design allows us to collect all available | ||
| + | information of an event, although the available detector components may vary for each event. In | ||
| + | principle KCDC can use any kind of input format, as these are implemented as plugins. Currently | ||
| + | three output formats are supported, HDF5, Root and ASCII. These can be extended by adding | ||
| + | 11additional plugins, too. The requests are processed using a Celery based task queue. The | ||
| + | simultaneous processing of requests can be achieved by adding more worker processes, which | ||
| + | can be distributed among several machines. Together with the possibility to run the MongoDB on a | ||
| + | sharded cluster, a scaling of the needed processing power with the demand can be achieved. The | ||
| + | use and configuration of a sharded cluster has yet to be studied, however. Not included within | ||
| + | KCDC, yet a nice addition would be the possibility to publish results, such as energy spectra | ||
| + | together with the option to select multiple spectra to compile and download plots. Adding a way to | ||
| + | visually select the data would be possible. In addition, the appropriate hardware has to be installed | ||
| + | and commissioned. The described concept and working plan is valid also for the next steps, i.e. to | ||
| + | generalize and consolidate the software package of a global astroparticle data centre for public | ||
| + | and scientific use. | ||
| + | |||
| + | Specific tasks: | ||
| + | * Movement of KCDC to large-scale computing facility and adapting the new environment (KIT-SCC, KIT-IKP, MSU-SINP) | ||
| + | * Optimizing data bank and access interfaces (MSU-SINP, ISDCT, KIT-SCC) | ||
| + | * A distributed system of storage and archiving the data is developed (MSU-SINP, KIT-SCC) | ||
| + | * Installation of appropriate hardware (KIT-SCC) | ||
| + | * Installing the Data Life Cycle Lab” (KIT-SCC) | ||
