Kerberos, AFS and Batch Systems

Introduction

Kerberos is a system for securely authenticating users in an unsecure network environment. It was developed in the 1980s at the MIT as part of the famous project Athena. During the 1990s Kerberos V5 was standardized in RFC 1510 and became widely used, especially after Microsoft decided to base Windows 2000 security on it. Within Kerberos, each user has a Ticket Granting Ticket (TGT) which can be used to acquire dedicated service-tickets. These service-tickets finally are used to authenticate a user to that service.

AFS (the "Andrew File System") was started at the Carnegie-Mellon University as a research project. By using a slightly modified Kerberos V4 system they built a secure network filesystem which allows several data-storing servers and complex access control lists. Today AFS is used all over the world, especially by Universities and other research institutions. From the Kerberos point of view, AFS is just a service. AFS itself requires the user to own a valid Kerberos ticket for the service "AFS", this is often called the "AFS-Token".

A Batch System is needed for controlling resource usage of special machines like a supercomputer or some compute servers. Users just define what their jobs need (e.g. the number of CPUs), the Batch System decides if and when the user will get these resources. It then takes care of starting (and probably killing) the job, finally some accounting information about the job is saved for accounting or statistics. There are several commercial and non-commercial Batch Systems available these days, two important open source systems are Torque/OpenPBS and Sun's Grid Engine.

This article is about the problems which arise, when these technologies are used together: Imagine a user submits a job which uses files from his home-directory in AFS. This requires the Batch System to make sure that the job has the users AFS-Token while running or the job would not be able to access the files. Making these files readable by anyone would allow the Batch System to not care about AFS, but this is not an option because of security considerations. Another problem is the limited lifetime of the AFS-Token, it has to be prolonged or renewed to allow the job access AFS continuously.

More information:

Kaserver / Kerberos V4 and AFS-Tokens

The Kaserver is a modified Kerberos V4 server used by the original AFS setup. It has some special features and speaks an additional network protocol ("Rx") used only by AFS utilities. Its possible to use a Kerberos V4 server instead of the Kaserver (Using MIT's Kerberos Server with AFS). The normal lifetime of an AFS-Token is configured per user on the server, usually its about 25 hours. The Token is isolated against the rest of the system in a Process Authentication Group (PAG) called structure which is identified by two unique group IDs.

Renewal of an AFS-Token

One possibility to provide a long lasting job with a token is to regain the token from time to time. The process doing this usually needs some knowledge about the users password though, either by asking him during job submission or by requiring him to store it somewhere.

One tool for this task is the "Password Storage and Retrieval System" (PSR), which uses asymmetric cryptography to securely store the users password encrypted in his AFS space. When a job needs to acquire a new token, the password gets decrypted and is used to simply request a new token. The secret key needed to decrypt the password is only stored on the machine that runs the job. If the user changes his password and updates the encrypted storage too, the new password is automatically used on the next renewal.

Another way to store the password is to use a "SRVTAB" file. Such a file is normally used to store a server key but it can also be used to store the key of a user. The stored key is not the plaintext password but some kind of hash. This way the password is not revealed, but be aware: Concerning Kerberos, the hash can be used just like the password. So when a job needs to acquire a new token, the hash can simply be used. You can find a description of this technology here: UMich: "How to run long lived jobs with AFS" Some quick hints to be used with KTH Kerberos V4: Create the SRVTAB like this: ksrvutil -f mysrvtab -c example.com add where example.com is the name of your AFS cell. Enter your username when prompted for "Name:", the name of your AFS cell in uppercase for "Realm:" and just press enter for "Instance:" and "Version Number:". It will then ask a password twice, enter your normal AFS password. The created file "mysrvtab" can be used like this: kauth -n myname -f mysrvtab bash where myname is your username. kauth will run the given command (here: bash) and repeatedly renew the AFS-Token by using the secret from mysrvtab.

Prolongation of an AFS-Token

A completely different approach to extending the lifetime of AFS-Tokens is to prolong them, extend their lifetime without acquiring a new one. To do this one has to extract the Token from the current environment and decrypt it with the AFS specific Kerberos service key (known only by the Kerberos server and the AFS fileservers). Its now possible to put a new timestamp into the Token, thereby extending its lifetime. After encrypting it with the service key again and putting it back into the users environment, the user has a Token with an extended lifetime. If this process is repeated regularly, the Token never expires.

To my knowledge this way was first gone at CERN in the first half of the 1990s, they created the programs GetToken, SetToken and forge. These programs became the base of CERN's "Authenticated Remote Control" (ARC) system, and some time later Codine and LoadLeveler evolved with support for these tools. (Codine is today known as Sun GridEngine, which still contains support for this method.)

A reimplementation of this method can be found in Mike Bechers OpenPBS module extension.

I did just another reimplementation, which does not rely on OpenAFS but uses only Kerberos and "krbafs", found on any Fedora Core 1 machine.

Kerberos V5 and AFS-Tokens

Kerberos V5 brought some new features, renewable TGT-tickets being the most notable with regard to this article. Such a ticket can be renewed via kinit -R without the need to enter the password again. During a renewal, the kinit command contacts the Kerberos server and asks if the renewal is acceptable. This makes it possible to inhibit further usage of a stolen TGT by e.g. disabling the account on the server. As another security measure, a renewable ticket not only contains the usual (short) lifetime specification, but also features a (long) "renewable lifetime" that declares an upper limit for ticket renewal. Usually the normal lifetime is about a day, while the renewable lifetime can last for months.

Creating AFS-Tokens out of Thin Air

A completely different way to provide jobs with AFS-Tokens is to fake them. This is easy if you know the service key of the AFS service. (Remember: AFS is a Kerberos service, the AFS-Token is just a Kerberos V4 service ticket. Therefore it has a well known structure and is encrypted with the AFS' service key.) An implementation of this method is available as GSSKLOG, a tool which uses the GSS-API to authenticate a client to the server, eventually giving back a faked AFS-Token if authentication succeeded.

Batch Systems: Required Steps for AFS-Integration

A complete solution would be to include full Kerberos and AFS support into the Batch System. But this would probably require considerable changes in network communication and internal structures, so some simpler way would be better.

When a job is submitted (e.g. qsub)

While the job is queued

When the job is started

While the job is running

When the job has finished



To sum it up, a Batch System that wishes to support AFS has - at least - be able to
Not all described mechanisms need all these hooks though.

Links

(C)opyright by Karsten Petersen <kapet@hrz.tu-chemnitz.de> in 2004