Fault-Tolerant SFTP scripting – Retry Failed Transfers Automatically

Fault-Tolerant SFTP scripting

Introduction

The whole of modern networking is built upon an unreliable medium. Routing equipment has free license to discard, corrupt, reorder, or duplicate data which it forwards. The understanding of the IP layer in TCP/IP is that there are no guarantees of accuracy. No IP network can claim to be 100% reliable.

The TCP layer acts as a guardian atop IP, ensuring data that it produces is correct. This is achieved with a number of techniques that sometimes purposely lose data in order to determine network limits. As most might know, TCP provides a connection-based network with guaranteed delivery atop an IP connectionless network that can and does discard traffic at will.

How curious it is that our file transfer tools are not similarly robust in the face of broken TCP connections. The SFTP protocol resembles both its ancestors and peers in that no effort is made to recover from TCP errors that cause connection closure. There are tools to address failed transfers (reget and reput), but these are not triggered automatically in a regenerated TCP session (those requiring this property might normally turn to NFS, but this requires both privilege and architectural configuration). Users and network administrators alike might be rapt with joy should such tools suddenly become pervasive.

What SFTP is able provide is a return status, an integer that signals success when it is the value of zero. It does not return status by default for file transfers, but only does so when called in batch mode. This return status can be captured by a POSIX shell and retried when non-zero. This check can even be done on Windows with Microsoft's port of OpenSSH with the help of Busybox (or even PowerShell, with restricted functionality). The POSIX shell script is deceptively simple, but uncommon. Let's change that.

Failure Detection with the POSIX Shell

The core implementation of SFTP fault tolerance is not particularly large, but batch mode assurance and standard input handling add some length and complexity, as demonstrated below in a Windows environment.