I'm using Apache Camel to transfer files from an input directory to a messagebroker. The files are written via SFTP. To avoid consuming incomplete files that are still in transit, I've set readLock=changed
and readLockCheckInterval=3000
.
As an example, this is how one of my tests looks:
<route><from uri="file:inbox?readLock=changed&readLockCheckInterval=3000"/><log message="copying ${file:name}"/><to uri="file:outbox"/></route>
I test this with (echo line 1; sleep 2; echo line 2) > inbox/test
and the file gets copied faithfully when readLockCheckInterval=3000
. However, this doesn't scale, because the file
component will wait three seconds before processing each file. So when I test with
for n in $(seq 1 100); do (echo line 1; sleep 2; echo line 2) > inbox/$n & done
it takes camel five minutes to move the files from inbox
to outbox
.
I've read the chapter on parallel processing in the Camel in Action book. But the examples focus on parallelizing processing of lines in a single consumed file. I couldn't find a way to parallelize the consumer itself.
A throughput of around one file per second would be fine in my use-case. I just don't like the idea of being forced to risk incomplete data to achieve it. The readLock=changed
setting seems like a hack anyway, but we can't tell the customer to copy then move, so there doesn't seem to be another option.
How can I improve throughput without sacrificing integrity in the face of network delays?