diff options
author | Kotresh HR <khiremat@redhat.com> | 2019-05-08 11:26:06 +0530 |
---|---|---|
committer | Sunny Kumar <sunkumar@redhat.com> | 2019-05-13 11:06:41 +0000 |
commit | f0d3690e8916cfb5e10a0df2e9721a0fd079bfce (patch) | |
tree | 1c907b441761cca76f2a10e8e827f6b6fbd353ec /tests/geo-rep.rc | |
parent | e2adc9dc66dc46519007790ecd7dd57642dff0fd (diff) |
geo-rep: Fix sync hang with tarssh
Problem:
Geo-rep sync hangs when tarssh is used as sync
engine at heavy workload.
Analysis and Root cause:
It's found out that the tar process was hung.
When debugged further, it's found out that stderr
buffer of tar process on master was full i.e., 64k.
When the buffer was copied to a file from /proc/pid/fd/2,
the hang is resolved.
This can happen when files picked by tar process
to sync doesn't exist on master anymore. If this count
increases around 1k, the stderr buffer is filled up.
Fix:
The tar process is executed using Popen with stderr as PIPE.
The final execution is something like below.
tar | ssh <args> root@slave tar --overwrite -xf - -C <path>
It was waiting on ssh process first using communicate() and then tar.
Note that communicate() reads stdout and stderr. So when stderr of tar
process is filled up, there is no one to read until untar via ssh is
completed. This can't happen and leads to deadlock.
Hence we should be waiting on both process parallely, so that stderr is
read on both processes.
Change-Id: I609c7cc5c07e210c504771115b4d551a2e891adf
fixes: bz#1707728
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Diffstat (limited to 'tests/geo-rep.rc')
-rw-r--r-- | tests/geo-rep.rc | 17 |
1 files changed, 17 insertions, 0 deletions
diff --git a/tests/geo-rep.rc b/tests/geo-rep.rc index 2035b9fe106..e4f014eb6f8 100644 --- a/tests/geo-rep.rc +++ b/tests/geo-rep.rc @@ -101,6 +101,23 @@ function create_data() chown 1000:1000 ${master_mnt}/${prefix}_chown_f1_ಸಂತಸ } +function create_data_hang() +{ + prefix=$1 + mkdir ${master_mnt}/${prefix} + cd ${master_mnt}/${prefix} + # ~1k files is required with 1 sync-job and hang happens if + # stderr buffer of tar/ssh executed with Popen is full (i.e., 64k). + # 64k is hit when ~800 files were not found while syncing data + # from master. So around 1k files is required to hit the condition. + for i in {1..1000} + do + echo "test data" > file$i + mv -f file$i file + done + cd - +} + function chown_file_ok() { local file_owner=$(stat --format "%u:%g" "$1") |