[Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Sebastien DAUBIGNE
sebastien.daubigne at atosorigin.com
Thu Sep 16 10:10:28 CDT 2010
Sorry, my mistake : The VxVM version is 5.0 GA, not 5.0 MP3.
The 316981 note states that fast_recovery is available in 5.0, but
neither manpage, nor administration guide nor vxdmpadm command
recognizes it.
However, I don't know if 5.0 GA behaviour is equivalent to
dmp_fast_recovery=off or dmp_fast_recovery=on.
The note states "In the case of a single path failure, MPxIO does not
notify DMP of the error, therefore, dmp_fast_recovery has no effect.",
hence it seems this parameter is not an issue in my case (single path
failure).
Maybe I should try to update to latest MP3 with dmp_fast_recovery=off.
Le 16/09/2010 15:40, Sebastien DAUBIGNE a écrit :
> Thank you Victor and William, it seems to be a very good lead.
>
> Unfortunately, this tunable seems not to be supported in the VxVM
> version installed on my system :
>
> > vxdmpadm gettune dmp_fast_recovery
> VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
> vxdmpadm gettune [tunable name]
> Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
> dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
> dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
> or dmp_stat_interval
>
> Something odd because my version is 5.0 MP3 Solaris SPARC, and
> according to http://seer.entsupport.symantec.com/docs/316981.htm this
> tunable should be available.
>
> > modinfo | grep -i vx
> 38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
> 40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
> 42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
> 296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
> 297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
> 298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
>
>
>
>
>
> Le 16/09/2010 12:15, Victor Engle a écrit :
>> Which version of veritas? Version 4/2MP2 and version 5.x introduced a
>> feature called DMP fast recovery. It was probably supposed to be
>> called DMP fast fail but "recovery" sounds better. It is supposed to
>> fail suspect paths more aggressively to speed up failover. But when
>> you only have one vxvm DMP path, as is the case with MPxIO, and
>> fast-recovery fails that path, then you're in trouble. In version 5.x,
>> it is possible to disable this feature.
>>
>> Google DMP fast recovery.
>>
>> http://seer.entsupport.symantec.com/docs/307959.htm
>>
>> I can imagine there must have been some internal fights at symantec
>> between product management and QA to get that feature released.
>>
>> Vic
>>
>>
>>
>>
>>
>> On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
>> <sebastien.daubigne at atosorigin.com> wrote:
>>> Dear Vx-addicts,
>>>
>>> We encountered a failover issue on this configuration :
>>>
>>> - Solaris 9 HW 9/05
>>> - SUN SAN (SFS) 4.4.15
>>> - Emulex with SUN generic driver (emlx)
>>> - VxVM 5.0-2006-05-11a
>>>
>>> - storage on HP SAN (XP 24K).
>>>
>>>
>>> Multipathing is managed by MPxIO (not VxDMP) because the SAN team
>>> and HP
>>> support imposed the Solaris native solution for multipathing :
>>>
>>> VxVM ==> VxDMP ==> MPxIO ==> FCP ...
>>>
>>> We have 2 paths to the switch, linked to 2 paths to the storage, so the
>>> LUNs have 4 paths, with active/active support.
>>> Failover operation has been tested successfully by offlining each port
>>> successively on the SAN.
>>>
>>> We regulary have transient I/O errors (scsi timeout, I/O error retries
>>> with "Unit attention"), due to SAN-side issues. Usually these errors
>>> are
>>> transparently managed by MPxIO/VxVM without impact on the applications.
>>>
>>> Now for the incident we encountered :
>>>
>>> One of the SAN port was reset , consequently there were some transient
>>> I/O error.
>>> The other SAN port was OK, so the MPxIO multipathing layer should have
>>> failover the I/O on the other path, without transmiting the error to
>>> the
>>> VxDMP layer.
>>> For some reason, it did not failover the I/O before VxVM caught it as
>>> unrecoverable I/O error, disabling the subdisk and consequently the
>>> filesystem.
>>>
>>> Note the "giving up" message from scsi layer at 06:23:03 :
>>>
>>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode
>>> 288/0x60
>>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x60
>>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode
>>> 288/0x20
>>> Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode
>>> 288/0x18
>>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x20
>>> Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x18
>>> Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>>> Sep 1 06:18:54 myserver SCSI transport failed: reason
>>> 'tran_err': retrying command
>>> Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>>> Sep 1 06:19:05 myserver SCSI transport failed: reason
>>> 'timeout':
>>> retrying command
>>> Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>>> Sep 1 06:21:57 myserver SCSI transport failed: reason
>>> 'tran_err': retrying command
>>> Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>>> Sep 1 06:22:45 myserver SCSI transport failed: reason
>>> 'timeout':
>>> retrying command
>>> Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003787 (ssd166):
>>> Sep 1 06:23:03 myserver SCSI transport failed: reason
>>> 'timeout':
>>> giving up
>>> Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
>>> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
>>> 300ce41c340 on device 0x1200000003a to DMP
>>> Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
>>> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
>>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
>>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
>>> 0/5935
>>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system
>>> disabled
>>> Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
>>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
>>> 0/265984
>>>
>>>
>>> It seems VxDMP gets the I/O error at the same time as MPxIO : I though
>>> MPxIO would have conceal the I/O error until failover has occured,
>>> which
>>> is not the case.
>>>
>>> As a workaround, I increased the VxDMP
>>> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give
>>> MPxIO a
>>> chance to failover before VxDMP fails, but I still don't understand why
>>> VxVM catch the scsi errors.
>>>
>>> Any advice ?
>>>
>>> thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sebastien DAUBIGNE
>>> Sebastien.daubigne at atosorigin.com - +33(0)5.57.89.31.09
>>> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
>>>
>>> _______________________________________________
>>> Veritas-vx maillist - Veritas-vx at mailman.eng.auburn.edu
>>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
>>>
>>
>
>
--
Sebastien DAUBIGNE
Sebastien.daubigne at atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
More information about the Veritas-vx
mailing list