[Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Sebastien DAUBIGNE sebastien.daubigne at atosorigin.com
Thu Sep 16 10:10:28 CDT 2010


Sorry, my mistake : The VxVM version is 5.0 GA, not 5.0 MP3.

The 316981 note states that fast_recovery is available in 5.0, but 
neither manpage, nor administration guide nor vxdmpadm command 
recognizes it.

However, I don't know if 5.0 GA behaviour is equivalent to 
dmp_fast_recovery=off or dmp_fast_recovery=on.
The note states "In the case of a single path failure, MPxIO does not 
notify DMP of the error, therefore, dmp_fast_recovery has no effect.", 
hence it seems this parameter is not an issue in my case (single path 
failure).

Maybe I should try to update to latest MP3 with dmp_fast_recovery=off.





Le 16/09/2010 15:40, Sebastien DAUBIGNE a écrit :
>  Thank you Victor and William, it seems to be a very good lead.
>
> Unfortunately, this tunable seems not to be supported in the VxVM 
> version installed on my system :
>
> > vxdmpadm gettune dmp_fast_recovery
> VxVM vxdmpadm ERROR V-5-1-12015  Incorrect tunable
> vxdmpadm gettune [tunable name]
> Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count, 
> dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open, 
> dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age, 
> or dmp_stat_interval
>
> Something odd because my version is 5.0 MP3 Solaris SPARC, and 
> according to http://seer.entsupport.symantec.com/docs/316981.htm this 
> tunable should be available.
>
> > modinfo | grep -i vx
>  38 7846a000  3800e 288   1  vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
>  40 784a4000 334c40 289   1  vxio (VxVM 5.0-2006-05-11a I/O driver)
>  42 783ec71d    df8 290   1  vxspec (VxVM 5.0-2006-05-11a control/st)
> 296 78cfb0a2    c6b 291   1  vxportal (VxFS 5.0_REV-5.0A55_sol portal )
> 297 78d6c000 1b9d4f   8   1  vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
> 298 78f18000   a270 292   1  fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
>
>
>
>
>
> Le 16/09/2010 12:15, Victor Engle a écrit :
>> Which version of veritas? Version 4/2MP2 and version 5.x introduced a
>> feature called DMP fast recovery. It was probably supposed to be
>> called DMP fast fail but "recovery" sounds better. It is supposed to
>> fail suspect paths more aggressively to speed up failover. But when
>> you only have one vxvm DMP path, as is the case with MPxIO, and
>> fast-recovery fails that path, then you're in trouble. In version 5.x,
>> it is possible to disable this feature.
>>
>> Google DMP fast recovery.
>>
>> http://seer.entsupport.symantec.com/docs/307959.htm
>>
>> I can imagine there must have been some internal fights at symantec
>> between product management and QA to get that feature released.
>>
>> Vic
>>
>>
>>
>>
>>
>> On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
>> <sebastien.daubigne at atosorigin.com>  wrote:
>>>   Dear Vx-addicts,
>>>
>>> We encountered a failover issue on this configuration :
>>>
>>> - Solaris 9 HW 9/05
>>> - SUN SAN (SFS) 4.4.15
>>> - Emulex with SUN generic driver (emlx)
>>> - VxVM 5.0-2006-05-11a
>>>
>>> - storage on HP SAN (XP 24K).
>>>
>>>
>>> Multipathing is managed by MPxIO (not VxDMP) because the SAN team 
>>> and HP
>>> support imposed the Solaris native solution for multipathing :
>>>
>>> VxVM ==>  VxDMP ==>  MPxIO ==>  FCP ...
>>>
>>> We have 2 paths to the switch, linked to 2 paths to the storage, so the
>>> LUNs have 4 paths, with active/active support.
>>> Failover operation has been tested successfully by offlining each port
>>> successively on the SAN.
>>>
>>> We regulary have transient I/O errors (scsi timeout, I/O error retries
>>> with "Unit attention"), due to SAN-side issues. Usually these errors 
>>> are
>>> transparently managed by MPxIO/VxVM without impact on the applications.
>>>
>>> Now for the incident we encountered :
>>>
>>> One of the SAN port was reset , consequently there were some transient
>>> I/O error.
>>> The other SAN port was OK, so the MPxIO multipathing layer should have
>>> failover the I/O on the other path, without transmiting the error to 
>>> the
>>> VxDMP layer.
>>> For some reason, it did not failover the I/O before VxVM caught it as
>>> unrecoverable I/O error, disabling the subdisk and consequently the
>>> filesystem.
>>>
>>> Note the "giving up" message from scsi layer at 06:23:03 :
>>>
>>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 
>>> 288/0x60
>>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x60
>>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 
>>> 288/0x20
>>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 
>>> 288/0x18
>>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x20
>>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>>> vxdmp V-5-0-111 disabled dmpnode 288/0x18
>>> Sep  1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>>> Sep  1 06:18:54 myserver        SCSI transport failed: reason
>>> 'tran_err': retrying command
>>> Sep  1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>>> Sep  1 06:19:05 myserver        SCSI transport failed: reason 
>>> 'timeout':
>>> retrying command
>>> Sep  1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>>> Sep  1 06:21:57 myserver        SCSI transport failed: reason
>>> 'tran_err': retrying command
>>> Sep  1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>>> Sep  1 06:22:45 myserver        SCSI transport failed: reason 
>>> 'timeout':
>>> retrying command
>>> Sep  1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
>>> /scsi_vhci/ssd at g60060e80152777000001277700003787 (ssd166):
>>> Sep  1 06:23:03 myserver        SCSI transport failed: reason 
>>> 'timeout':
>>> giving up
>>> Sep  1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
>>> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
>>> 300ce41c340 on device 0x1200000003a to DMP
>>> Sep  1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
>>> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
>>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
>>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 
>>> 0/5935
>>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system 
>>> disabled
>>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>>> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
>>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
>>> 0/265984
>>>
>>>
>>> It seems VxDMP gets the I/O error at the same time as MPxIO  : I though
>>> MPxIO would have conceal the I/O error until failover has occured, 
>>> which
>>> is not the case.
>>>
>>> As a workaround, I increased the VxDMP
>>> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give 
>>> MPxIO a
>>> chance to failover before VxDMP fails, but I still don't understand why
>>> VxVM catch the scsi errors.
>>>
>>> Any advice ?
>>>
>>> thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Sebastien DAUBIGNE
>>> Sebastien.daubigne at atosorigin.com  - +33(0)5.57.89.31.09
>>> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
>>>
>>> _______________________________________________
>>> Veritas-vx maillist  -  Veritas-vx at mailman.eng.auburn.edu
>>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
>>>
>>
>
>


-- 
Sebastien DAUBIGNE
Sebastien.daubigne at atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix



More information about the Veritas-vx mailing list