[Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Joshua Fielden Joshua_Fielden at symantec.com
Thu Sep 16 09:50:34 CDT 2010


dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack and send path inquiry/status CDBs directly from the HBA in order to bypass long SCSI queues and recover paths faster. With a TPD (third-party driver) such as MPxIO, bypassing the stack means we bypass the TPD completely, and interactions such as this can happen. The vxesd (event-source daemon) is another 5.0/MP2 backport addition that's moot in the presence of a TPD.

From your modinfo, you're not actually running MP3. This technote (http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly your scenario, but looking for partially-installed pkgs is a good start to getting your server correctly installed, then the tuneable should work -- very early 5.0 versions had a differently-named tuneable I can't find in my mail archive ATM.

Cheers,

Jf

-----Original Message-----
From: veritas-vx-bounces at mailman.eng.auburn.edu [mailto:veritas-vx-bounces at mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE
Sent: Thursday, September 16, 2010 7:41 AM
To: Veritas-vx at mailman.eng.auburn.edu
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

  Thank you Victor and William, it seems to be a very good lead.

Unfortunately, this tunable seems not to be supported in the VxVM 
version installed on my system :

 > vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015  Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count, 
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open, 
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age, 
or dmp_stat_interval

Something odd because my version is 5.0 MP3 Solaris SPARC, and according 
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable 
should be available.

 > modinfo | grep -i vx
  38 7846a000  3800e 288   1  vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
  40 784a4000 334c40 289   1  vxio (VxVM 5.0-2006-05-11a I/O driver)
  42 783ec71d    df8 290   1  vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2    c6b 291   1  vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f   8   1  vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000   a270 292   1  fdd (VxQIO 5.0_REV-5.0A55_sol Quick )





Le 16/09/2010 12:15, Victor Engle a écrit :
> Which version of veritas? Version 4/2MP2 and version 5.x introduced a
> feature called DMP fast recovery. It was probably supposed to be
> called DMP fast fail but "recovery" sounds better. It is supposed to
> fail suspect paths more aggressively to speed up failover. But when
> you only have one vxvm DMP path, as is the case with MPxIO, and
> fast-recovery fails that path, then you're in trouble. In version 5.x,
> it is possible to disable this feature.
>
> Google DMP fast recovery.
>
> http://seer.entsupport.symantec.com/docs/307959.htm
>
> I can imagine there must have been some internal fights at symantec
> between product management and QA to get that feature released.
>
> Vic
>
>
>
>
>
> On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
> <sebastien.daubigne at atosorigin.com>  wrote:
>>   Dear Vx-addicts,
>>
>> We encountered a failover issue on this configuration :
>>
>> - Solaris 9 HW 9/05
>> - SUN SAN (SFS) 4.4.15
>> - Emulex with SUN generic driver (emlx)
>> - VxVM 5.0-2006-05-11a
>>
>> - storage on HP SAN (XP 24K).
>>
>>
>> Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
>> support imposed the Solaris native solution for multipathing :
>>
>> VxVM ==>  VxDMP ==>  MPxIO ==>  FCP ...
>>
>> We have 2 paths to the switch, linked to 2 paths to the storage, so the
>> LUNs have 4 paths, with active/active support.
>> Failover operation has been tested successfully by offlining each port
>> successively on the SAN.
>>
>> We regulary have transient I/O errors (scsi timeout, I/O error retries
>> with "Unit attention"), due to SAN-side issues. Usually these errors are
>> transparently managed by MPxIO/VxVM without impact on the applications.
>>
>> Now for the incident we encountered :
>>
>> One of the SAN port was reset , consequently there were some transient
>> I/O error.
>> The other SAN port was OK, so the MPxIO multipathing layer should have
>> failover the I/O on the other path, without transmiting the error to the
>> VxDMP layer.
>> For some reason, it did not failover the I/O before VxVM caught it as
>> unrecoverable I/O error, disabling the subdisk and consequently the
>> filesystem.
>>
>> Note the "giving up" message from scsi layer at 06:23:03 :
>>
>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-111 disabled dmpnode 288/0x60
>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
>> Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-111 disabled dmpnode 288/0x20
>> Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
>> vxdmp V-5-0-111 disabled dmpnode 288/0x18
>> Sep  1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>> Sep  1 06:18:54 myserver        SCSI transport failed: reason
>> 'tran_err': retrying command
>> Sep  1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
>> /scsi_vhci/ssd at g60060e80152777000001277700003794 (ssd165):
>> Sep  1 06:19:05 myserver        SCSI transport failed: reason 'timeout':
>> retrying command
>> Sep  1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>> Sep  1 06:21:57 myserver        SCSI transport failed: reason
>> 'tran_err': retrying command
>> Sep  1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
>> /scsi_vhci/ssd at g60060e8015277700000127770000376d (ssd168):
>> Sep  1 06:22:45 myserver        SCSI transport failed: reason 'timeout':
>> retrying command
>> Sep  1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
>> /scsi_vhci/ssd at g60060e80152777000001277700003787 (ssd166):
>> Sep  1 06:23:03 myserver        SCSI transport failed: reason 'timeout':
>> giving up
>> Sep  1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
>> vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
>> 300ce41c340 on device 0x1200000003a to DMP
>> Sep  1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
>> vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>> 1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>> 2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
>> Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
>> 3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
>> /dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
>> 0/265984
>>
>>
>> It seems VxDMP gets the I/O error at the same time as MPxIO  : I though
>> MPxIO would have conceal the I/O error until failover has occured, which
>> is not the case.
>>
>> As a workaround, I increased the VxDMP
>> recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
>> chance to failover before VxDMP fails, but I still don't understand why
>> VxVM catch the scsi errors.
>>
>> Any advice ?
>>
>> thanks.
>>
>>
>>
>>
>>
>>
>> --
>> Sebastien DAUBIGNE
>> Sebastien.daubigne at atosorigin.com  - +33(0)5.57.89.31.09
>> AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
>>
>> _______________________________________________
>> Veritas-vx maillist  -  Veritas-vx at mailman.eng.auburn.edu
>> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
>>
>


-- 
Sebastien DAUBIGNE
Sebastien.daubigne at atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix

_______________________________________________
Veritas-vx maillist  -  Veritas-vx at mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx


More information about the Veritas-vx mailing list