solaris zdb

Original link

zdb: Examining ZFS At Point-Blank Range
01 Nov '08 - 08:13 by benr

ZFS is an amazing in its simplicity and beauty, however it is also deceivingly complex. The chance that you'll ever be forced to peer behind the veil is unlikely outside of the storage enthusiast ranks, but as it proliferates more questions will come up regarding its internals. We have been given a tool to assist us investigate the inner workings, zdb, but it is, somewhat intentionally I think, undocumented. Only two others that I know have had the courage to talk about it publicly, Max Bruning who is perhaps the single most authoritative voice regarding ZFS outside of Sun, and Marcelo Leal.

In this post, we'll look only at the basics of ZDB to establish a baseline for its use. Running "zdb -h" will produce a summary of its syntax.

In its most basic form, zdb poolname, several bits of information about our pool will be output, including:

Cached pool configuratino (-C)
Uberblock (-u)
Datasets (-d)
Report stats on zdb's I/O (-s), this is similar to the first interval of zpool iostat
Thus, zdb testpool is the same as zdb -Cuds testpool. Lets look at the output. The pool we'll be using is actually a 256MB pre-allocated file with a single dataset... as simple as it can come.

root@quadra /$ zdb testpool
version=12
name='testpool'
state=0
txg=182
pool_guid=1019414024587234776
hostid=446817667
hostname='quadra'
vdev_tree
type='root'
id=0
guid=1019414024587234776
children[0]
type='file'
id=0
guid=6723707841658505514
path='/zdev/disk002'
metaslab_array=23
metaslab_shift=21
ashift=9
asize=263716864
is_log=0
Uberblock

magic = 0000000000bab10c
version = 12
txg = 184
guid_sum = 7743121866245740290
timestamp = 1225486684 UTC = Fri Oct 31 13:58:04 2008

Dataset mos [META], ID 0, cr_txg 4, 87.0K, 49 objects
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 19.5K, 5 objects
Dataset testpool [ZPL], ID 16, cr_txg 1, 19.0K, 5 objects
capacity operations bandwidth ---- errors ----
description used avail read write read write read write cksum
testpool 139K 250M 638 0 736K 0 0 0 0
/zdev/disk002 139K 250M 638 0 736K 0 0 0 0
And so we see a variety of useful information, including:

Zpool (On Disk Format) Version Number
State
Host ID & Hostname
GUID (This is that numberic value you use when zpool import doesn't like the name)
Children VDEV's that make up the pool
Uberblock magic number (read that hex value as "uba-bloc", get it, 0bab10c, its funny!)
Timestamp
List of datasets
Summary of IO stats
So this information is interesting, but frankly not terribly useful if you already have the pool imported. This would likely be of more value if you couldn't, or wouldn't, import the pool, but those cases are rare and 99% of the time zpool import will tell you want you want to know even if you don't actually import.

There are 3 arguments that are really the core ones of interest, but fefore we get to them, you absolutely must understand something unique about zdb. ZDB is like a magnifying glass, at default magnification you can see that its tissue, turn up the magnification and you see that it has veins, turn it up again and you see how intricate the system is, crank it up one more time and you can see blood cells themselves. With zdb, each time we repeat an argument we increase the verbosity and thus dig deeper. For instance, zdb -d will list the datasets of a pool, but zdb -dd will output the list of objects within the pool. Thus, when you really zoom in you'll see commands that look really odd like zdb -ddddddddd. This takes a little practice to get the hang of, so please toy around on a small test pool to get the hang of it.

Now, here are summaries of the 3 primary arguments you'll use and how things change as you crank up the verbosity:

zdb -b pool: This will traverse blocks looking for leaks like the default form.
-bb: Outputs a breakdown of space (block) usage for various ZFS object types.
-bbb: Same as above, but includes breakdown by DMU/SPA level (L0-L6).
-bbbb: Same as above, but includes line line per object with details about it, including compression, checksum, DVA, object ID, etc.
-bbbbb...: Same as above.
zdb -d dataset: This will output a list of objects within a dataset. More d's means more verbosity:
-d: Output list of datasets, including ID, cr_txg, size, and number of objects.
-dd: Output concise list of objects within the dataset, with object id, lsize, asize, type, etc.
-ddd: Same as dd.
-dddd: Outputs list of datasets and objects in detail, including objects path (filename), a/c/r/mtime, mode, etc.
-ddddd: Same as previous, but includes indirect block addresses (DVAs) as well.
-dddddd....: Same as above.
zdb -R pool:vdev_specifier:offset:size[:flags]: Given a DVA, outputs object contents in hex display format. If given the :r flag it will output in raw binary format. This can be used for manual recovery of files.
So lets play with the first form above, block traversal. This will sweep the blocks of your pool or dataset adding up what it finds and then producing a report of any leakage and how the space breakdown works. This is extremely useful information, but given that it traverses all blocks its going to take a long time depending on how much data you have. On a home box this might take minutes or a couple hours, on a large storage subsystem is could take hours or days. Lets look at both -b and -bb for my simple test pool:

root@quadra ~$ zdb -b testpool

Traversing all blocks to verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 50
bp logical: 464896 avg: 9297
bp physical: 40960 avg: 819 compression: 11.35
bp allocated: 102912 avg: 2058 compression: 4.52
SPA allocated: 102912 used: 0.04%

root@quadra ~$ zdb -bb testpool

Traversing all blocks to verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 50
bp logical: 464896 avg: 9297
bp physical: 40960 avg: 819 compression: 11.35
bp allocated: 102912 avg: 2058 compression: 4.52
SPA allocated: 102912 used: 0.04%

Blocks LSIZE PSIZE ASIZE avg comp %Total Type
3 12.0K 1.50K 4.50K 1.50K 8.00 4.48 deferred free
1 512 512 1.50K 1.50K 1.00 1.49 object directory
1 512 512 1.50K 1.50K 1.00 1.49 object array
1 16K 1K 3.00K 3.00K 16.00 2.99 packed nvlist
- - - - - - - packed nvlist size
1 16K 1K 3.00K 3.00K 16.00 2.99 bplist
- - - - - - - bplist header
- - - - - - - SPA space map header
3 12.0K 1.50K 4.50K 1.50K 8.00 4.48 SPA space map
- - - - - - - ZIL intent log
16 256K 18.0K 40.0K 2.50K 14.22 39.80 DMU dnode
3 3.00K 1.50K 3.50K 1.17K 2.00 3.48 DMU objset
- - - - - - - DSL directory
4 2K 2K 6.00K 1.50K 1.00 5.97 DSL directory child map
3 1.50K 1.50K 4.50K 1.50K 1.00 4.48 DSL dataset snap map
4 2K 2K 6.00K 1.50K 1.00 5.97 DSL props
- - - - - - - DSL dataset
- - - - - - - ZFS znode
- - - - - - - ZFS V0 ACL
1 512 512 512 512 1.00 0.50 ZFS plain file
3 1.50K 1.50K 3.00K 1K 1.00 2.99 ZFS directory
2 1K 1K 2K 1K 1.00 1.99 ZFS master node
2 1K 1K 2K 1K 1.00 1.99 ZFS delete queue
- - - - - - - zvol object
- - - - - - - zvol prop
- - - - - - - other uint8[]
- - - - - - - other uint64[]
- - - - - - - other ZAP
- - - - - - - persistent error log
1 128K 4.50K 13.5K 13.5K 28.44 13.43 SPA history
- - - - - - - SPA history offsets
- - - - - - - Pool properties
- - - - - - - DSL permissions
- - - - - - - ZFS ACL
- - - - - - - ZFS SYSACL
- - - - - - - FUID table
- - - - - - - FUID table size
1 512 512 1.50K 1.50K 1.00 1.49 DSL dataset next clones
- - - - - - - scrub work queue
50 454K 40.0K 101K 2.01K 11.35 100.00 Total
Here we can see the "zooming in" effect I described earlier. Here "BP" stands for "Block Pointer". The most common "Type" you'll see is "ZFS plain file", that is, a normal data file like an image or textfile or something... the data you care about.

Moving on to the second form, -d to output datasets and their objects. This is where introspection really occurs. With a simple -d we can see a recursive list of datasets, but as we turn up the verbosity (-dd) we zoom into the objects within the dataset, and then just get more and more detail about those objects.

root@quadra ~$ zdb -d testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects

root@quadra ~$ zdb -dd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects

Object lvl iblk dblk lsize asize type
0 7 16K 16K 16K 14.0K DMU dnode
1 1 16K 512 512 1K ZFS master node
2 1 16K 512 512 1K ZFS delete queue
3 1 16K 512 512 1K ZFS directory
4 1 16K 512 512 512 ZFS plain file
So lets pause here. We can see the list of objects in my testpool/dataset01 by object id. This is important because we can use those id's to dig deeper on an individual object later. But for now, lets zoom in a little bit more (-dddd) on this dataset.

root@quadra ~$ zdb -dddd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=<0:12200:200> DVA[1]=<0:3014c00:200> fletcher4 lzjb LE contiguous birth=8 fill=5 cksum=a525c6edf:45d1513a8c8:ef844ac0e80e:22b9de6164dd69

Object lvl iblk dblk lsize asize type
0 7 16K 16K 16K 14.0K DMU dnode

Object lvl iblk dblk lsize asize type
1 1 16K 512 512 1K ZFS master node
microzap: 512 bytes, 6 entries

casesensitivity = 0
normalization = 0
DELETE_QUEUE = 2
ROOT = 3
VERSION = 3
utf8only = 0

Object lvl iblk dblk lsize asize type
2 1 16K 512 512 1K ZFS delete queue
microzap: 512 bytes, 0 entries

Object lvl iblk dblk lsize asize type
3 1 16K 512 512 1K ZFS directory
264 bonus ZFS znode
path /
uid 0
gid 0
atime Fri Oct 31 12:35:30 2008
mtime Fri Oct 31 12:35:51 2008
ctime Fri Oct 31 12:35:51 2008
crtime Fri Oct 31 12:35:30 2008
gen 6
mode 40755
size 3
parent 3
links 2
xattr 0
rdev 0x0000000000000000
microzap: 512 bytes, 1 entries

testfile01 = 4 (type: Regular File)

Object lvl iblk dblk lsize asize type
4 1 16K 512 512 512 ZFS plain file
264 bonus ZFS znode
path /testfile01
uid 0
gid 0
atime Fri Oct 31 12:35:51 2008
mtime Fri Oct 31 12:35:51 2008
ctime Fri Oct 31 12:35:51 2008
crtime Fri Oct 31 12:35:51 2008
gen 8
mode 100644
size 21
parent 3
links 1
xattr 0
rdev 0x0000000000000000
Now, this output is short because the dataset include only a single file. In the real world this output will be gigantic and should be redirected to a file. When I did this on the dataset containing my home directory the output file was 750MB... its a lot of data.

Look specifically at Object 4, a "ZFS plain file". Notice that I can see that files pathname, uid, gid, a/m/c/crtime, mode, size, etc. This is where things can get really interesting!

In zdb's 3rd form above (-R) we can actually display the contents of a file, however we need its Device Virtual Address (DVA) and size to do so. In order to get that information, we can zoom in using -d little further, but this time just on Object 4:

root@quadra /$ zdb -ddddd testpool/dataset01 4
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 19.5K, 5 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=<0:172e000:200> DVA[1]=<0:460e000:200> fletcher4 lzjb LE contiguous birth=168 fill=5 cksum=a280728d9:448b88156d8:eaa0ad340c25:21f1a0a7d45740

Object lvl iblk dblk lsize asize type
4 1 16K 512 512 512 ZFS plain file
264 bonus ZFS znode
path /testfile01
uid 0
gid 0
atime Fri Oct 31 12:35:51 2008
mtime Fri Oct 31 12:35:51 2008
ctime Fri Oct 31 12:35:51 2008
crtime Fri Oct 31 12:35:51 2008
gen 8
mode 100644
size 21
parent 3
links 1
xattr 0
rdev 0x0000000000000000
Indirect blocks:
0 L0 0:11600:200 200L/200P F=1 B=8

segment [0000000000000000, 0000000000000200) size 512
Now, see that "Indirect block" 0? Following L0 (Level 0) is a tuple: "0:11600:200". This is the DVA and Size, or more specifically it is the triple: vdev:offset:size. We can use this information to request its contents directly.

And so, the -R form can display and individual blocks from a device. To do so, we need to know the pool name, vdev/offset (DVA) and its size. Given what we did above, we now know that, so lets try it:

root@quadra /$ zdb -R testpool:0:11600:200
Found vdev: /zdev/disk002

testpool:0:11600:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 2073692073696854 6620747365742061 This is a test f
000010: 0000000a2e656c69 0000000000000000 ile.............
000020: 0000000000000000 0000000000000000 ................
000030: 0000000000000000 0000000000000000 ................
000040: 0000000000000000 0000000000000000 ................
000050: 0000000000000000 0000000000000000 ................
...
w00t! We can read the file contents!

You'll notice in the zdb syntax ("zdb -h") that this syntax above accepts flags as well. We can find these in the ZDB source. The most interesting is the "r" flag which rather than display the data as above, actually dumps the data in raw form to STDERR.

So why is this useful? Try this on for size:

root@quadra /$ rm /testpool/dataset01/testfile01
root@quadra /$ sync;sync
root@quadra /$ zdb -dd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.0K, 4 objects

Object lvl iblk dblk lsize asize type
0 7 16K 16K 16K 14.0K DMU dnode
1 1 16K 512 512 1K ZFS master node
2 1 16K 512 512 1K ZFS delete queue
3 1 16K 512 512 1K ZFS directory

....... THE FILE IS REALLY GONE! ..........

root@quadra /$ zdb -R testpool:0:11600:200:r 2> /tmp/output
Found vdev: /zdev/disk002
root@quadra /$ ls -lh /tmp/output
-rw-r--r-- 1 root root 512 Nov 1 01:54 /tmp/output
root@quadra /$ cat /tmp/output
This is a test file.
How sweet is that! We delete a file, verify with zdb -dd that it really and truely is gone, and then bring it back out based on its DVA. Super sweet!

Now, before you get overly excited, some things to note... firstly, if you delete a file in the real world you probly don't have its DVA and size already recorded, so your screwed. Also, notice that the origonal file was 21 bytes, but the "recovered" file is 512... its been padded, so if you recovered a file and tried using an MD5 hash or something to verify the content it wouldn't match, even though the data was valid. In other words, the best "undelete" option is snapshots.. they are quick, easy, use them. Using zdb for file recovery isn't practical.

I recently discovered and used this method to deal with a server that suffered extensive corruption as a result of a shitty (Sun Adaptec rebranded STK) RAID controller gone berzerk following a routine disk replacement. I had several "corrupt" files that I could not read or reach, if I tried to do so I'd get a long pause, lots of errors to syslog, and then a "I/O Error" return. Hopeless, this is a "restore from backups" situation. Regardless, I wanted to learn from the experience. Here is an example of the result:

[root@server ~]$ ls -l /xxxxxxxxxxxxxx/images/logo.gif
/xxxxxxxxxxxxxx/images/logo.gif: I/O error

[root@server ~]$ zdb -ddddd pool/xxxxx 181359
Dataset pool/xxx [ZPL], ID 221, cr_txg 1281077, 3.76G, 187142 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=<0:1803024c00:200> DVA[1]=<0:45007ade00:200> fletcher4
lzjb LE contiguous birth=4543000 fill=1
87142 cksum=8cc6b0fec:3a1b508e8c0:c36726aec831:1be1f0eee0e22c

Object lvl iblk dblk lsize asize type
181359 1 16K 1K 1K 1K ZFS plain file
264 bonus ZFS znode
path /xxxxxxxxxxxxxx/images/logo.gif
atime Wed Aug 27 07:42:17 2008
mtime Wed Apr 16 01:19:06 2008
ctime Thu May 1 00:18:34 2008
crtime Thu May 1 00:18:34 2008
gen 1461218
mode 100644
size 691
parent 181080
links 1
xattr 0
rdev 0x0000000000000000
Indirect blocks:
0 L0 0:b043f0c00:400 400L/400P F=1 B=1461218

segment [0000000000000000, 0000000000000400) size 1K

[root@server ~]$ zdb -R pool:0:b043f0c00:400:r 2> out
Found vdev: /dev/dsk/c0t1d0s0
[root@server ~]$ file out
out: GIF file, v89
Because real data is involved I had to cover up most of the above, but you can see how the methods we learned above were used to gain a positive result. Normal means of accessing the file failed miserably, but using zdb -R I dumped the file out. As a verification I opened the GIF in an image viewer and sure enough it looks perfect!

This is a lot to digest, but this is about as simple a primer to zdb as your going to find. Hopefully I've given you a solid grasp of the fundamentals so that you can experiment on your own.

Where do you go from here? As noted before, I recommend you now check out the following:

Max Bruning's ZFS On-Disk Format Using mdb and zdb: Video presentation from the OpenSolaris Developer Conference in Prague on June 28, 2008. An absolute must watch for the hardcore ZFS enthusiast. Warning, may cause your head to explode!
Marcelo Leal's 5 Part ZFS Internals Series. Leal has tremendous courage to post these, he's doing tremendous work! Read it!
Good luck and happy zdb'ing.... don't tell Sun. 🙂

This entry was posted in solaris. Bookmark the permalink.

Comments are closed.