Blame view

kernel/linux-rt-4.4.41/Documentation/md-cluster.txt 6.97 KB
5113f6f70   김현기   kernel add
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
  The cluster MD is a shared-device RAID for a cluster.
  
  
  1. On-disk format
  
  Separate write-intent-bitmap are used for each cluster node.
  The bitmaps record all writes that may have been started on that node,
  and may not yet have finished. The on-disk layout is:
  
  0                    4k                     8k                    12k
  -------------------------------------------------------------------
  | idle                | md super            | bm super [0] + bits |
  | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
  | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
  | bm bits [3, contd]  |                     |                     |
  
  During "normal" functioning we assume the filesystem ensures that only one
  node writes to any given block at a time, so a write
  request will
   - set the appropriate bit (if not already set)
   - commit the write to all mirrors
   - schedule the bit to be cleared after a timeout.
  
  Reads are just handled normally.  It is up to the filesystem to
  ensure one node doesn't read from a location where another node (or the same
  node) is writing.
  
  
  2. DLM Locks for management
  
  There are two locks for managing the device:
  
  2.1 Bitmap lock resource (bm_lockres)
  
   The bm_lockres protects individual node bitmaps. They are named in the
   form bitmap001 for node 1, bitmap002 for node and so on. When a node
   joins the cluster, it acquires the lock in PW mode and it stays so
   during the lifetime the node is part of the cluster. The lock resource
   number is based on the slot number returned by the DLM subsystem. Since
   DLM starts node count from one and bitmap slots start from zero, one is
   subtracted from the DLM slot number to arrive at the bitmap slot number.
  
  3. Communication
  
  Each node has to communicate with other nodes when starting or ending
  resync, and metadata superblock updates.
  
  3.1 Message Types
  
   There are 3 types, of messages which are passed
  
   3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
     updated, and the node must re-read the md superblock. This is performed
     synchronously.
  
   3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
     so that each node may suspend or resume the region.
  
  3.2 Communication mechanism
  
   The DLM LVB is used to communicate within nodes of the cluster. There
   are three resources used for the purpose:
  
    3.2.1 Token: The resource which protects the entire communication
     system. The node having the token resource is allowed to
     communicate.
  
    3.2.2 Message: The lock resource which carries the data to
     communicate.
  
    3.2.3 Ack: The resource, acquiring which means the message has been
     acknowledged by all nodes in the cluster. The BAST of the resource
     is used to inform the receive node that a node wants to communicate.
  
  The algorithm is:
  
   1. receive status
  
     sender                         receiver                   receiver
     ACK:CR                          ACK:CR                     ACK:CR
  
   2. sender get EX of TOKEN
      sender get EX of MESSAGE
      sender                        receiver                 receiver
      TOKEN:EX                       ACK:CR                   ACK:CR
      MESSAGE:EX
      ACK:CR
  
      Sender checks that it still needs to send a message. Messages received
      or other events that happened while waiting for the TOKEN may have made
      this message inappropriate or redundant.
  
   3. sender write LVB.
      sender down-convert MESSAGE from EX to CW
      sender try to get EX of ACK
      [ wait until all receiver has *processed* the MESSAGE ]
  
                                       [ triggered by bast of ACK ]
                                       receiver get CR of MESSAGE
                                       receiver read LVB
                                       receiver processes the message
                                       [ wait finish ]
                                       receiver release ACK
  
     sender                         receiver                   receiver
     TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
     MESSAGE:CR
     ACK:EX
  
   4. triggered by grant of EX on ACK (indicating all receivers have processed
      message)
      sender down-convert ACK from EX to CR
      sender release MESSAGE
      sender release TOKEN
                                 receiver upconvert to PR of MESSAGE
                                 receiver get CR of ACK
                                 receiver release MESSAGE
  
     sender                      receiver                   receiver
     ACK:CR                       ACK:CR                     ACK:CR
  
  
  4. Handling Failures
  
  4.1 Node Failure
   When a node fails, the DLM informs the cluster with the slot. The node
   starts a cluster recovery thread. The cluster recovery thread:
  	- acquires the bitmap<number> lock of the failed node
  	- opens the bitmap
  	- reads the bitmap of the failed node
  	- copies the set bitmap to local node
  	- cleans the bitmap of the failed node
  	- releases bitmap<number> lock of the failed node
  	- initiates resync of the bitmap on the current node
  
   The resync process, is the regular md resync. However, in a clustered
   environment when a resync is performed, it needs to tell other nodes
   of the areas which are suspended. Before a resync starts, the node
   send out RESYNC_START with the (lo,hi) range of the area which needs
   to be suspended. Each node maintains a suspend_list, which contains
   the list  of ranges which are currently suspended. On receiving
   RESYNC_START, the node adds the range to the suspend_list. Similarly,
   when the node performing resync finishes, it send RESYNC_FINISHED
   to other nodes and other nodes remove the corresponding entry from
   the suspend_list.
  
   A helper function, should_suspend() can be used to check if a particular
   I/O range should be suspended or not.
  
  4.2 Device Failure
   Device failures are handled and communicated with the metadata update
   routine.
  
  5. Adding a new Device
  For adding a new device, it is necessary that all nodes "see" the new device
  to be added. For this, the following algorithm is used:
  
      1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
         ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
      2. Node 1 sends NEWDISK with uuid and slot number
      3. Other nodes issue kobject_uevent_env with uuid and slot number
         (Steps 4,5 could be a udev rule)
      4. In userspace, the node searches for the disk, perhaps
         using blkid -t SUB_UUID=""
      5. Other nodes issue either of the following depending on whether the disk
         was found:
         ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
                  disc.number set to slot number)
         ioctl(CLUSTERED_DISK_NACK)
      6. Other nodes drop lock on no-new-devs (CR) if device is found
      7. Node 1 attempts EX lock on no-new-devs
      8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
         as SpareLocal
      9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
      10. Other nodes get the information whether a disk is added or not
  	by the following METADATA_UPDATED.