0811.1304/tr.tex
1: \documentclass[10pt,a4paper,titlepage]{article}
2: \usepackage{latex8}
3: \usepackage{times}
4: \usepackage{epsfig, graphicx, subfigure}
5: \usepackage{color}
6: \usepackage{psfrag}
7: \usepackage{amsmath, amsthm, amssymb}
8: \usepackage{latexsym}
9: \usepackage[T1]{fontenc}
10: \usepackage[english]{babel}
11: \usepackage{fancyheadings}
12: \usepackage{algorithm}
13: \usepackage{algorithmic}
14: 
15: 
16: %Changing paper size
17: \usepackage{a4wide}
18: %\usepackage{setspace}
19: \setlength{\textwidth}{18cm}
20: \setlength{\textheight}{23.5cm}
21: %\setlength{\oddsidemargin}{0 cm}
22: \setlength{\oddsidemargin}{-1cm}
23: \setlength{\evensidemargin}{-1cm}
24: \setlength{\topmargin}{-1 cm}
25: 
26: 
27: \pagestyle{empty}
28: 
29: 
30: %%%%% for easy-to-undo modifications %%%%%
31: \newcommand{\remove}[1]{}
32: 
33: %%%%%% theorems, definitions, etc %%%%%%%%%%%%%%%%
34: \newtheorem{theorem}{\bf Theorem}
35: \newtheorem{lemma}{\bf Lemma}
36: \newtheorem{corollary}{\bf Corollary}
37: \newtheorem{notation}{\bf Notation}
38: \newtheorem{definition}{\bf Definition}
39: \newtheorem{claim}{\bf Claim}
40: \newtheorem{remark}{\bf Remark}
41: \newtheorem{require}{\bf Requirement}
42: \newtheorem{property}{\bf Property}
43: %\newenvironment{proof}{\noindent \bf Proof:\rm}{\hspace*{\fill}$\Box$\vspace{1ex}}
44: 
45: \renewcommand{\algorithmicrequire}{\textbf{Input:}}
46: \renewcommand{\algorithmicensure}{\textbf{Output:}}
47: \renewcommand{\algorithmiccomment}[1]{// #1}
48: 
49: %% for easy-to-undo modifications
50: %\newcommand{\remove}[1]{{}}
51: 
52: 
53: %% To get nice lists that will not need too much space.
54: %%
55: \newenvironment{BulletList}
56: {\begin{list}
57: {$\bullet$}
58: {\setlength{\topsep}{1pt}
59: \setlength{\parsep}{0pt}
60: \setlength{\itemsep}{1pt}
61: \setlength{\parskip}{1pt}}}
62: {\end{list}
63: \vspace{0pt}
64: }
65: 
66: %% To get nice enum lists that will not need too much space.
67: %%
68: \newcounter{num}
69: \newenvironment{NumList}
70: {\begin{list}
71: {\arabic{num}. }
72: {\usecounter{num}
73: \setlength{\topsep}{2pt}
74: \setlength{\parsep}{0pt}
75: \setlength{\itemsep}{2pt}
76: \setlength{\parskip}{2pt}}}
77: {\end{list}
78: \vspace{2pt}
79: }
80: 
81: 
82: \newcommand{\squishlist}{
83:    \begin{list}{$\bullet$}
84:     { \setlength{\itemsep}{0pt}      \setlength{\parsep}{3pt}
85:       \setlength{\topsep}{3pt}       \setlength{\partopsep}{0pt}
86:       \setlength{\leftmargin}{1.5em} \setlength{\labelwidth}{1em}
87:       \setlength{\labelsep}{0.5em} } }
88: 
89: \newcommand{\squishlisttwo}{
90:    \begin{list}{$\bullet$}
91:     { \setlength{\itemsep}{0pt}    \setlength{\parsep}{0pt}
92:       \setlength{\topsep}{0pt}     \setlength{\partopsep}{0pt}
93:       \setlength{\leftmargin}{2em} \setlength{\labelwidth}{1.5em}
94:       \setlength{\labelsep}{0.5em} } }
95: 
96: \newcommand{\squishend}{
97:     \end{list}  }
98: 
99: 
100: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
101: %% DOCUMENT START
102: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
103: %\setlength{\textheight}{0.95\textheight}
104: %\setlength{\topmargin}{1.5cm}
105: %\setlength{\textwidth}{14.5cm}
106: %\setlength{\oddsidemargin}{0.8cm}
107: %\setlength{\evensidemargin}{0.8cm}
108: 
109: \begin{document}
110: \onecolumn
111: 
112: \begin{titlepage}
113: \title{
114:   \raisebox{30mm}[0mm][0mm]{\Large
115:     Technical Report no. 2008-69
116:   }
117:   \raisebox{5mm}[0mm][0mm]{
118:     \textbf{
119:       \begin{tabular}{c}
120:       NB-FEB: An Easy-to-Use and Scalable Universal \\
121: 		Synchronization Primitive for Parallel Programming
122:       \end{tabular}
123:     }
124:   }
125: }
126: \author{\raisebox{-15mm}[0mm][0mm]{\textbf{\Large Phuong Hoai Ha}} \and \raisebox{-15mm}[0mm][0mm]{\textbf{\Large Philippas Tsigas\footnote{Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 G{\"o}teborg, Sweden.}}} \and \raisebox{-15mm}[0mm][0mm]{\textbf{\Large Otto J. Anshus}}}
127: \date{
128:   \vspace{\stretch{1}}
129:   \enlargethispage{1.1\baselineskip}
130: %  \includegraphics{Logos/ChalmGUtextsvEng.eps} \\
131: %  \vspace{5mm}
132:   {\resizebox*{0.2\columnwidth}{!}{\includegraphics[angle=90]{Logos/UiTLogo.epsi}}} \\
133:   \vspace{12mm}
134:   Department of Computing Science \\
135:   Faculty of Science \\
136:   University of Troms{\o} \\
137:   N-9037 Troms{\o}, Norway \\
138:   \vspace{12mm}
139:    Troms{\o}, October 2008.
140: }    
141: \maketitle
142: \end{titlepage}
143: 
144: \newpage
145: \thispagestyle{empty}
146: \mbox{}
147: \vspace{\stretch{1}}
148: 
149: \noindent
150: {\large
151:   \begin{tabular}{l}
152:  %   {\resizebox*{0.2\columnwidth}{!}{\includegraphics[angle=90]{Logos/UiTLogo.epsi}}} \\[3ex]
153:     Technical Report in Computing Science at \\
154:     University of Troms{\o}
155:     \vspace{3ex} \\
156:     Technical Report no. 2008-69 \\
157:     ISSN: XXXX-XXXX 
158:     \vspace{3ex} \\
159:     Department of Computing Science \\
160:     Faculty of Science \\
161:     University of Troms{\o} \\
162:     N-9037 Troms{\o}, Norway \\
163:     \vspace{3ex} \\
164:     Troms{\o}, Norway, October 2008.
165:   \end{tabular}
166: }
167: 
168: \newpage
169: 
170: \twocolumn
171: 
172: \begin{abstract}
173: This paper addresses the problem of universal synchronization primitives that can support {\em scalable} thread synchronization for large-scale many-core architectures.
174: The universal synchronization primitives that have been deployed widely in conventional architectures, are the {\em compare-and-swap} (CAS) and {\em load-linked/store-conditional} (LL/SC) primitives. However, such synchronization primitives are expected to reach their scalability limits in the evolution to many-core architectures with thousands of cores.
175: 
176: We introduce a {\em non-blocking} full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is {\em universal, scalable, feasible} and {\em convenient} to use. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes ({\em universality}). NB-FEB is {\em combinable}, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" ({\em scalability}). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems ({\em feasibility}). 
177: %(e.g. HEP, Tera, MDP, Sparcle, M-Machine and Eldorado). 
178: The original full/empty bit is well-known as a {\em special-purpose} primitive for fast producer-consumer synchronization and has been used extensively in the specific domain of applications. 
179: In this paper, we show that NB-FEB can be deployed easily as a {\em general-purpose} primitive. Using NB-FEB, we construct a non-blocking software transactional memory system called NBFEB-STM, which can be used to handle concurrent threads {\em conveniently}. NBFEB-STM is space efficient: the space complexity of each object updated by $N$ concurrent threads/transactions is $\Theta(N)$, the optimal. 
180: 
181: 
182: 
183: \remove{ %%% Start of removing abstract
184: The universal synchronization primitives that are deployed widely in conventional architectures are {\em compare-and-swap} (CAS) (e.g. IBM System/370, Sun SPARC, Intel Pentium) and {\em load-linked/store-conditional} (LL/SC) (e.g. MIPS II, DEC Alpha). However, due to their scalability limit, these synchronization primitives are  considered not suitable for future many-core architectures with thousands of cores. 
185: 
186: This paper proposes a non-blocking variant of the full/empty bit operations, called NB-FEB, as promising synchronization operations for parallel programming on may-core architectures. 
187: Particularly, the non-blocking variant of the {store-if-clear-and-set} operation 
188:  will return the value of the variable instead of waiting for the variable conditional flag to be clear. We prove that NB-FEB operations are {\em universal, scalable, implementable} and {\em convenient} to use. They are {\em universal} since they can solve the consensus problem for arbitrary number of processes. They are {\em scalable} since they are {\em combinable}. The combining technique, which has been implemented in the NYU Ultracomputer and IBM RP3 machines, has been shown to be a promising technique for large-scale multiprocessor. They are {\em implementable} since they are a slight variant of the original full/empty bit that has been implemented already in many machines like HEP, Tera, MDP, Sparcle, M-Machine and Eldorado. The original full/empty bit is well-known as a {\em special-purpose} primitive for fast producer-consumer communication and has been deployed in the specific domain of applications. In this paper, we show that NB-FEB, in fact, can be deployed as a {\em general-purpose} primitive. Using NB-FEB, we construct non-blocking software transactional memory.
189: %and non-blocking garbage collection. 
190: The software transactional memory can be used as a wrapper of NB-FEB to handle concurrent threads conveniently. The non-blocking software transactional memory is space efficient: the space complexity of each object accessed is $\Theta(N)$, where $N$ is the number of concurrent threads/transactions. 
191: %The non-blocking garbage collection deploys the sliding view technique, substantially reducing the number of required updates.
192: } %%% End of removing abstract
193: 
194: \end{abstract}
195: 
196: %\category{CR-number}{subcategory}{third-level}
197: 
198: %\terms
199: %term1, term2
200: 
201: %\centerline{
202: {\bf Keywords}: many-core architectures, non-blocking synchronization, full/empty bit, universal, combining, non-blocking software transactional memory, synchronization primitives.%}
203: %, non-blocking garbage collection.
204: 
205: \section{Introduction}
206: 
207: Universal synchronization primitives \cite{Her91} are essential for constructing non-blocking synchronization mechanisms for parallel programming, like non-blocking software transactional memory \cite{FraH07, HarF03, HerLMS03,MarSS05,RieFF06}.  Non-blocking synchronization eliminates the concurrency control problems of mutual exclusion locks, such as priority inversion, deadlock and convoying.
208: As many-core architectures with thousands of cores are expected to be our future chip architectures \cite{Asa06}, universal synchronization primitives that can support scalable thread synchronization for such large-scale architectures are desired.   
209: 
210: %The universal synchronization primitives that are deployed widely in conventional architectures are {\em compare-and-swap} ($CAS$) (e.g. IBM System/370, Sun SPARC, Intel Pentium) and {\em load-linked/store-conditional} ($LL/SC$) (e.g. MIPS II, DEC Alpha). 
211: %However, such synchronization primitives are expected to reach their scalability limits in the evolution of many-core architectures with thousands of cores.
212: %However, such synchronization primitives are not suitable for manycore architectures with hundreds of cores. 
213: 
214: 
215: However, the conventional universal primitives like {\em compare-and-swap} ($CAS$) and {\em load-linked/store-conditional} ($LL/SC$) are expected to reach their scalability limits in the evolution to many-core architectures with thousands of cores.
216: For each shared memory location, the $LL/SC$ implementation conceptually associates a reservation bit with each processor. The reservations are invalidated when the location are modified by any processor. Implementing $LL/SC$ in the memory (without compromising its semantics) limits the scalability of the multiprocessor since the total directory size increases quadratically with the number of processors \cite{MicS95}. Therefore, the $LL/SC$ primitives are built on conventional cache-coherent protocols \cite{MicS95,CulSG98}. 
217: However, experimental studies have shown that the $LL/SC$ primitives are not scalable for multicore architectures \cite{SriRK07}. The conventional cache-coherent protocols are considered inefficient for large scale manycore architectures \cite{Asa06}. As a result, several emerging multicore architectures like the NVIDIA CUDA \cite{CUDA}, the ClearSpeed CSX \cite{CSX06}, the IBM Cell BE \cite{GscHFHWY06} and the Cyclops-64 \cite{CasCSW02} architectures utilize fast local memory for each processing core rather than coherent data cache. 
218: 
219: For the emerging many-core architectures without coherent data cache, the $CAS$ primitive is not scalable either since $CAS$ is not {\em combinable} \cite{KruRS88,BleGV08}. Primitives are combinable if their memory requests to the same memory location (arriving at a switch of the processor-to-memory interconnection network) can be combined into only one memory request. Separate replies to the original requests are later created from the reply to the combined request (at the switch). The combining technique has been implemented in the NYU Ultracomputer \cite{GotGKMRS82} and the IBM RP3 \cite{PfiBGHKMMNW85} machine and has been shown to be a promising technique for large-scale multiprocessors to alleviate the performance degradation due to synchronization "hot spot". Although the {\em single-valued} $CAS_{a}(x,b)$ \cite{BleGV08}, which will atomically swap $b$ to $x$ if $x$ equals $a$ is combinable, the number of instructions $CAS_{a}$ must be as many as the number of integers $a$ that can be stored in one memory word (e.g. $2^{64}$ $CAS_{a}$ instructions for 64-bit words). This fact makes the {\em single-valued} $CAS_{a}$ unfeasible for hardware implementation.
220: 
221: Another universal primitive called {\em sticky bit} has been suggested in \cite{Plo89}, but it has not been deployed so far due to its usage complexity. To the best of our knowledge, the universal construction using the sticky bit \cite{Plo89} does not prevent a delayed thread, even after being helped, from jamming the sticky bits of a cell that has been re-initialized and reused. 
222: Since the universal construction is built on a doubly-linked list of cells, it is not obvious how an external garbage collector (supported by the underlying system) can help solve the problem. Moreover, the space complexity of the universal construction for an object is as high as $O(N^2logN)$, where $N$ is the number of processes.
223:  
224: 
225: This paper suggests a novel synchronization primitive, called NB-FEB, as a promising synchronization primitive for parallel programming on many-core architectures. What makes NB-FEB be a promising primitive is its following four main properties. NB-FEB is:
226: \begin{description}
227: \item[Feasible]: NB-FEB is a {\em non-blocking} variant of the conventional full/empty bit that {\em always returns} the old value of the variable instead of waiting for its conditional flag to be set (or cleared). 
228: This simple modification makes NB-FEB as {\em feasible} as the original (blocking) full/empty bit, which has been implemented in many computer systems like HEP \cite{Smi85}, Tera \cite{AlvCCKPS90}, MDP \cite{DalFKLNNDF92}, Sparcle \cite{AKKLYDP93}, M-Machine \cite{KecDMCCL98} and Eldorado \cite{FeoHKK05}. The space overhead of full/empty bits can be reduced using the {\em synchronization state buffer (SSB)} \cite{ZhuSHG07}.
229: 
230: \item[Universal]: This simple modification, however, significantly increases the synchronization power of full/empty bits, making NB-FEB as powerful as $CAS$ or $LL/SC$. NB-FEB, together with registers, can solve consensus problem for arbitrary number of processes, the essential property for constructing non-blocking synchronization mechanisms (cf. Section \ref{sec:universality}). 
231: 
232: \item[Scalable]: Like the original full/empty bit, NB-FEB is {\em combinable}: its memory requests to the same memory location can be combined into only one memory request (cf. Section \ref{sec:combinability}). This empowers NB-FEB 
233: with the ability to provide {\em scalable} thread synchronization for large-scale many-core architectures.
234: 
235: \item[Convenient to use]: The original full/empty bit is well-known as a {\em special-purpose} primitive for fast producer-consumer synchronization and has been used extensively in the specific domain of applications. 
236: In this paper, we show that NB-FEB can be deployed easily as a {\em general-purpose} primitive. Using NB-FEB, we construct a non-blocking software transactional memory system called NBFEB-STM, which can be used to handle concurrent threads {\em conveniently}. NBFEB-STM is space efficient: the space complexity of each object updated by $N$ concurrent threads/transactions is $\Theta(N)$, the optimal (cf. Section \ref{sec:NBFEBSTM}). 
237: \end{description}
238: 
239: %An obstruction-free software transactional memory (STM) \todo{citation} is implemented using only the primitive as a wrapper of the primitive. Programmers use the STM to handle concurrent threads.    
240: 
241: %The space complexity $\Theta(N)$ of the new STM for an object using $TFAS$ primitive is better than the space complexity $O(N^2logN)$ of the universal construction for an object using the sticky bit \cite{Plo89}.
242: 
243: %The combinable memory-block transaction \cite{BleGV08} is a lock-based transaction, not an obstruction-free transaction.
244: 
245: %The $Jam(v)$ operation of the sticky bit \cite{Plo89} is not convenient to use (cf. the universal construction using the sticky bit in \cite{Plo89}). Moreover, the universal construction using the sticky bit does not prevent a slow/sleeping thread (e.g. just before executing $Grab(OldHead)$ in $Append\_Inner$), even after being helped, from jamming the sticky bits (e.g. NotHead and Prev) of a cell (e.g. OldHead) that has been initialized and reused. Note that the system garbage collection (as in Java) is not applicable for the doubly-linked list. 
246: %It must check all cells in memory (cf. FindHead()), resulting in the time complexity linear to the memory space.
247: 
248: 
249: \remove { %%%% Start of removing Jam(v) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
250: The $Jam(v)$ operation of the sticky byte \cite{Plo89} (cf. Algorithm \ref{alg:Sticky}), is not combinable by definition since, like $CAS$, its states are as many as the values that can be stored in one memory word \cite{KruRS88}  
251: 
252: 
253: \begin{algorithm}[tbh]
254:   \caption{{\sc Jam}($x$: variable, $v$: value)} \label{alg:Sticky}
255: 	\begin{algorithmic}
256: 	\IF{$x = \perp$ or $x=v$}
257: 		\STATE $x \leftarrow v$;
258: 		\RETURN {\bf success};
259: 	\ELSE
260: 		\RETURN {\bf fail};
261: 	\ENDIF
262:   	\end{algorithmic}
263: \end{algorithm}
264: } %%%%%% End of removing Jam(v) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
265: 
266: %This paper presents a novel synchronization primitive that is
267: 
268: %\begin{algorithm}[tbh]
269: %  \caption{{\sc Clear}($x$: variable)} \label{alg:Clear}
270: %	\begin{algorithmic}
271: %	\STATE $flag_x \leftarrow$ \FALSE;
272: %  	\end{algorithmic}
273: %\end{algorithm}
274: 
275: \remove{ %%%%%% Start to remove LSC %%%%%%%%%%%%%%%%%%%%%
276: \begin{algorithm}[tbh]
277:   \caption{{\sc LSC}($x$: variable): a non-blocking variant of the original Load-if-Set-and-Clear operation, which returns $\perp$ (instead of waiting) if $flag_x$ is false.} \label{alg:LSC}
278: 	\begin{algorithmic}
279: 	\IF{ $flag_x =$ \TRUE}
280: 	\STATE $flag_x \leftarrow$ \FALSE;
281: 	\STATE $tmp \leftarrow x$; \COMMENT{Read the value of $x$}
282: 	\RETURN $tmp$; 
283: 	\ELSE 
284: 	\RETURN $\perp$;
285: 	\ENDIF
286:   	\end{algorithmic}
287: \end{algorithm}
288: } %%%%%% End of removing LSC %%%%%%%%%%%%%%%%%%%%%%%%%%
289: 
290: %The basic read/write operation is assumed not to read/write the flag of a variable.
291: 
292: %The new primitives {\em TFAS} and {\em Clear} are variants of {\em store-if-clear-and-set} and {\em load-and-clear} operations provided by full-empty bits, respectively, where {\em TFAS} is the {\em store-if-clear-and-set} that returns the current value and {\em Clear} is the {\em load-and-clear} that ignores the returned value. 
293: 
294: %Therefore, the paper also promotes the efficient full-empty bits from a special-purpose primitive, which is designed for fast producer-consumer communication, to a general-purpose universal primitive for many-core architectures. 
295: 
296: %Note that the {\em TFAS} is different from the sticky bytes \cite{Plo89} since the initial value of the variable used by {\em TFAS} is arbitrary, facilitating {\em TFAS} usage.
297: 
298: The rest of this paper is organized as follows. 
299: Section \ref{sec:models} presents the shared memory and interconnection network models assumed in this paper. 
300: Sections \ref{sec:NBFEB} describes the NB-FEB primitive in detail and proves its universality and combinability properties.
301: Section \ref{sec:NBFEBSTM} presents NBFEB-STM, the obstruction-free multi-versioning STM constructed on the NB-FEB primitive.
302: Section \ref{sec:GarbageCollector} describes a garbage collector that can be used as an external garbage collector for the NBFEB-STM.
303: 
304: %\ref{sec:aiword} and \ref{sec:asvword} present the exact consensus numbers of the first, second and third models, respectively.
305: 
306: \section{Models} \label{sec:models}
307: 
308: As previous research on the synchronization power of synchronization primitives \cite{Her91}, this paper assumes the linearizable shared memory model \cite{AttW04}. 
309: Due to NB-FEB combinability, as in \cite{KruRS88} we assume that the processor-to-memory interconnection network is {\em nonovertaking} and that a reply message is sent back on the same path followed by the request message. The immediate nodes, on the communication path from a processor to a global shared memory module (such as switches of a multistage interconnection network or higher memory modules of a multilevel memory hierarchy), can detect requests destined for the same destination and maintain the queues of requests. 
310: No memory coherent schemes are assumed. 
311: 
312: 
313: %- Universal: linearizable memory consistency.
314: %- Requirements for combining: Memory hierarchy without coherent data cache (cf. discussion in \cite{BleGV08}) and requirements for combining networks from \cite{KruRS88}
315: %- Correctness criterion for transactional memory: committed instantaneously, aborted, preserving real-time order, precluding inconsistent view \cite{GueK08}
316: %- No nested transaction
317: 
318: %\subsection{LSA-STM}
319: 
320: %- Introduction
321: 
322: %- Terminology from LSA-STM
323: 
324: %\todo{ The basic write that doesn't change the $flag_x$ will makes the set of operations including TFAS hard to combine (cf. \cite{KruRS88}). The full/empty bit has {\em store-and-clear} operation, which is needed for a closed set due to combinability $\Rightarrow$ Avoid using the basic $write$, using {\em store-and-clear} and {\em store-and-set} instead}
325: 
326: \section{NB-FEB Primitives} \label{sec:NBFEB}
327: 
328: The set of NB-FEB primitives consists of four sub-primitives: $TFAS$ (Algorithm \ref{alg:TFAS}), $Load$ (Algorithm \ref{alg:Load}), $SAC$ (Algorithm \ref{alg:SAC}) and $SAS$ (Algorithm \ref{alg:SAS}). The last three primitives are similar to those of the original full/empty bit. Regarding conditional load primitives, a processor can
329: check the flag value, $flag_x$, returned by the unconditional load primitive to determine if it was successful. 
330: 
331: 
332: \begin{algorithm}[t]
333:   \caption{{\sc TFAS}($x$: variable, $v$: value): Test-Flag-And-Set, a non-blocking variant of the original Store-if-Clear-and-Set primitive, which {\em always} returns the old value of $x$.} \label{alg:TFAS}
334: 	\begin{algorithmic}
335: 	\STATE $(o, flag_o) \leftarrow (x, flag_x)$; %Need $flag_x$ for combining, the successive load needs to know TFAS sucess or not to get $tmp$ or $v$.
336: 	\IF{$flag_x =$ \FALSE}
337: 	\STATE $(x, flag_x) \leftarrow (v,$ \TRUE $)$;
338: 	\ENDIF
339: 	\RETURN $(o, flag_o)$; %\todo{Check if it needs to return $tmp$, yet ???}
340:   	\end{algorithmic}
341: \end{algorithm}
342: 
343: \begin{algorithm}[t]
344:   \caption{{\sc Load}($x$: variable)} \label{alg:Load}
345: 	\begin{algorithmic}
346: 	\RETURN $(x,flag_x)$;
347:   	\end{algorithmic}
348: \end{algorithm}
349: 
350: 
351: \begin{algorithm}[t!]
352:   \caption{{\sc SAC}($x$: variable, $v$: value): Store-And-Clear} \label{alg:SAC}
353: 	\begin{algorithmic}
354: 	\STATE $(o,flag_o) \leftarrow (x,flag_x)$;
355: 	\STATE $(x,flag_x) \leftarrow (v,$ \FALSE $)$;
356: 	\RETURN $(o,flag_o)$;
357:   	\end{algorithmic}
358: \end{algorithm}
359: 
360: \begin{algorithm}[t!]
361:   \caption{{\sc SAS}($x$: variable, $v$: value): Store-And-Set} \label{alg:SAS}
362: 	\begin{algorithmic}
363: 	\STATE $(o,flag_o) \leftarrow (x,flag_x)$;
364: 	\STATE $(x,flag_x) \leftarrow (v,$ \TRUE $)$;
365: 	\RETURN $(o,flag_o)$;
366:   	\end{algorithmic}
367: \end{algorithm}
368: 
369: 
370: 
371: When the value of $flag_x$ returned is not needed, we just write $r \leftarrow$ {\sc TFAS}$(x,v)$ instead of $(r,flag_r) \leftarrow$ {\sc TFAS}$(x,v)$, where $r$ is $x$'s old value. The same applies to $SAC$ and $SAS$. For $Load$, we just write $r \leftarrow x$ instead of $r \leftarrow$ {\sc Load}$(x)$. 
372: In this paper, the flag value returned is needed only for combining NB-FEB primitives. 
373: 
374: \subsection{$TFAS$: A Universal Primitive}\label{sec:universality}
375: 
376: \begin{lemma}
377: (Universality) The {\em test-flag-and-set} primitive (or $TFAS$ for short) is universal.
378: \end{lemma}
379: \begin{proof}
380: We will show that there is a wait-free\footnote{An implementation is {\em wait-free} if it guarantees that any process can complete any operation on the implemented object in a finite number of steps, regardless of the execution speeds on the other processes \cite{Her91,Lamport77}.} consensus algorithm, for arbitrary number of processes, that uses only the $TFAS$ primitive and registers. 
381: 
382: The wait-free consensus algorithm is shown in Algorithm \ref{alg:TFASConsensus}.
383: Processes share a variable called $Decision$, which is initialized to $\perp$ with a $false$ flag. Each process $p$ proposes its value ($\neq \perp$) called $proposal$ by calling {\sc TFAS\_Consensus}$(proposal)$. 
384: 
385: The {\sc TFAS\_Consensus} procedure is clearly wait-free since it contains no loops. We need to prove that i) the procedure returns the same value to all processes and ii) the value returned is the value proposed by some process. Indeed, the procedure will return the proposal of the first process executing $TFAS$ on the $Decision$ variable to all processes. Let $p$ be a process calling the procedure.
386: \begin{itemize}
387: \item If $p$ is the first process executing $TFAS$ on the $Decision$ variable, since the $Decision$ variable is initialized to $\perp$ with a $false$ flag, 
388: $p$'s $TFAS$ will successfully write $p$'s proposal to $Decision$ and return $\perp$, the previous value of $Decision$. Since the value returned is $\perp$, the procedure returns $p$'s proposal (line \ref{alg:T:proposal}T), the proposal of the first process executing $TFAS$.
389: \item If $p$ is not the first process executing $TFAS$ on the $Decision$ variable, $p$'s $TFAS$ will fail to write $p$'s proposal to $Decision$ since $flag_{Decision}$ has been set to $true$ by the first $TFAS$ on $Decision$. $p$'s $TFAS$ will return the value, called $first$, written by the first $TFAS$. The $first$ value is the proposal of the first process executing $TFAS$ on the $Decision$ variable. Since $first \neq \perp$ (due to the hypothesis that proposals are not $\perp$), the procedure will return $first$ (line \ref{alg:T:first}T).
390: \end{itemize}
391: \end{proof}
392: 
393: 
394: \begin{algorithm}[t]
395:   \caption{{\sc TFAS\_Consensus}($proposal$: value)} \label{alg:TFASConsensus}
396: 	$Decision$: shared variable. The shared variable is initialized to $\perp$ with a clear flag (i.e. $flag_{Decision} =$ {\bf false}).\\
397: 
398:   	\algsetup{linenodelimiter=T:}
399:   	\begin{algorithmic}[1]
400: 	\ENSURE a value agreed by all processes.
401: 	\STATE $first \leftarrow$ {\sc TFAS}$(Decision, proposal)$;
402: 	\IF{ $first = \perp$}
403: 		\RETURN $proposal$; \label{alg:T:proposal}
404: 	\ELSE
405: 		\RETURN $first$; \label{alg:T:first}
406: 	\ENDIF 
407:   	\end{algorithmic}
408: \end{algorithm}
409: 
410: \subsection{Combinability}\label{sec:combinability}
411: 
412: \begin{lemma}
413: (Combinability) NB-FEB primitives are combinable.
414: \end{lemma}
415: \begin{proof}
416: Table \ref{tab:Combination} summarizes the combining logic of NB-FEB primitives on a memory location $x$. The first column is the name of the first primitive request and the first row is the name of  the successive primitive request. For instance, the cell $[SAS,TFAS]$ is the combining logic of $SAS$ and $TFAS$ in which $SAS$ is followed by $TFAS$.
417: Let $v_1, v_2, r$ and $f_r$ be the value of the first primitive request, the value of the second primitive request, the value returned and the flag returned, respectively. 
418: In each cell, the first line is the combined request, the second is the reply
419: to the first primitive request and the third (and forth) is the reply
420: to the successive primitive request. The values $0$ and $1$ of $f_r$ in the reply represent $false$ and $true$, respectively.
421: 
422: Consider the cell $[TFAS,TFAS]$ as an example. The cell describes the case where request $TFAS(x,v_1)$  is followed by request $TFAS(x,v_2)$, at a switch of the processor-to-memory interconnection network. The two requests can be combined into only one request $TFAS(x,v_1)$ (line 1), which will be forwarded further to the corresponding memory controller. When receiving
423:  a reply $(r,f_r)$ to the combined request, the switch at which the requests were combined, creates separate replies to the two original requests. The reply to the first original request, $TFAS(x,v_1)$, is $(r,f_r)$ (line 2) as if the request was executed by the memory controller. The reply to the successive request, $TFAS(x,v_2)$, depends on whether the combined request $TFAS(x,v_1)$ has successfully updated the memory location $x$. 
424: If $f_r=0$, $TFAS(x,v_1)$ has successfully updated $x$ with its value $v_1$. Therefore, the reply to the successive request $TFAS(x,v_2)$ is $(v_1,1)$ as if the request was executed right after the first request $TFAS(x,v_1)$.
425: If $f_r=1$, $TFAS(x,v_1)$ has failed to update the $x$ variable. Therefore, the reply to the successive request $TFAS(x,v_2)$ is $(r,1)$.
426: 
427: %From a combining point of view, $TFAS$ is a combination of the original {\em load} and {\em store-if-clear-and-set} primitives, where the {\em load} primitive is followed by the {\em store-if-clear-and-set} primitive. Since it has been proven in \cite{KruRS88} that the six primitives of the original full/empty bit, which is {\em load}, {\em store-if-clear-and-set}, {\em load-and-clear}, {\em store-and-set}, {\em store-and-clear} and {\em store-if-clear-and-clear}, are combinable, $TFAS$ is combinable. 
428: 
429: %Note that from combining aspect, $TFAS$ is similar to {\em store-if-clear-and-set} if {\em store} is assumed to return the old value (i.e. {\em swap}) as in \cite{KruRS88}. The difference between the two primitives is the synchronization aspect. Whereas the {\em store-if-clear-and-set} primitive waits for the flag being clear in order to store the new value and return the old value, 
430: \end{proof}
431: 
432: \begin{figure}[t]
433: \centering
434: \begin{tabular}{|l|l|l|l|l|} \hline
435:  $(x,[v_1])$ & \multicolumn{4}{c|}{ The successive primitive with parameters $(x,[v_2])$}\\
436: 			& $Load$ 		& 	$SAC$ 		& $SAS$ 			& $TFAS$ \\ \hline
437: $Load$ 	&  $Load$ 		& $SAC(v_2)$ 	& $SAC(v_2)$	& $TFAS(v_2)$\\ 
438: 			& $(r,f_r)$ 	& $(r,f_r)$ 	& $(r,f_r)$		& $(r,f_r)$\\
439: 			& $(r,f_r)$ 	& $(r,f_r)$ 	& $(r,f_r)$		& $(r,f_r)$ \\ \hline
440: $SAC$ 	& $SAC(v_1)$ 	& $SAC(v_2)$ 	& $SAS(v_2)$ 	& $SAS(v_2)$\\
441: 			& $(r,f_r)$ 	& $(r,f_r)$ 	& $(r,f_r)$		& $(r,f_r)$\\
442: 			& $(v_1,0)$ 	& $(v_1,0)$ 	& $(v_1,0)$		& $(v_1,0)$ \\ \hline
443: $SAS$ 	& $SAS(v_1)$ 	& $SAC(v_2)$ 	& $SAS(v_2)$ 	& $SAS(v_1)$\\
444: 			& $(r,f_r)$ 	& $(r,f_r)$ 	& $(r,f_r)$		& $(r,f_r)$\\
445: 			& $(v_1,1)$ 	& $(v_1,1)$ 	& $(v_1,1)$		& $(v_1,1)$ \\ \hline
446: $TFAS$ 	& $TFAS(v_1)$ 	& $SAC(v_2)$ 	& $SAS(v_2)$ 	& $TFAS(v_1)$\\
447: 			& $(r,f_r)$ 	& $(r,f_r)$ 	& $(r,f_r)$		& $(r,f_r)$\\
448: 			& Like 5th 		& Like 5th		& Like 5th		& if $f_r$=0: $(v_1,1)$ \\
449: 			& column 		& column			& column  		& else: $(r,1)$ \\ \hline
450: \end{tabular}
451: \caption{The combining logic of NB-FEB primitives on a memory location $x$}\label{tab:Combination}
452: \end{figure}
453: 
454: 
455: 
456: 
457: 
458: 
459: 
460: \section{NBFEB-STM: Obstruction-free Multi-versioning STM} \label{sec:NBFEBSTM}
461: 
462: Like previous obstruction-free multi-versioning STM called LSA-STM \cite{RieFF06}, the new software transactional memory called NBFEB-STM, assumes that objects are only accessed and modified within transactions. NBFEB-STM assumes that there are no nested transactions, namely each thread executes only one transaction at a time. 
463: NBFEB-STM, like other obstruction-free STMs \cite{HerLMS03, MarSS05, RieFF06}, is designed for garbage-collected programming languages (e.g. Java).
464: %Like previous obstruction-free STMs \cite{,MarSS05, RieFF06}, 
465: A variable reclaimed by the garbage collector is assumed to have all bits 0 when it is reused. Note that there are non-blocking garbage collection algorithms that do not require synchronization primitives other than reads and writes while they still guarantee the non-blocking property for application-threads. Such a garbage collection algorithm is presented in Section \ref{sec:GarbageCollector}.
466: 
467: Only two NB-FEB primitives, $TFAS$ and $SAC$, are needed for implementing NBFEB-STM.
468: 
469: \subsection{Challenges and Key Ideas}
470: 
471: Unlike the STMs using $CAS$ \cite{HerLMS03, MarSS05, RieFF06}, NBFEB-STM using $TFAS$ and $SAC$ must handle the problem that $SAC$'s interference with  concurrent $TFAS$es will violate the atomicity semantics expected on variable $x$. Overlapping $TFAS_1$ and $TFAS_2$ both may successfully write their new values to $x$ if $SAC$ interference occurs.
472: 
473: The key idea is not to use the transactional memory object $TMObj$ \cite{HerLMS03, MarSS05, RieFF06} that needs to switch its pointer frequently to a new locator (when a transaction commits). Such a $TMObj$ would need $SAC$ in order to clear the pointer's flag, allowing the next transaction to switch the pointer. Instead, NBFEB-STM keeps a linked-list of locators for each object and integrates a write-once pointer $next$ into each locator (cf. Figure\ref{fig:NBFEB-STM}). When opening an object $O$ for write, a transaction $T$ tries to append its locator to $O$'s locator-list by changing the $next$ pointer of the head-locator of the list using $TFAS$. Due to the semantics of $TFAS$, only one of the concurrent transactions trying to append their locators succeeds. The other transactions must retry in order to find the new head and then append their locators to the new head. Using the locator-list, each $next$ pointer is changed only once and thus its flag does not need to be cleared during the lifetime of the corresponding locator. This prevents a $SAC$ from interleaving with concurrent $TFAS$es.
474: The $next$ pointer, together with its locator, will be reclaimed by the garbage collector when the  lifetime of its locator is over. The garbage collector ensures that a locator will not be recycled until no thread/transaction has a reference to it.
475: 
476: Linking locators together creates another challenge on the space complexity of NBFEB-STM. Unlike the STMs using $CAS$, a delayed/halted transaction $T$ in NBFEB-STM may prevent all locators appended after its locator in a locator-list from being reclaimed. As a result, $T$ may make the system run out of memory and thus prevent other transactions from making progress, violating the obstruction-freedom property.
477: The key idea to solve the space challenge is to break the list of obsolete locators into pieces so that a delayed transaction $T$ prevents from being reclaimed only the locator that $T$ has a direct reference as in the STMs using $CAS$. The idea is based on the fact that only the head of $O$'s locator-list is needed for further accesses to the $O$ object.
478: 
479: However, breaking the list of an obsolete object $O$ also creates another challenge on finding the head of $O$'s locator-list. Obviously, we cannot use a head pointer as in non-blocking linked-lists since modifying such a pointer requires $CAS$.
480: The key idea is to utilize the fact that there are no nested transactions and thus each thread has at most one {\em active} locator\footnote{An {\em active} locator is a locator that is still in use, opposite to an {\em obsolete} locator.} in each locator list. Therefore, by recording the latest locator of each thread appended to $O$'s locator-list, a transaction can find the head of $O$'s locator list. The solution is elaborated further in Section \ref{sec:Algorithm} and Section \ref{sec:Correctness}. 
481: 
482: Based on the key ideas, we come up with the data structure for a transactional memory object that is illustrated in Figure \ref{fig:NBFEB-STM} and presented in Algorithm \ref{alg:StartSTM}.  
483: 
484: The transactional memory object in NBFEB-STM is an array of $N$ pairs (pointer, timestamp), where $N$ is the number of concurrent threads/transactions as shown in Figure \ref{fig:NBFEB-STM}. Item $TMObj[i]$ is modified only by thread $t_i$ and can be read by all threads. Pointer $TMObj[i].loc$ points to the locator called $Loc_i$ corresponding to the latest transaction committed/aborted by thread $t_i$. Timestamp $TMObj[i].ts$ is the commit timestamp of the object referenced by $Loc_i.old$. 
485: After successfully appending its locator $Loc_i$ to  the list by executing $TFAS(head.next,Loc_i)$, $t_i$ will update its own item $TMObj[i]$ with its new locator $Loc_i$.
486: The $TMObj$ array is used to find the head of the list of locators $Loc_1, \cdots, Loc_N$.
487: 
488: For each locator $Loc_i$, in addition to fields $Tx, old$ and $new$ that reference the corresponding transaction object, the old data object and the new data object, respectively, as in DSTM\cite{HerLMS03}, there are two other fields $cts$ and $next$. The $cts$ field records the commit timestamp of the object referenced by $old$. The $next$ field is the pointer to the next locator in the locator list. The $next$ pointer is modified by NB-FEB primitives. In Figure \ref{fig:NBFEB-STM}, values $\{0,1\}$ in the $next$ pointer denote the values $\{false,true\}$ of its flag, respectively. The $next$ pointer of the head of the locator list, $Loc_3.next$, has its flag clear (i.e. 0), and the $next$ pointers of previous locators (e.g. $Loc_1.next$, $Loc_2.next$) have their flags set (i.e. 1) since their $next$ pointers were changed. The $next$ pointer of a new locator (e.g. $Loc_4.next$) is initialized to $(\perp,0)$.
489: Due to the garbage collector semantics, all locators $Loc_j$ reachable from the $TMObj$ {\em shared} object by following their $Loc_j.next$ pointers, will not be reclaimed.
490: 
491: For each transaction object $Tx_i$, in addition to fields $status$, $readSet$ and $writeSet$ corresponding to the status, the set of objects opened for read, and the set of objects opened for write, respectively, there is a field $cts$ recording $Tx_i$'s commit timestamp (if $Tx_i$ committed) as in LSA-STM \cite{RieFF06}. 
492: 
493: \begin{figure}[t]
494: \begin{center}
495:     {\scalebox{0.45}{\input{nbFebStm.pstex_t}}
496:     \caption{The data structure of a transactional memory object $TMObj$ in NBFEB-STM with four threads.} \label{fig:NBFEB-STM}}
497: \end{center}
498: \end{figure}
499: 
500: 
501: \subsection{Algorithm} \label{sec:Algorithm}
502: 
503: A thread $t_i$ starts a transaction $T$ by calling the {\sc StartSTM}$(T)$ procedure (Algorithm \ref{alg:StartSTM}). The procedure sets $T.status$ to $Active$ and clears its flag using $SAC$ (cf. Algorithm \ref{alg:SAC}). The procedure then initializes the lazy snapshot algorithm (LSA) \cite{RieFF06} by calling {\sc LSA\_Start}. NBFEB-STM utilizes LSA to preclude inconsistent views by {\em live} transactions, an essential aspect of transactional memory semantics \cite{GueK08}. The LSA has been shown to be an efficient mechanism to construct consistent snapshots for transactions \cite{RieFF06}. Moreover, the LSA can utilize up to $(N+1)$ versions of an transactional memory object $TMObj$ recorded in $N$ locators of $TMObj$'s locator list.
504: Note that the global counter $CT$ in LSA can be implemented by the {\em fetch-and-increment} primitive \cite{GotGKMRS82}, a combinable (and thus scalable) primitive \cite{KruRS88}. Except for the global counter $CT$, the LSA in NBFEB-STM does not need any strong synchronization primitives other than $TFAS$.  The {\sc Abort}$(T)$ operation in LSA, which is used to abort a transaction $T$, is replaced by $TFAS(T.status, Aborted)$. Note that the $status$ field is the only field of a transaction object $T$ that can be modified by other transactions.
505: 
506: \begin{algorithm}[t]
507:   \caption{{\sc StartSTM}($T$: transaction)} \label{alg:StartSTM}
508: 	$TMObj$: array$[N]$ of $\{ptr, ts\}$. Pointer $TMObj[i].ptr$ points to the locator called $Loc_i$ corresponding to the latest transaction committed/aborted by thread $t_i$. Timestamp $TMObj[i].ts$ is the commit timestamp of the object referenced by $Loc_i.old$. $N$ is the number of concurrent threads/transactions. $TMObj[i]$ is written only by thread $t_i$. \\
509: %\todo{remove the $start$ field}
510: 
511: 	$Locator$: {\bf record} 
512: 		$tx, new, old$: pointer;
513: 		$cts$: timestamp; 
514: 	{\bf end}. The $cts$ timestamp is the commit timestamp of the old version.\\
515: 
516: 	$Transaction$: {\bf record}
517: 		$status: \{Active, Committed, Aborted\}$;
518: 		$cts$: timestamp; 
519: 	{\bf end}. NBFEB-STM also keeps read/write sets as in LSA-STM, but the sets are omitted from the pseudocode since managing the sets in NBFEB-STM is similar to LSA-STM. \\
520: 
521: 	%$Transaction$: {\bf record} $status$ {\bf end}. \\
522: 
523:   	\algsetup{linenodelimiter=S:}
524:   	\begin{algorithmic}[1]
525: 	\STATE {\sc SAC}$(T.status, Active)$; \COMMENT{Store-and-clear} \label{alg:S:initStatus}
526: 	\STATE {\sc LSA\_Start}$(T)$ \COMMENT{Lazy snapshot algorithm}
527:   	\end{algorithmic}
528: \end{algorithm}
529: 
530: When a transaction $T$ opens an object $O$ for read, it invokes the {\sc OpenR} procedure (Algorithm \ref{alg:OpenRObject}). The procedure simply calls the {\sc LSA\_Open} procedure of LSA \cite{RieFF06} in the $Read$ mode to get the version of $O$ that maintains a consistent snapshot with the versions of other objects being accessed by $T$.
531: If no such a version of $O$ exists, {\sc LSA\_Open} will abort $T$ and consequently {\sc OpenR} will return $\perp$ (line \ref{alg:R:ReturnNil}R). That means there is a conflicting transaction that makes $T$ unable to maintain a consistent view of all the object being accessed by $T$. Otherwise, {\sc OpenR} returns the version of $O$ that is selected by LSA. This version is guaranteed by LSA to belong to a consistent view of all the objects being accessed by $T$.
532: Up to $(N+1)$ versions are available for each object $O$ in NBFEB-STM (cf. Lemma \ref{lem:numOfVersions}).
533: Since NBFEB-STM utilizes LSA, read-accesses to an object $O$ are invisible to other transactions and thus do not change $O$'s locator list.
534: 
535: \begin{algorithm}[t]
536:   \caption{{\sc OpenR}($T$: Transaction; $O_i$: TMObj): Open a transactional onject for read} \label{alg:OpenRObject}
537:   \algsetup{linenodelimiter=R:}
538:   \begin{algorithmic}[1]
539: 	\ENSURE	reference to a {\em data} object if succeeds, or $\perp$.
540: 	\STATE {\sc LSA\_Open}$(T, 0_i, "Read")$; \COMMENT{LSA's {\sc Open} procedure} \label{alg:R:LSAOpen}
541: 	\IF{$T.status = Aborted$} \label{alg:R:TStatus}
542: 	\RETURN $\perp$; \label{alg:R:ReturnNil}
543: 	\ELSE
544: 	\RETURN the version chosen by {\sc LSA\_Open}; \label{alg:R:ReturnVersion}
545: 	\ENDIF
546:   	\end{algorithmic}
547: \end{algorithm}
548: 
549: %Figure \ref{alg:OpenRObject}: Only the thread owner $t$ of a transaction $Tx$ can change $Tx.status$ to $Committed$. Other transactions change $Tx.status$ to only $Aborted$. $Tx.status$ is the only field of $Tx$  that can be modified by other threads. For each transactional memory object $TMObj$, its read-only transactions are invisible to its write transactions.
550: 
551: 
552: When a transaction $T$ opens an object $O$ for write, it invokes the {\sc OpenW} procedure (cf. Algorithm \ref{alg:OpenWObject}). The task of the procedure is to append to the head of $O$'s locator list a new locator $L$ whose $Tx$ and $old$ fields reference to $T$ and $O$'s latest version, respectively. In order to find $O$'s latest version, 
553: the procedure invokes {\sc FindHead} (cf. Algorithm \ref{alg:FindHead}) to find the current head of $O$'s locator list (line \ref{alg:W:findHead}W). When the head called $H$ is found, the procedure determines $O$'s latest version based on the status of the corresponding transaction $H.Tx$ as in DSTM \cite{HerLMS03}. 
554: If the $H.Tx$ transaction committed, $O$'s latest version is $H.new$ with commit timestamp $H.Tx.cts$ (lines \ref{alg:W:ifCommit}W-\ref{alg:W:TxCts}W). A copy of $O$'s latest version is created and referenced by $L.new$ (line \ref{alg:W:CopyHNew}W) (cf. locators $Loc_2$ and $Loc_3$ in Figure \ref{fig:NBFEB-STM} as $H$ and $L$, respectively, for an illustration).
555: If the $H.Tx$ transaction aborted, $O$'s latest version is $H.old$ with commit timestamp $H.cts$ (lines \ref{alg:W:ifAbort}W-\ref{alg:W:HCts}W) (cf. locators $Loc_1$ and $Loc_2$ in Figure \ref{fig:NBFEB-STM} as $H$ and $L$, respectively, for an illustration). 
556: If the $H.Tx$ transaction is active, {\sc OpenW} consults the contention manager \cite{GeuHP05,SchS05} (line \ref{alg:W:CM}W) to solve the conflict between the $T$ and $H.Tx$ transactions. If $T$ must abort, {\sc OpenW} tries to change $T.status$ to $Aborted$ using $TFAS$ (line \ref{alg:W:AbortT}W) and returns $\perp$. Note that other transactions change $T.status$ only to $Aborted$, and thus if $TFAS$ at line \ref{alg:W:AbortT}W fails, $T.status$ has been changed to $Aborted$ by another transaction.
557: If $H.Tx$ must abort, {\sc OpenW} changes $H.Tx.status$ to $Aborted$ using $TFAS$ (line \ref{alg:W:AbortHTx}W) and checks $H.Tx.status$ again. 
558: 
559: The latest version of $O$ is then checked to ensure that it, together with the versions of other objects being accessed by $T$, belongs to a consistent view using {\sc LSA\_Open} with "Write" mode (line \ref{alg:W:LSAOpen}W). If it does, {\sc OpenW} tries to append the new locator $L$ to $O$'s locator list by changing the $H.next$ pointer to $L$ (line \ref{alg:W:TFASHead}W). Note that the $H.next$ pointer was initialized to $\perp$ with a clear flag, before $H$ was successfully appended to $O$'s locator list (line \ref{alg:W:initNext}W).
560: If {\sc OpenW} does not succeed, another locator has been appended as a new head and thus {\sc OpenW} must retry to find the new head (line \ref{alg:W:FindHeadAgain}W). Otherwise, it successfully appends the new locator $L$ as the new head of $O$'s locator list. {\sc OpenW}, which is being executed by a thread $t_i$, then makes $O[i].ptr$ reference to $L$ and records $L.cts$ in $O[i].ts$ (line \ref{alg:W:UpdateOi}W). This removes $O$'s reference to the previous locator $oldLoc$ appended by $t_i$, allowing $oldLoc$ to be reclaimed by the garbage collector. Since $oldLoc$ now becomes an obsolete locator, its $next$ pointer is reset (line \ref{alg:W:oldLocNext}W) to break possible chains of obsolete locators reachable by a delayed/halted thread, helping $oldLoc$'s descendant locators in the chains be reclaimed. 
561: For each item $j$ in the $O$ array such that $O[j].ts < O[i].ts$, the $O[j].ptr$ locator now becomes obsolete in a sense that it no longer keeps $O$'s latest version although it is still referenced by $O[j]$ (since only thread $t_j$ can modify $O[j]$). In order to break the chains of obsolete locators, {\sc OpenW} resets the $next$ pointer of the $O[j].ptr$ locator so that $O[j].ptr$'s descendant locators can be reclaimed by the garbage collector (lines \ref{alg:W:ClearOjLoop}W-\ref{alg:W:ClearOj}W). This chain-breaking mechanism makes the space complexity of an object updated by $N$ concurrent transactions/threads in NBFEB-STM be $\Theta(N)$, the optimal (cf. Theorem \ref{the:spaceComplexity}).
562: 
563: \begin{algorithm}[p!]
564:   \caption{{\sc OpenW}($T$: Transaction; $O$: TMObj): Open a transactional memory object for write by a thread $p_i$} \label{alg:OpenWObject}
565:   \algsetup{linenodelimiter=W:}
566:   \begin{algorithmic}[1]
567: 	\ENSURE	reference to a {\em data} object if succeeds, or $\perp$.
568: 	\STATE $newLoc \leftarrow$ new Locator; \label{alg:W:newLoc}
569: 	\WHILE{\TRUE}	 	
570: 	\STATE $head \leftarrow$ {\sc FindHead}$(O)$;  \COMMENT{Find the head of $O$'s list.} \label{alg:W:findHead}
571: %head is the {\tt Locator} whose write-once pointer $next$ hasn't been written at this time, i.e. $(*,0)$} 
572: 	\FOR{$i=0$ to $1$}	
573:  		\IF{ $head.tx.status = Committed$} \label{alg:W:ifCommit}
574: 			\STATE $newLoc.old \leftarrow head.new$; \label{alg:W:HNew} 
575: 			\STATE $newLoc.cts \leftarrow head.tx.cts$; \label{alg:W:TxCts}
576: 			\STATE $newLoc.new \leftarrow$ {\sc Copy}$(head.new)$;\COMMENT{Create a duplicate} \label{alg:W:CopyHNew}
577: 			\STATE {\bf break};
578: 		\ELSIF{ $head.tx.status = Aborted$} \label{alg:W:ifAbort}
579: 			\STATE $newLoc.old \leftarrow head.old$; \label{alg:W:HOld}
580: 			\STATE $newLoc.cts \leftarrow head.cts$; \label{alg:W:HCts}
581: 			\STATE $newLoc.new \leftarrow$ {\sc Copy}$(head.old)$; \label{alg:W:CopyHOld}
582: 			\STATE {\bf break};
583: 		\ELSE 
584: 			\STATE $myProgession \leftarrow$ {\sc CM}$(O_i, "Write")$\COMMENT{ $head.tx$ is active $\Rightarrow$ Consult the contention manager} \label{alg:W:CM}
585: 			\IF{ $myProgression =$ \FALSE}
586: 	%\STATE $newLoc.tx \leftarrow \perp$; $newLoc.prev \leftarrow \perp$; \COMMENT{For garbage collecting}
587: 				\STATE {\sc TFAS}$(T.status, Aborted)$; \COMMENT{If fails, another has executed this $TFAS$.} \label{alg:W:AbortT}
588: 				\RETURN $\perp$;
589: 			\ELSE  
590: 				\STATE {\sc TFAS}$(head.tx.status, Aborted)$; \label{alg:W:AbortHTx}  
591: 				\STATE {\bf continue}; \COMMENT{ Transaction $head.tx$ has committed/aborted $\Rightarrow$ Check $head.tx.status$ one more time}
592: 			\ENDIF
593: 		\ENDIF \label{alg:W:endIf}
594: 	\ENDFOR
595: 	\STATE $newLoc.tx \leftarrow T$; 
596: 	\STATE {\sc SAC}$(newLoc.next, \perp)$; \COMMENT{Store-and-clear} \label{alg:W:initNext}
597: %	\STATE $newLoc.prev \leftarrow (head,0)$; \COMMENT{For multi-version STM}
598: 	\STATE {\sc LSA\_Open}$(T, O, "Write")$; \COMMENT{LSA's {\sc Open} procedure.} \label{alg:W:LSAOpen}
599: %\COMMENT{{\sc Abort(T)} in {\sc LSA\_Open} is {\sc TFAS}$(T.status, Aborted)$; Available versions are in the locators accessible from $O[]$.} 
600: %and following the $next$ pointers.}
601: 	\IF{$T.status = Aborted$}
602: 	\RETURN $\perp$; \COMMENT{Performance (not correctness): Don't add $newLoc$ to $O$ if $T$ has aborted due to, for instance, {\sc LSA\_Open}.}
603: 	\ENDIF
604: 	\IF{ {\sc TFAS}$(head.next, newLoc) \neq \perp$} \label{alg:W:TFASHead}
605: 		\STATE {\bf continue}; \COMMENT{Another locator has been appended $\Rightarrow$ Find the head again} \label{alg:W:FindHeadAgain}
606: 	\ELSE
607: 		\STATE $oldLoc = O[i]$;
608: 		\STATE $O[i] \leftarrow (newLoc, newLoc.cts)$; \COMMENT{Atomic assignment; $p_i$'s old locator is unlinked from $O$.} \label{alg:W:UpdateOi} %This command must is before the next for-loop due to correctness.
609: 		\STATE {\sc SAC}$(oldLoc.next, \perp)$; \COMMENT{$oldLoc$ may be in the chain of a sleeping thread $\Rightarrow$ Stop the chain here} \label{alg:W:oldLocNext}
610: 		\FOR{ each item $L_j$ in $O$ such that $L_j.ts < O[i].ts$} \label{alg:W:ClearOjLoop}
611: 			\STATE {\sc SAC}$(L_j.ptr.next, \perp)$ \COMMENT{Reset the $next$ pointer of the obsolete locator} \label{alg:W:ClearOj}
612: % so that $p_i$'s old locator can be reclaimed by garbage collection.} 
613: 		\ENDFOR
614: 		\RETURN $newLoc.new$;
615: 	\ENDIF
616: 	\ENDWHILE
617: 
618:   	\end{algorithmic}
619: \end{algorithm}
620: 
621: 
622: In order to find the head of $O$'s locator list as in {\sc OpenW}, a transaction invokes the {\sc FindHead}$(O)$ procedure (cf. Algorithm \ref{alg:FindHead}). The procedure atomically reads $O$ into a local array $start$ (line \ref{alg:F:scanO}F). Such a multi-word read operation is supported by emerging multicore architectures like CUDA \cite{CUDA} and Cell BE \cite{GscHFHWY06}. In the contemporary chips of these architectures, a read operation can atomically read 128 bytes. In general, such a multi-word read operation can be implemented as an atomic snapshot using only single-word read and single-word write primitives \cite{AfeADGMS93}. {\sc FindHead} finds the item $start_{latest}$ with the highest timestamp in $start$ and searches for the head from locator $start_{latest}.ptr$ by following the $next$ pointers until it finds a locator $H$ whose $next$ pointer is $\perp$ (lines \ref{alg:F:latest}F-\ref{alg:F:tmpNext}F). Since some locators may become obsolete and their $next$ pointers were reset to $\perp$ by concurrent transactions (lines \ref{alg:W:oldLocNext}W and \ref{alg:W:ClearOj}W in Algorithm \ref{alg:OpenWObject}), {\sc FindHead} needs to check $H$'s commit timestamp against the highest timestamp of $O$ at a moment after $H$ is found (lines \ref{alg:F:scanO2}F-\ref{alg:F:until}F). If $H$'s commit timestamp is greater than or equal to the highest timestamp of $O$, $H$ is the head of $O$'s locator list (cf. Lemma \ref{lem:FindHead}). Otherwise, $H$ is an obsolete locator and {\sc FindHead} must retry (line \ref{alg:F:until}F). The {\sc FindHead} procedure is lock-free, namely it will certainly return the head of $O$'s  locator list after at most $N$ iterations unless a concurrent thread has completed a transaction and subsequently has started a new one, where $N$ is the number of concurrent (updating) threads (cf. Lemma \ref{lem:FindHead_LF}).
623: Note that as soon as a thread obtains $head$ from {\sc FindHead} (line \ref{alg:W:findHead}W of {\sc OpenW}, Algorithm \ref{alg:OpenWObject}), the locator referenced by $head$ will not be reclaimed by the garbage collector until the thread returns from the {\sc OpenW} procedure.
624: 
625: 
626: \begin{algorithm}[t]
627:   \caption{{\sc FindHead}($O$: TMObj): Find the head of the locator list} \label{alg:FindHead}
628:   \algsetup{linenodelimiter=F:}
629:   \begin{algorithmic}[1]
630: 	\ENSURE reference to the head of the locator list
631: 	\REPEAT
632: 		\STATE $start \leftarrow O$; \COMMENT{Read $O$ to a local array atomically.} \label{alg:F:scanO}
633: %Such a multi-word read is supported by recent multicore chips like CUDA \cite{CUDA} and Cell BE \cite{CellBEArch}. See \cite{AfeADGMS93} for a general solution.
634: 		\STATE Let $start_{latest}$ is the item with highest timestamp; \label{alg:F:latest}
635: 		\STATE $tmp \leftarrow start_{latest}.ptr$; \COMMENT{Find a locator whose $next$ pointer is $\perp$}
636: 		\WHILE{ $tmp.next \neq \perp$} \label{alg:F:while}
637: 			\STATE $tmp \leftarrow tmp.next$; \label{alg:F:tmpNext}
638: 		\ENDWHILE
639: 		\STATE $start' \leftarrow O$; \COMMENT{Check if $tmp$ is the head.} \label{alg:F:scanO2}
640: 		\STATE Let $start'_{latest}$ is the item with highest timestamp;
641: %	\UNTIL{ $start_{latest} = start'_{latest}$}; \label{alg:F:until} % Can improve performance by checking tmp.cts instead of start_{latest}.cts
642: 	\UNTIL{ $tmp.cts \geq start'_{latest}.ts$}; \label{alg:F:until}
643: 	\RETURN $tmp$;
644:   	\end{algorithmic}
645: \end{algorithm}
646: 
647: 
648: When committing, read-only transactions in NBFEB-STM do nothing and always succeed in their commit phase as in LSA-STM \cite{RieFF06}. They can abort only when trying to open an object for read (cf. Algorithm \ref{alg:OpenRObject}). Other transactions $T$, which have opened at least one object for write, invoke the {\sc CommitW} procedure (Algorithm \ref{alg:Commit}). The procedure calls the {\sc LSA\_Commit} procedure to ensure that $T$ still maintains a consistent view of objects being accessed by $T$ (line \ref{alg:C:LSACommit}C). $T$'s commit timestamp is updated with the timestamp returned from {\sc LSA\_Commit} (line \ref{alg:C:TCts}C). Finally, {\sc CommitW} tries to change $T.status$ to $Committed$ (line \ref{alg:C:TFAS}C). $T.status$ will be changed to $Committed$ at this step if it has not been changed to $Aborted$ due to the semantics of $TFAS$.
649: 
650: \begin{algorithm}[t]
651:   \caption{{\sc CommitW}($T$: Transaction): Try to commit an update transaction $T$ by thread $p_i$} \label{alg:Commit}
652:   \algsetup{linenodelimiter=C:}
653:   \begin{algorithmic}[1]
654: 	\STATE $CT_{T} \leftarrow$ {\sc LSA\_Commit}$(T)$; \COMMENT{Check consistent snapshot. $CT_{T}$ is $T$'s unique commit timestamp from LSA.} \label{alg:C:LSACommit}
655: 	\STATE $T.cts \leftarrow CT_{T}$; \COMMENT{Commit timestamp of $T$ if $T$ manages to commit.} \label{alg:C:TCts}
656: 	\STATE {\sc TFAS}$(T.status, Committed)$; \label{alg:C:TFAS}
657: %	\IF{ {\sc TFAS}$(T_i.status, Committed) = Active$}
658: %		\FORALL{ object $O_j$ in $T_i$'s write-set}
659: %		\STATE $O_j[i].ts \leftarrow CT_{T_i}$; \COMMENT{Help reduce the searching time for the latest version.}
660: %		\ENDFOR
661: %	\ELSE 
662: %	\STATE b \COMMENT{$T$ has been aborted}
663: %	\ENDIF
664:   	\end{algorithmic}
665: \end{algorithm}
666: 
667: %contains a pointer {\tt start} to a {\tt Locator} of the list. If there is only one transaction modifying the object at a time, the pointer points to the head of the list, the newest {\tt Locator}. Otherwise, it may point to one of {\tt Locator}s inserted by concurrent transactions. 
668: 
669: %Need this? No! For the pseudocode simplicity, we assume that $flag_x$ is the least bit of variable $x$, where $true$ and $false$ are represented by 1 and 0, respectively}. 
670: 
671: %At this moment we assume that the garbage collector does not reclaim the objects that are being (directly or indirectly) referenced by a thread (e.g. Java garbage collector). 
672: %An object $I$ is indirectly referenced by a thread $t$ if $p$ access $I$ via another object (e.g. items of a linked list are indirectly referenced by the threads that are keeping only a reference to the head of the list). 
673: %An object that is {\em shared} by $N$ threads is {\em referenced} by the $N$ threads.
674: 
675: %(e.g. {\sc FindHead} procedure). 
676: 
677: %such a garbage collector can be implemented without the need of synchronization primitives
678: 
679: %\todo{Note that $L_j.prev$ pointers may prevent the garbage collector from collecting unused locator $\Rightarrow$ When updating the $O.end$ pointer to locator $L_e$, the thread must set $TFAS(L_e.prev,\perp)$.}
680: 
681: %\todo{Check: with the garbage collector, we may not need $Clear$.}
682: 
683: 
684: 
685: 
686: 
687: 
688: %The $TMObj$ does not keep track of transactions that only read the object. \cite{RieFF06}.
689:  
690: %%%% The following text used for describe challenges %%%%%%
691: %\todo{Challenges: a sleeping transaction Tx that is keeping a reference to an obsolete locator $L_o$ may prevent garbage collector from reclaiming all unused locators due to its pointers $L_o.next$. Unlike LSA-STM where locators are isolated (no link/pointer between 2 locators), the new STM has locators linked together, causing this garbage collection problem. If a transaction $T_s$ falls to sleep after inserting its own $newLocator$ into the list, all locators inserted later cannot be reclaimed}
692: 
693: %Solution: what $T_s$ will do after successfully inserting its own locator $L_s$ is to write to $L_s.new$  and finally call $TFAS(T_s.status, Committed)$, which will certainly fail since other transaction $T_i$ must successfully abort $T_s$ before inserting its locator $L_i$. Therefore, if we can reset $L_s.next$ and $L_s.prev$ to $\perp$, garbage collector can collect other unused locators to which no transaction has a reference $\Rightarrow$ before shifting $O.end$ to the new position/locator, reset $next$ and $prev$ of all locators between old and new positions of $O_i.end$. \todo{Check race-condition + FindHead() starting from a not-yet-reset (but obsolete) locator.}   
694: 
695: %Tip: Compare the new STM with $N$-version LSA-STM $\Rightarrow$ the number $N$ of versions/locators (and thus remote memory references (RMR)) of the new STM due to the pool of locators to be scanned is not worse than LSA-STM. The experimental study in the LSA-STM paper \cite{RieFF06} show that $8$-version LSA-STM is the best on their $8$-core system $\Rightarrow$ It seems that the number of versions should equal the number of cores. 
696: 
697: %Tips: However, we can read $N$ versions from global memory to local memory in one coalesced memory access before scanning for the head $\Rightarrow$ scanning the pool only takes one RMR.
698: 
699: %\todo{Challenges: We must use the system garbage collector, a highest priority thread that never fails, to prevent a sleeping transaction Tx from accessing reused data when waked up. Otherwise, Tx may fall to sleep after getting a pointer to a locator but before setting its mark-bit on that locator.}
700: 
701: %Tip: Each processor/transaction $p$ needs at least $N$ locators in the pool since $N-1$ other processors each may keep a reference to each of $p$'s $N-1$ locators $\Rightarrow$ the pool must have as least $N^2$ locators.
702: 
703: 
704: \subsection{Analysis} \label{sec:Correctness}
705: 
706: In this section, we prove that NBFEB-STM fulfills the three essential aspects of 
707: transactional memory semantics \cite{GueK08}:
708: \begin{description}
709: \item[Instantaneous commit]: Committed transactions must appear as if they executed instantaneously at some unique point in time, and aborted transactions, as if they did not execute at all.
710: \item[Preserving real-time order]: If a transaction $T_i$ commits before a transaction $T_j$ starts, then $T_i$ must appear as if it executed before $T_j$. 
711: Particularly, if a transaction $T_1$ modifies an object $O$ and commits, and then another transaction $T_2$ starts and reads $O$, then $T_2$ must read the value written by $T_1$ and not an older value.
712: \item[Preluding inconsistent views]: The state (of shared objects) accessed by {\em live} transactions must be consistent.
713: \end{description}
714: 
715: First, we prove some key properties of NBFEB-STM.
716: 
717: \begin{lemma} \label{lem:nextPointer}
718: A locator $L_i$ with timestamp $cts_i$ does not have any links/references to another locator $L_j$ with a lower timestamp $cts_j < cts_i$. %(\todo{in the case of no Prev pointer}) 
719: \end{lemma}
720: \begin{proof}
721: There is only the $next$ pointer to link between locators. 
722: The $next$ pointer of locator $L_i$ points to a locator $L_j$ only if $L_j.cts$ is not less than $L_i.cts$ (lines \ref{alg:W:TxCts}W and \ref{alg:W:HCts}W, Algorithm \ref{alg:OpenWObject}). Note that for each locator $L_i$, the commit timestamp $L_i.tx.cts$ of its corresponding transaction $L_i.tx$ (if $L_i.tx$ committed) is the commit timestamp of $L$'s new data and thus it is always greater than the commit timestamp $L_i.cts$ of $L_i$'s old data. 
723: \end{proof}
724: 
725: \begin{lemma}\label{lem:FindHead}
726: The locator returned by {\sc FindHead}$(O)$ (Algorithm \ref{alg:FindHead}) is the head $H$ of $O$'s locator list at the time-point {\sc FindHead} found $H.next = \perp$ (line \ref{alg:F:while}F).
727: \end{lemma}
728: \begin{proof}
729: Let $L$ be the locator returned by {\sc FindHead}. 
730: Since the $next$ pointer of a new locator is initialized to $\perp$ (line \ref{alg:W:initNext}W, Algorithm \ref{alg:OpenWObject}) before the locator is appended into the list by $TFAS$  (line \ref{alg:W:TFASHead}W), {\sc FindHead} will find a locator $L$ whose $next$ pointer is $\perp$ at a time-point $tp$ (line \ref{alg:F:while}F). The $L$ locator is either the head at that time or a reset locator (due to lines \ref{alg:W:oldLocNext}W and \ref{alg:W:ClearOj}W, Algorithm \ref{alg:OpenWObject}).
731: 
732:  
733: If $L$ is a reset locator, $start'_{latest}.cts > L.cts$ holds (line \ref{alg:F:until}F) since a locator is reset (e.g. $oldLoc$ at line  \ref{alg:W:oldLocNext}W or $L_j$ at line \ref{alg:W:ClearOj}W) only after a locator with a higher timestamp (e.g. $newLoc$) has been written into the $O$ array (line \ref{alg:W:UpdateOi}W). Since {\sc FindHead} atomically reads the $O$ array after it found $L.next = \perp$, it will observe the higher timestamp.
734: This makes {\sc FindHead} retry and discard $L$, a contradiction to the hypothesis that $L$ is returned by {\sc FindHead}. Therefore, the $L$ locator returned by {\sc FindHead} must be the head at the time-point {\sc FindHead} found $L.next = \perp$ (line \ref{alg:F:while}F).
735: \end{proof}
736: 
737: Since a thread must get a result from {\sc FindHead} (line \ref{alg:W:findHead}W) before it can consult the contention manager (line \ref{alg:W:CM}W), {\sc FindHead} must be lock-free (instead of being obstruction-free) in order to guarantee the obstruction-freedom for transactions.
738: 
739: 
740: \begin{lemma}\label{lem:FindHead_LF}
741: {\em (Lock-freedom)} {\sc FindHead}$(O)$ will certainly return the head of $O$'s  locator list after at most $N$ repeat-until iterations unless a concurrent thread has completed a transaction and subsequently has started a new one, where $N$ is the number of concurrent threads updating $O$. 
742: \end{lemma}
743: \begin{proof}
744: From Lemma \ref{lem:FindHead}, any locator returned by {\sc FindHead}$(O)$ is the head of $O$'s locator list. Therefore, we only need to prove that {\sc FindHead}$(O)$ will certainly return a locator after at most $N$ iterations unless a concurrent thread has completed a transaction and subsequently has started a new one.
745: 
746: We prove this by contradiction. Assume that {\sc FindHead}$(O)$ executed by thread $t_i$, does not return after $N$ iterations and no thread has completed its transaction since {\sc FindHead} started.
747: Since each thread $t_j$ updates its own item $O[j]$ only once when opening $O$ for update (line \ref{alg:W:UpdateOi}W, , Algorithm \ref{alg:OpenWObject}), at most $(N-1)$ items $j$ of $O, j \neq i,$ have been updated since {\sc FindHead}$(O)$ started.
748: 
749: First we prove that {\sc FindHead}$(O)$ will return in the iteration during which no item of $O$ is updated between the first atomic read (line \ref{alg:F:scanO}F) and the second atomic read of the $O$ array (line \ref{alg:F:scanO2}F). 
750: 
751: Indeed, since each transaction successfully appends its own locator to the head of $O$'s locator list only once when opening $O$ for update (line \ref{alg:W:TFASHead}W), at most $(N-1)$ locators are appended to $O$'s locator list after the first scan. Therefore, {\sc FindHead} will certainly find a locator $L$ such that $L.next \neq \perp$ (line \ref{alg:F:while}F) in the current repeat-until iteration. Note that for each $next$ pointer, only the first transaction executing $TFAS$ on the pointer, manages to append its locator to the pointer.
752: 
753: Since (1) the $next$ pointer of a locator $L_i$ points to a locator $L_j$ only if $L_j.cts \geq L_i.cts$ (cf. Lemma \ref{lem:nextPointer}) and (2) {\sc FindHead} found $L$ by following the $next$ pointers starting from $start_{latest}.ptr$ (lines \ref{alg:F:latest}F-\ref{alg:F:tmpNext}F), we have $L.cts \geq start_{latest}.ptr.cts$. Note that $start_{latest}.ptr.cts = start_{latest}.ts$ (line \ref{alg:W:UpdateOi}W). 
754: Since no item of $O$ is updated between the first scan (line \ref{alg:F:scanO}F) and the second scan of the $O$ array (line \ref{alg:F:scanO2}F), the items with highest timestamp of both scans are the same, i.e. $start_{latest} = start'_{latest}$. Therefore, $L.cts \geq start'_{latest}.ts$ holds (line \ref{alg:F:until}F) and $L$ is returned. 
755: 
756: Since {\sc FindHead} executed by thread $t_i$ does not return after $N$ iterations due to hypothesis, it follows that at least $N$ items have been updated since {\sc FindHead} started, a contradiction to the above argument that at most $(N-1)$ items have been updated since {\sc FindHead} started.
757: \end{proof}
758: 
759: \remove {
760: Question: 
761: When to update O[i] (by $p_i$)? Successfully insert a new locator or Commit/Abort time? 
762: 
763: - If update occurs at Abort time, O[i] must be writable by other threads since other threads allow to abort $p_i$. 
764: 
765: - If update occurs only at Commit time (but not Abort time), the number of locators of the list would be at least $O(N^2)$ since one committed Tx can abort all other Tx, forcing them to retry. The number of aborted locators in the list is (N-1) + (N-2) + (N-3) + ... + 1 = O(N^2) 
766: 
767: - Therefore, update occurs at a successful insertion of new locator is the best since it keeps the number of locator in the list is at most $2N$: $p_i$'s old locator is removed from the list as soon as the new locator is added.
768: } %%% End of remove
769: 
770: 
771: 
772: 
773: % NOTE: For each object $O$, the state of the transaction of its head-locator can be $Aborted$) due to LSA_Commit(), i.e. no valid range for current timestamp. 
774: 
775: \begin{lemma}
776: {\em (Instantaneous commit)} TFAS-LSA guarantees that committed transactions appear as if they executed instantaneously and aborted transactions appear as if they did not execute at all.  
777: \end{lemma}
778: \begin{proof}
779: Similar to the DSTM \cite{HerLMS03} and LSA-STM \cite{RieFF06}, the NBFEB-STM uses the indirection technique that allows a transaction $T_j$ to  commit its modifications to all objects in its write-set instantaneously by switching its status from $Active$ to $Committed$. Its committed status must no longer be changed. NBFEB-STM uses the $TFAS$ primitive (Algorithm \ref{alg:TFAS}) to achieve the property (line \ref{alg:C:TFAS}C, Algorithm \ref{alg:Commit}). 
780: Since the flag of the $T_j.status$ variable is $false$ (or 0) when the transaction starts (line \ref{alg:S:initStatus}S, Algorithm \ref{alg:StartSTM}), only the first $TFAS$ primitive can change the variable.
781: If $T_j$ manages to change the $T_j.status$ variable to $Committed$, the  variable is no longer able to be changed using $TFAS$ until the transaction object $T_j$ is reclaimed by the garbage collector. Note that even if thread $t_j$ completed transaction $T_j$ and has started another transaction $T'_j$, the transaction object $T_j$ will not be reclaimed until all the locators keeping a reference to $T_j$ are reclaimable.
782: 
783: Since active transactions $T_j$ make all changes on their own copy $T_j.new$ of a shared object $O$ before their status is changed from $Active$ to either $Aborted$ or $Committed$, aborted transactions do not affect the value of $O$.
784: \end{proof}
785: 
786: The two other correctness criteria for transactional memory are precluding inconsistent views and preserving real-time order \cite{GueK08}. 
787: Since TFAS use the lazy snapshot algorithm $LSA$ \cite{RieFF06}, the former will follow if we can prove that the LSA algorithm is integrated correctly into NBFEB-STM. 
788: 
789: 
790: \begin{lemma}\label{lem:LSACorrectness}
791: The versions kept in $N$ locators $O[j].ptr, 1 \leq j \leq N$, for each object $O$ is enough for checking the validity of a transaction $T$ using the LSA algorithm \cite{RieFF06}, from the correctness point of view.
792: \end{lemma}
793: 
794: \begin{proof}
795: The LSA algorithm requires only the commit timestamp (i.e. $\lfloor O^{CT} \rfloor$ \footnote{Term $\lfloor O^{t} \rfloor$ denotes the time of most recent update of object $O$ performed no later than time $t$ \cite{RieFF06}.}) of the most recent version (i.e. $O^{CT}$ \footnote{Term $O^{t}$ denotes the content/version of object $O$ at time $t$ \cite{RieFF06}.}) of each object $O$ at a timestamp $CT$ when it checks the validity of a transaction $T$.
796: The older versions of $O$ are not required for correctness - they only increase the chance that a suitable object version is available.
797: 
798: We will prove that by atomically reading the $O$ object/array at the timestamp $CT$ to a local variable $V$  as at line \ref{alg:F:scanO}F in Algorithm \ref{alg:FindHead}, LSA will find the commit timestamp $\lfloor O^{CT} \rfloor$. 
799: %The latest version, together with its timestamp, is recorded in one of locators $start[k].ptr$ where $start[k].ts$ is highest among $start[j].ts, 1 \leq j \leq N$.
800: 
801: 
802: %Let $k$ be the items whose timestamp  $O[k].ts$ is highest among $O[j].ts, 1 \leq j \leq N$ at the timestamp $CT$. We will prove, by contradiction, that one of locators $O[k].ptr$ will contain the latest version of $O$.
803: 
804: A new version of $O$ is created and becomes accessible by all transactions when a transaction $T_j$ commits its modification $L_j.new$ (stored in locator $L_j$) to $O$ by changing its status from $Active$ to $Committed$ (line \ref{alg:C:TFAS}C, Algorithm \ref{alg:Commit}). Since every transaction $T_j$ writes its locator $L_j$ to $O[j].ptr$ when opening $O$ for update (line \ref{alg:W:UpdateOi}W, Algorithm \ref{alg:OpenWObject}) (i.e. before committing), at least one of the locators $O[j].ptr, 1 \leq j \leq N$, must contain the most recent version of $O$ at the timestamp $CT$ when $O$ is read to $V$.
805: 
806: Since a transaction $T_j$ updates $O[j]$ with its new locator $L_j$ only after 
807:  successfully appending $L_j$ to the head of $O$'s locator list, at most one of the locators $O[j].ptr, 1 \leq j \leq N,$ is the head of the list at the timestamp $CT$ when the snapshot $V$ of $O$ is taken. Other locators $V[j].ptr$ that are not the head, have their transactions committed/aborted before $CT$. Note that as soon as the transaction of a locator committed/aborted, the locator's versions together with their commit timestamp is no longer changed.
808: If transaction $V[i].ptr.tx$ committed, the version kept in locator $V[j].ptr$ is $V[j].ptr.new$ with commit timestamp $V[j].ptr.tx.cts$, the commit timestamp of the transaction. If transaction $V[j].ptr.tx$ has been aborted or is active, the version is $V[j].ptr.old$ with commit timestamp $V[j].ptr.cts$. The only 
809: possible version with commit timestamp higher than $CT$ is $V[h].ptr.new$ where $V[h].ptr$ was the head at the timestamp $CT$ when $V$ was taken and then transaction $V[h].ptr.tx$ committed. In this case, $V[h].ptr.old$ is the most recent version at $CT$ and its commit timestamp is $V[h].ptr.cts$.
810: 
811: Therefore, by checking the commit timestamps of the versions kept in each locator $V[j].ptr, 1 \leq j \leq N,$ against $CT$, LSA will find the commit timestamp $\lfloor O^{CT} \rfloor$ of the most recent update of object $O$ performed no later than $CT$.
812: 
813: 
814: 
815: %That means the version of $O$, together with its commit timestamp, in a non-head locator $V[j].ptr$ is fixed, i.e. no longer changed from $V[j].ptr.old$ to $V[j].ptr.new$. If transaction $V[i].ptr.Tx$ committed, the version kept in locator $V[j].ptr$ is $V[j].ptr.new$ with commit timestamp $V[j].ptr.Tx.cts$, the commit timestamp of the transaction. If transaction $V[j].ptr.Tx$ aborted, the version is $V[j].ptr.old$ with commit timestamp $V[j].ptr.cts$.  If $V[j].ptr$ is the head and transaction $V[j].ptr.Tx$ is active, the version is $V[j].ptr.old$ with commit timestamp $V[j].ptr.cts$.
816: 
817: %Note that if $p_j$ later updates $O[j].ptr$ with another locator $L'_j$, the version of $O$ stored in $L'_j$ is never older than that stored in $L_j$ (lines \ref{alg:W:HNew}W and \ref{alg:W:HOld}W). This implies that a read-transaction do not need to invoke {\sc FindHead}$(O)$ to find the latest version of $O$.
818: 
819: %Assume that the latest version of $O$ is $O[l].ptr$ and $O[l].ts < O[k].ts$. It follows that $O[l].ptr.old$ is an earlier version than $O[k].ptr.old$ (lines \ref{alg:W:TxCts}W and \ref{alg:W:HCts}W). Therefore, $O[l].ptr$ cannot be the latest version, a contradiction.
820: 
821: %It follows that the new STM can find the latest version of each object in the transaction's read/write sets. 
822: 
823: %Since checking the validity of an transaction using the LSA algorithm needs only the commit timestamp (i.e. $\lfloor O^{CT}_i \rfloor$ in \cite{RieFF06}) of the latest version (i.e. at the current timestamp $CT$) of each object $O$ in the transaction's read/write sets \cite{RieFF06}, the lemma follows. 
824: \end{proof}
825: 
826: \begin{lemma} \label{lem:numOfVersions}
827: The number of versions available for each object in NBFEB-STM is up to $(N+1)$, where $N$ is the number of threads.
828: \end{lemma}
829: \begin{proof}
830: For each object $O$, each thread $t_j$ keeps a version of $O$ that has been  accessed most recently by $t_j$, in locator $O[j].ptr$ (or $L_j$ for short). If $t_j$'s latest transaction $T_j$ committed $\forall j \in [1,N]$, the $L_j.old$ is an old version of $O$ with validity range $[L_j.cts, L_j.tx.cts)$ 
831: \footnote{The {\em validity range} of a version $v_i$ of an object $O$ is the interval from the commit time of $v_i$ to the commit time of the next version $v_{i+1}$ of $O$ \cite{RieFF06}.}. 
832: Therefore, if every thread has its latest transaction committed, each object $O$ updated by $N$ threads will have $N$ old versions with validity ranges, additional to its latest version.  
833: \end{proof}
834: 
835: \begin{lemma}
836: {\em (Consistent view)} NBFEB-STM precludes inconsistent views of shared objects from {\em live} transactions.
837: \end{lemma}
838: \begin{proof}
839: Since the LSA lazy snapshot algorithm is correctly integrated into NBFEB-STM (Lemma \ref{lem:LSACorrectness}), the lemma follows.   
840: \end{proof}
841: 
842: \begin{definition}
843: The {\em value} of a locator $L$ is either $L.new$ if $L.tx.status=Committed$, or $L.old$ otherwise.
844: \end{definition}
845: 
846: \begin{lemma}\label{lem:LocValue}
847: In each $O$'s locator list, the old value $L'.old$ of a locator $L'$ is not older than the value of its previous locator \footnote{A locator $L$ is a {\em previous} locator of a locator $L'$ if starting from $L$ we can reach $L'$ by following $next$ pointers.} $L$. 
848: % Note that if the is no such a directed path between L and L' (e.g. both L and L' are reset locators), we cannot ensure that $L.old$ is not older than value of $L'$
849: \end{lemma}
850: \begin{proof}
851: Let $L''$ be the locator pointed by $L.next$. Since $L.tx.status$ must be either $Committed$ or $Aborted$ (but not $Active$) before $L''$ is appended to $L.next$ (lines \ref{alg:W:ifCommit}W-\ref{alg:W:endIf}W, Algorithm \ref{alg:OpenWObject}), $L''.old$ is $L$'s value, which is either $L.new$ if $L.tx.status=Committed$ (line \ref{alg:W:HNew}W) or $L.old$ if $L.tx.status=Aborted$ (line \ref{alg:W:HOld}W). That means $L''.old$ is not older than $L$'s value. Arguing inductively for all locators on the directed path from $L$ to $L'$, the lemma follows.
852: \end{proof}
853: 
854: 
855: \begin{lemma}
856: {\em (Real-time order preservation)} NBFEB-STM preserves the real-time order of transactions.
857: \end{lemma}
858: \begin{proof}
859: %\todo{Re-write: OpenR doesn't call FindHead. Use Lemma \ref{lem:LSACorrectness}.}
860: 
861: We need to prove that if a transaction $T_1$ modifies an object $O$ and commits and then another transaction $T_2$ starts and reads $O$, $T_2$ must read the value written by $T_1$ and not an older value \cite{GueK08}. Namely, $T_1$ is the most recent transaction committing its modification to $O$ before $T_2$ reads $O$.
862: 
863: %If have time, add an Algorithm for reading the most recent value of O used in LSA
864: 
865: First we prove that $T_2$ reads the value $v_1$ written by $T_1$ if $T_2$ opens $O$ for read (cf. {\sc OpenR}, Algorithm \ref{alg:OpenRObject}).
866: In the proof of Lemma \ref{lem:LSACorrectness}, we have proven that the value of $O$ read at a timestamp $CT$ by LSA is the most recent value of $O$ at that timestamp. Since $T_1$ is the most recent transaction committing its modification to $O$ before $T_2$ reads $O$, $v_1$ is in the set of available versions of $O$ read by {\sc LSA\_Open} (line \ref{alg:R:LSAOpen}R).  Since $T_1$ commits before $T_2$ starts and reads $O$, the commit timestamp of $v_1$ is less than the upper bound of any validity range $R_{T_2}$\footnote{The {\em validity range} $R_T$ of a transaction $T$ is the time range during which each of the objects accessed by $T$ is valid \cite{RieFF06}.}
867:  chosen by the {\sc LSA\_Open} (i.e. $\lfloor O^{CT} \rfloor \leq T_{max}$ in terminology used by LSA \cite{RieFF06}.) Therefore, the {\sc LSA\_Open} in {\sc OpenR} will return $v_1$, which is subsequently returned by {\sc OpenR} (line \ref{alg:R:ReturnVersion}R)
868: 
869: We now prove that $T_2$ reads the value $v_1$ written by $T_1$ if $T_2$ opens $O$ for read (cf. {\sc OpenW}, Algorithm \ref{alg:OpenWObject}). Particularly, we prove that the $old$ value of $T$'s new locator (lines \ref{alg:W:HNew}W and \ref{alg:W:HOld}W) is $v_1$.
870: 
871: Let $p_1$ and $p_2$ be the threads executing $T_1$ and $T_2$, respectively, $L_1$ be the locator containing $T_1$'s modification (in $L_1.new$) that is committed to $O$ and $v_2$ be the value of $O$ read by $T_2$.   
872: The $v_2$ value is the value of the head $H$ of $O$'s locator list returned from {\sc FindHead} executed by $T_2$, which is either $H.new$ if $H.ts.status=Committed$ or $H.old$ otherwise (line \ref{alg:W:HNew}W or \ref{alg:W:HOld}W). 
873: 
874: Since $T_1$ committed before $T_2$ started, $H$ is the head of $O$'s locator list that includes $L_1$ (cf. Lemma \ref{lem:FindHead}). Note that since $T_1$ is the latest transaction committing its modification to $O$,  all locators $L'$ that have ever been reachable from $L_1$ via $next$ pointers, have the most recent timestamp/value (cf. Lemma \ref{lem:LocValue}) and thus will not be reset (lines \ref{alg:W:ClearOjLoop}W-\ref{alg:W:ClearOj}W, Algorithm \ref{alg:OpenWObject}). Since there is a directed path from $L_1$ to $H$ via $next$ pointers, it follows from Lemma \ref{lem:LocValue} that the value of $H$ is not older than that of $L_1$. 
875: 
876: On other hand, since $T_1$ is the latest transaction committing its modification to $O$ before $T_2$ reads $O$, there is no value of $O$ that is newer than that of $L_1$. Therefore, the value of $H$ is the value of $L_1$. That means $T_2$ reads the $v_1$ value written by $T_1$.
877: 
878: Finally, we need to prove that {\sc LSA\_Open} at line \ref{alg:W:LSAOpen}W accepts $v_1$. Indeed, since $v_1$ is the most recent update of $O$ and $T_1$ commits before $T_2$ starts, the commit timestamp of $v_1$ is less than the upper bound of any validity range $R_{T_2}$ chosen by the {\sc LSA\_Open} (i.e. $\lfloor O^{CT} \rfloor \leq T_{max}$). Therefore, the {\sc LSA\_Open} at line \ref{alg:W:LSAOpen}W accepts $v_1$.
879: \end{proof}
880: 
881: 
882: \begin{lemma}\label{lem:numOfLocators}
883: For each object $O$, there are at most $4N$ locators that cannot be reclaimed by the garbage collector at any time-point, where $N$ is the number of update threads.
884: \end{lemma}
885: \begin{proof}
886: Let $L_i$ be a locator created by a thread $p_i$.
887: A locator $L_i$ cannot be reclaimed by the garbage collector if it is reachable by a thread. In NBFEB-STM, a locator $L_i$ is reachable if it is i) $p_i$'s {\em new} locator $newLoc$, ii) $p_i$'s {\em shared} locator, which is referenced directly by $O[i].ptr$, and iii) $p_i$'s {\em old} locators $oldLoc$ that is reachable by other threads. $p_i$'s shared locator will become one of $p_i$'s old locators if $O[i].ptr$ is updated with $p_i$'s new locator (line \ref{alg:W:UpdateOi}W, Algorithm \ref{alg:OpenWObject}). At that moment, $p_i$'s new locator becomes $p_i$'s shared locator. If there is no thread keeping a direct/indirect reference to $p_i$'s old locators, these locators are ready to be reclaimed (i.e. unreachable) when $p_i$ returns from the {\sc OpenW} procedure.
888: 
889: Let $C^p_i$ and $C^o_i$ be the chains of locators (linked by their $next$ pointers) that cannot be reclaimed due to thread $p_i$ and $O[i]$, respectively. The $C^p_i$ chain starts at the locator that is referenced directly by $p_i$ (not directly by $O$) and ends at either the locator whose $next$ pointer is $\perp$ or the locator whose next locator is referenced directly by another thread or $O$. The $C^o_i$ chain starts at the locator that is referenced directly by $O[i]$ and ends at either the locator whose $next$ pointer is $\perp$ or the locator that is referenced directly by another thread or $O$.
890: Note that there are no two locators whose $next$ pointers point to the same locator $L_j$ since $p_j$ successfully appends $L_j$ into the head of the locator list only once (line \ref{alg:W:TFASHead}W, Algorithm \ref{alg:OpenWObject}). 
891: 
892: At any time, each thread $p_i$ has at most one $C^p_i$ and one $C^o_i$. The $C^p_i$ starts either with $p_i$'s new locator (before assignment $O[i] \leftarrow newLoc$ at line \ref{alg:W:UpdateOi}W, Algorithm \ref{alg:OpenWObject}) or with $p_i$'s old locator (after this assignment). Since $p_i$ has a unique item in the $O$ array, it has at most one $C^o_i$. Therefore, there are at most $2N$ chains.
893: 
894: We will prove that if $p_i$ has three locators participating in chains (of arbitrary threads), at least one of the three locators must be the end-locator of a chain. Indeed, during the execution of the {\sc OpenW} procedure (Algorithm \ref{alg:OpenWObject}), $p_i$ creates only one new locator (line \ref{alg:W:newLoc}W) in addition to its locator $O[i].ptr$, if any. If $p_i$ has three locators that are participating in chains, at least one of them is $p_i$'s old locator $L^o$ resulting from one of $p_i$'s previous executions $E$ of {\sc OpenW}. Since $p_i$ sets the $next$ pointer of its old locator $oldLoc$ to $\perp$ before returning from $E$ (line \ref{alg:W:oldLocNext}W), $L^o$'s $next$ pointer is $\perp$. That means $L^o$ is the end-locator of a chain. 
895: 
896: It then follows that each thread has at most two {\em non-end} locators participating in all the chains. The number of non-end locators in all the chains is at most $2N$. Since there are at most $2N$ chains, there are at most $2N$ {\em end}-locators. Therefore, the total number of locators in all the chains is $4N$.
897: \end{proof}
898: 
899: \begin{theorem} \label{the:spaceComplexity}
900: {\em (Space complexity)} The space complexity of an object updated by $N$ threads in NBFEB-STM is $\Theta(N)$, the optimal.
901: \end{theorem}
902: \begin{proof}
903: Since each object $O$ in NBFEB-STM is an array of $N$ items (cf. Algorithm \ref{alg:StartSTM}), the space complexity of an object is $\Omega(N)$.
904: 
905: From Lemma \ref{lem:numOfLocators}, for each object $O$ there are at most $4N$ locators that cannot be reclaimed by the garbage collector at any point in time. Since each locator $L$ references to at most one transaction object $L.tx$ (cf. Figure \ref{fig:NBFEB-STM}), the space complexity of an object is $O(N)$. 
906: 
907: Due to the {\em instantaneous commit} requirement of transactional memory semantics \cite{GueK08}, when opening an object for update, each thread/transaction in any STM system must create a copy of the original object. Therefore, the space complexity of an object updated by $N$ threads is $O(N)$ for all STM systems. It follows that the space complexity $\Theta(N)$ of  an object updated by $N$ threads in NBFEB-STM is optimal.
908: \end{proof}
909: 
910: 
911: \begin{definition}
912: Contention level $CL_{l,t}$ of a memory location $l$ at a timestamp $t$ is the number of requests that need to be executed sequentially on the location by a memory controller (i.e. the number of requests for $l$ buffered at time $t$).
913: \end{definition}
914: 
915: \begin{definition}
916: Contention level of a transaction $T$ that starts at timestamp $s_T$ and ends (i.e. commits or aborts) at timestamp $e_T$ is $max_{s_T \leq t \leq e_T} CL_{l,t}$ for all memory locations $l$ accessed by $T$
917: \end{definition}
918: 
919: \begin{lemma}
920: {\em (Contention reduction)} Transactions using NBFEB-STM have lower contention levels than those using $CAS$-based STMs do.
921: \end{lemma}
922: \begin{proof}
923: 
924: %(\it Sketch) In the RMR time complexity analysis, we focus on {\em update} operation (e.g. TFAS or CAS) to global shared variables like $next$ pointers (or $TMObj$ pointer in LSA-STM) and transactions' $status$. 
925:  
926: 
927: %Wrong: FindHead needs to read O again to check the validity of Head => FindHead need at least 2 RMRs.  
928: 
929: %For each thread/transaction, reading the $O_i[]$ array and locators (to find the head locator (line \ref{alg:W:findHead}) or to check validity (line \ref{alg:W:LSAOpen})) in NBFEB-STM without retry (or reading versions and locator in LSA-STM) can be considered as one RMR since emerging many-core architectures without coherent data cache allow each core to read a huge amount of data in one read-operation. 
930: %For instance, the contemporary 16-core CUDA chip (GeForce 8800 GTX) allow each core to read 128 bytes in one memory transaction from the global shared memory. 
931: 
932: %\todo{Can we specify a memory bank from which garbage collector allocates and reclaims variables?}
933: 
934: ({\it Sketch}) 
935: %Contention level of a transaction is basically determined by synchronization "hot spots", which is modified by synchronization primitives. 
936: Since $CAS$ is not {\em combinable} \cite{KruRS88,BleGV08}, $M$ conflicting $CAS$ primitives on the same synchronization variable, like $TMObj$ pointer or a transaction's $status$ variable in $CAS$-based STMs \cite{HerLMS03, MarSS05, RieFF06}, issue $M$ remote-memory requests to the corresponding memory controller. Since $TFAS$ is combinable, the remote-memory requests from $M$ conflicting $TFAS$ primitives to the same variable, like the $next$ pointer or a transaction's $status$ variable in NBFEB-STM, can be combined into only one request to the corresponding memory controller. Therefore, the combinable primitive significantly reduces the number of requests for each memory location buffered at the memory controller.
937: 
938: %In the worst case of LSA-STM, for each object $O_i$, all $N$ threads first try to change the state of an active transaction $T_j$ using $CAS$ ( $p_j$ is trying to commmit its transaction $T_j$ and $N-1$ other threads are trying to abort $T_j$). The contention level on variable $T_j.ststus$ is $N$. Similarly, The contention level on pointer $TMObj$ is $N$.
939: 
940: %For NBFEB-STM, at any time the contention level on a $next$ pointer or $status$ is $1$ due to combination. 
941: 
942: %\todo{If have time, elaborate on How does the contention level help gain performance}
943: \end{proof}
944: 
945: %\todo{Problem: Even if we don't use the garbage collector supported by the system, a preempted Tx, when waked up, may move the $O_i$ pointer to its obsolete locator $L_o$ (1000 year old) $\Rightarrow$ following such a $O_i.start$ pointer cannot find the correct head of the locator list unless we keep all the intermediate locators as in the case using the system's garbage collector $\Rightarrow$ FindHead() must scan the pool of locators to find the head like sticky bit, but try to reduce the number of locators in the pool.
946: 
947: %Solution: using array $Start[N]$ of \{$pointer$, $timestamp$\} where pointer $ptr_i$ point to the latest locator finished by $Tx_i$ at timestamp $ts_i$. Timestamp $ts_i$ is either the timestamp of the version $ptr_i.state$ if $Tx_i$ aborts or the timestamp of the version $ptr_i.tentative$ if $Tx_i$ commits. 
948: %To find "head", scan the array and find the pointer with the highest timestamp. Timestamp can be implemented by FAA, a combinable operation $\Rightarrow$ the new STM is still  scalable.
949: 
950: 
951: %Garbage collection: 
952: 
953: %- For each $Tx_i$, after updating Start[i].ptr to its current location $L_c$, its previous locator $L_p$ is unlinked to $Tx_i$. $L_p$ is ready to reclaim if there is no indirect/direct link to it from a transaction. Any link from $L_p$ doesn't prevent a locator from reclaiming.
954:  
955: %- After updating $Start[i]$ with $\{ptr_i, ts_i\}$, $Tx_i$  resets locators pointed by $Start[j].ptr, j \neq i$ with $Start[j].ts < ts_i$. This removes all obsolete links to $L_p$;
956: 
957: %- Don't need Locator.Prev pointer. This new multi-version STM works on the latest versions of each transactions instead of consecutive versions as in LSA\_STM. }
958: 
959: 
960: %%%%%%%%%% Check model of sticky bit again to see if the paper assume something that can prevent a sleeping thread from using relaimed/re-used data. Then, read the garbage collection paper PLDI'06 & OOPSLA'03
961: 
962: \section{Garbage Collectors} \label{sec:GarbageCollector}
963: 
964: In this section, we present a non-blocking garbage collection algorithm called NB-GC that can be used in the context of NBFEB-STM. The NB-GC algorithm does not requires synchronization primitives other than reads and writes while it still guarantees the obstruction-freedom property for {\em application threads} (or mutators in the memory management terminology). The obstruction-freedom here means that a halted application-thread cannot prevent other application-threads from making progress. 
965:  
966: Like previous concurrent garbage collection algorithms for multiprocessors \cite{Appel04, AzaLPP03, BacALRS01, BoeDS91, CheB01, DijLMSS78, DolL93, DolG94, DomKP00, Lamport76, LevP06, SinBWC07, SomDK06, SomK07, Steele75}, the new NB-GC algorithm is a priority-based garbage collection algorithm in which the collector thread is a privileged thread that may suspend and subsequently resume the mutator threads. 
967: The NB-GC algorithm is an improvement of the seminal on-the-fly garbage collector \cite{DijLMSS78, DolG94, DolL93} using the sliding view technique \cite{LevP06} called SV-GC. Unlike the SV-GC algorithm, the NB-GC algorithm allows the collector to suspend a mutator at any point in the mutator's code (even in the reference slot update and object allocation procedures). This prevents a mutator from blocking the collector and consequently from blocking other mutators.
968: 
969: %In this section, we present a on-the-fly garbage collector \cite{DijLMSS78, DolG94, DolL93} that guarantees the obstruction-freedom property for {\em application threads} (or mutators in the memory management terminology). 
970: %Namely, a delayed/preempted application-thread should not affect other application-threads progress via stopping the garbage collector. 
971: %The new garbage collector uses the new $TFAS$ operation to keep progressing when interacting with a mutator that may stop at {\em any} point in the mutator code.
972: 
973: %\todo{Does the collector need to suspend the mutator in the new obstruction-free garbage collector? Yes, since the collector needs to read the mutator's stack & state also. We don't need to make a big improvement here: Nonblocking garbage collection may be future research}
974: 
975: In the concurrent garbage collection model, there are two kind of threads: application threads (e.g. the mutators) that perform user programs (error-prone codes), and privileged threads with higher priority (e.g. the collector) that perform system tasks (error-free codes). Whereas the application threads can be delayed/preempted arbitrarily, the system threads when running will not be preempted by the application threads. NB-GC guarantees obstruction-freedom for {\em application threads}, which usually perform users error-prone codes. Namely, a halted application-thread will not prevent other application-threads from making progress via blocking the garbage collector. 
976: The model, in some sense, covers the non-blocking garbage collection algorithms  \cite{HerLMM05, Mic04}  that, at the first look, seem not to require privileged threads. In fact, the non-blocking garbage collectors require strong synchronization primitives like {\em compare-and-swap} whose atomicity is guaranteed by hardware threads, a kind of privileged threads.
977: 
978: The SV-GC algorithm using the sliding view technique \cite{LevP06} does not need synchronization primitives other than reads and writes. However, it requires that the mutator be suspended only at a safe point, particularly it requires that the mutator not be stopped during the execution of a reference slot update nor new object allocation. If a mutator $M$ is preempted during such an execution, the collector cannot progress since it cannot suspend the mutator $M$. This would prevent the other mutators from making progress due to lack of memory. Therefore, the SV-GC collector does not guarantee the obstruction-freedom for mutators and must rely heavily on the scheduler to avoid such a scenario.
979: \footnote{In order to reclaim unreachable cyclic structures of objects, the reference-counting collectors use either a backup tracing collector \cite{AzaLPP03} infrequently or a cycle collector \cite{PazBKPR07}. Both the efficient backup tracing collector \cite{AzaLPP03} and cycle collector \cite{PazBKPR07} use the sliding view technique.}
980: 
981: The basic idea of the sliding view technique in the SV-GC algorithm is as follows. At the beginning of a collection cycle $k$, the collector takes an asynchronous heap snapshot $S_k$ of all (heap) reference slots $s$. By comparing snapshot $S_{k-1}$ and $S_k$, the collector knows which objects have their reference counter changed during the interval between the two collections. For instance, if in the interval a reference slot $s$ is sequentially assigned references to objects $o_0, o_1, \cdots, o_n$, where $(s,o_1)$ is recorded in $S_{k-1}$ and $(s, o_n)$ in $S_k$,  
982: the collector only needs to execute two reference count updates for $o_0$ and $o_n$: $RC(o_0)--$ and $RC(o_n)++$, instead of $2n$ reference count updates for $o_0$, $o_n$ and $(n-1)$ immediate objects $o_i, 1 \leq i \leq (n-1)$: $RC(o_0)--, RC(o_1)++, RC(o_1)--, \cdots, RC(o_n)++$. The main stages of the {\em generic} sliding view algorithm \cite{LevP06} are shown in Algorithm \ref{alg:GenericCollector}. The algorithm is {\em generic} in the sense that it may use any mechanism for obtaining the sliding view.
983: Instead of using an atomic snapshot algorithm \cite{AfeADGMS93} to obtain a consistent view of all heap reference slots, the algorithm uses a much simpler mechanism called {\em snooping} \cite{DijLMSS78} to avoid wrong reference counts that result from an inconsistent view. For instance, if the only reference to an  object $O$ is moving from slot $s_1$ to slot $s_2$ when the view is taken, the view may miss the reference in both $s_1$ (reading after modification) and  $s_2$ (reading before modification). To deal with the problem, the snooping mechanism marks as {\em local} any object that is assigned a new reference in the heap while the view is being read from the heap. The marked objects are left to be collected in the next collection cycle. The reader is referred to \cite{LevP06} for the complete SV-GC algorithm.
984: 
985: \begin{algorithm}[t]
986:   \caption{{\sc GenericCollector}: the main stages of a collection cycle using the sliding view technique} \label{alg:GenericCollector}
987:  %\algsetup{linenodelimiter=Gen:}
988:   \begin{algorithmic}[1]
989: 	\STATE Raise the $Snoop_i$ flag of each mutator;
990: 	\STATE Obtain a sliding view (concurrently with mutator's computation); \\
991: 	%\COMMENT{cf. {\sc ViewObtain}, Algorithm \ref{alg:ViewObtain}.} \label{alg:Gen:GetView}
992: 	\STATE For each mutator $M_i$: 1) Suspend $M_i$; 2) Turn the $Snoop_i$ flag off; 3) Mark as {\em local} objects $O$ directly reachable from $M_i$'s roots; 4) Resume $M_i$; \\
993: 	%\COMMENT{cf. {\sc StateRead}, Algorithm \ref{alg:StateRead}.} 	\label{alg:Gen:Suspend}
994: 	\STATE Update the reference counter $O.rc$ of each object $O$; \\
995: 	%\COMMENT{cf. {\sc CounterUpdate}, Algorithm \ref{alg:CounterUpdate}.} 	\label{alg:Gen:UpdateRC} 
996: 	\STATE Reclaim objects $O$ that are not marked {\em local} and $O.rc=0$; For each descendent $D$ of a reclaimed object, $D.rc --$; $D$ is checked for reclamation like $O$. This operation continues recursively until there are no objects that can be reclaimed. 
997:   	\end{algorithmic}
998: \end{algorithm}
999: 
1000: We found that the SV-GC algorithm  \cite{LevP06} can be easily improved to provide obstruction-freedom for mutators using the {\em helping technique} \cite{Barnes93}. Basically, if the collector suspends a mutator during its execution of a reference slot update or object allocation procedure, the collector helps the mutator by completing the procedure on behalf of the mutator and moving the mutator's program counter (PC) to the end of the procedure before resuming the mutator. Note that in the concurrent garbage collection model there is only one collector that can suspend a given mutator and the collector suspends only one mutator at a time. The improved algorithm  provides obstruction-freedom for mutators (or application-threads) by preventing mutators from blocking the collector and consequently from blocking other mutators. It is obstruction-free in the sense that progress is guaranteed for each active mutator regardless of the status of the other mutators.
1001: 
1002: 
1003: %\appendix
1004: %\section{Appendix}
1005: 
1006: 
1007: %This is the text of the appendix, if you need one.
1008: \medskip
1009: 
1010: {\bf Acknowledgments} Phuong Ha's and Otto Anshus's work was supported by the Norwegian Research Council (grant numbers 159936/V30 and 155550/420). Philippas Tsigas's work was supported by the Swedish Research Council (VR) (grant number 37252706). 
1011: %Acknowledgments, if needed.
1012: 
1013: % {\scriptsize %\small
1014: %\bibliographystyle{plainnat}
1015: \bibliographystyle{abbrv}
1016: \bibliography{References}
1017: % }
1018: 
1019: \end{document}
1020:  
1021: