0601:cs0601068/paper.tex

1: \documentclass{IEEEtran}

2:

3: \usepackage{bm}

4: \usepackage{psfig, epsfig}

5:

6: \begin{document}

7:

8: \title{Checkbochs: Use Hardware to Check Software}

9: \author{Sorav Bansal\footnote{This work was done with guidance and

10: support from Mendel Rosenblum, Professor of Computer Science, Stanford

11: University}}

12:

13: %\author{Sorav Bansal

14: %(based on work done with Mendel Rosenblum),\\

15: %Computer Systems Lab, Stanford University}

16: %\\

17: %{\em Computer Systems Lab,\\

18: %Stanford University}

19: %\author{

20: %Sorav Bansal, Mendel Rosenblum\\\tiny{\today}

21: %}

22: \twocolumn[\begin{@twocolumnfalse}

23:

24: %{\renewcommand\thefootnote{}\footnotetext{This work was done with guidance

25: %and support from Mendel Rosenblum, Professor of Computer Science,

26: %Stanford University}}

27:

28:

29: \maketitle

30: \thispagestyle{empty}

31: \begin{abstract}

32: In this paper, we present a system called Checkbochs, a machine simulator

33: that checks rules about its guest operating system and

34: applications at the hardware level. The properties to be checked can

35: be implemented as `plugins'

36: in the Checkbochs simulator. Some of the properties that were checked

37: using Checkbochs include null-pointer checks, format-string vulnerabilities,

38: user/kernel pointer checks, and race-conditions. On implementing these

39: checks, we were able to uncover previously-unknown bugs in widely used

40: Linux distributions. We also tested our

41: tools on undergraduate coursework, and found numerous bugs.

42: \end{abstract}

43: \begin{keywords}

44: Software Reliability, Machine Simulation, Type Systems

45: \end{keywords}

46:

47: \end{@twocolumnfalse}]

48:

49: \section{Introduction}

50: The use of software in many mission critical applications has led computer

51: scientists to focus on software reliability, more than ever before.

52: Many ways have been proposed to make software more reliable, secure

53: and bug-free. Broadly, these techniques can be grouped into static

54: compile-time techniques and dynamic run-time techniques.

55:

56: Static compile-time techniques analyze program source code to detect

57: errors. This method,

58: though very effective cannot take advantage of the information available

59: at run-time. Dynamic runtime techniques on the other hand mostly

60: rely on an interpreted environment, or instrumentation at source-code

61: or binary level. For this reason, dynamic runtime techniques have seldom been

62: used to test the correctness of an operating system. Instrumenting the

63: source code or binary of an operating system requires clever engineering

64: effort and deep understanding of the OS internals. Further, binary

65: instrumentation requires thorough understanding of the instruction set

66: architecture. And even when the instrumentation is complete and correct,

67: the execution of instrumented code can change the behavior of the OS

68: indeterministically due to a different interleaving of concurrent threads.

69:

70: In this paper, we present a dynamic runtime checker called Checkbochs which

71: has been implemented using an existing x86 machine simulator called Bochs

72: \cite{bochs}. Since each machine instruction in a simulator is interpreted,

73: it allows us to do complex runtime analyses without the pitfalls associated

74: with instrumentation. By tracking data flow at runtime, and checking for

75: some known properties, we found interesting bugs in two widely used

76: Linux distributions. We also implemented a dynamic data race detector

77: inside the simulator to be able to find data races in operating systems. As

78: a case study, we checked an academic operating system called Pintos which

79: was built as part of undergraduate coursework at Stanford.

80:

81: Some of our contributions are:

82: \begin{itemize}

83: \item Checkbochs allows dynamic runtime analysis by implementing rules

84: to be executed while interpreting instructions in the guest software.

85: \item Checkbochs can identify bugs across all layers of software running

86: on the system. It does not require prior knowledge of the applications or

87: their source code before reporting violations.

88: \item Checkbochs does not change application or OS behavior in any way since

89: all its operations occur at the hardware level. In fact, Checkbochs can be

90: used on an existing disk image.

91: \item Implementing new runtime rules in Checkbochs is easy and straightforward

92: \end{itemize}

93: This paper is organized as follows. Section \ref{sec:dataflow} explains

94: the Checkbochs dataflow analysis framework that can be extended to check

95: various system rules at runtime. Section

96: \ref{sec:eraser} describes the implementation of one such rule -

97: the dynamic data race detector. Section \ref{sec:relwork} discusses

98: the related work, and finally Section \ref{sec:conclusion} concludes.

99:

100: \section{Dynamic Data Flow Analysis}

101: \label{sec:dataflow}

102: In our dynamic data flow analyses, we wish to infer types

103: of machine values and then check for violations of type-properties

104: at runtime. For example, if two file-descriptors are multiplied at

105: any point in the execution of the system, it signals an error.

106:

107: Checbochs provides a framework, which allows users to specify

108: data types and corresponding system rules to be checked during

109: runtime. In this section, we describe the implementation of

110: this framework and also the results obtained by implementing

111: some rules.

112:

113: \subsection{Implementation}

114: To track types of machine values, Checkbochs maintains a {\em shadow

115: machine state}

116: The shadow state consists of the shadow registers, instructions

117: and shadow physical memory. Since the instructions use virtual memory addresses,

118: we also implemented a virtual-memory translation logic in the shadow space.

119:

120: Data flow is monitored by associating all data transfer and arithmetic

121: instructions ({\tt load}, {\tt store}, {\tt move}, {\tt add}, etc.) with

122: their counterparts in shadow space. For example, a {\tt move dest, src}

123: instruction

124: would also cause the contents of {\tt src} to be copied to {\tt dest} in

125: shadow space. Using this data flow framework, we tag data values with

126: their {\em types} in shadow space and analyse their flow across the system.

127: Some of the types that we inferred using Checkbochs are tabulated in

128: Table \ref{tab:types} with their method of inference.

129:

130: \begin{table}[htb]

131: \begin{center}

132: %\begin{tabular}{|c|p{14cm}|}

133: \begin{tabular}{|p{3.5cm}|p{4.5cm}|}

134: \hline

135: Method Of Inference & Types \\

136: \hline

137: \hline

138: Linux System Call Interface & User Pointers, File Descriptors, Strings \\

139: \hline

140: x86 Instruction Operands & I/O address, I/O bytes, NullChecked, Memory Address, Code/Data\\

141: \hline

142: Planted Shadow Instructions in Guest Software & malloc/fopen return value,

143: user pointer safety\\

144: \hline

145: \end{tabular}

146: \label{tab:types}

147: \caption{\small{Type values and their Method of Inference at the Hardware Level}}

148: \end{center}

149: \end{table}

150:

151: Sometimes, the flow of type information is not directly associated with

152: the corresponding data flow. For example, consider the sample code of

153: the {\tt read()} system call in Figure \ref{fig:sysread}.

154: \begin{figure}[htb]

155: \begin{minipage}[]{\columnwidth}

156: {

157: \begin{small}

158: \begin{verbatim}

159:   ssize_t sys_read(...,char *buf,...) {

160:       char *tmp = buf;

161:       if (!access_ok(tmp)) fail();

162:       ...

163:       *buf = 42; /* safe? */

164:   }

165: \end{verbatim}

166: \end{small}

167: }

168: \end{minipage}

169: \caption{\label{fig:sysread}

170: The system call parameter {\tt buf} is an unsafe user pointer. However, since

171: a derived variable {\tt tmp} has been checked for safety, the assignment

172: to {\tt *buf} should be considered legal.

173: }

174: \end{figure}

175: At the entry of the function, {\tt buf} is an {\em unchecked user pointer}. The

176: assignment statement {\tt tmp=buf} changes the type of {\tt tmp} to

177: an unsafe user pointer too. Subsequently, {\tt tmp} goes through the

178: appropriate check and its type is changed to a safe user pointer. However,

179: the result of this check is not reflected on the type of {\tt buf} and the

180: line {\tt *buf=42} will incorrectly flag a warning.

181:

182: We solved this problem by using an extra level of indirection in the shadow

183: space. Instead of holding the type value, the shadow state holds pointers

184: to type objects. Hence, an assignment statement like {\tt tmp=buf} causes

185: the pointer to {\tt buf}'s type object to be copied to {\tt tmp}'s shadow

186: state. Since both {\tt buf} and {\tt tmp} now point to the same type

187: object, any change in {\tt tmp}'s type is reflected in {\tt buf}'s type

188: and vice versa. This extra level of indirection is also illustrated in

189: Figure \ref{fig:indirection}

190: \begin{figure}[htb]

191: \centerline{\epsfig{figure=TypeIdentifier.eps,width=0.6\columnwidth}}

192: \caption{\label{fig:indirection}{\bf Pointers to Type Objects}.

193: Both {\tt tmp} and {\tt buf} point to the same type object, after

194: assignment of {\tt tmp} to {\tt buf}.

195: Hence, any change to the type of {\tt tmp}

196: will also reflect in the type of {\tt buf}.

197: }

198: \end{figure}

199:

200: \subsection{Results}

201: We ran two different disk images on Checkbochs: {\tt debian-3.0r1} and

202: {\tt gentoo-2004.3}. Both these disk images were obtained after freshly

203: installing the latest available distribution CDs by these vendors. Using

204: our dataflow framework, we check for flavors of copyin/copyout bugs, improper

205: handling of {\tt malloc()}, {\tt fopen()} functions and formatstring

206: vulnerabilities. The results obtained by performing these checks are

207: described in detail in the following sections.\\

208: \subsubsection{CopyIn/CopyOut Bugs}

209: A user value passed as a system call argument must be checked through one

210: of the copyin/copyout functions before getting dereferenced. Failure to do

211: so, opens a port of attack, whereby an attacker can crash the kernel,

212: or worse be able to write his own data at a specific kernel address. Another

213: variation of this rule is that a user pointer should never be dereferenced

214: in the kernel with disabled interrupts.

215:

216: Implementing this rule using our data flow framework was straightforward. All

217: system call parameters are tagged as {\em unchecked user values}. Any check

218: on the user value through one of the copyin/copyout functions cause the

219: type of the user value to change from {\em unchecked} to {\em checked user

220: value}. A warning is

221: flagged if an unchecked user value is dereferenced in the kernel or

222: a user value is dereferenced with disabled interrupts.

223:

224: Using these rules, we identified one copyin/copyout bug in the {\tt poll}

225: system call of the linux kernel. In this instance, the user pointer was

226: checked for read access, while a write operation was performed on it. This

227: can be a security flaw on many architectures, including Intel 386.

228: The submitted bug was acknowledged by Alan Cox on the Linux Kernel

229: Mailing List \cite{lkml}.\\

230:

231: \begin{figure*}[t]

232: \begin{center}

233: \begin{tabular}{|c|c|c|c|}

234: \hline

235: Rule & Kernel Bugs & Application Bugs & False Positives \\

236: \hline

237: \hline

238: Pointers/File Descriptors checked against NULL & 6 & 16 & 0 \\

239: \hline

240: CopyIn/CopyOut Bugs & 1 & - & 4 \\

241: \hline

242: Untrusted IO in FormatString & 0 & 0 & 1 \\

243: \hline

244: \end{tabular}

245: \caption{\label{fig:results}

246: {\bf Summary}: The table summarizes the results obtained on implementing

247: certain rules

248: and checking two widely used Linux distributions. Besides these rules,

249: we implemented the race-detection algorithm in Checkbochs which uncovered

250: numerous bugs in undergraduate coursework.

251: }

252: \end{center}

253: \end{figure*}

254:

255:

256: \subsubsection{Improper handling of {\tt malloc()}, {\tt fopen()} return values}

257: Any use of {\tt glibc} functions like {\tt malloc()} and {\tt fopen()} must be

258: accompanied by a null check before they are used. To check this rule,

259: we tagged the return values of these functions until they were checked against

260: {\tt null}. Implementing this rule, we found many instances of violation

261: in the kernel and common applications like {\tt ps}, {\tt grep}, {\tt fsck},

262: and {\tt swapon}. The violations in the kernel were found in the

263: IDE device driver of {\tt linux-2.4.18}. In all, after minimal testing,

264: we found 16 bugs in user-level

265: software and 6 bugs in the kernel. All these bugs were accepted, and

266: subsequently fixed.\\

267:

268: \subsubsection{Formatstring Vulnerabilities}

269: A formatstring vulnerability  \cite{formatstring}

270: is caused due to a design misfeature in the C

271: standard library combined with problematic implementation of variable

272: argument functions. A value from an untrusted source (such as the network)

273: should not be used without proper checks inside the formatstring argument

274: of the {\tt printf} family of functions. Failure to do so, can lead to a

275: complete compromise of security, when combined with other bugs. To test

276: applications, we mounted an NFS partition on the guest system and ran a

277: number of applications on it. Although, we hoped to obtain some bugs in

278: applications running on data in the untrusted NFS partition, we obtained

279: only one false positive in the {\tt sendmail} program. In this case,

280: each character of the string received over the network was carefully

281: checked before using it in the formatstring. Since our rule does not

282: implement a way of finding out whether a value has been appropriately

283: checked for safety, it reports a false warning.\\\\

284:

285: In essence, we believe that machine simulation allows us to track

286: data flow across many different applications, which would have been

287: difficult (if not impossible) to do with instrumentation techniques. We

288: are investigating other uses of this dataflow framework in determining

289: interesting system properties. Recently, dynamic dataflow tracing was

290: used to determine lifetime of sensitive data like passwords \cite{taintbochs}.

291: The work revealed the startling result that sensitive information could

292: live in the system for a long time in places like system caches and device

293: buffers (places which are not in the control of the application vendor).

294:

295: \section{Race Conditions}

296: \label{sec:eraser}

297: Multi-threaded programming is difficult and error-prone. It is easy to make

298: a mistake that produces a data race, yet it can be extremely hard to

299: locate this mistake during debugging. A very effective dynamic data

300: race detection algorithm is the lockset algorithm, first proposed in

301: a tool called Eraser \cite{eraser}. Eraser used binary rewriting techniques

302: to monitor every shared memory reference and verify that consistent locking

303: behavior is observed. The lockset algorithm enforces the simple locking

304: discipline that every shared variable is protected by some lock. Since

305: there is no way of knowing which locks are intended to protect which

306: variables, this protection relation can be inferred from the execution

307: history. Figure \ref{fig:lockset} summarizes the

308: lockset algorithm. More details on the algorithm can be found

309: at \cite{eraser}.

310:

311: We implemented the lockset algorithm in Checkbochs.

312: On testing undergraduate coursework \cite{cs140}, we found numerous

313: bugs in assignments that had received a near-perfect score. We also

314: found one benign race in the base operating system {\tt pintos} \cite{pintos}

315: provided to the students.

316: \begin{figure}[htb]

317: \begin{minipage}[]{\columnwidth}

318: {

319: \begin{small}

320: \hspace{10mm}

321: Let $locks\_held(t)$ be the set of locks held by thread $t$

322:

323: \hspace{10mm}

324: For each $v$, initialize $lockset(v)$ to the set of all locks

325:

326: \hspace{10mm}

327: On each access to $v$ by thread $t$,

328:

329: \hspace{10mm}

330: \ \ set $lockset(v)$ := $lockset(v) \bigcap locks-held(t)$;

331:

332: \hspace{10mm}

333: \ \ if $C(v)$ = $\{\}$, then issue a warning.

334: \end{small}

335: }

336: \end{minipage}

337: \caption{\label{fig:lockset}

338:   {\bf The Lockset Algorithm}

339: }

340: \end{figure}

341:

342: \section{Related Work}

343: \label{sec:relwork}

344: There has been a growing impetus on software reliability and security in

345: recent years. Researchers have considered many ways to perform

346: post-production checks

347: in software.

348:

349: Static compile-time analysis with programmer written compiler-extensions

350: was used to catch around 500 bugs in the linux kernel

351: \cite{metacompilation1}, \cite{metacompilation2}.

352: Using static data flow analysis and domain specific knowledge, many bugs

353: were found in the heavily audited kernel. Ways have also been suggested

354: to automatically detect anomalies as deviant behavior in the source

355: code \cite{deviant}. Most of the bugs checked by static analysis are

356: local to a single file, sometimes even local to a single procedure. This is

357: due to the complexity involved in performing global compile time

358: analysis. This limits the power of static analysis tools to {\em surface

359: bugs}. Our approach, on the other hand, can track data flow across many

360: different software components possibly written by different vendors and

361: can thus target a different variety of errors. However, static analysis

362: has the huge advantage of being able to check all possible code paths, while

363: our execution-driven approach can only check bugs along the path of

364: execution in the system.

365:

366: Recently, model checking was used to find serious file system errors

367: \cite{modelcheckingOSDI04}. Using

368: an abstract model and intelligent reduction of the state space, they could

369: check for errors which would have required an exponential number of

370: search paths through traditional testing. Model checking can check for

371: deeper semantic bugs than possible with static compile-time analysis. We

372: intend to use similar ideas to model check entire system images, thus allowing

373: us to search a larger number of execution paths while performing our

374: shadow machine analysis. One of the obstacles in this direction is the

375: slow speed of machine simulation that makes execution of speculative

376: paths almost infeasible.

377:

378: Shadow machine simulation has been previously used to perform taint analysis

379: to determine

380: the data lifetime of sensitive data \cite{taintbochs}. This work reported a

381: startling observation that sensitive data like passwords and credit card

382: numbers may reside in computer's memory and disk long after the user

383: has logged out. Such leaks occur at caches, I/O buffers, kernel queues,

384: and other places which are not under the control of the application

385: developer. Our work uses a similar taint analysis by marking all bytes received

386: over the network as untrusted and checking if they are used in unwanted

387: ways (eg. formatstring).

388:

389: Recently, \cite{valgrindtaint} used taint-analysis on untrusted data to

390: check for security violations such as buffer overflows and formatstring

391: attacks in applications. By implementing a valgrind

392: skin, they were able to restrict the overhead of their taint-analysis

393: tool to 10-25x. Considering that the computation power is relatively cheap,

394: they suggest using their tool in production runs of the software. This will

395: detect and prevent any online attacks on the system.

396:

397: \section{Conclusion}

398: \label{sec:conclusion}

399: We present a novel technique to finding bugs and security holes in system

400: software. Our technique can check for bugs across all layers

401: of software, from the OS to the application.

402: Our approach has very low false positive and false

403: negative rates. This technique can be especially

404: very useful in expediting the process of discovering bugs during software

405: testing.

406:

407: We conjecture that shadow machine simulation, combined with speculative

408: execution (such

409: as model-checking) can yield a huge number of bugs. While the slow speed

410: of machine simulation is an impediment to this approach, we are considering

411: using virtual machine environments to achieve the same objective.

412:

413: \section{Acknowledgements}

414: This work was done under the able guidance

415: and support from Mendel Rosenblum, Professor of Computer Science,

416: Stanford University. The author would also like to thank Ben Pfaff

417: for very insightful discussions, help with testing Checkbochs on

418: {\tt pintos}, and making Checkbochs available to CS140 students.

419:

420: \bibliographystyle{plain}

421: \bibliography{bibliography}

422: \begin{thebibliography}{66.}

423: \small{

424: \bibitem{metacompilation1} D. Engler, B. Chelf, A. Chou, S. Hallem. Checking

425: System Rules Using System-Specific, Programmer Written Compiler Extensions,

426: Appeared in OSDI 2000

427: \bibitem{metacompilation2} K. Ashcraft, D. Engler. Using Programmer-Written

428: Compiler Extensions to Catch Security Holes,

429: IEEE Security and Privacy 02

430: \bibitem{deviant} D. Engler, D. Chen, S. Hallem, A. Chou, B. Chelf. Bugs

431: as Deviant Behavior: A General Approach to Inferring Errors in System

432: Code. Appeared in SOSP 01

433: \bibitem{modelcheckingOSDI04} J. Yang, P. Twohey, D. Engler, M. Musuvathi.

434: Using Model Checking to Find Serious File System Errors. Appeared in OSDI 04

435: \bibitem{bochs} Bochs: The cross platform IA-32 emulator.

436: {\tt http://bochs.sourceforge.net/}.

437: \bibitem{taintbochs} J. Chow, B. Pfaff, K. Christopher, M. Rosenblum.

438: Understanding Data Lifetime via Whole-System Simulation, USENIX Security

439: Symposium 2004.

440: \bibitem{userKernel} Rob Johnson and David Wagner. ``Finding User/Kernel

441: Pointer bugs with Type Inference,'' 13th USENIX Security Symposium 04

442: \bibitem{formatstring} Tim Newsham. ``Format String Attacks,'' Guardent,

443: Inc. September 2000. {\tt http://www.guardent.com/docs/FormatString.PDF}

444: \bibitem{lkml} Linux Kernel Mailing List. {\tt http://lkml.org}

445: \bibitem{crash} Mission Critical Software's {\tt crash}. http://www.missioncritical.com/support/crash.php

446: \bibitem{gdb} GNU Debugger. http://www.gnu.org/software/gdb/gdb.html

447: \bibitem{cs140} Undergraduate Course CS140. http://cs140.stanford.edu

448: \bibitem{pintos} Pintos: instructional operating system.

449: http://www.stanford.edu/class/cs140/projects/

450: \bibitem{bugs} A list of the bugs reported and fixed using Checkbochs. http://www.stanford.edu/people/sbansal/bugs.html

451: \bibitem{valgrindtaint} J. Newsome and D. Song. Dynamic taint analysis for

452: automatic detection, analysis and signature generation of exploits on

453: commodity software, to appear in Network and Distributed System Security

454: Symposium 05.

455: \bibitem{eraser} S. Savage, M. Burrows, G. Nelson, P. Solbovarro and T.

456: Anderson. Eraser: A dynamic data race detector for multi-threaded programs. In

457: Proceedings of the 19 Sixteenth Symposium on Operating Systems Principles, 1997

458: %\bibitem{englerOSDI00} Engler et al. OSDI 2000

459: %\bibitem{ashcraft02} Engler et al. Finding Security Holes. IEEE S\&P 2002

460: %\bibitem{englerPLDI02} Engler et al. PLDI 2002. XXX

461: %\bibitem{englerOSDI04} Engler et al. OSDI 2004

462: %\bibitem{englerNSDI04} Engler et al. NSDI 2004

463: }

464: \end{thebibliography}

465: \end{document}

466: