Monday, April 23, 2012

Think Complexity book review

Think Complexity, by Allen B. Downey

I've seen the title of this book several times, so when it appeared on the list of books in the O'Reilly blogger review program, I had to grab it.

I'm interested in algorithms, complexity, and programming. This book's table of contents covered all those things, but with a huge caveat: a very low page count (158 pages for the PDF version). Most topics got only a few pages, so how useful could this book be?

Well, the author clarifies that up front: the book is meant to be a supplemental text for either an algorithms, intermediate programming, or special topics course. As such, the book does a decent job: it doesn't cover any particular topic in great depth, but it serves as an interesting smorgasbord of problems and algorithms around what Computer Scientists call Complexity. As either a quick sampler for students (or teachers, looking for some interesting topics), or as easy-reading for more seasoned developers, the book is worth picking up and reading, but be prepared to spend the real time on your own, digging into the details that the author only covers briefly.

A high point for me (as a former educator) is the inclusion of student-written chapters. The author required his students to produce analyses similar to the early chapters, and he chose a few of the better results to include as final chapters (and he encourages others to make submissions to his website, for possible inclusion in future editions). The final chapter (on student cheating) is especially relevant to many students and teachers. The game-theoretic approach to finding stable systems for managing cheating is an interesting approach.

My recommendation: read this review and the author's introduction. If you fit one of the categories he is targeting, then you'll probably find the book interesting.

[Disclaimer: I got this book for free as part of the O'Reilly blogger review program I was not required to write a positive review. The opinions I have expressed are my own. I am disclosing this in accordance with the Federal Trade Commission’s 16 CFR, Part 255 : “Guides Concerning the Use of Endorsements and Testimonials in Advertising.”]

Sunday, March 11, 2012

Machine Learning for Hackers book review

Buy this book.  I got one for free via the O'Reilly review program,
but I'll probably buy a paper copy, just so that I can mark it
up and loan it out to others.

This book is everything it is advertised to be.  It has enough of
a statistics refresher to to bring the average hacker up to speed,
and then it dives right in, using R as the language of choice to
cover several common machine learning tasks.

It's not a gentle introduction to R, but code samples are
carefully explained (be prepared to look at R's documentation if
you aren't familiar with R, though).  The book doesn't teach R
programming, but it does cover several useful libraries for
machine learning (including mining textual data).  The authors give
good presentation advice, though (e.g., they  point out that a little
extra time given to the presentation can make the difference between
an amateurish presentation versus a professional one, and they show
the difference).

Two items deserve special note:  first, while the book was in press,
the API used to generate the data for the chapter on analyzing social
graphs was removed, and the authors had to make a decision to either go
with the existing data, or wait and see what new APIs were made available.
The authors chose to provide their sample data and go with their
example rather than wait.  That was a great choice, as developers have
to deal with the real world, where vendors remove and change APIs.  I was
impressed at how they handled that issue.

The second item is not-so-great: the section on the Support Vector Machine
was too cursory.  It read like an editor or reviewer had said "hey, you
should mention SVM," and so the authors added a section.  But that material
was not given the same level of treatment as other contents, and,
as a result, the book stops on a somewhat off note.  A better
choice would have been to simply skip that chapter completely.

Overall, though, this is a great book.  It's hands-on, filled with
useful and interesting examples and advice, and it will get you moving
towards solving your own machine learning problems.

[Disclaimer: I got this book for free as part of the Oreilly blogger review
program I was not required to write a positive review. The opinions I have
expressed are my own. I am disclosing this in accordance with the Federal
Trade Commission’s 16 CFR, Part 255 : “Guides Concerning the Use of
Endorsements and Testimonials in Advertising.”]

Sunday, February 26, 2012

Test-Driven Infrastructure with Chef

I just finished Test-Driven Infrastructure with Chef, and I was disappointed.  It's a short book (88 pages if you read the paper version), and out of the 7 chapters, only one (chapter 6) contains what I expect from an O'Reilly book.

But it's hard to blame the author: he didn't claim the book is "Learning Chef" or "Programming Chef". 

The book is an appetizer of what it claims to be: coverage of how to do test-driven infrastructure using Chef.  Unfortunately, the author takes up most of the book to explain what he means by "Test-Driven Infrastructure", and the page count is so small that there simply is not enough meat to satisfy hungry readers.  Additionally, readers who are already familiar with Behavior-Driven Development and DevOps won't find much new here.

My suggestion: skip this and wait to see what the author produces in his forthcoming book on Chef itself.

 [Disclaimer: I got this book for free as part of the Oreilly blogger review program I was not required to write a positive review. The opinions I have expressed are my own. I am disclosing this in accordance with the Federal Trade Commission’s 16 CFR, Part 255 : “Guides Concerning the Use of Endorsements and Testimonials in Advertising.”]

Wednesday, July 6, 2011

Book review: The R Cookbook by Paul Teetor

I haven't seen a better introduction to R than Paul Teetor's R Cookbook, published by O'Reilly.  While it follows the  familiar O'Reilly cookbook format, it also provides a gentle introduction, with all the necessary information to get started  As a particularly nice touch for a cookbook, it includes basic statistics and input/output in the early chapters so that the reader doesn't need to wade through (or fearfully skip over) a lot of material before getting to the needed resources.

A common complaint with other R resources is that the novice in
statistics is overwhelmed with statistical terminology.  Teetor
is not trying to provide a statistics textbook, but he includes refreshing
explanations for the underlying statistics.

Some chapters are particular standouts:

Chapter 2: Some Basics.  This chapter is an appetizer of what R can do,
and it's very helpful to get this early.  Aside from the basic usage of R covered in this chapter, section 2.6 (Computing Basic Statistics) provides a quick introduction to performing basic statistics with R.

Chapter 4: Input and Output.  R's input/output support is a bit cumbersome, but the R Cookbook provides examples for many common cases that newcomers need to handle (text files, CSV's, etc).

Chapter 9: General Statistics.  This is the meat and potatoes of R for many statistical users.  Students in a basic statistics course (or practitioners needing to do most fundamental analyses) will find chapter 9 to be indispensable.

Chapter 10: Graphics provides a nice dessert as visualizing data is often critical to understanding it.  Teetor provides simple, concrete examples that cover many of the common graphics, as well as how to handle their titles, labels, and legends.

As an added bonus, Teetor and O'Reilly provide Chapter 14: Time Series Analysis.  The coverage here goes beyond standard cookbook fare and provides a good starting point for those interested in Time Series Analysis.

Overall, the R Cookbook is the best O'Reilly cookbook I've read since the release of the Perl Cookbook, and it's by far the best introduction to R that I've seen.  It's a must-have for every newcomer to R.

[Disclaimer: I got this book for free as part of the Oreilly blogger review program I was not required to write a positive review. The opinions I have expressed are my own. I am disclosing this in accordance with the Federal Trade Commission’s 16 CFR, Part 255 : “Guides Concerning the Use of Endorsements and Testimonials in Advertising.”]

Monday, August 20, 2007

OpenAFS, Acopia, and Panasas

AFS (as the Andrew File System, IBM AFS, and, now, OpenAFS) has been around for a long time. That longevity has brought it some distinct advantages: the userbase is both broad and deep; the product is also stable. There are lots of competitors to OpenAFS, though, with money to be made in the storage market. Witness the recent acquisition of Acopia by F5 Networks. Another vendor, Panasas, is clearly viewed as a potential good business (e.g., take a look at their Board of Directors -- venture capitalists would not be on the Board if they did not think the company would be profitable).

Those two companies, Acopia and Panasas, represent two different market segments that have historically been in the sweet spot of AFS usage. AFS is still strong in one of those areas, but it has soured a bit in the other.

Acopia's claim-to-fame is virtualization, the ability to keep a namespace constant while changing the back ends around. They also do data migrations. They export via NFS or CIFS, so virtually any modern operating system can access data through their systems. This is very nice. The downside, though, is that Acopia is a hardware solution. Lori MacVittie's neat article about her personal NAS notwithstanding, using Acopia's ARX to provide seamless migration for your failed personal NAS just does not make fiscal sense.

AFS provides the same kind of virtualization, but at a different cost. First, no special hardware is needed. The cost comes in complexity: clients have to run the AFS client. Ports exist to lots of modern operating systems (from AIX to Windows), but installing clients is definitely more expensive than plugging a network-transparent NFS proxy into your network. The other cost is in administration: the ramp-up for AFS is fairly steep. While efforts have been made to help people get started with AFS, there is still a lot of work to be done.

The two key features of AFS that provide this virtualization are the @sys magic, and the separation of filesystems into volumes, with volume metadata managed by database servers. These key pieces let administrators glue together namespaces seamlessly. The stable semantics of volume migration also lets administrators migrate data around a site even while users are accessing that data, letting users stay even further from the underlying details of the storage infrastructure.

Panasas, on the other hand, is a clear winner over AFS in its product niche: high-performance NFS. Like Lustre and several other filesystem products that live in the High-Performance Filesystem niche, Panasas accomplishes this by parallelizing remote filesystem accesses. AFS gets some performance benefits from its caching, but the filesystem accesses are done against a single filesystem. AFS also doesn't really do NFS.

So who is buying Panasas? While l have no knowledge of the sales, I can make some fairly educated guesses: Organizations with large data sets in NFS (or CIFS) that need greater performance. The large research organizations (high-energy physics labs, TeraGRID research groups, etc) might be interested in Panasas (except they already have Lustre, with support via ClusterFS). The most direct competitors to Panasas, then, are the NAS appliance vendors. It is interesting that most of the large research organizations have historically been heavy AFS users as well, and many still are. AFS is widely used for the cross-site sharing of data, but it simply doesn't perform well enough to be competitive with NAS appliances running NFS.

My suggestion, then, to the OpenAFS community is to get serious about helping people get started with AFS, and compete with the Acopias of the world. Also, look into improving AFS performance: as AFS is more complex than NFS, there is likely never to be a performance comparison in favor of AFS; however, parallel filesystem accesses have been around for quite a while, and an implementation of it in AFS could be very interesting.