While they say they have only really tested it under RedHat and Fedora, the binaries work just fine for me on Ubuntu. I have now also installed it on all the machines at Trinity. I need to do a direct speed test between it and the Intel compiler on my cluster, but I haven't done so yet. The set of optimizations that they list is quite impressive and from what I have seen it certainly isn't significantly slower than Intel on my home machine.
There is one very odd thing that I have noticed on my home machine though. My rings code uses OpenMP for multithreading and big simulations typically run on all eight cores of my home machine. Smaller simulations often sit at a load of 5-7. I started a large simulation using x86 Open64 and it keeps all 8 cores busy. However, smaller simulations that run through steps really quickly are running at loads of 3-4. The load level is also far less ragged than what I had seen with the Intel compiler. At this point I can't say if this is a good thing or not. At first I worried it wasn't properly parallelizing. Now I'm wondering if it just does a better job of keeping jobs to specific cores and keeping those cores loaded instead of having stuff jump around. That would certainly be a smart thing to do as it would improve cache performance, but only benchmarking and profiling will tell me for certain.