TensorFlow Dev Environment

Having decided to try and create Ruby bindings for TensorFlow, it was time to understand the scope of the project. TensorFlow is a large, complex library:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                       4840   1492014   168362    186964  1136688     115178
C Header                  2709    445434    60301    123693   261440      15052
Python                    2320    900325    67204     89500   743621      34562
C                           32      3619      430      1062     2127        255
Swig                        31      4351      527       438     3386          0
...                         ..        ..       ..        ..       ..          .
───────────────────────────────────────────────────────────────────────────────
Total                    11262   3323988   326695    444888  2552405     174047
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $102,068,655
Estimated Schedule Effort 88.975387 months
Estimated People Required 135.886765
───────────────────────────────────────────────────────────────────────────────

In addition, the C API is documented solely in its header files and its usage via C++ unit tests and Python code (those units tests are invaluable!)

As I’ve written previously, tackling such a large code base requires knowing how to read source code and setting up a dev environment. When a C API calls crashes, guessing what happened is a useless strategy.  On the other hand, having a crash trigger a break point in a debugger is the path to enlightenment.

But sometimes things are easier said than done.

Issue #1- Create a Debug TensorFlow library

Since I generally develop on Windows, I started by trying to make a Windows build. After getting Basel, the TensorFlow build program setup, the compile failed with various syntax errors.

Ok fine, let’s try MacOS. That also failed – and in a more inscrutable way than on Windows – see issue #32998.

Alright, then let’s compile it on Fedora, that will surely work. And it did, but the build on my laptop (Thinkpad Yoga Gen 3, 4 core processor) took about 5 hours and 45 GB of disk space (just about filling up the partition).  Now that I had a build, let’s set a break point in the library and make sure its gets triggered when making a C API call.  And it did, after about 90 seconds!  That’s right, 90 seconds. The tensorflow library is so big, and has so many symbols, that it takes gdb on my laptop 90 seconds to even load. So that’s not going to work.  Back to Windows.

Fixing the compile errors, all in the C++ code, was easy. But then the linker stage failed. Once the PDB (a separate symbol file generated by the Microsoft Visual C linker) got around 4GB the linker crashed. I wondered if the MSVC linker didn’t support large files and I was out of luck. After a day of reading a lot about the MSVC linker, experimenting, and patching TensorFlow, I finally figured out the right set of linker flags to make MSVC happy. 

bazel --output_user_root= build --config opt --linkopt=/OPT:REF --linkopt=/DEBUG:FASTLINK --compilation_mode=dbg //tensorflow/tools/lib_package:libtensorflow

Sweet.

Once again, I tried the setting a breakpoint in the library. For Visual Studio, it took a few seconds to trigger the breakpoint – much better – but then getting the call stack and local variable information took about 30 seconds. Not great, but 3x better than Linux. So Windows it is.

I haven’t submitted these patches to TensorFlow since they are mostly just hacks to get a working build.

Issue #2 – MSVC Ruby Build

Since TensorFlow uses MSVC on Windows, that means I needed a MSVC build of Ruby (because the bindings use libffi and you can’t mix MSVC libraries and GCC libraries).  That is a very non-standard Ruby setup, almost everyone using Ruby on Windows uses a GCC build made via Mingw64 or more recently WSL. Ruby itself builds fine, but many gems do not because they tend to use or wrap C code that assumes C99 support which is MSVC does not fully support.

Issue #3 – Google Protobufs

Sure enough, Google’s protobuf Ruby bindings didn’t compile. Since protobufs are used a lot in TensorFlow, I had to fix that. Patch submitted.

Issue #4 – NArray

The next gem that didn’t compile was NArray, the Ruby equivalent of Numpy. Since multidimensional arrays are a key part of machine learning, I had to work around that too. This one turned out to an issue with the NMake. Issue submitted. I also later found another issue with NArray on Windows, see issue #142.

And finally I had a working dev environment. Time to complete – about five days of on and off experimentation. Ouch.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top