Having decided to try and create Ruby bindings for TensorFlow, it was time to understand the scope of the project. TensorFlow is a large, complex library:
─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── C++ 4840 1492014 168362 186964 1136688 115178 C Header 2709 445434 60301 123693 261440 15052 Python 2320 900325 67204 89500 743621 34562 C 32 3619 430 1062 2127 255 Swig 31 4351 527 438 3386 0 ... .. .. .. .. .. . ─────────────────────────────────────────────────────────────────────────────── Total 11262 3323988 326695 444888 2552405 174047 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop $102,068,655 Estimated Schedule Effort 88.975387 months Estimated People Required 135.886765 ───────────────────────────────────────────────────────────────────────────────
In addition, the C API is documented solely in its header files and its usage via C++ unit tests and Python code (those units tests are invaluable!)
As I’ve written previously, tackling such a large code base requires knowing how to read source code and setting up a dev environment. When a C API calls crashes, guessing what happened is a useless strategy. On the other hand, having a crash trigger a break point in a debugger is the path to enlightenment.
But sometimes things are easier said than done.
Issue #1- Create a Debug TensorFlow library
Since I generally develop on Windows, I started by trying to make a Windows build. After getting Basel, the TensorFlow build program setup, the compile failed with various syntax errors.
Ok fine, let’s try MacOS. That also failed – and in a more inscrutable way than on Windows – see issue #32998.
Alright, then let’s compile it on Fedora, that will surely work. And it did, but the build on my laptop (Thinkpad Yoga Gen 3, 4 core processor) took about 5 hours and 45 GB of disk space (just about filling up the partition). Now that I had a build, let’s set a break point in the library and make sure its gets triggered when making a C API call. And it did, after about 90 seconds! That’s right, 90 seconds. The tensorflow library is so big, and has so many symbols, that it takes gdb on my laptop 90 seconds to even load. So that’s not going to work. Back to Windows.
Fixing the compile errors, all in the C++ code, was easy. But then the linker stage failed. Once the PDB (a separate symbol file generated by the Microsoft Visual C linker) got around 4GB the linker crashed. I wondered if the MSVC linker didn’t support large files and I was out of luck. After a day of reading a lot about the MSVC linker, experimenting, and patching TensorFlow, I finally figured out the right set of linker flags to make MSVC happy.
bazel --output_user_root= build --config opt --linkopt=/OPT:REF --linkopt=/DEBUG:FASTLINK --compilation_mode=dbg //tensorflow/tools/lib_package:libtensorflow
Sweet.
Once again, I tried the setting a breakpoint in the library. For Visual Studio, it took a few seconds to trigger the breakpoint – much better – but then getting the call stack and local variable information took about 30 seconds. Not great, but 3x better than Linux. So Windows it is.
I haven’t submitted these patches to TensorFlow since they are mostly just hacks to get a working build.
Issue #2 – MSVC Ruby Build
Since TensorFlow uses MSVC on Windows, that means I needed a MSVC build of Ruby (because the bindings use libffi and you can’t mix MSVC libraries and GCC libraries). That is a very non-standard Ruby setup, almost everyone using Ruby on Windows uses a GCC build made via Mingw64 or more recently WSL. Ruby itself builds fine, but many gems do not because they tend to use or wrap C code that assumes C99 support which is MSVC does not fully support.
Issue #3 – Google Protobufs
Sure enough, Google’s protobuf Ruby bindings didn’t compile. Since protobufs are used a lot in TensorFlow, I had to fix that. Patch submitted.
Issue #4 – NArray
The next gem that didn’t compile was NArray, the Ruby equivalent of Numpy. Since multidimensional arrays are a key part of machine learning, I had to work around that too. This one turned out to an issue with the NMake. Issue submitted. I also later found another issue with NArray on Windows, see issue #142.
And finally I had a working dev environment. Time to complete – about five days of on and off experimentation. Ouch.