filtering out gibberish

So, this is a bit of a long story. As a background, one of my projects is building a DSpace repository for 3D architectural models. These models are stored in a format that is unique to the tool the researchers use, which means they are big binary files. There is interesting text data in these files, so a full text index would be helpful for people searching through them all, to find something they might have missed. I mean, we'll have good metadata for these files, but not all data is surfaced by even the best metadata. The typical approach for DSpace repositories is to convert the data formats to something DSpace can index, like a PDF, or a CSV file. But these data files are not the kind of thing you just convert. And no text extraction tool exists for this data format. Until today. :-)

I've been thinking about this problem for a few months. I've even run the files through the very handy strings program. Unfortunately, strings is very exuberant and pulls all kinds of gibberish out of a binary file. So... not very useful for a full-text index. I shelved the idea. Until yesterday.

Yesterday, during a meeting, I had to talk about a tool I've wanted to play with for a long time, the Data Science Toolkit I pulled up the URL so I could talk meaningfully about it, but didn't get to make more than a passing mention. After the meeting, before closing the tab I had open, I noticed the Text to Sentences API for the DSTK. I tinkered with it. It had never occurred to me that you could filter out gibberish and retrieve meaningful text. This was the missing piece! I poked around a bit, looking for code, but, ran out of time to do any more investigation, so I instead shouted out to the void (i.e. I tweeted):

Hey, Digital Humanities peeps: anyone know of a simple command line tool similar to the DSTK's text2sentences API?

--- Hardy Pottinger (@HardyPottinger) April 19, 2017

and, amazingly, I got a response:

HardyPottinger You just run their ruby file from the command line: there's even some commented out code at the end for command line

--- willismonroe (@willismonroe) April 19, 2017

So, I grabbed a copy of cruftstripper.rb and un-commented the last three lines renaming the script "sentences", pointed strings at a data file and piped that gibberish through my new "sentences" script, voila:

strings big-data-file.3d | sentences

And got useful data, which I can't share with you because I don't have permission to do so. But, you'll have to trust me, it's beautiful.

And the really cool thing about this approach is, it works for any mysterious data format a repository might receive.

Now, what I have left to do is to wire this mess up into a media filter script for DSpace which will be pretty easy. The tricky bit will be figuring out a way to do it so that the code can be merged into the DSpace application by default. Because that's what I really want to happen here---give DSpace the ability to produce a full-text index from any file.