Saturday, December 27, 2008

Parsing "real" C code in ruby using CAST.

CAST - parser for C written in Ruby is a pretty nice piece of machinery. One drawback it has - it does not have the preprocessor. And it chokes on "#" origin file marks left by CPP. So, as a first step to making it more fun for real-world use, I made it more user-friendly when it comes to parsing the CPP-digested files.

The algorithm (if I may call it like that) is dead simple: read the lines from CPP-digested file,
and create a "compressed" content, with empty lines and #-lines removed. Also need to do some gsub()-style replacement of some gcc-isms before stuffing them into the compressed content. For each line of compressed content, record the original line#, as well as track the pre-CPP filename and line#. When/if CAST barfs, catch the exception and parse the line# from the message. Then lookup in the info that you've generated previously - and generate your own exception, this time supplying the miscellaneous goodies like the pre-processed file and its line number.

With the resulting wrapper parser it seems like I can parse a lot of my personally written sources. The cool part is that it catches where I have a bit "stretched" the gcc's kindness - cast's parser is stricter than GCC. I'm not sure if it is "pure C99", but it's a good thing to have something like this.

The only tiny gotcha I got caught with while coding this little 80-line piece: line number == array index + 1. I named one variable "linenumber" whereas really it was counting the array index - so indeed shot myself in the foot a bit later. Sometimes I wonder why don't we all start counting from zero, really.

Update: here's the ruby hack that allows the CAST to not barf on the preprocessed file. Again - the "real_code.c" is the file *after* the cpp processing. One could write their own preprocessor indeed, but I just reused cpp.

require 'rubygems'
require 'cast'

# "prepare" a line - strip any of the gcc internal stuff from it,
# and any other things we do not understand.

def prepare_line(str)
line = str.strip
line = line.gsub(/\(\(__.*\)\)/, "").gsub(/__(attribute|extension)__/, "")
line = line.gsub(/__const /, "const ")
line = line.gsub(/__restrict /, "")
line = line.gsub(/__asm__\s*\(.*\)/, "")

# "prepare" the preprocessed output in array "lines"
# taken from file fname (filename used only for error messages)
def prepare_lines(fname, lines)
# blob that will be parseable by CAST
real_text =
# info about the lines in the blob
real_info =
original_file = fname
original_lineno = 1
lines.each_with_index do |line, idx|
real_line = prepare_line(line)

if real_line =~ /^#\s+(\d+)\s+"([^"]+)"((?:\s+.*)?)$/
num = $1.to_i
fn = $2
misc = $3
original_file = fn
original_lineno = num # minus something ?
# puts "== set #{original_file}:#{original_lineno}"
if real_line =~ /^#/
raise real_line
unless real_line =~ /^$/
real_text << real_line
real_info << { :file => original_file, :lineno => original_lineno,
:debug_file => fname, :debug_lineno => idx+1 }
# puts "#{original_file}:#{original_lineno}:#{real_line}"

original_lineno += 1
[ real_text, real_info ]
def parse_c(fname)
lines ="\n")

text, info = prepare_lines(fname, lines)
blobtext = text.join("\n")

parser =
parser.type_names << "__builtin_va_list"
parser.type_names << "double"
tree = nil

print "Parse start...\n"
tree = parser.parse(blobtext)
print "Parse end...\n"
rescue Exception => e
puts "Got exception: #{e.inspect}"
if e.message =~ /^(\d+):(.*)$/
errline = $1.to_i
errmsg = $2
ei = info[errline-1]
src = text[errline-2 .. errline+1].join("\n")
# puts "source lines: #{src}"

raise "Error in #{ei[:file]}:#{ei[:lineno]} (#{ei[:debug_file]}:#{ei[:debug_lineno]}) : #{errmsg}"

raise "unparseable error message from parser"


tree = parse_c("real_code.c")

# puts tree.to_s
p tree


Etienne Savard said...


I'm pretty new to Ruby and I too try to use Cast to parse real life C files.

I'm wondering how you modified Cast to ignore preprocessor directives?


Etienne Savard

Andrew Yourtchenko said...

hi Etienne,

just to clarify - I've been parsing what's left *after* the CPP preprocessing, so the directives were already preprocessed by the preprocessor ;) It was mostly the markup "this line is line #10 of file foo.c" that I got rid of.

If that's what you are looking for, I see if I find back the code still and will update the post with it.

Andrew Yourtchenko said...

Also, for the real-world code, you'll most certainly encounter the gcc-isms - which I simply trashed with a very ugly regex hack, but if you are doing anything beyond the mind entertainment (like it was mostly in my case), you'd need to somehow handle them.

A pointer which you might be interested in would be clang ( - it claims to be gcc-compatible, so maybe would be useful if you hack it to Ruby.

I think for my purposes (getting rid of pretty obscene-looking C syntax constructs which are currently done with very ugly macros) I am going to just use Lua with some dynamic code bound to it.

Etienne Savard said...

Hi Andrew,

Thanks for you response.

I have trouble with cast when parsing plain C file. I don't need or plan to parse the source code after the CPP preprocessing. I have to deal with the files without using the compiler (which is MSVC8 in my case).

I got an error from my ruby script when cast encounter the first "#include" in my file. If I remove the preprocessor directives from my file, the parsing work great.

I have around 30 files to parse (all ANSI C). If I could parse the files without removing the preprocessor directives it will be great. Your code to remove the # from the preprocessor output could work in my case also.



Andrew Yourtchenko said...

hi Etienne,

ok then probably this little hack is not going to be very useful to you - as I was parsing the C code *after* it was handled by the preprocessor.

You will need to write your own preprocessor in case you want to avoid using cpp.