Saturday, December 27, 2008

Parsing "real" C code in ruby using CAST.

CAST - parser for C written in Ruby is a pretty nice piece of machinery. One drawback it has - it does not have the preprocessor. And it chokes on "#" origin file marks left by CPP. So, as a first step to making it more fun for real-world use, I made it more user-friendly when it comes to parsing the CPP-digested files.

The algorithm (if I may call it like that) is dead simple: read the lines from CPP-digested file,
and create a "compressed" content, with empty lines and #-lines removed. Also need to do some gsub()-style replacement of some gcc-isms before stuffing them into the compressed content. For each line of compressed content, record the original line#, as well as track the pre-CPP filename and line#. When/if CAST barfs, catch the exception and parse the line# from the message. Then lookup in the info that you've generated previously - and generate your own exception, this time supplying the miscellaneous goodies like the pre-processed file and its line number.

With the resulting wrapper parser it seems like I can parse a lot of my personally written sources. The cool part is that it catches where I have a bit "stretched" the gcc's kindness - cast's parser is stricter than GCC. I'm not sure if it is "pure C99", but it's a good thing to have something like this.

The only tiny gotcha I got caught with while coding this little 80-line piece: line number == array index + 1. I named one variable "linenumber" whereas really it was counting the array index - so indeed shot myself in the foot a bit later. Sometimes I wonder why don't we all start counting from zero, really.

Update: here's the ruby hack that allows the CAST to not barf on the preprocessed file. Again - the "real_code.c" is the file *after* the cpp processing. One could write their own preprocessor indeed, but I just reused cpp.


require 'rubygems'
require 'cast'


#
# "prepare" a line - strip any of the gcc internal stuff from it,
# and any other things we do not understand.
#

def prepare_line(str)
line = str.strip
line = line.gsub(/\(\(__.*\)\)/, "").gsub(/__(attribute|extension)__/, "")
line = line.gsub(/__const /, "const ")
line = line.gsub(/__restrict /, "")
line = line.gsub(/__asm__\s*\(.*\)/, "")
line
end

#
# "prepare" the preprocessed output in array "lines"
# taken from file fname (filename used only for error messages)
#
def prepare_lines(fname, lines)
# blob that will be parseable by CAST
real_text = Array.new
# info about the lines in the blob
real_info = Array.new
original_file = fname
original_lineno = 1
lines.each_with_index do |line, idx|
real_line = prepare_line(line)

if real_line =~ /^#\s+(\d+)\s+"([^"]+)"((?:\s+.*)?)$/
num = $1.to_i
fn = $2
misc = $3
original_file = fn
original_lineno = num # minus something ?
# puts "== set #{original_file}:#{original_lineno}"
next
else
if real_line =~ /^#/
raise real_line
end
end
unless real_line =~ /^$/
real_text << real_line
real_info << { :file => original_file, :lineno => original_lineno,
:debug_file => fname, :debug_lineno => idx+1 }
# puts "#{original_file}:#{original_lineno}:#{real_line}"
end

original_lineno += 1
end
[ real_text, real_info ]
end
def parse_c(fname)
lines = File.open(fname).read.split("\n")

text, info = prepare_lines(fname, lines)
blobtext = text.join("\n")

parser = C::Parser.new
parser.type_names << "__builtin_va_list"
parser.type_names << "double"
tree = nil

begin
print "Parse start...\n"
tree = parser.parse(blobtext)
print "Parse end...\n"
rescue Exception => e
puts "Got exception: #{e.inspect}"
if e.message =~ /^(\d+):(.*)$/
errline = $1.to_i
errmsg = $2
ei = info[errline-1]
src = text[errline-2 .. errline+1].join("\n")
# puts "source lines: #{src}"

raise "Error in #{ei[:file]}:#{ei[:lineno]} (#{ei[:debug_file]}:#{ei[:debug_lineno]}) : #{errmsg}"

else
raise "unparseable error message from parser"
end
end


tree
end

tree = parse_c("real_code.c")

# puts tree.to_s
p tree

5 comments:

Etienne Savard said...

Hi,

I'm pretty new to Ruby and I too try to use Cast to parse real life C files.

I'm wondering how you modified Cast to ignore preprocessor directives?

Thanks!

Etienne Savard

Andrew Yourtchenko said...

hi Etienne,

just to clarify - I've been parsing what's left *after* the CPP preprocessing, so the directives were already preprocessed by the preprocessor ;) It was mostly the markup "this line is line #10 of file foo.c" that I got rid of.

If that's what you are looking for, I see if I find back the code still and will update the post with it.

Andrew Yourtchenko said...

Also, for the real-world code, you'll most certainly encounter the gcc-isms - which I simply trashed with a very ugly regex hack, but if you are doing anything beyond the mind entertainment (like it was mostly in my case), you'd need to somehow handle them.

A pointer which you might be interested in would be clang (http://clang.llvm.org/) - it claims to be gcc-compatible, so maybe would be useful if you hack it to Ruby.

I think for my purposes (getting rid of pretty obscene-looking C syntax constructs which are currently done with very ugly macros) I am going to just use Lua with some dynamic code bound to it.

Etienne Savard said...

Hi Andrew,

Thanks for you response.

I have trouble with cast when parsing plain C file. I don't need or plan to parse the source code after the CPP preprocessing. I have to deal with the files without using the compiler (which is MSVC8 in my case).

I got an error from my ruby script when cast encounter the first "#include" in my file. If I remove the preprocessor directives from my file, the parsing work great.

I have around 30 files to parse (all ANSI C). If I could parse the files without removing the preprocessor directives it will be great. Your code to remove the # from the preprocessor output could work in my case also.

Thanks.

Etienne.

Andrew Yourtchenko said...

hi Etienne,

ok then probably this little hack is not going to be very useful to you - as I was parsing the C code *after* it was handled by the preprocessor.

You will need to write your own preprocessor in case you want to avoid using cpp.