The algorithm (if I may call it like that) is dead simple: read the lines from CPP-digested file,
and create a "compressed" content, with empty lines and #-lines removed. Also need to do some gsub()-style replacement of some gcc-isms before stuffing them into the compressed content. For each line of compressed content, record the original line#, as well as track the pre-CPP filename and line#. When/if CAST barfs, catch the exception and parse the line# from the message. Then lookup in the info that you've generated previously - and generate your own exception, this time supplying the miscellaneous goodies like the pre-processed file and its line number.
With the resulting wrapper parser it seems like I can parse a lot of my personally written sources. The cool part is that it catches where I have a bit "stretched" the gcc's kindness - cast's parser is stricter than GCC. I'm not sure if it is "pure C99", but it's a good thing to have something like this.
The only tiny gotcha I got caught with while coding this little 80-line piece: line number == array index + 1. I named one variable "linenumber" whereas really it was counting the array index - so indeed shot myself in the foot a bit later. Sometimes I wonder why don't we all start counting from zero, really.
Update: here's the ruby hack that allows the CAST to not barf on the preprocessed file. Again - the "real_code.c" is the file *after* the cpp processing. One could write their own preprocessor indeed, but I just reused cpp.
require 'rubygems'
require 'cast'
#
# "prepare" a line - strip any of the gcc internal stuff from it,
# and any other things we do not understand.
#
def prepare_line(str)
line = str.strip
line = line.gsub(/\(\(__.*\)\)/, "").gsub(/__(attribute|extension)__/, "")
line = line.gsub(/__const /, "const ")
line = line.gsub(/__restrict /, "")
line = line.gsub(/__asm__\s*\(.*\)/, "")
line
end
#
# "prepare" the preprocessed output in array "lines"
# taken from file fname (filename used only for error messages)
#
def prepare_lines(fname, lines)
# blob that will be parseable by CAST
real_text = Array.new
# info about the lines in the blob
real_info = Array.new
original_file = fname
original_lineno = 1
lines.each_with_index do |line, idx|
real_line = prepare_line(line)
if real_line =~ /^#\s+(\d+)\s+"([^"]+)"((?:\s+.*)?)$/
num = $1.to_i
fn = $2
misc = $3
original_file = fn
original_lineno = num # minus something ?
# puts "== set #{original_file}:#{original_lineno}"
next
else
if real_line =~ /^#/
raise real_line
end
end
unless real_line =~ /^$/
real_text << real_line
real_info << { :file => original_file, :lineno => original_lineno,
:debug_file => fname, :debug_lineno => idx+1 }
# puts "#{original_file}:#{original_lineno}:#{real_line}"
end
original_lineno += 1
end
[ real_text, real_info ]
end
def parse_c(fname)
lines = File.open(fname).read.split("\n")
text, info = prepare_lines(fname, lines)
blobtext = text.join("\n")
parser = C::Parser.new
parser.type_names << "__builtin_va_list"
parser.type_names << "double"
tree = nil
begin
print "Parse start...\n"
tree = parser.parse(blobtext)
print "Parse end...\n"
rescue Exception => e
puts "Got exception: #{e.inspect}"
if e.message =~ /^(\d+):(.*)$/
errline = $1.to_i
errmsg = $2
ei = info[errline-1]
src = text[errline-2 .. errline+1].join("\n")
# puts "source lines: #{src}"
raise "Error in #{ei[:file]}:#{ei[:lineno]} (#{ei[:debug_file]}:#{ei[:debug_lineno]}) : #{errmsg}"
else
raise "unparseable error message from parser"
end
end
tree
end
tree = parse_c("real_code.c")
# puts tree.to_s
p tree
5 comments:
Hi,
I'm pretty new to Ruby and I too try to use Cast to parse real life C files.
I'm wondering how you modified Cast to ignore preprocessor directives?
Thanks!
Etienne Savard
hi Etienne,
just to clarify - I've been parsing what's left *after* the CPP preprocessing, so the directives were already preprocessed by the preprocessor ;) It was mostly the markup "this line is line #10 of file foo.c" that I got rid of.
If that's what you are looking for, I see if I find back the code still and will update the post with it.
Also, for the real-world code, you'll most certainly encounter the gcc-isms - which I simply trashed with a very ugly regex hack, but if you are doing anything beyond the mind entertainment (like it was mostly in my case), you'd need to somehow handle them.
A pointer which you might be interested in would be clang (http://clang.llvm.org/) - it claims to be gcc-compatible, so maybe would be useful if you hack it to Ruby.
I think for my purposes (getting rid of pretty obscene-looking C syntax constructs which are currently done with very ugly macros) I am going to just use Lua with some dynamic code bound to it.
Hi Andrew,
Thanks for you response.
I have trouble with cast when parsing plain C file. I don't need or plan to parse the source code after the CPP preprocessing. I have to deal with the files without using the compiler (which is MSVC8 in my case).
I got an error from my ruby script when cast encounter the first "#include" in my file. If I remove the preprocessor directives from my file, the parsing work great.
I have around 30 files to parse (all ANSI C). If I could parse the files without removing the preprocessor directives it will be great. Your code to remove the # from the preprocessor output could work in my case also.
Thanks.
Etienne.
hi Etienne,
ok then probably this little hack is not going to be very useful to you - as I was parsing the C code *after* it was handled by the preprocessor.
You will need to write your own preprocessor in case you want to avoid using cpp.
Post a Comment