Learn Roslyn Now: Part 8 Data Flow Analysis

Writing this blog post has been really painful. It’s been three months since I last published my introduction to the semantic model and I’ve been putting off this post for as long as I could. I started a new series called Learn Roslyn Now Quick Tips, I helped build Source Browser, and I even submitted a small pull request to clean up the analysis APIs. Basically, I’ve done everything but learn and write about these APIs.

The two reasons I’ve struggled to write about AnalyzeControlFlow and AnalyzeDataFlow are:

  1. I’ve struggled to imagine how one would use them in an analyzer or extension.
  2. They’re weird, unintuitive and they frighten me.

I put out a tweet asking how others were using them, and it appears they’re only really used within Microsoft to implement the “Extract Method” functionality. A handful of questions on Stack Overflow have mentioned these APIs, so I’m sure someone out there is putting them to good use.

Data Flow Analysis

This API can be used to inspect how variables are read and written within a given block of code. Perhaps you’d like to make a Visual Studio extension that captures and logs all assignments to a certain variable. You could use the data flow analysis API to find the statements, and a rewriter to log them.

To demonstrate the capabilities of this API, we’ll be looking at a modified piece of code posted on Stack Overflow. I’ve cleaned it up slightly, but it shows a number of interesting behaviors consumers of this API should be aware of.

We can analyze the for-loop in the following code:


var tree = CSharpSyntaxTree.ParseText(@"
public class Sample
{
public void Foo()
{
int[] outerArray = new int[10] { 0, 1, 2, 3, 4, 0, 1, 2, 3, 4};
for (int index = 0; index < 10; index++)
{
int[] innerArray = new int[10] { 0, 1, 2, 3, 4, 0, 1, 2, 3, 4 };
index = index + 2;
outerArray[index – 1] = 5;
}
}
}");
var Mscorlib = MetadataReference.CreateFromFile(typeof(object).Assembly.Location);
var compilation = CSharpCompilation.Create("MyCompilation",
syntaxTrees: new[] { tree }, references: new[] { Mscorlib });
var model = compilation.GetSemanticModel(tree);
var forStatement = tree.GetRoot().DescendantNodes().OfType<ForStatementSyntax>().Single();
DataFlowAnalysis result = model.AnalyzeDataFlow(forStatement);

view raw

gistfile1.cs

hosted with ❤ by GitHub

At this point we’ve got access to a DataFlowAnalysis object.

Perhaps the most important property on this object is Succeeded. This tells you if the data flow analysis completed successfully. In my experience the API has been pretty good at dealing with semantically invalid code. Neither invocations to missing methods nor use of undeclared variables seemed to trip it up. The documentation notes that if the analyzed region does not span a single expression or statement then analysis is likely to fail.

The DataFlowAnalysis object exposes a pretty rich API for uses to consume. It exposes information about unsafe addresses, local variables captured by anonymous methods and much more.

In our case, we’re interested in the following properties:

To refresh, the code on which we’ve analyzed is displayed below. The region we’ve declared interest in is the for-loop.


public class Sample
{
public void Foo()
{
int[] outerArray = new int[10] { 0, 1, 2, 3, 4, 0, 1, 2, 3, 4};
for (int index = 0; index < 10; index++)
{
int[] innerArray = new int[10] { 0, 1, 2, 3, 4, 0, 1, 2, 3, 4 };
index = index + 2;
outerArray[index 1] = 5;
}
}
}

view raw

gistfile1.cs

hosted with ❤ by GitHub

The results from analysis are as follows:

AlwaysAssigned: index
index is always assigned to as it is contained within the initializer of the for-loop, which runs unconditionally.

WrittenInside: index, innerArray
Both index and innerArray are clearly written within the loop.

One important point is that outerArray is not. While we’re mutating the array, we’re not mutating the reference contained within the outerArray variable. Therefore it does not show up in this list.

WrittenOutside: outerArray, this
outerArray is clearly written to outside of the for-loop.

However, it surprised me that this showed up as a parameter symbol within the WrittenOutside list. It appears as though this is passed as a parameter to the class and its member, which means that it shows up here as well. This appears to be by design, although I suspect most consumers of this API will be surprised, and likely ignore this value.

ReadInside: index, outerArray
It is clear that the value of index is read within the loop.

It was surprising to me that outerArray is considered to be “read” inside the loop as we’re not reading its value directly. I suppose that technically we must first read the value of outerArray in order to calculate the offset and retrieve the correct address for the given element of the array. So we’re performing a sort of “implicit read” inside the loop here.

VariablesDeclared: index, innerArray
This is fairly straightforward. index is declared within the loop initializer and innerArray within the body of the for-loop.

Final Thoughts

The general weirdness of the data flow analysis API has long kept me from writing about it. The issues with this and what’s considered a read vs. a write is pretty offputting to me. I suspect these kinds of issues will prevent a lot of people from taking advantage of this API, but I could be wrong. It’s difficult to say this early in the game and I have not seen very much discussion about this API and the above problems.

 

13 thoughts on “Learn Roslyn Now: Part 8 Data Flow Analysis

  1. Thoughts on this being written inside. This information could be used to determine if the current code is instance code or static code, thus determining if you can suggest making a method static if this is WrittenOutside but not ReadInside or WrittenInside.

  2. From that post, which is really great, how can I make the difference between a variable not being used and a variable which has been assigned but is not being used? How can I check that a variable has been assigned? With readInside?

  3. I have this code :
    var methodBody = variableDeclarator.AncestorsAndSelf(false).OfType().First();
    if (methodBody == null)
    return false;

    var model = syntaxNode.SemanticModel;
    var result = model.AnalyzeDataFlow(methodBody);
    The result breaks the code for some reason : System.ArgumentOutRangeException Index was out of range Must be non negative and lesser than the size of the collection
    The thing I don’t get is the test code works fine :

    @”
    class TestClass {
    void TestMethod ()
    {
    int i;
    }
    }”;

  4. Nice article, it’s hard to find any Roslyn tutorial on the web.

    However I don’t see anything weird about outerArray being read in the loop – as you mentioned, it has to be implicitly read to access it. If, for example, you were to lazy-load a field, it certainly would be nice to be informed that your field is read and thus initialized.

    1. I think you’re right. I actually ended up needing this behavior in a tool I was working on.

      As with most things I’ve dealt with in Roslyn: It behaves as it should, it’s my understanding that was incomplete. 🙂

  5. “It was surprising to me that outerArray is considered to be “read” inside the loop as we’re not reading its value directly”
    In that case it is only compiler that is doing out of bounds check whenever you try to access by index… but it is easy to imagine a situation where some folks would modify some state on indexed getter of some other variables.

    1. It perhaps is not “reading” any elements of “outerArray”, but it is actually reading its base address. =)

  6. I would like to write the exact same thing. It looks like work has picked up on data flow analysis at roslyn. Maybe we can talk them into offering a good api for doing this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s