Saturday, June 20, 2015

Business Interrelations as a Graph

We look at a practical application of Graph Theory to take a complex representation of information and distill it to something 'interesting.' And by 'interesting,' I mean: finding a pattern in the sea of information otherwise obscured (sometimes intentionally) by all the overwhelming availability of data.

We'll be working with a graph database created from biz.csv, so to get this started, load that table into neo4j:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///[...]/biz.csv" AS csvLine
MERGE (b1:CORP { name: csvLine.contractor })
MERGE (b2:CORP { name: csvLine.contractee })

MERGE (b1)-[:CONTRACTS]->(b2)

We got to a graph of business interrelations this past week (see, e.g.: http://lpaste.net/5444203916235374592) showing who contracts with whom, and came to a pretty picture like this from the Cypher query:

MATCH (o:OWNER)--(p:PERSON)-[]->(c:CORP) RETURN o,p,c

diagram 1: TMI

That is to say: the sea of information. Yes, we can tell that businesses are conducting commerce, but ... so what? That's all well and good, but let's say that some businesses want to mask how much they are actually making by selling their money to other companies, and then getting it paid back to them. This is not an impossibility, and perhaps it's not all that common, but companies are savvy to watch dogs, too, so it's not (usually) going to be an obvious A -[contracts]-> B -[contracts] -> A relationship.

Not usually, sometimes you have to drill deeper, but if you drill too deeply, you get the above diagram which tells you nothing, because it shares too much information.

(Which you never want to do ... not on the first date, anyway, right?)

But the above, even though noisy, does show some companies contracting with other companies, and then, in turn being contracted by some companies.

So, let's pick one of them. How about company 'YPB,' for example? (Company names are changed to protect the modeled shenanigans)

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 2: tier 1 of YPB

So, in this first tier we see YPB contracting with four other companies. Very nice. Very ... innocuous. Let's push this inquiry to the next tier, is there something interesting happening here?

MATCH (c:CORP)-[:CONTRACTS*1..2]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 3: tier 2 of YPB

Nope. Or what is interesting is to see the network of businesses and their relationships (at this point, not interrelationships) begin to extend the reach. You tell your friends, they tell their friends, and soon you have the MCI business-model.

But we're not looking at MCI. We're looking at YPB, which is NOT MCI, I'd like to say for the record.

Okay. Next tier:

MATCH (c:CORP)-[:CONTRACTS*1..3]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 4: tier 3 of YPB

Okay, a little more outward growth. Okay. (trans: 'meh') How about the next tier, that is to say: tier 4?

MATCH (c:CORP)-[:CONTRACTS*1..4]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 5: tier 4 of YPB

So, we've gone beyond our observation cell, but still we have no loop-back to YPB. Is there none (that is to say: no circular return to YPB)? Let's push it one more time to tier 5 and see if we have a connection.

MATCH (c:CORP)-[:CONTRACTS*1..5]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1


diagram 6: tier 5 with a (nonobvious) cycle of contracts

Bingo. At tier 5, we have a call-back.

But from whom?

Again, we've run into the forest-trees problem in that we see we have a source of YPB, and YPB is the destination, as well, but what is the chain of companies that close this loop. We can't see this well in this diagram, as we have so many companies. So let's zoom into the company that feeds money back to YPB and see if that answers our question.

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP)-[:CONTRACTS]->(c2:CORP)-[:CONTRACTS]->(c3:CORP)-[:CONTRACTS]->(c4:CORP)-[:CONTRACTS]->(c5:CORP) WHERE c.name='YPB' AND c5.name='GUB' RETURN c, c1, c2, c3, c4, c5

diagram 7: cycle of contracts from YPB

Aha! There we go. By focusing our query the information leaps right out at us. Behold, we're paying Peter, who pays Paul to pay us back, and it's there, plain as day.

Now, lock them up and throw away the key? No. We've just noted a cyclical flow of contracts, but as to the legality of it, that is: whether it is allowed or whether this is fraudulent activity, there are teams of analysts and lawyers who can sink their mandibles into this juicy case.

No, we haven't determined innocence or guilt, but we have done something, and that is: we've revealed an interesting pattern, a cycle of contracts, and we've also identified the parties to these contracts. Bingo. Win.

The problem analysts face today is diagram 1: they have just too much information, and they spend the vast majority of their time weeding out the irrelevant information to winnow down to something that may be interesting. We were presented with the same problem: TMI. But, by using graphs, we were able to see, firstly, that there are some vertices (or 'companies') that have contracts in and contracts out, and, by investigating further, we were able to see a pattern develop that eventually cycled. My entire inquiry lasted minutes of queries and response. Let's be generous and say it took me a half-hour to expose this cycle.

Data analysts working on these kinds of problems are not so fortunate. Working with analysts, I've observed that:

  1. First of all, they never see the graph: all they see are spreadsheets,
  2. Secondly, it takes days to get to even just the spreadsheets of information
  3. Third, where do they go from there? How do they see these patterns? The learning curve for them is prohibitive, making training a bear, and niching their work to just a few brilliant experts and shunting out able-bodied analysts who are more than willing to help, but just don't see the patterns in grids of numbers
With the graph-representation, you can run into the same problems, but 
  1. Training is much easier for those who can work with these visual patterns,
  2. Information can be overloaded, leaving one overwhelmed, but then one can very easily reset to just one vertex and expand from there (simplifying the problem-space). And then, the problem grows in scope when you decide to expand your view, and if you don't like that expanse, it's very easy either to reset or to contract that view.
  3. An analyst can focus on a particular thread or materialize that thread on which to focus, or the analyst can branch from a point or branch to associated sets. If a thread is not yielding interesting results, then they can pursue other, more interesting, areas of the graph, all without losing one's place.
The visual impact of the underlying graph (theory) cannot be over-emphasized: "Boss, we have a cycle of contracts!" an analyst says and presents a spreadsheet requires explanation, checking and verification. That same analysis comes into the boss' office with diagram 7, and the cycle practically leaps off the page and explains itself, that, coupled with the ease and speed of which these cycles are explored and materialized visually makes a compelling case of modeling related data as graphs.

We present, for your consideration: graphs.



Models presented above are derived from various Governmental sources include Census Bureau, Department of Labor, Department of Commerce, and the Small Business Administration.

Graphs calculated in Haskell and stored and visualized in Neo4J