GT Big Data Bootcamp training material

- Understand composition of a graph in Spark GraphX.
- Being able to create a graph.
- Being able to use the built-in graph algorithm.

In this section we begin by creating a graph with patient and diagnostic codes. Later we will show how to run graph algorithms on the the graph you will create.

Spark GraphX abstracts a graph as a concept named **Property Graph**, which means that each edge and vertex is associated with some properties. The `Graph`

class has the following definition

Where `VD`

and `ED`

define property types of each vertex and edge respectively. We can regard `VertexRDD[VD]`

as RDD of `(VertexID, VD)`

tuple and `EdgeRDD[ED]`

as RDD of `(VertexID, VertexID, ED)`

.

Let's create a graph of patients and diagnostic codes. For each patient we can assign its patient id as vertex property, and for each diagnostic code, we will use the code as vertex property. For the edge between patient and diagnostic code, we will use number of times the patient is diagnosed with given disease as edge property.

Let's first define necessary data structure and import

Load patient event data and filter out diagnostic related events only

Let's create patient vertex

In order to use the newly created vetext id, we finally `collect`

all the patient to `VertrexID`

mapping.

Theoretically, collecting RDD to driver is not an efficient practice. One can obtain uniqueness of ID by calculating ID directly with a Hash.

Similar to patient vertex, we can create diagnostic code vertex with

Here we assign vertex id by adding the result of `zipWithIndex`

with an offset obtained from previous patient vertex to avoid ID confliction between patient and diagnostic code.

In order to create edge, we will need to know vertext id of vertices we just created.

We first broadcast patient and diagnostic code to vertext id mapping. Broadcast can avoid unnecessary copy in distributed setting thus will be more effecient. Then we count occurrence of `(patient-id, icd-9-code)`

pairs with `map`

and `reduceByKey`

, finally we translate them to proper `VertexID`

.

We will need to put vertices and edges together to create the graph

Given the graph we created, we can run some basic graph operations.

Connected component can help find disconnected subgraphs. GraphX provides the API to get connected components as below

The return result is a graph and assigned components of original graph is stored as `VertexProperty`

. For example

The first element of the tuple is `VertexID`

identical to original graph. The second element in the tuple is `connected component`

represented by the lowest-numbered `VertexID`

in that component. In above example, five vertices belong to same component.

We can easily get number of connected components using operations on RDD as below.

The property graph abstraction of GraphX is a directed graph. It provides computation of in-dgree, out-degree and total degree. For example, we can get degrees as

GraphX also provides implementation of the famous PageRank algorithm, which can compute the 'importance' of a vertex. The graph we generated above is a bipartite graph and not suitable for PageRank. To gve an example of PageRank, we randomly generate a graph and run fixed iteration of PageRank algorithm on it.

Or, we can run PageRank until converge with tolerance as `0.01`

using `randomGraph.pageRank(0.01)`

Next, we show some how we can ultilize the graph operations to solve some practical problems in the healthcare domain.

Comorbidity is additional disorders co-occuring with primary disease. We know all the case patients have heart failure, we can explore possible comorbidities as below (see comments for more explaination)

We have

And we can check the vertex of index 3129 in original graph is

The 4019 code correponds to Hypertension, which is reasonable.

Given a patient diagnostic graph, we can also find similar patients. One of the most straightforward approach is shortest path on the graph.