apache tinkerpop logo

Getting Started

Apache TinkerPop is an open source Graph Computing Framework. Within itself, TinkerPop represents a large collection of capabilities and technologies and, in its wider ecosystem, an additionally extended world of third-party contributed graph libraries and systems. TinkerPop’s ecosystem can appear complex to newcomers of all experience, especially when glancing at the reference documentation for the first time.

So, where do you get started with TinkerPop? How do you dive in quickly and get productive? Well - Gremlin, the most recognizable citizen of The TinkerPop, is here to help with this thirty minute tutorial. That’s right - in just thirty short minutes, you too can be fit to start building graph applications with TinkerPop. Welcome to The TinkerPop Workout - by Gremlin!

gremlin gym

The First Five Minutes

It is quite possible to learn a lot in just five minutes with TinkerPop, but before doing so, a proper introduction of your trainer is in order. Meet Gremlin!

gremlin-standing

Gremlin helps you navigate the vertices and edges of a graph. He is essentially your query language to graph databases, as SQL is the query language to relational databases. To tell Gremlin how he should "traverse" the graph (i.e. what you want your query to do) you need a way to provide him commands in the language he understands - and, of course, that language is called "Gremlin". For this task, you need one of TinkerPop’s most important tools: The Gremlin Console.

Note
Are you unsure of what a vertex or edge is? That topic is covered in the next section, but please allow the tutorial to get you oriented with the Gremlin Console first, so that you have an understanding of the tool that will help you with your learning experience.

Download the console, unpackage it and start it:

$ unzip apache-tinkerpop-gremlin-console-3.1.5-SNAPSHOT-bin.zip
$ cd apache-tinkerpop-gremlin-console-3.1.5-SNAPSHOT
$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin>
Tip
Windows users may use the included bin/gremlin.bat file to start the Gremlin Console.

The Gremlin Console is a REPL environment, which provides a nice way to learn Gremlin as you get immediate feedback for the code that you enter. This eliminates the more complex need to "create a project" to try things out. The console is not just for "getting started" however. You will find yourself using it for a variety of TinkerPop-related activities, such as loading data, administering graphs, working out complex traversals, etc.

To get Gremlin to traverse a graph, you need a Graph instance, which holds the structure and data of the graph. TinkerPop is a graph abstraction layer over different graph databases and different graph processors, so there are many Graph instances you can choose from to instantiate in the console. The best Graph instance to start with however is TinkerGraph. TinkerGraph is a fast, in-memory graph database with a small handful of configuration options, making it a good choice for beginners.

Tip
TinkerGraph is not just a toy for beginners. It is useful in analyzing subgraphs taken from a large graph, working with a small static graph that doesn’t change much, writing unit tests and other use cases where the graph can fit in memory.
Tip
Resist the temptation to "get started" with more complex databases like Titan or to delve into how to get Gremlin Server working properly. Focusing on the basics, presented in this guide, builds a good foundation for all the other things TinkerPop offers.

To make your process even easier, start with one of TinkerPop’s "toy" graphs. These are "small" graphs designed to provide a quick start into querying. It is good to get familiar with them, as almost all TinkerPop documentation is based on them and when you need help and have to come to the mailing list, a failing example put in the context of the toy graphs can usually get you a fast answer to your problem.

For your first graph, use the "Modern" graph which looks like this:

tinkerpop-modern

It can be instantiated in the console this way:

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]

The first command creates a Graph instance named graph, which thus provides a reference to the data you want Gremlin to traverse. Unfortunately, just having graph doesn’t provide Gremlin enough context to do his job. You also need something called a TraversalSource, which is generated by the second command. The TraversalSource provides additional information to Gremlin (such as the traversal strategies to apply and the traversal engine to use) which provides him guidance on how to execute his trip around the Graph.

With your TraversalSource g available it is now possible to ask Gremlin to traverse the Graph:

gremlin> g.V() //(1)
==>v[1]
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[6]
gremlin> g.V(1) //(2)
==>v[1]
gremlin> g.V(1).values('name') //(3)
==>marko
gremlin> g.V(1).outE('knows') //(4)
==>e[7][1-knows->2]
==>e[8][1-knows->4]
gremlin> g.V(1).outE('knows').inV().values('name') //(5)
==>vadas
==>josh
gremlin> g.V(1).out('knows').values('name') //(6)
==>vadas
==>josh
gremlin> g.V(1).out('knows').has('age', gt(30)).values('name') //(7)
==>josh
  1. Get all the vertices in the Graph.

  2. Get a vertex with the unique identifier of "1".

  3. Get the value of the name property on vertex with the unique identifier of "1".

  4. Get the edges with the label "knows" for the vertex with the unique identifier of "1".

  5. Get the names of the people that the vertex with the unique identifier of "1" "knows".

  6. Note that when one uses outE().inV() as shown in the previous command, this can be shortened to just out() (similar to inE().inV() and in for incoming edges).

  7. Get the names of the people vertex "1" knows who are over the age of 30.

Important
A Traversal is essentially an Iterator so if you have code like x = g.V(), the x does not contain the results of the g.V() query. Rather, that statement assigns an Iterator value to x. To get your results, you would then need to iterate through x. This understanding is important because in the context of the console typing g.V() instantly returns a value. The console does some magic for you by noticing that g.V() returns an Iterator and then automatically iterates the results. In short, when writing Gremlin outside of the console always remember that you must iterate your Traversal manually in some way for it to do anything.

In this first five minutes with Gremlin, you’ve gotten the Gremlin Console installed, instantiated a Graph and TraversalSource, wrote some traversals and hopefully learned something about TinkerPop in general. You’ve only scratched the surface of what there is to know, but those accomplishments will help enable your understanding of the detailed sections to come.

The Next Fifteen Minutes

In the first five minutes of The TinkerPop Workout - by Gremlin, you learned some basics for traversing graphs. Of course, there wasn’t much discussion about what a graph is. A graph is a collection of vertices (i.e. nodes, dots) and edges (i.e. relationships, lines), where a vertex is an entity which represents some domain object (e.g. a person, a place, etc.) and an edge represents the relationship between two vertices.

modern-edge-1-to-3-1

The diagram above shows a graph with two vertices, one with a unique identifier of "1" and another with a unique identifier of "3". There is an edge connecting the two with a unique identifier of "9". It is important to consider that the edge has a direction which goes out from vertex "1" and in to vertex "3'.

Important
Most TinkerPop implementations do not allow for identifier assignment. They will rather assign their own identifiers and ignore assigned identifiers that you attempt to assign to them.

A graph with elements that just have identifiers does not make for much of a database. To give some meaning to this basic structure, vertices and edges can each be given labels to categorize them.

modern-edge-1-to-3-2

You can now see that a vertex "1" is a "person" and vertex "3" is a "software" vertex. They are joined by a "created" edge which allows you to see that a "person created software". The "label" and the "id" are reserved attributes of vertices and edges, but you can add your own arbitrary properties as well:

modern-edge-1-to-3-3

This model is referred to as a property graph and it provides a flexible and intuitive way in which to model your data.

Creating a Graph

As intuitive as it is to you, it is perhaps more intuitive to Gremlin himself, as vertices, edges and properties make up the very elements of his existence. It is indeed helpful to think of our friend, Gremlin, moving about a graph when developing traversals, as picturing his position as the traverser helps orient where you need him to go next. Let’s use the two vertex, one edge graph we’ve been discussing above as an example. First, you need to create this graph:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> v1 = graph.addVertex(id, 1, label, "person", "name", "marko", "age", 29)
==>v[1]
gremlin> v2 = graph.addVertex(id, 3, label, "software", "name", "lop", "lang", "java")
==>v[3]
gremlin> v1.addEdge("created", v2, id, 9, "weight", 0.4)
==>e[9][1-created->3]

There are a number of important things to consider in the above code. First, recall that id and label are "reserved" for special usage in TinkerPop and are members of the enum, T. Those "keys" supplied to the creation method are statically imported to the console, which allows you to access them without having to specify their owning enum. Think of it as a shorthand form that enables a more fluid code style. You would normally refer to them as T.id and T.label. Without that static importing you would instead have to write:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> v1 = graph.addVertex(T.id, 1, T.label, "person", "name", "marko", "age", 29)
==>v[1]
gremlin> v2 = graph.addVertex(T.id, 3, T.label, "software", "name", "lop", "lang", "java")
==>v[3]
gremlin> v1.addEdge("created", v2, id, 9, "weight", 0.4)
==>e[9][1-created->3]
Note
The fully qualified name for T is org.apache.tinkerpop.gremlin.structure.T. Another important static import that is often seen in Gremlin comes from org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__, which allows for the creation of anonymous traversals.

Second, don’t forget that you are working with TinkerGraph which allows for identifier assignment. That is not the case with most graph databases.

Finally, the label for an Edge is required and is thus part of the method signature of addEdge(). It is the first parameter supplied, followed by the Vertex to which v1 should be connected. Therefore, this usage of addEdge is creating an edge that goes out of v1 and into v2 with a label of "created".

Graph Traversal - Staying Simple

Now that Gremlin knows where the graph data is, you can ask him to get you some data from it by doing a traversal, which you can think of as executing some process over the structure of the graph. We can form our question in English and then translate it to Gremlin. For this initial example, let’s ask Gremlin: "What software has Marko created?"

To answer this question, we would want Gremlin to:

  1. Find "marko" in the graph

  2. Walk along the "created" edges to "software" vertices

  3. Select the "name" property of the "software" vertices

The English-based steps above largely translate to Gremlin’s position in the graph and to the steps we need to take to ask him to answer our question. By stringing these steps together, we form a Traversal or the sequence of programmatic steps Gremlin needs to perform in order to get you an answer.

Let’s start with finding "marko". This operation is a filtering step as it searches the full set of vertices to match those that have the "name" property value of "marko". This can be done with the has() step as follows:

gremlin> g.V().has('name','marko')
==>v[1]
Note
The variable g is the TraversalSource, which was introduced in the "The First Five Minutes". The TraversalSource is created with graph.traversal() and is the object used to spawn new traversals.

We can picture this traversal in our little graph with Gremlin sitting on vertex "1".

modern-edge-1-to-3-1-gremlin

When Gremlin is on a vertex or an edge, he has access to all the properties that are available to that element.

Important
The above query iterates all the vertices in the graph to get its answer. That’s fine for our little example, but for multi-million or billion edge graphs that is a big problem. To solve this problem, you should look to use indices. TinkerPop does not provide an abstraction for index management. You should consult the documentation of the graph you have chosen and utilize its native API to create indices which will then speed up these types of lookups. Your traversals will remain unchanged however, as the indices will be used transparently at execution time.

Now that Gremlin has found "marko", he can now consider the next step in the traversal where we ask him to "walk" along "created" edges to "software" vertices. As described earlier, edges have direction, so we have to tell Gremlin what direction to follow. In this case, we want him to traverse on outgoing edges from the "marko" vertex. For this, we use the outE step.

gremlin> g.V().has('name','marko').outE('created')
==>e[9][1-created->3]

At this point, you can picture Gremlin moving from the "marko" vertex to the "created" edge.

modern-edge-1-to-3-2-gremlin

To get to the vertex on the other end of the edge, you need to tell Gremlin to move from the edge to the incoming vertex with inV().

gremlin> g.V().has('name','marko').outE('created').inV()
==>v[3]

You can now picture Gremlin on the "software" vertex as follows:

modern-edge-1-to-3-3-gremlin

As you are not asking Gremlin to do anything with the properties of the "created" edge, you can simplify the statement above with:

gremlin> g.V().has('name','marko').out('created')
==>v[3]

modern-edge-1-to-3-4-gremlin

Finally, now that Gremlin has reached the "software that Marko created", he has access to the properties of the "software" vertex and you can therefore ask Gremlin to extract the value of the "name" property as follows:

gremlin> g.V().has('name','marko').out('created').values('name')
==>lop

You should now be able to see the connection Gremlin has to the structure of the graph and how Gremlin maneuvers from vertices to edges and so on. Your ability to string together steps to ask Gremlin to do more complex things, depends on your understanding of these basic concepts.

Graph Traversal - Increasing Complexity

Armed with the knowledge from the previous section, let’s ask Gremlin to perform some more difficult traversal tasks. There’s not much more that can be done with the "baby" graph we had, so let’s return to the "modern" toy graph from the "five minutes section". Recall that you can create this Graph and establish a TraversalSource with:

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]

Earlier we’d used the has() step to tell Gremlin how to find the "marko" vertex. Let’s look at some other ways to use has(). What if we wanted Gremlin to find the "age" values of both "vadas" and "marko"? In this case we could use the within comparator with has() as follows:

gremlin> g.V().has('name',within('vadas','marko')).values('age')
==>29
==>27

It is worth noting that within is statically imported from P to the Gremlin Console (much like T is, as described earlier).

Note
The fully qualified name for P is org.apache.tinkerpop.gremlin.process.traversal.P.

If we wanted to ask Gremlin the average age of "vadas" and "marko" we could use the mean() step as follows:

gremlin> g.V().has('name',within('vadas','marko')).values('age').mean()
==>28.0

Another method of filtering is seen in the use of the where step. We know how to find the "software" that "marko" created:

gremlin> g.V().has('name','marko').out('created')
==>v[3]

gremlin-on-software-vertex Let’s extend on that query to try to learn who "marko" collaborates with when it comes to the software he created. In other words, let’s try to answer the question of: "Who are the people that marko develops software with?" To do that, we should first picture Gremlin where we left him in the previous query. He was standing on the "software" vertex. To find out who "created" that "software", we need to have Gremlin traverse back in along the "created" edges to find the "person" vertices tied to it.

gremlin> g.V().has('name','marko').out('created').in('created').values('name')
==>marko
==>josh
==>peter

So that’s nice, we can see that "peter", "josh" and "marko" are both responsible for creating "lop". Of course, we already know about the involvement of "marko" and it seems strange to say that "marko" collaborates with himself, so excluding "marko" from the results seems logical. The following traversal handles that exclusion:

gremlin> g.V().has('name','marko').as('exclude').out('created').in('created').where(neq('exclude')).values('name')
==>josh
==>peter

We made two additions to the traversal to make it exclude "marko" from the results. First, we added the as() step. The as() step is not really a "step", but a "step modulator" - something that adds features to a step or the traversal. Here, the as('exclude') labels the has() step with the name "exclude" and all values that pass through that step are held in that label for later use. In this case, the "marko" vertex is the only vertex to pass through that point, so it is held in "exclude".

The other addition that was made was the where() step which is a filter step like has(). The where() is positioned after the in() step that has "person" vertices, which means that the where() filter is occurring on the list of "marko" collaborators. The where() specifies that the "person" vertices passing through it should not equal (i.e. neq()) the contents of the "exclude" label. As it just contains the "marko" vertex, the where() filters out the "marko" that we get when we traverse back in on the "created" edges.

You will find many uses of as(). Here it is in combination with select:

gremlin> g.V().as('a').out().as('b').out().as('c').select('a','b','c')
==>[a:v[1], b:v[4], c:v[5]]
==>[a:v[1], b:v[4], c:v[3]]

In the above example, we tell Gremlin to iterate through all vertices and traverse out twice from each. Gremlin will label each vertex in that path with "a", "b" and "c", respectively. We can then use select to extract the contents of that label.

Another common but important step is the group() step and its related step modulator called by(). If we wanted to ask Gremlin to group all the vertices in the graph by their vertex label we could do:

gremlin> g.V().group().by(label)
==>[software:[v[3], v[5]], person:[v[1], v[2], v[4], v[6]]]

The use of by() here provides the mechanism by which to do the grouping. In this case, we’ve asked Gremlin to use the label (which, again, is an automatic static import from T in the console). We can’t really tell much about our distribution though because we just have vertex unique identifiers as output. To make that nicer we could ask Gremlin to get us the value of the "name" property from those vertices, by supplying another by() modulator to group() to transform the values.

gremlin> g.V().group().by(label).by('name')
==>[software:[lop, ripple], person:[marko, vadas, josh, peter]]

In this section, you have learned a bit more about what property graphs are and how Gremlin interacts with them. You also learned how to envision Gremlin moving about a graph and how to use some of the more complex, but commonly utilized traversal steps. You are now ready to think about TinkerPop in terms of its wider applicability to graph computing.

The Final Ten Minutes

In these final ten minutes of The TinkerPop Workout - by Gremlin we’ll look at TinkerPop from a higher level and introduce different features of the stack in order to orient you with what it offers. In this way, you can identify areas of interest and dig into the details from there.

Why TinkerPop?

provider-integration The goal of TinkerPop, as a Graph Computing Framework, is to make it easy for developers to create graph applications by providing APIs and tools that simplify their endeavors. One of the fundamental aspects to what TinkerPop offers in this area lies in the fact that TinkerPop is an abstraction layer over different graph databases and different graph processors. As an abstraction layer, TinkerPop provides a way to avoid vendor lock-in to a specific database or processor. This capability provides immense value to developers who are thus afforded options in their architecture and development because:

  • They can try different implementations using the same code to decide which is best for their environment.

  • They can grow into a particular implementation if they so desire - start with a graph that is designed to scale within a single machine and then later switch to a graph that is designed to scale horizontally.

  • They can feel more confident in graph technology choices, as advances in the state of different provider implementations are behind TinkerPop APIs, which open the possibility to switch providers with limited impact.

TinkerPop has always had the vision of being an abstraction over different graph databases. That much is not new and dates back to TinkerPop 1.x. It is in TinkerPop 3.x however that we see the introduction of the notion that TinkerPop is also an abstraction over different graph processors like Spark and Giraph. The scope of this tutorial does not permit it to delve into "graph processors", but the short story is that the same Gremlin statement we wrote in the examples above can be executed to run in distributed fashion over Spark or Hadoop. The changes required to the code to do this are not in the traversal itself, but in the definition of the TraversalSource. You can again see why we encourage, graph operations to be executed through that class as opposed to just using Graph. You can read more about these features in this section on hadoop-gremlin.

Tip
To maintain an abstraction over Graph creation use GraphFactory.open() to construct new instances. See the documentation for individual Graph implementations to learn about the configuration options to provide.

Loading Data

gremlin-to-the-7 There are many strategies for getting data into your graph. As you are just getting started, let’s look at the more simple methods aimed at "smaller" graphs. A "small" graph, in this context, is one that has less than ten million edges. The most direct way to load this data is to write a Groovy script that can be executed in the Gremlin Console, a tool that you should be well familiar with at this point. For our example, let’s use the Wikipedia Vote Network data set which contains 7,115 vertices and 103,689 edges.

$ curl -L -O http://snap.stanford.edu/data/wiki-Vote.txt.gz
$ gunzip wiki-Vote.txt.gz

The data is contained in a tab-delimited structure where vertices are Wikipedia users and edges from one user to another implies a "vote" relationship. Here is the script to parse the file and generate the Graph instance using TinkerGraph:

graph = TinkerGraph.open()
graph.createIndex('userId', Vertex.class) (1)

g = graph.traversal()

getOrCreate = { id ->
  g.V().has('userId', id).tryNext().orElseGet{ g.addV('userId', id).next() }
}

new File('wiki-Vote.txt').eachLine {
  if (!it.startsWith("#")){
    (fromVertex, toVertex) = it.split('\t').collect(getOrCreate) (2)
    fromVertex.addEdge('votesFor', toVertex)
  }
}
  1. To ensure fast lookups of vertices, we need an index. The createIndex() method is a method native to TinkerGraph. Please consult your graph databases documentation for their index creation approaches.

  2. We are iterating each line of the wiki-Vote.txt file and this line splits the line on the delimiter, then uses some neat Groovy syntax to apply the getOrCreate() function to each of the two userId fields encountered in the line and stores those vertices in the fromVertex and toVertex variables respectively.

Note
While this is a tab-delimited structure, this same pattern can be applied to any data source you require and Groovy tends to have nice libraries that can help making working with data quite enjoyable.
Warning
Take care if using a Graph implementation that supports transactions. As TinkerGraph does not, there is no need to commit(). If your Graph does support transactions, intermediate commits during load will need to be applied.

To load larger data sets you should read about the BulkLoaderVertexProgram (BLVP), which provides a generalized method for loading graphs of virtually any size.

Gremlin Server

gremlin-server-protocol Gremlin Server provides a way to remotely execute Gremlin scripts against one or more Graph instances hosted within it. It does this by exposing different endpoints, such as REST and websockets, which allow a request containing a Gremlin script to be processed with results returned.

$ curl -L -O https://www.apache.org/dist/tinkerpop/3.1.5-SNAPSHOT/apache-tinkerpop-gremlin-server-3.1.5-SNAPSHOT-bin.zip
$ unzip apache-tinkerpop-gremlin-server-3.1.5-SNAPSHOT-bin.zip
$ cd apache-tinkerpop-gremlin-server-3.1.5-SNAPSHOT
$ bin/gremlin-server.sh conf/gremlin-server-modern.yaml
[INFO] GremlinServer -
         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----

[INFO] GremlinServer - Configuring Gremlin Server from conf/gremlin-server-rest-modern.yaml
...
[INFO] GremlinServer$1 - Channel started at port 8182.
$ curl -X POST -d "{\"gremlin\":\"g.V(x).out().values('name')\", \"language\":\"gremlin-groovy\", \"bindings\":{\"x\":1}}" "http://localhost:8182"
{
    "requestId": "abe3be05-1e86-481a-85e0-c59ad8a37c6b",
    "status": {
        "message": "",
        "code": 200,
        "attributes": {}
    },
    "result": {
        "data": [
            "lop",
            "vadas",
            "josh"
        ],
        "meta": {}
    }
}
Important
Take careful note of the use of "bindings" in the arguments on the request. These are variables that are applied to the script on execution and is essentially a way to parameterize your scripts. This "parameterization" is critical to performance. Whenever possible, parameterize your queries.

As mentioned earlier, Gremlin Server can also be configured with a websockets endpoint. This endpoint has an embedded subprotocol that allow a compliant driver to communicate with it. TinkerPop supplies a reference driver written in Java, but there are drivers developed by third-parties for other languages such as Python, Javascript, etc. Gremlin Server therefore represents the method by which non-JVM languages can interact with TinkerPop.

Conclusion

…and that is the end of The TinkerPop Workout - by Gremlin. You are hopefully feeling more confident in your TinkerPop skills and have a good overview of what the stack has to offer, as well as some entry points to further research within the reference documentation. Welcome to The TinkerPop!