[MUSIC]>>Okay. Well, let’s go
ahead and get started. I want to welcome everybody to
our next edition of Data Club. This our first one in the new year.
My name is Eric Olson. I’m on the core OS or
basically the DEP, Developer Platform team within, Cosine which is part of Azure. If you haven’t been Data Club before, just a reminder what we’re here for. Basically, these are training
and sharing sessions for data, I’ll say Data Science Club. It also includes
data engineering related work. There’s a number of
different formats we use, which you’ve probably seen before
if you’ve been here before. But this is our place for
collaborative learning, our chance to share with each
other and actually learn from what other people are doing and the types of
problems we’re working on. These are basically
training sessions or information sharing sessions to help grow the capability of the organization
or are a lot of times just make people aware of what’s
out there if you’re not aware. It also builds you opportunity
to build connections with others in the data community who might
be working on similar problems, who would find help essentially, so it’s a networking opportunity. So, our first rule at Data Club
is to talk about Data Club, as to participate and learn together. That’s how we leverage our
learning to get better. Then if you see
something you like here, be sure to share it with
those that couldn’t attend or a reminder there’s
a recording that can get it afterwards, and
they can check it out. Then finally, looking ahead we have these sessions
about every two weeks. So, our next session
is on February 7th. Alternatively, right now
we’re going to be talking about Azure Ignite impacts, so some of the data analysis
around Ignite. Then on the 21st, we’ll be revisiting the business
intelligence graph, there are not this big, to get some updates of what’s
going on there. So, we’re our commercial space. If you want to keep it with
information about the sessions, you can join the Data
Club alias on ID Web. You’ll get all the announcements
and all the meeting requests. Then if you missed anything here
today, what I’m talking about, then don’t worry you can always go to aka.ms/dataclub and you can get
all the links for the slides, the talks and the videos. Then finally, after
this session I will send out a very short survey
because we are data folks, we want to know what you think. So, we have a feedback survey. It takes maybe
five minutes to fill out. So, you can give us feedback on what you like, what you didn’t like, what you’d like to
see, things like that. With that, I will now turn it on
to David to talk about Neo4j.>>Hi everybody.>>Hi.>>Thanks for taking the time today. I love talking about graphs. Let’s do some of that.
Today, we’re going to talk about introduction to
Neo4j and Graph Databases. Since this is a data science group, I do want to towards
the end get to talk about graph algorithms and some
of the data science applications. But because graphs represents such a different way of
thinking about data, we need to go through
some basics first. This is who I am. I’m a partner
solution architect at Neo4j. I have a pretty fun job there
because about 50 percent of it is business and
about 50 percent of it is technical. I do a lot of integrations between Neo4j and some of
our strategic partners. Microsoft and Azure
being one of those. But I get to play a lot and
I get to play with a lot of new fun stuff from all over the
industry and that makes me happy. You can get me on Twitter. Send me an email if you have
any question after this session. A couple of things that I want
to cover in this session, go over an introduction
into graphs and Neo4j and what the
relationship between is, talk a little bit
about why people are using graphs and when
you would use graphs, versus some of the
alternatives that you have. We’re going to talk
about the underlying property graph data model, because you can’t really
do rigorous analysis or data science unless you
understand what’s underneath. We’ll talk about the
Cypher query language and about how we query
and manipulate graphs. Talk about some data import, and then how we develop applications. We’re going to have
some demos in here too. So as we go, we will have one demo from Ambros that’s
about Microsoft specific stuff and then I’ll show you
some other demos to give you a kind of a flavor of
the tool set around this. We’ll also talk about graph algorithms and data
science applications. So, if you are one of those
technologists who doesn’t want to just listen to slides and high level theoretical stuff,
take note of this. If you want to follow
along and actually use the software while we’re talking, try out things that I’m talking about on the slides,
this is how you do it. We are in the Azure marketplace. So you can search
for Neo4j Enterprise and launch a single instance. Version 3.5.1 is the latest
that’s on the Azure catalog. Or if you know absolutely
nothing about Neo4j, and this is your very first
experience, go to neo4jsandbox.com. It will allow you to launch any
of the sandboxes that you see on the right-hand side and gets
you to a gooey really quickly, so you don’t have to think about what port do I go to,
any of that stuff. Sandbox is the fastest way but
there is the Azure way too. On Sandbox, you’re
going to get a free temporary instance that only last for about two days I think and it
starts preloaded with data, so you don’t need to know
how to load or query or do really anything
with an empty database. So, with that out of the way, let’s start talking a little bit
about graphs and the basics. So why graphs? Ankit was saying at
the beginning of the session he got this sticker at
our conference graph connect. Graphs are everywhere. We truly do see that. Basically, the entire world is a graph and everything is connected. You have people, places, and events that are thoroughly
connected with one another. Companies and markets,
countries, history of politics, there’s so many endless examples
of these things. It’s very natural and visual for human beings to connect things
in their mind as graphs, rather than to use some of
the other formalisms that we’ve become accustomed to as computer
scientists and as technologists. We learn how to think about
things in terms of relations, and sets, and JSON documents
and things like that. But that’s an abstraction
that we’ve mapped more reasonable understanding onto, not how our brains work. Our brains tend to
think of these things as connected nodes and edges. Whether it is flights going all over the globe or
any other use cases. So, before we get into
the bits about Neo4j, let’s talk a little bit up
on the business side of what people are using
Neo4j and graphs for. Our business is primarily
focused on the global 1,000 or we have
heavy presence in retail, finance and a number
of other sectors to give you three concrete examples of how we’re transforming
large enterprises. We do real-time promotion
recommendations for a lot of these big retailers. So, whenever you see that somebody has had these record
Cyber Monday sales, part of that is consumers buying more online
and part of it is that the retailers are
consistently getting smarter about how to do
product recommendations. A lot of that product
recommendation stuff is driven by Neo4j behind the scenes. In that, we can model the products
and who is buying what as a graph and then we can create
social recommendations for users. You might like what your
friends have purchased, and any number of other recommendation
approaches using Neo4j. Marriott uses us for
real-time pricing. With 300 million pricing
operations per day. One of the other things that
Marriott has found and we can talk about this a little bit more
and in the architecture parts. But we tend to require a whole lot
less hardware and when we get to index free adjacency and talk about how the database
works a little bit, you’ll understand why that is. Folks are frequently finding
that they can replace large fleets of relational
database clusters with fewer instances over
Neo4j Graph database when they’re really focused
on real-time relationships. We also work with large postal services for handling
package routing in real time. So, if you think of
the traveling salesman problem or a network of roads as
being a graph of sorts, package routing is full of
shortest path type queries. I need to get a package
from point A to point B going through the fewest number
of logistics hops in the middle. So, that’s what you would think of as a fundamentally graphy problem. Just to kind of go through a number
of other use cases we cover. In two categories,
internal applications. We have a lot of folks using
us for master data management and so they would take
the metadata from all of their systems and then put it into Neo4j and draw a lot of
correspondences back and forth to say this field in this system is equal to
this field in this other system, and metadata catalog if you will. We get used for network
in IT operations, where you have to understand the topology of a network and
how all of your infrastructure relates and this is used for things
like critical path analysis. So, which router if I knocked it down would knock the whole
data center offline? That’s an example of
another kind of a graph you use case and fraud detection. So, a lot of our financial clients
will put transaction data into the graph and then we’ll ask
maybe I sent 5,000 here, 5,000 there, 5,000
to a third location. But if all three of those recipients of the funds are controlled by the same party in aggregate I’ve
transferred 15,000 to somebody. So, it gets used to sniff
out financial fraud. In customer facing applications, real-time recommendations
particularly products for retailers, graph-based search and identity
and access management. So, that’s what graphs get
used for. What are they? In much the same way as a relational
database has a set of tables, rows, columns, and schema, let’s talk a little bit about
how graphs are structured. Fundamentally, it
boils down to nodes, relationships, properties, and
labels which we’ll go through. A node is simply an object in the
graph and it can be labelled. A label is like a semantic category, much as you might have an entity in an entity relationship diagram. So, in this particular graph, we’ve got persons
and we’ve got a car. Relationships relate nodes
by type and direction. Relationships in Neo4j
are always directed. You may traverse them
undirected if you want, but they always fundamentally
have a direction. So, a relationship always has a type. So, if you are traversing
a relationship, you can segment that out and say
that you only want to traverse certain kinds of
relationships not just any way that these
two nodes are connected. So, in this particular case, we can tell from
this really simple graph that these two people
love each other, they live with one another. One owns the car but
the other drives it regularly. So, properties are
basically key value maps that get associated with
nodes and relationships, they can go on either. So, by adding properties
on top of these nodes, we realized that it’s Dan and Anne who love each other
and live with one another. Dan has a Twitter handle and when we look at this
relationship drives, Dan drives the car. We can put metadata on
the relationship and assert that that’s
only been since 2011. So, basically you can think of a node and a relationship
as a property container, where properties are
simple key-value maps. Yes sir.>>Is the [inaudible] relationship
is the same as edges, right?>>Yes. So, in the graph world sometimes people refer
to vertexes and edges. We tend to talk about
nodes in relationships, because we find that language
is more accessible. But in the math literature you’ll see vertexes and edges and we’re
talking about the same thing.>>So for the relationships, so there are two persons. Are the relationships
only between nodes or you have some sort
of like ontology on top of person then the relationship is between person as entity
and the car as entity.>>Okay. So, there’s a couple of different ways
of going with that question. There question for those
who didn’t hear it is, is there some sort of
an ontology on the top that specifies what kind of
relationships you can have? We’re going to talk
about constraints. You can assert constraints that certain kinds of nodes must have
certain kinds of relationships. However, those constraints
are optional. Neo4j does not have
an ontology layered on top of it and those schema
constraints are optional. Furthermore, the
constraint may assert that have to have a relationship
but you cannot for example, assert that there could never be a drives relationship
between two persons. Okay, that would be
a different constraint. Does that answer
your question for now? We’re going to be talking
more about constraints. Any other questions before we go on. Okay. So, just quickly summary here. Nodes are entities with
complex value types, relationships connect them
and structure the domain. Properties are basically
these key value pairs. They tend to express metadata
about your nodes and your logical entities in your domain and labels
group nodes by role. So, usually, we think of labels
as the entities in an ERD. Nodes you would think
of as the instances, the rows in a table. Relationships can be
thought of as joints, which we’ll go into
much more depth on that. Yes.>>How are labels
different from properties? It seems like it’s
a somewhat similar concept.>>That is true. So,
the question for those who if you can’t hear it online is how do labels differ from properties? So, labels are
an optimized indexed way of scanning to a particular subset. So, you actually could
do it either way. You could have nodes with no label at all and then you could
have a property that says, let’s call it type and then say
a node with type equals person. All right. When we get to how
the Cypher query language works, labels are a lot more intuitive
to use in terms of structuring your domain and they’re also more performant underlying in terms of
how the database is implemented. But you could do it either way. If you came onto our community
forums and ask that question, we would probably say
you can do it both ways but please use labels
for lots of reasons. Okay. One of the really
cool things about this graph model that
we’re talking about is a property that we would
call whiteboard friendliness. When our field engineers go
out and work with customers, frequently the customers
have not been exposed to graphs before
and they don’t really know how to approach working with
graphs and modeling their data. So, just in a very human open way, we get a whiteboard out, we get some markers and we say, “So, tell me a little bit
about your domain.” So, you’ve got customers and so you draw a little something
on the board and you say, “Well, they they buy products. So, let’s draw another
will circle called product and then create
a link between them.” So, you have this
elicitation session if you will where you’re trying to get
them to talk about their domain, what’s important to them, what the data means,
and you draw that out. So, to give you a simple example, if we were talking about movies and actors you might end up with
a white boarding sessions like this. So, you got Tom Hanks who
acted in Cloud Atlas, Hugo Weaving was also in Cloud Atlas. But I don’t know if
you guys like the Matrix, at Neo4j we love the matrix. Okay. It’s in the name
people. All right. We love the matrix. Anyways, so
Hugo Weaving was Agent Smith, he was in the matrix. Then Lana Wachowski happens to
direct both of those movies. So you can elicit
this information from the user, get this really rough
whiteboard sketch going. Right? Wow look at
that, that’s a graph. We got a node called
Tom Hanks who acted in Cloud Atlas and
so on and so forth. So, that’s how simple
the translation was. We literally just applied
this on top of it. Then what we’re going to do is
we’re going to slap some labels and we’re going to property
if I, if you will. What do we care about people? What do we care about these movies? So, a person who is an actor has
a name and maybe a birth date. When we say that they
acted in a movie, it’s probably important to
know what role they played. So, we’ll give that a role property. Cloud Atlas was definitely a
movie and we’re going to want to track what year it was released
in and so on and so forth. So, when we say
whiteboard friendliness, this is what we’re talking about. Go from, I understand
my domain inside of my own head elicitation session to a rough model that we can query
really quickly. All right. Now, I don’t know about
you. I worked with relational databases for years
and years and years and it’s very easy to get bogged down in these conversations of should
we be third normal form or not? What the data is and
how we think of it often get very radically
separated from one another. Okay. So, this is a this
is a screenshot of what the result of this is going to look like as concrete data
inside of Neo4j. When we get to the demo,
you’ll get to see these springy cool graphs
moving around. So, we’ve talked a lot about graphs, the property graph data model. So, that brings us to Neo4j itself. Hopefully, at this point
in the presentation it’s not going to come as
a big shock or surprise, but Neo4j is a graph database. A couple of properties about Neo4j. We support strong asset transactions, so we are not an eventually
consistent database. You get strong asset guarantees. It’s very very fast and I’m not going to ask you to accept that
as a marketing claim, we’re going to talk about index
free adjacency and I’ll be able to tell you in terms of
data structure why that is. So, we can get two to four million
operations per second per core. It comes with both binary
and HTTP protocols that have a lot of different language supported drivers, we’ll cover that. We have a clustering approach that
provides for high availability. So you can have multi
node clusters and you can survive the failure of
multiple nodes in your cluster and still retain
those strong asset guarantees and stay in operation,
and no size limit.>>Question.>>Yes.>>You said this is available
in Azure marketplace. So, it’s available as
your own VM that you can run or is it available or do you take this and put it into
your VM as a database? How do you actually do this?>>Or is it like service? Are you offering it as service? Will you offer it here? Yeah. Essentially, you need to install it you need to
take care of it yourself.>>It is not available as
a managed service at this time, that’s something that
we’re actively working on. So, it’s provided as
a VM based deploys. So yes, you you take care of
the VMs once it is launched. Okay. But you can of course, create your own VM
and then install it much as you would any other
software package. But I wouldn’t recommend doing that. We provide on the Azure marketplace. It’s way faster to just
launch the version that we offer that’s already configured
nicely and so on and so forth.>>Okay, great.>>Okay. So, it’s a native graph database. It’s schema free. Schema free is a little bit
misleading, it’s schema optional. We had many schema constructs but they’re not required
that you use them. Let’s see. It gives you a really nice developer
workbench that you’ll see. One of our superpowers relative to other graph databases is
the Cypher query language. We’re going to talk a lot about
that and why that is so important. Let’s talk about it right
now. Graphic querying. Yes.>>Can we ask a couple of questions?>>Sure.>>Neo4j, how does it
compare with the graph [inaudible] SQL Server that was
recently introduced and with Cosmos DB? Which also supports graph using
different query language, I believe it’s grammar.>>I have a very specific
answer for that, but it’s coming a little bit later. Can I park that question
and returned to it?>>Okay, because that gets into. I mean, the really short answer is, if you have a graph abstraction
you can do graphs either way. But the underlying
implementation matters a lot in terms of your performance and
scalability expectations. I hope to talk a little bit about how that’s implemented
under the covers, and then when we talk about how
graphs work on top of SQL Server, you’d see some clear differences.>>Okay.>>Okay. So, don’t let me dodge the question. Okay? It’s just that if we don’t
yet haven’t talked about Cypher, I’m going to give
too much information too soon.>>Okay.>>All right.>>So, Cypher is a pattern matching query
language made for graphs. Now, I’m a big database geek, Neo4j is not the first database
that I’ve ever worked with, I love them all for
different reasons, okay. One of the things that I’m completely unreasonable about at this point in my career is I have to have
a declarative query language. I do not want to write
code that goes and tells the database how to fetch data. I want a declarative language where I express what I want and then it’s the database’s job to
go figure out what’s the best query execution
plan to go do that. Now, if you guys have been using
SQL forever and most people have, you’re used to this. All right. You just express what you
want and you don’t think about which index gets used
first or anything like that. This is an extremely powerful
thing and yet some of the newer NoSQL
databases have trained us to go do with less than that. So, this is the point where
I’m going to be unreasonable, you need a declarative query
language if you’re going to work on a serious database and Cypher is that for graphs. Did
you have a question?>>Yeah. I actually wanted to,
I assume you’re all interested, let you know there’s a little delay. So, especially maybe a little delay. But someone was asking, does
Neo4j support GraphStreams?>>I would ask the questioner
to clarify that. So, yes in the sense that you can ask a query and the result can be a stream of things that you
process as it comes back. But, I’m not sure if I’m getting
to what the questioner is asking. So, Cypher is a pattern matching
Query Language made for graphs. It’s declarative,
hopefully I’ve already convinced you that that’s
a really good thing. It’s expressive, and it’s
focused on pattern matching. Now, if you remember
the whiteboard friendliness point you can probably follow why
pattern matching is important. We want to be able to write a query fluently as we think about
how the data is structured. So, here’s a pattern
in our graph model. We’ve got Dan loves Anne, two nodes in a relationship. What does that look like in
Cypher? It looks like this. A person named Dan loves
a person named Anne. I mean, you can read it
from left to right and it almost looks like
the actual pattern in the graph. So, this :Person is how we tell Cypher we’re
talking about a label. The brackets is how we talk about that property
map that we wanted. So, name equals Dan loves
person Anne, labels and properties. You will notice that in the round brackets parentheses
if you’re American, are nodes. So, when we asked to
create a pattern, we can do the same thing with labels, properties,
and relationships. So, we can create an entire pattern
in the database just by that, by visually describing it, and just saying, “Hey, go
make me one of those.” We can also match and we
can create these variables. So, a person named Dan
loves whom? Return whom? That’s you guys probably don’t have a whole lot
of Cypher experience, but everybody ought to be able
to tell me what that query does, right? All right, great. So, we’ve got two nodes
and in the second case, we’re creating a match to
a variable on the second node, and then we’re just returning
what that variable should be bound to as a result of
what’s in the database. So, let’s look at a social
recommendation query example. This is our VP of Products,
his name is Phillip. Here’s one of our product
managers named Andreas. This guy, he’s amazing. I hope you’ll run into him someday. His name is Michael Hunger. So, these guys are friends and they like certain sushi restaurants. So, iSushi serve sushi, Sushi Zam serve sushi. They’re both located in New York. Phillip [inaudible] here
finds himself searching for sushi restaurants in New York
that my friends like. Frequently a lot of these social
recommendation type questions can be phrased as a graph path. So, how you would answer that
in Cypher would look like this. So, I’m looking for a person who
is a friend of somebody else. That friend likes a restaurant, that restaurant is located
in a certain location, and it serves
a certain type of cuisine. The variable bindings that our user has given us is that
the person’s name is Phillip. We’re talking about New York
and he’s interested in sushi. So, these graph patterns with a
couple of variables thrown in, get used to drive
the social recommendation.>>[inaudible] Capital
P, for example.>>Okay, so before the colon is the name of a variable
that’s being bound, after the colon is the label.>>[inaudible].>>Okay, yes. There’s the label on the node, right?>>Simple schema. So, when
we say the legal node, what does it mean exactly?>>So, we talked a little bit
earlier about how nodes can have a label and that’s like the semantic category
of information it is. So, this node represents
a person and this person, whoever wrote the query, chose the variable name person, so it looks a little bit misleading. But the lowercase person
is the name of a variable, and then the uppercase P
is the label. Basically, what we’re
saying is whatever gets matched to this variable must be labeled person.
Does that make sense?>>If node has multiple labels?>>Then it will match.>>It will match too.>>It will match too. If it’s
labeled both person and enemy, it’s still a person.>>When you put the
names of these cuisines, so are those properties
or what is that?>>Yes they’re properties.>>Okay [inaudible].>>So, when the data was created, this node got created
with that property. So, basically this is
placing a constraint that the only persons
who can match are those having a name property
whose value is Phillip. Okay?>>What kind of clarification
if you want to jump back. So, going back to the [inaudible]
work streams, are asking basically, streams of small graphs are
syntactic structure of sentences, or the scrap can come from Twitter, for instance, and a use
case would be extracting patterns in real-time [inaudible].>>Okay. Yeah. There’s so many ways
I could go with that question. Google Neo4j and NLP. I wrote this long medium article about natural language
processing with Neo4j, and that one link which I can’t go into that for time
reasons right now has a lot of information
on this topic. On the streaming thing, the person can also
Google Neo4j streams and there’s a Kafka
integration that talks about producing transactions as a stream or consuming streams from Kafka
and putting them into a graph. So, hopefully that’s going to
help without going too deep.>>Okay, cool.>>Okay so, in our earlier example, we had a really tiny graph. Now, imagine this happens in a supermassive graph and you have hundreds and
thousands of friends. Basically, what these queries
are doing is they’re finding the best starting points
and then they’re traversing through the graph
from those starting points. Declarative query for graphs. Sometimes our developer
relations people find these things that
people said on Twitter. That’s like particularly emblematic, like we couldn’t have
said it better if we had our own marketing people do it. So, they cap these things
and then keep them. What I learned in Neo4j training today is that you draw
ASCII Art to code. So, how true is that? Nodes are drawn with parentheses. Relationships are drawn with arrows with additional
details in brackets. Patterns you connect
nodes and relationships with hyphens and optionally
specifying direction. Now, you’ll notice this is a
relationship going one way. This is the same going
the opposite way. You can traverse it undirected. This is same either way, it’ll match either way. But the components of a Cypher query basically look a lot like SQL just with adaptations. So, match and return our keywords, M is a variable, movie
is a node label. We actually covered
that just a moment ago about how to tell what’s the variable and what is the type of information
you’re trying to match. In this particular query, p, r, and m are variables. Notice that we can bind relationships too and we can return them
as first-class types. We can specify that a
relationship we want to traverse must be the
acted in relationship. So, yeah this is pretty
straightforward. The only addition here is that
sometimes what we want to match is not the node and not the relationship
but the path itself. So, in this case, what we’re doing is
we’re drawing a pattern, we’re assigning that to a path, and then we’re returning the path. We have a host of built-in functions that allow you to manipulate paths. So, for example, you
can ask how long it is, you can ask which node is in the third position,
so on and so forth. So, graph versus tabular results. If you do match M movie
return M. Okay. Basically, what you’re going
to get back as a node. If you do return m.title, m.released, you’re going to get a data square and it’s going to be a table
just like any other, right? Properties get accessed by
saying variable.propertyname. So, in this way, you can return graph components,
paths, relationships, nodes, or you can just return tables of information
much as you would with SQL. Not terribly interesting is
moving quickly through this. Cypher keywords are always case insensitive and node labels
relationship types and property keys are
always case sensitive. So, match on the right with
funky capitalization is fine and ACTED_IN is always
strictly all uppercase with an underscore in
between, no exceptions. So, aggregates in Cypher, they’re a little bit different. We never need to
specify a grouping key, and so in SQL, you have this groupBy concept
that does not exist in Cypher. We always group by any non-aggregate
keys in the return statement. So, if for example, you did this, give me all the movies
that this person acted in. You’re going to aggregate that
by the individual actor’s name. Notice there’s no group
by statement here. That’s a thing we want to pull out. This is something that
often trips people coming from SQL going to Cypher. Is there like how do I do group
by and the answer is you don’t. There’s a bunch of
different aggregate functions at the very back of the presentation. At the end, I’m going
to give a lot of different links and resources. There’s a thing out there
called the Cypher ref card. If you Google Cypher Refcard, it is like the one-page cheat sheet of everything you could possibly
want to know about Cypher. It’s the 90 percent
solutions and most of my problems when I’m
working with Cypher. Suck a little bit about
constraints and indexes. Now, Neo4j aid hasn’t have
formal schemas as such. But we do support a lot of different kinds of
constraints and indexes. We can create unique constraints. Basically, these allow really fast
look-up of nodes that match by properties and this
is how you would do that in fairly
straightforward English, create constraint on label, assert that a certain
property is unique and so in much the same way as
you’d create a primary key, this is how you would do roughly
the equivalent in Cypher. By the way, that’s unique
with respect to this label, it’s not global unique
in the database. So, constraints are always
bound to a certain label. So, there are three kinds
of unique constraints. You have the unique node
property constraint. You have the node property
existence constraints. So, for example, if we
want to create a person, we want to always ensure for data quality that they
always have a name, can never have a person
without a name, and we can create relationship property
existence constraints saying, for example, don’t create
a company record in our database unless you
know who the CEO is. So, company is controlled by CEO. Okay. Company can’t
exist without a CEO. So, in general, indexes
allow fast lookup of nodes just as they
do in other databases. You can create an index like this. This place has no particular
constraint on the values, but it drastically increases the selectivity of queries
when they execute and this is how in declarative languages, you hint to the database. How it’s going to
build a plan and how it’s going to execute
a query efficiently. These predicates all use indexes. So, when you create those
inside of Neo4j too, we have a way of backing
indexes differently. So, you can create indexes backed by leucine or backed by our native implementation and there are some other options as well. So, you have some flexibility
with your data types. If you know more about
your data type you can choose a non-standard index type
and improve performance. So, indexes are only used for
finding the starting points for queries and you’ll
find this is really a pattern with graph query overall. Is fundamentally we’re
not scanning through millions and billions of records
and trying to filter that. But rather what we’re doing
is we’re trying to identify starting points and then
traverse out from that. We’ll talk about
index free adjacency. The operation of traversing a relationship is fundamentally
very cheap in Neo4j, very fast. So, that’s why you’re
going to do it this way. We use index scans to hook up rows. In relational, you use those index scans to look
up the rows and join, and in graph, you use them just to find the starting points
and then you traverse. Okay. So, one last tricky
thing about Cypher, I want to talk a little bit about
before we move on is MERGE. MERGE is, how many folks
are familiar with upsert? Upsert is about, this is
the equivalent of upsert. Okay. So, when you merge, it is create if it does not exist. So, when we say merge P person
named Tom Hanks Oscar True. So, if there’s not a person node with name Tom Hanks
and Oscar true in the graph, but there is somebody
whose name is Tom Hanks, what do you think’s
going to happen here? It’s going to create the node. If you took off the Oscar is true, it’s going to match entirely on
what is in the MERGE statement. If that exists it does nothing. If it does not exist in its entirety, then it gets created as such. So, one of the biggest
stumbling points with Cypher is somebody runs a query like this
and they already had a Tom Hanks. Now, they have two Tom
Hanks’. All right. So, quickly some write queries. Create that’s pretty much as
straightforward as it gets, right? Okay. We’re going to
create Mystic River 2003. All right. Well, we can also do
is if we wanted to modify that, but we didn’t actually want
to create their record. We can match it and then set its tagline to be
this famous quote from the movie. I think I’ve got some co-workers who were real nuts for Mystic River. I was lobbying for a Matrix example, but they went with
Mystic River. Okay.>>You have to be there.>>Absolutely. Okay. So, what if we wanted to create a relationship
between two existing nodes? Well, we would match them both. We’ve got Kevin Bacon, and
we’ve got Mystic River and then we would simply create
a relationship between them. Now, you’ll notice on either
end of that relationship, we’re using a variable which
is already bound to something. So, we’re not saying create a node. That results in only the
creation of the relationship, and what we return from that is p, r, m, the whole all three components. So, we’ve got the merge person, Tom Hanks example, with just the name versus
with Oscar equals true. Suppose you wanted to make
sure Tom Hanks got an Oscar, but she didn’t know whether
he already existed or not, then what you would do is you
would merge just Tom Hanks, guaranteeing that it would
not create one if he already exists and then you would
set p Oscar equals true. This would be the way
that you can get only the Tom Hanks and also
modify him at the same time. Yes?>>So in this scenario, if the [inaudible] Tom Hanks
and Oscar equals true, all would be in the database? Would this create
a second Tom Hank’s node?>>No, it would not. In the merge, it would not create it because
there’s one already existing. That one existing would
be bound to the variable p. And then in the next clause, the p’s Oscar property would
be set to the value true. So, this is shown
specifically to illustrate the difference in merge
semantics between create. The bottom line is that merge
checks everything that you give it and so you want to merge
only on your key values, if that makes sense, and
then set anything else. In this way, you can do
what you want to do.>>[inaudible].>>Okay, that’s what I’m
trying to illustrate. So, this is the first merge, and this is the second merge. In this merge, what
we’re telling Cypher is, go find me a person who has named Tom Hanks and who has Oscar true. If such a thing does
not exist, create it. That’ll work. That’ll always work. But if there’s just a Tom Hanks who does not have an Oscar property, then you’ll end up
with two Tom Hanks in your database. Make sense?>>Even with merge?>>Even with merge because
you specify that you only were looking for a Tom
Hanks where Oscar is true. If you do it this way, it will look to see if
there’s any Tom Hanks irrespective of whether or
not he has an Oscar property. It’ll find that Tom
Hanks and then it’ll ensure that whoever that guy
is, he’s got an Oscar. Make sense? In this scenario, you will end up with one Tom Hanks. I point this out because this
is a common stumbling block about merged semantics that is
usually pretty easy to explain, but folks need to
know how that works. So, merge also has
these other two options. You can do On Create and On Match. So, for example, if it
mattered to you whether he was new or not, then you could say, “On Create, give him a timestamp
of when he was created, and specify that as if
his time of creation, he’s never been updated.” But if you actually matched him, then you don’t want to update the created timestamp because you didn’t just create them just now. But instead, you want to
increment his updated counter. In this way, you can, the Cypher planner will tell you whether or not it actually gave you something that already existed, or whether it created something
as a result of a merge. Before we go into data import, any other broad questions
about Cypher? Yes?>>So, can you create relationships
and nodes at the same time?>>Absolutely. If you say create and then you give it a pattern that has any number
of nodes and relationships, it’s going to create all of them at the same time in
the same transactions. Yeah. Yes?>>So, in many examples saying
that person love movie, right? And then return. So what do you for
example a label for a person, the multiple topics. And for movie, the same name of
the movie occurred several times. In this case, what do we return?>>Suppose there are, say, three Tom Hanks is or three Mystic Rivers.>>I think they all have
three movies with the same name.>>Yeah. In this case, your variable would be bound
to three different instances. Then if, for example, you said set
the property on this variable, you’d be setting it across all three.>>Not merge. What
will you do on query, just reach without the return?>>Yes, same thing. If you just
did match person named Tom Hanks, and they were actually three, then that variable would be
bound to three different notes. So if in the next clause, you then set the property
Oscar equals true, it’s going to get set on all three.>>Okay. So, I see. So, for example, a person love movie, movie names show up seven times, and the person name like
there are say five nodes, and then there will be at that
time seven without return? Because each pair of them will
form this love relationship, okay?>>Yes, that’s possible depending on what the data
in your graph is, but we need to get to
a more particular example there. So, it’s like what you’re asking
about is a Cartesian product.>>Yes. I mean, there are many ways because you’re talking about graph. When we talk about graph,
there are many ways to organize your graphs.>>Yes.>>So I’m asking which way do
choose to organize your graphs?>>I’m not sure I
dodged the question, but it just depends on what you
want out of the database, right?>>That will be
the problem. You really need to infer what I want, right? Now say person love something.
That’s what I really want.>>Well, there’s there’s no
inference that’s happening, right? The database is going to give
you exactly what you asked for, but you need some practice with Cypher to specify precisely
what you’re looking for. So usually, that’s where you kind of go back to what I said earlier, which is you identify the
starting points in your graph, like person named Tom Hanks, and then you traverse out from that. It’s possible to create
Cartesian products using Cypher, but you don’t usually
want to do that.>>What I’m trying to
say is, you can’t really avoid ambiguity in the data. For example, the same table where always show off most of the times. No matter it’s personal [inaudible]. So in this case, whether it’s a precise language, well-defined language or not, you need to deal with this situation.>>What I can say is I’ve worked with Cypher for a couple of years, and I have not run into a situation where I could
not avoid ambiguity. I would welcome
a concrete example of that and then maybe we could work through
how to reformulate the query. Now, it is possible to
express ambiguous queries. But I believe that that’s
true of most query languages. I think we would
really focus on tuning the data model and tuning the
query to get to the specifics. I’m not saying it’s not possible. I’m just saying, if that’s difficult, don’t do it that way. Yes?>>It’s going back to the previous
[inaudible] from the other side. I think you may have covered this [inaudible]>>So, is the question
about match or merge? Or did they not specify?>>What would happen
is exactly if you actually came back with
more than one match?>>Right. Okay. So, if you
use the match keyword, then the variable gets bound to
as many of them as there are. If you use the merge keyword, you’re basically
saying get or create. And so if you say get or create, then it’s going to end up being one.>>So, you could actually found multiple,
or we shouldn’t find [inaudible]?>>Right.>>Okay. That still
apply there. Okay.>>Yeah. Yes?>>So here, the names that could be used as that identity attributes [inaudible] Is there a way to enforce
uniqueness of identity attributes?>>Yes, absolutely. We
covered that a bit earlier. That’s in the unique. If I can backup to unique node
property constraints. So, there you would say, yeah, you want to create a, where is it? Create constraint on person, a certain label.name is unique. Then in this way, if you attempted to insert a second Tom
Hanks, it would fail. So, I mean, it’s really the same as you do it in
many other databases there. All right, let’s see,
moving forward, okay? So, data import. There are so many different options
of loading data into Neo4j, I can’t really cover them all, so we chose to focus on one of the most common,
one of the simplest, that people use the most
frequently, called Load CSV. It lets you take a CSV file
from HTTP or file URL. It gives you a stream of records and then basically
pipes those stream of records into a
subsequent Cypher query which you can use to create
and update graph structures. It gives you
transactional operations. So, whenever you do Load CSV, it’s happening in the context
of a big transaction. You can transform and
convert stuff as you go and it’s a primary way that when
people are getting started, they insert data into Neo4j. It works up to, say, 10 million
or so nodes in relationships. Because it’s transactionally
bound, if you wanted to put, let’s say, 20 gigabytes
of data into a graph, you wouldn’t want
the overhead of transactions and you would use a separate tool
called Neo4j import, but for simplicity, we’re going talk about Load CSV as
a simple way to get started. Let’s take an example. Sticking with the movies theme, we have a simple movie CSV file, hopefully this is pretty
self-explanatory. Titles released and taglines. We have a people CSV file that just gives us a name and
when they were born. We have some actors. You’ll notice movies,
roles, and persons. So, this is like implicitly
telling us about an edge in our graph or a relationship that
this person was in this movie. Then you have some directors. So, recalling our data model, a person acts in a movie, a person can also direct a movie because remember they can
be an actor or director. So, for every record in the file, we want to either create a
node for a person or a movie. We want to find a start and end node and then create a
relationship between them. So, we covered those in the basic sections earlier
about how that works. That’s the creating other
relationship at the bottom. So, how do we actually
read the CSV with Cypher? Using periodic commit,
basically says, batch this up into
smaller transactions, so we don’t like pull in 100 megabytes and then do
all of that all at once. Load CSV is pretty straightforward. With headers, it tells us that the first line of that CSV
will give us metadata. From URL, pretty straight forward. We’re going to call whatever that stream of records
coming back is, we’re going to call those rows, and we’re going to specify
that it’s semicolon delimited, it’s not comma delimited data. Then, that big Cypher
thing that I showed you on the previous slide
where we’re doing all that match and
create comes later. So basically, now
this variable Rho is bound and we can create graph
patterns with the content of Rho. Rho will have, if it’s
the movies CSV file, Rho will have a title attribute
that we can use. Yes.>>Another query, sorry if
that’s taking you back again. So, somebody’s asking,
is there an equivalent of an own key in Neo4j?>>No, there is not. Okay, so, that’s
a fantastic question. A foreign key in a
relational database is a key to somebody else’s table and we put that so that we
can do a join, okay? There are no joins in Neo4j, there’s only relationship traversal, and so, foreign keys have no purpose. If I saw a person
writing a Cypher query where they were putting
a foreign key on a node and then they were trying to match a bunch
of other nodes where it’s ID was equal to this foreign key
that they had on this other node, oh man, that would be a person
who’s working really hard.>>Really hard.>>That’s a person who’s
really trying to defeat the model so that they
can work too hard. So, but the short answer is no. There’s really
no foreign key in Neo4j. Related to this Load CSV stuff; so, in our browser, we have this cool function
where you can type in colon, play, and then give it
a URL and it’ll step you through a nice guide on how to
use some of these features. So, this is how to go through a mini class on importing
data in Cypher. It’s going to step you through
the same example that I showed but it’s going to
make it interactive and executable for you in the
browser. Developing apps. So, we have APIs for most
of the popular languages. When you talk to
a database over the wire, you’re using a binary
protocol we call Bolt. We support Go, Java, Python, JavaScript and a number of others. We also have community support
for a bunch of other languages, I mean, all kinds of
different things. There is a Cypher
transactional HTTP end point, so, you can talk to Neo4j over HTTP. We generally recommend
that people use Bolt much as you would use
JDBC for relational database. There’s a link to language guides
where you can get really simple, just bootstrap me and get me started quickly with Python
type code examples. We have a native Java API as well, where you can actually
launch Neo4j in memory, and you can use the Java API to define user-defined procedures
and access the core API. So, the way that you extend Neo4j
itself is typically in Java, Neo4j is written in Java and you
can extend the Cypher language by writing your own functions and procedures as much as you
can in other databases. Bolt; high-performance,
it’s versioned, it’s based on Packstream,
supports TLS, and it also does a lot of
connection pooling type stuff. So, if you’re talking
to a Neo4j cluster, you would like to talk to just the cluster and not
worry about whether you’re talking to node A or node B and
that is called Bolt plus routing. So, there’s a way that you can set your client app up to use
a Bolt plus routing driver to a cluster and it will worry about routing the query to the right
server automatically for you. So, you just ask a question, get an answer, forget
the cluster topology. You can extend Cypher with
user-defined procedures. We mostly don’t recommend
people do this until they’ve exhausted
the other options with Cypher, so, you can get really
far by reformulating your Cypher query or by putting
the right indexes in place. Sometimes you need to access some third-party API or
you need to do something extremely performance
sensitive and you can write your own function or procedure to do that and then call that from Cypher. That just gets done with
these Java annotations, they’re just Java classes that
are annotated in a certain way. You compile this, you get a JAR file, you drop it into
the plugins directory and you’re pretty much done and your server
has a new extension. Talk a little bit about
the migration of relational to graph because this is
a really big topic for us, because most folks are used to
thinking in terms of relations. Relational, it tends to be simple
until it gets complicated. You end up with all of
these different joins, and how many folks here
have written one of those gigantic monster ugly SQL queries where you’re
joining like eight, nine, 10 tables. Yes.>>[inaudible].>>So, you know relational, now, consider the
relationships themselves. The way we think about the mapping
between these two ideas is, okay, the naive approach. Before you optimize anything, basically think about
all your tables, and think about the table name, turn that into a label, and make the table a set of nodes. Foreign keys become relationships. That’s actually apropos
the question earlier. So, whenever you see
a Primary Foreign Key linkage, you need to be thinking in my model, there’s a relationship there. Link tables that are used
in relational to resolve many-to-many mappings are basically just relationships with
extra properties, typically. Then you throw out all of the primary and foreign keys
that you do not need. So, sometimes, you
need a primary key to look up an item by its identity, but you never need
foreign keys and so you’re storing a lot less data
to begin with. Yes.>>Is there a good approach
to determine where the data or the problem suitable more for relational database or for graph database or maybe
for NoSQL database?>>Yes, we’re getting
to your section, okay? So I can’t differ that much longer. We’re almost there. So we know how
to query a relational database. We just use SQL and
so we do these joins. Most people are familiar with joins. How do we create a graph? What we do is we traverse, we’ve kind of covered this. So, this is starting to get into, when do you use one versus
when do you use the other. So, RDBMSs, it turns out that
they actually don’t handle relationships that well because they can’t model or store
them without complexity. What I mean by that is
they’re introducing artificial extra features
like Foreign Keys, where you are basically propping up the formalism in order to support
this thing that you need to do. The performance degrades
with the number and level of relationships
and database size, and we can reason about
this straightforwardly, and that joins are computed
at query run-time. So, it’s not that
this data is pre-joined, but I have to scan table A, scan table B, match
things up in memory. Now, granted, 30 years of database research has gone
into making that super fast, but you still have to compute
it every single time. Query complexity grows with
the need for more joins, and as you add new kinds
of data and relationships, this is where you get into
the kind of NoSQL thing, where SQL being
fundamentally schema bound, is good in the data integrity sense, and it’s bad in the evolvability
and agility sense, because it tends to be inflexible, it’s not terribly easy to define a new relationship on the fly
and then rewire your graph. So, one of the things
we would say is, when data relationships
are valuable in real-time, traditional databases
aren’t the best choice. The reason for that is
that you’re going to be recomputing these joins
every single time, and you’re going to be doing
typically a lot more scans, of a lot more data than
is strictly necessary. This gets to, when would you
use a graph and not relational? Suppose your question
was, find all reports, and how many people they manage
up to three levels down? If I could give you one slide
that got me into graph databases, this would probably be it. I had this data lineage problem, where I had directed acyclic graph of data and
the things used to derive that. The question that I
wanted to ask was, give me all reports
that were derived from information sourced by the Air Force. I don’t care if it’s two hops back, or if it’s five hops
back, or if it’s 15. Okay. And oh, man, I spent a week and a half becoming
a SQL paladin learning how to write recursive SQL
using stored procedures, optimizing the hell out of it, not getting very far with it. So, basically, I went
up the mountain, and I consulted with the SQL gods, and I got the absolute best advice. All right? And in the end, it still was terrible. So one of our field engineers, I think they took this. Yeah, this is an example
from a previous customer who ended up coming to the dark side
of graphs, I don’t know. Anyway, somebody actually
wrote this query. I wish I could zoom in. I mean, I don’t want to
waste you guys time by reading this thing to you,
but it’s a real query. There it isn’t SQL, and there it isn’t in Cypher. Now, the magic here in Cypher, is that we can specify
a variable number of relationship hops that
we want to traverse. Earlier on, I said that traversing
a relationship is fundamentally cheap and very performant in Neo4j
because of the way it’s set up, and we’re about to get
to how that works. But here, one line of Cypher with some simple constraints and a return clause gets
rid of this much SQL. Why? Because the question and
the data is fundamentally graphy. So, when you take a fundamentally
graphy problem domain, and you put it into tables,
typically pain results. Now if you had a fundamentally
table-based problem where you said, “I have an entry, I just have a customer list
of 300 million customers, and all I ever want to do is pull out which ones are in this
particular zip code.” Relational has been optimized
for 30 years to do that well. I’m not going to try to convince you that that’s fundamentally graphy. But if your problem is
fundamentally graphy, I think we can get
there. So, why graph? This is basically about
modeling your data naturally. Driving the graph model from
the domain and from the use cases, rather than from
your college textbooks that tell you how you need to have it in third normal form in order
to reduce redundancy. Whenever you need to use relationship
information in real time, and whenever you need
this flexibility to add relationships on the fly, you’re probably in
the graph sweet spot. So, relationships are
a first-class citizen, and what we mean by that is, the entire database is focused around relationships
and traversing those. So, it’s not something
that you tack on with scans and with foreign keys
and with primary keys, it’s just baked in. An interesting way that I’ve
heard Neo4j described is, imagine if your database had all
of its Joins pre-materialized, like they had been pre-computed once. Then you’re most of the way
to a graph database. We have query and data locality. Part of the way that we can be faster is that we identify
the starting points, and move out from them rather
than doing these massive scans. Only load what’s needed, aggregate and project as you go and then optimize disk and
memory model for graphs. So, I’m going to get to the index
free adjacency here in a second. If you have a social graph
with a thousand people, and you average
50 friends per person, you end up with
this densely connected graph. If you ask, is there a path from Favad to myself in this social graph, but I never want to go
deeper than four hops. First of all, before we
talk about performance, can we agree that that’s a really ugly SQL query that’s
very difficult to write. Let’s say we warm up the cache, and we eliminate a disc I/O
for both databases. These are the observed values. It makes sense if you think about
how these databases are built, and that it is a
fundamentally graphy problem, this shouldn’t be surprising
or a hollow marketing claim. So, this is the secret sauce. This is how it works, and
this is why I’m asking you to believe that this is very much
a more performant database, and it’s not just marketing. So, inside of the database, we use pointers instead of lookups. And so when you have a relationship, it has a pointer on either
end to the place in memory. You need to go to find that node. When you traverse a relationship, it does not scan all the nodes, and figure out what’s
connected. All right. Second, is we have
fixed sized records, and if you know much about disk I/O, that allows us to rip through
a whole lot very quickly, and to be able to do offset jumps, and index seeks
extremely efficiently. Joins on creation. This is what I meant by
pre-materializing your join. If relational database is computing it every single
time on the fly, we computed it once when you wrote
the data and then never again. So, there’s fundamentally
a computational cost that we don’t have to pay every
time you traverse the edge. Essentially, the secret sauce
is that you just spin, spin, spin through
that data structure over, and over, and over again, and that is why
traversing relationships is cheap because its pointer
dereferencing in the end.>>So, this whole concept, the secret sauce so to speak. Online, you’ll hear us talk about
this as index free adjacency. So, there are articles you can
look up on index free adjacency, but this is really what’s meant. So, when we say that the secret
sauce is index free adjacency, this is why we are claiming
that relational can’t respond sub-second for n-way joins, and that why we claim
that relational is not agile is because it requires changes to queries
in these new data feeds. Now, back to your question. You were asking about SQL Server. So hopefully, I’ve given
an overview about that. Cosmos is a completely
different story because its underlying
architecture is different, but you’d kind of see how it structures its data
differently as well.>>SQL Server recently
added graph community. This is what I was asking about. This is non-relational
as far as I understand. Graph community, graphs, I’m not sure what services they added there. I’m not familiar if that’s part
of it but it’s not relational. It’s supposed to be pure graph. Are you familiar what that mean?>>I’m unfortunately, not. I’d like to take a look at it though.>>It was added recently. I noticed it months
ago. I’m not sure.>>2017.>>Yeah. I noticed it months ago. Maybe it wasn’t- Yeah. So there is something. When it comes to Cosmos DB. Cosmos DB has a way to look
at the data as a graph, and I believe it supports
querying as Gremlin.>>Yeah, that’s right.>>I just wanted to see how
Neo4j compares to Cosmos, especially when it comes to
performance in the case where nodes would have a huge number
of relationships. I mean, current,
millions, relationships. In the past, I was
trying to use Neo4j, what I learned is that, Neo4j does not scale out. Essentially that you can
have a system with cluster, which is a very two or three
or several beefy machines with lots of memory, and that’s all you can have. If you want to add more data, essentially, you will keep them all.>>Okay.>>Is this- has something
change from that time. I look at Neo4j in 2014 or 2013.>>So you had like a three or four different things
I need to unpack there. So the first is
the supernode pattern, which is the idea of a node having hundreds or
thousands of connections.>>Right.>>Okay. It is true that supernodes in general with an all graph databases are
considered an anti-pattern. Now, most of the time, when we run into customers
who have this problem, that the problem is somewhere
in the modeling layer, and so you may wish to make
a compromise such that you do not end up with nodes
with this crazy degree. Now, sometimes the problem can
be reduced in another way. So it matters a lot for example
whether the node has 200,000 out edges of the same type
or whether they are of different types because that of course affects selectivity
in the database, right? So in general, I wouldn’t
recommend a modeling approach that ends you up with hundreds
of thousands of links per node. Okay? I think you’ll find that common to a lot of
the graph databases. Now on the Gremlin point. So comparatively, between
the Neo4j and Cypher, I’ve used Cypher and
I’ve also used Gremlin. I beating on this point about declarative graph query languages earlier in part because of Gremlin. Gremlin has a very
imperative feel to it. It has some declarative features
but then generally, you’re telling it how
to traverse things. We find this pretty brittle. So, one of the things that
you’re going to find with Gremlin after you use it for
an extensive period of time, is that as your data
structure changes, you’re going to have
to go back and rewrite your queries because you told it which way to traverse and then
that structure has changed, right? You’re also going to find yourself optimizing the queries
for the database, and so you can do graph
query with Gremlin. I just find it to be a lot
more difficult sometimes less performant and less
maintainable as well, right? There was a third point
I think I’m forgetting.>>About scaling out.>>About scaling out, right.
So the scalability picture is still similar to
what you remember. The way that I described,
scalability in Neo4j is that our cluster architecture
has a leader and followers. You get vertical scalability for writes and you get
horizontal scalability for reads. What that means is, to guarantee
all the acid properties, you have one leader. Whenever you do a write, it has to be processed
through that leader. So you probably can’t process more writes than one leader can handle, and so you can scale
that leader up. Okay?>>But not out.>>If you want to
scale the graph out, you have the option to add read replicas and additional
followers to your cluster. So, there’s functionally
unlimited scalability for read queries out. Okay?>>Thank you.>>We’ve covered this. So Ambros.>>Yes.>>We’re going to get into
learning resources but we want to start some demos
and begin with Ambros, and then I’ll show some extra stuff.>>All right.>>Come on down.>>All right. So my name is Ambros. I’m from the services
Pentest team of CDG. Today, I’m just going
to show you like a 10-minute demo of
how to import data from CSV into a Neo4j graph
and without using Cypher. So it’s a little bit more intuitive. Let me quickly show you
what problem we just deal with. So during the reconnaissance
phase of IR Pentest, we have lots of data
from different tools. So here on the left, we have a direct report to a manager. On the right, we have those aliases back to what
file they recently open. So that’s from the delv tool output. Okay? So the thing we
wanna do is pretty much just combine them and see
what it looks like in the graph. Okay? So I created a little like helper library console
application that will make it a little bit more
intuitive for you to import data. So let’s take a look at importing
the management data here. So the CSV looks something like this. In my little language
thing that I made here, pretty much you have to define
some metadata about that CSV file. So what nodes do you want to create? So here, I’m saying I want
to create a manager node from that first column, and then I want to create a
report node, the direct report. Then at the bottom here, you specify the relationship, which is manager,
which is this manager, the relationship name
which is manager of, and then the second node,
which is report. Then here, I have the properties. So the ID would be Manager. The name is Manager label
which is the type is MS alias. So I’m just gonna
run this real quick. I’ll show you what that
looks like. Hold on. Let me make sure that
the graph is clear. So this is what it
looks like at the end. I’m going to delete it. So here, I have Neo4j running
locally on my laptop. So I’m going to run
the import manager. This, and then you’ll see
that it’s going to take that CSV file and then
load it up over here. Okay. So, here I have the labels
and names applied to these nodes. Okay. So, l later on, we might get some more data
about these people, right? So, I’m going to import
the second set of data, which is the delve data about
what file they recently opened, and then we’re going to
see what that looks like. Put delve data down here. So, in this case, I made my helper
tool merge on the node IDs, so, if it sees the user with the same, username it’s going to
use that node instead of creating a separate node. So, in this case, I’m
going to ID the files by the URL and then the users are just ID of our username
like they were before, and the relationship is “User”
and then “Worked on file”. So, I’m going to run this
“Import Delve Data” down here. Okay. Refresh it now,
then expand here. So, you can see which are the common files
that two users have opened, and this is a much easier to
see than looking at CSV files. Yes, and then I guess you can create whatever Cypher Queries you want to traverse this graph
if you want but, this solves the initial
hurdle of getting some a CSV data into a graph format. One last thing, I wrote
a tool here so you can find aka.ms/csv2graph and you can download the console
application version of this, which is pretty much the same. You just run it like this, you pass it in JSON config file
and then in that JSON config file, here it looks something like this. So, you pretty much say
which folder you want to use and then CSV files with
name directory ports will have this metadata which
is similar to what I showed you before and
then CSV files with delve would have
this metadata and this tells how to create the Neo4j graph. So, underneath the hood I
do call a Cypher Queries but if you’re just learning
this for the first time it may be easier to just define
the properties like so here.>>So you don’t need
to actually go to CSV you have to open
a JSON tool or a Cypher?>>Yes. So, essentially the
library I wrote to you, it doesn’t necessarily
have to be a CSV, it just has to be an innumerable, this thing, it doesn’t
have to be CSV. So, maybe you can connect to a SQL database or
something and then get your objects in a list and
then import it that way. So, yes, both the library
and the console app are located here. Yes.>>So, is file the least
technical person in this room, though super impressive
and super intimidating. So, David why would I choose to do this method for
getting a CSV into Neo4j as opposed to
the standard dose for your site?>>Let me ask a question
before I answer that. So, did you read the files line by line and then
create use, like create emerge? Or did you take the data from the user and then run
load CSV in your code?>>No, I didn’t run. It doesn’t
run the Load CSV thing.>>Okay.>>I did with Custom.>>Well, so, a really good reason
why you would want to not use Load CSV would be, for example, if you needed to do something in
your programming language that was well supported in
your programming language but it wasn’t in Cypher, so I’ve seen users do
stuff for example, maybe you get some addresses and you want to run them
through Google’s Geocoding API, which one according to your graph
is latitude and longitude. Okay, you can’t call the
Geocoding API from Cypher, but you can do that
in a C Sharp program that would be a good reason. Right? Another is that, you may know more about
the form of the data and you need a really high
performance insert, so, you might choose for example, Load CSV will say that’s
500 records at once. But if you’re doing
a zillion of these, you might want to batch them in a particular way for
performance reasons. But I think the biggest reason we’d probably be programming language specific features where you’re not really just loading data
but you’re transforming or enriching or cleaning or
doing something else with it. So, like Load CSV is
a pure Cypher solution, so you can do anything with that, you can do a Cypher but Cypher is in our general general purpose
programming language it’s a query language.>>That makes perfect sense, I
mean that really gets back to the core question of what probably it can solve of course you can answer.>>Yes. Well, this being
technology there’s always 15 ways, different ways to do the same thing.>>Yes.>>Not necessarily wrong. Do you mind if I ask you, in your learning curve when you
first got started in your future, can you recall anything about it just for people who are new that you particularly liked your thought
was good or sticking points where in particularly had problems or didn’t understand the concept?>>There are. I found the querying by relationships
really helpful. The fact that you can just, I just liked the querying like
you had different levels, you can create down
three levels, that was good. What I didn’t like too much is about, it was around creating
nodes because it’s awkward, you don’t really create nodes much
when you’re using Neo4j graph, so, every time I need to create
something I had to reference back. So, that’s why I created this helper tool so
that I don’t have to remember what the syntax was for creating stuff. Yes. That’s all.>>Half of what engineers do
for a living is automate away the pain and the problems
that they had. So, I mean that’s really cool. The other question is, so,
you have this graph right, and you can put all the CSV in it, have you guys looked at any analysis that you might do on top of that? So, it sounds like your use case
is fundamentally like analysis. If we know all these people
in this department are looking at these
documents, did you ever, have you looked into what are the top three most influential things that are the most widely read? Or you could say, if you knew who wrote these
documents you can say, internally, who were the top five
most read authors at Microsoft?>>Yeah those are good questions. I’ll really figure out what
the exact query is for that but.>>It sounds like you
can do that easily?>>Yes.>>David Baldacci or Stephen King
of Microsoft. Right?>>Yes.>>Yeah. So, with that as->>With whom? With probably
significantly lower numbers. I don’t know, I trust
big companies. [inaudible]>>So yeah, that’s
pretty much my demo. You can get it at that URL
and just feel free to reach out to me if
you have any issues.>>That is possible.>>So I leave back to you. Thank you.>>All right. So I’m
going to take you through couple of
different demo related things. We’re going to talk a little bit
about tooling and it’ll show you how the
software actually works. Just before we begin, as I said, this is on the Azure marketplace
and so when I showed up today, I deployed a three node
cluster. Here it is. I’ve got my core node set, got all my Azure resources. I can show you logging
into this later and show you how the cluster
topology works, but basically anybody can do this. You go to the marketplace
right here and I type in Neo4j and we get
lots of different options. 3.5.1 is the latest, so Neo4j enterprise is a single node setup where
you’re going to get one machine and causal cluster
is a multi node cluster setup, and so if you’re just looking
to play you do not need three VMs to do that or nine or 15, it depends on how you want to scale. But Neo4j enterprise will do, if you want the high availability
guarantees then you would go for causal cluster 3.5.1 right
there and launch that. So quickly, I want to
show you this tool, this is called Neo4j desktop, and so inside of Neo4j desktop you
can create lots of local graphs and have multiple instances
of the database running just on your machine if
you don’t want the cloud setup. So here I’ve got
a Microsoft demograph which I’m going to start right now. Inside of Neo4j desktop you get
these things called graph apps. These little tiles up here at the top are applications that you can
run against the Neo4j Graph. So what Ambros was showing earlier is this application
here called Neo4j Browser. A lot of the times when
users first come to Neo4j, this is their first point of entry. Really, it’s just the cipher
command executions shell with some graphic stuff
built on top of it. So, I saw him execute this query. We’re going to do
the same thing on mine. You’re going to be shocked
to learn that I have movie data in my demo set for today. So, I’ve got this big graph. This tool will allow
you to for example, select nodes and I can for example
change the colors of all of them and I can change how they’re captioned and make some nodes bigger than others and
so on and so forth. It’s a force-directed layout. But basically, it is a command shell. You can run queries and then capture CSV directly as a result of this. One thing if you’re
playing with this, I would recommend that you check
out the play command, colon play. It will run in-browser
guides where you can do learning and
examples and tutorials. So for example, a lot of the stuffs
that I’ve shown you today, if you do play movies
and then hit enter, it will step you
through a little guide. We’re not going to do this right now, but this is what
the guide looks like. It’ll tell you how to create this graph and you’ll see
these little play buttons. I can click on the play button and it’ll automatically insert that. So as I step through the tutorial, I can execute stuff and start
to play really quickly. All right. So that is Neo4j browser. I’m going to keep a picture of my graph up because
we might need to compare it when we look at
some of these other stuff. We also have this tool called Bloom. Bloom is for exploratory
visualization. So whereas browser is kind
of for command execution, run cipher, get a result. See, the trouble with that
is that you have to know cipher and you have to be
a data engineering sort. Bloom is a more natural
language focus tool with better visualization where you’re going to start
in a particular point, not execute an analytic query, but you’re looking for patterns, you’re looking to
identify bigger issues. So I can say for example, a person named Tom Hanks and you’ll see that it interprets
that as a graph pattern. It thinks what I mean is match a person with a name
property Tom Hanks. That is what I mean. Here’s Tom. Let’s command E to expand him and we blow out
just his immediate hops. Then we might say, ooh, You’ve Got Mail was not
a great movie, let’s skip that one. Let’s see, what was a great movie? Joe Versus The Volcano.
Anybody seen that? That’s some knowing laughs
back there, all right. It’s about a man who gets
convinced that he has this fake brain condition and he’s
going to jump into a- anyway. I can’t go into that. But let’s expand Joe Versus The Volcano because
it’s a funny movie. We’ve got Nathan Lane, John Patrick Stanley, and
Meg Ryan who were in that movie. They’re all connected to
Joe Versus The Volcano. We can then explore further and say, I happen to know that Meg Ryan
was also in You’ve Got Mail. So if I expand
Meg Ryan, look at that. Sleepless in Seattle pops
up as a movie they were both in and You’ve Got Mail. So we can see the commonalities. Let’s pick Tom, Meg, and You’ve Got Mail and dismiss
everybody else and then refocus. So this kind of GUI
is basically meant to do these kinds of
exploratory visualizations. A lot of our financial
clients, for example, they might have an idea of how fraud is happening
within a financial graph, but not really be able
to quite nail it. So you can use this for hypothesis exploration where you say I think this is
the pattern that’s happening, now let me go look
for evidence of that and then rinse leather repeat to build an overall picture
of how the network works. As Ambros said earlier, it’s just so much easier to actually see this data as a graph sometimes, that it spurs a lot of
discovery in that way. So that’s a quick overview of Bloom. We also have a graph app here called Haleine which is for monitoring. Let’s close this one and do
instead our cloud instance, see if we can get that working. Here inside of Neo4j desktop I’ve got the second tile Azure deploy. That is this deploy that I set up at the very beginning of
the meeting and I’m going to activate this as
a remote database not running on my machine and I’m going to open
my little Haleine graph app. What I’m going to
see here is a lot of operational metrics about
a multi-node cluster. So along the top you’ll see, these three IP addresses are the machines that are
participating in my cluster. The one with the little star
is the leader of the cluster that takes all the rights and
the other two are followers. Basically, this is
a rather in-depth program, we’re not going to go
all the way through it, but you can hover here
and you can see it’s got this green status,
everything’s looking good. That’s because it’s basically
not doing anything right now, the cluster is idle. Then we can look at individual
machines and for example, take a look at what the memory or the disk on this machine is doing, what plugins do we have installed
and how is it configured. So for example, I can
type in here, oops. I can see what addresses it is
advertising to the broader network. One last thing about Haleine
is in the diagnostics tab. You can run diagnostics and it
will gather a lot of information about your cluster and give you recommendations on what looks good, what might be misconfigured, what you might want to
look into and this sort of helps users diagnose some of the most common problems associated with configuring clusters and so on. So any questions about
the graphite concept? We have an API online so that if you wanted to write simple
applications sitting on top of Neo4j, this is a pretty good way to do it. In the end, they’re
just JavaScript programs typically with
a certain JavaScript stub API injected that allows
you to connect to and talk to the graph with
a bolt driver instance. Okay. Yes.>>Is the same monitor UI
available from the web interface?>>Yes, it’s the same.>>Okay.>>The only difference
here is that it’s wrapped inside of a graph arc,
but it’s the same.>>All right.>>Yes. All right. So that’s kind of Neo4j, the nickel tour of Neo4j desktop. Let’s talk a little bit about the analytics and data science parts. I want to show you a simple
or maybe not so simple query. This is it right here. Gets to the rate browser tab. So we bootstrapped a whole lot
of knowledge in to you today about what graphs are, what cipher is, how all
of this stuff works. We have this library that
comes packaged as a plugin called Neo4j graph algorithms
or just called graph algos. For data science groups, data engineering groups, really, really recommend you take a look
here because this is all of the graphy data science stuff
here that you’re not going to find in other libraries and
that Neo4j makes really easy that these sorts of
things are going to be quite difficult in other systems. So the graph algorithms
plugin is basically a compilation of a lot of different algorithms across
a lot of different use cases. This documentation, I’ve always
thought was pretty good because not only does it
describe like the syntax of how you’re going to call
this or that algorithm, but it also goes into what is Louvain about and
when would you use this and if your problem looks like this then you probably want
PageRank instead of something else. Some of the most fun things
that I have done working for Neo4j have been
interacting with this. So, a quick story. Neo4j has this data
journalism outreach program where we use graph technology, work with journalists who’re
doing investigations, help them get answers and then they write up the story
and publish it. When I was brand new to Neo4j
I got to work with NBC news. They were doing coverage of Russian Twitter trolls manipulating elections in the United States. Using the graph algorithms package, we loaded a whole lot of
Twitter data in that was given to us by the sources that
the journalists had cultivated. We were looking for community issues. That is, what topics were the trolls most
frequently talking about, how did they break down across
certain demographic lines, and all of that social
network data is fundamentally graphy and lent itself well to the algorithms that
are in this package. Getting to play with that and doing those kinds of data analyses
is really fun. Primarily, I was using at the time the community
detection algorithms. We were looking at all the
trolls and saying, if you got rid of all of
the civilians and you looked at just the people who were involved
in this instigating behaviors, who was talking to who? Who was retweeting who? It turned out that they broke
into multiple communities. Prior to the analysis and
the publishing NBC did, the thought had been there’s
this big group of trolls and they want a particular candidate to win and that’s really
the end of the story. It just wasn’t. At least that’s
not what was in the data. We found that there were a group of trolls who were
aligned with right causes, a group of trolls aligned with
left causes and a third group, and that the overall modus
operandi if you will, or what they were trying to do was
more about social division and less about advocating for a
particular person in that election. So that stuff got written
about in NBC news, and we don’t have to
go too deep into it, but for data scientists people, I just want you to know
that what went into that reporting was Neo4j
and graph algorithms. This is where you should
start if you want to do data analysis with Neo4j. So, as a simple example, I did a query for harmonic closeness
on my movie graph. This is just computing, whoops. Oh yeah, I didn’t start my database. That’s going to help. Databases are much more
responsive when they’re running. So let’s take a look at this query. Basically, what we’re getting is this metric coming out of
the harmonic closeness algorithm. The metric is centrality. I believe with this
particular algorithm the metric in and of itself
is not meaningful, so it’s not like 0.286
means something, but magnitude and
directionality is meaningful. So this is basically about, when you look at the graph, how centralism node or
how influential is it in the overall flow of
information throughout the graph. So, the way that this query works is that
callalgo.closeness.harmonic.stream, and so fundamentally, when we use the call keyword we’re
calling a stored procedure. We’re getting it
some parameters where we’re looking for nodes and
acted in relationships, and then basically we
yield some data and then basically match nodes
in the graph that were yielded out of this
harmonic closeness and then return whether it’s
a movie or a actor. We order by centrality in
descending order to find the most central nodes first
and limit to the top 20. Our buddy, Tom Hanks
comes out on top. The reason for that is that when
we created this sample dataset it was centered on Tom Hanks and we
tended to add data around him. So it’s not really surprising in this particular dataset that
he comes out as very central. Also not surprising is a lot
of stuff that he’s in, shows up being central. Because as we expand it out
and add it to this dataset, we tended to tack on co-stars and directors of
the movies that he had been in. So if for example, you did this on
the entirety of Twitter, you would probably find
that the center PewDiePie, Lady Gaga, and all of those right? Because that’s who
people are retweeting, that’s who’s sending out the most content and that’s who’s
referenced the most often. So one thing I can’t
really do justice in less than a full day
class is the depth of how much is in this
algorithms stuff. So basically, we group these into
families of algos, if you will. So sometimes what you care about is centrality and that’s
what I just showed you. Sometimes what you cared
about is community detection. So, I do have a whiteboard, maybe the cameras will follow me. So sometimes if you
have a graph like this, let’s say you have a graph like- Let’s say your graph looks like that. Totally made up, but you can see that while there
are a bunch of nodes and edges, there are three distinct
communities in there. So the family of community
detection algorithms, what they’re trying to
do is give you this, one, two, three. A lot of these algorithms as with many machine-learning algorithms and other things that you’ve used, they come with a lot of
tweakable and tunable parameters. So if you fuzz it enough, hey, it’s all one community. If you make the community stack enough then you have as many
communities as you have nodes here. So as with many other algorithms
that you’re going to work with, there’s some tweaking
and tuning according to the domain that you’re dealing with and how specific
that you want to get. But that’s kind of
community detection. Now, if you look at
this particular graph, if we ask about
the centrality of this graph, you don’t really get much
meaningful out of that because there isn’t
really any node with the possible exception of this guy that’s really
central in this graph. On the other hand, if the pattern
that you’re looking at is like spoken hub like this, and you run a centrality
type approach on that, here we can clearly see even visually even without
our fancy algorithm, what is the most central
node in that graph. I know you guys know
this from data analysis, it’s just that the technical features offer us all of these cool
algorithms that we can run, but then we have to
have a whole lot of domain understanding of what our data is so that we can fit
this abstract math thing that does something really cool
in software and then fit that to our domain so that we know
which algorithm to use and when. That is the real art, that’s why we get paid
the big bucks, right?>>I have a question.>>Yes.>>Just a use case
that we’re looking at, we won’t actually color
something nodes based on some criteria [inaudible]
to the improve the graph, maybe as a means to do
something like that through these APIs or maybe need
to do something more custom?>>Both depending. So there
are path finding algorithms, that’s a whole class where that
might be what you want to do. It depends on how
sophisticated your paths are. Sometimes what you want to do is
like an iterative algorithm where you find a path and then you
set a property on that path->>I see.>>-and then you just
iterate on that and then you expand it out time
and time and time again, expanding out which nodes get this magic property each time.
You know what I’m saying?>>Yeah.>>So, it depends on
what you want to do, but the path finding
algorithms are held for that, but in many cases because cipher’s kind of good
at that out of the box, sometimes the path finding
algorithms are overkill for that and you just
need some simple cipher. Can you say a little bit more about what kind of paths do you want to find and what
meaning do they have for you?>>Yes essentially, I work
on the build up team, we’re building our
clusters for Azure.>>Okay, cool.>>Each of those builds out have a certain series of jobs that happen. These job have also dependencies that need to go through as
we travel through that. So, sometimes a certain jobs
get blocked or delayed, then we wanted to re-figure the critical path based on
what we had to do there.>>Yeah, that’s a dependency graph. So you’re talking about something
like an armed template. Is that the kind of
job in front here?>>Essentially, you have a cluster live and you’re already doing it, this is still in the initial
build out for VS VVR.>>So, you’ve got some
job one triggers job two, triggers job three, and then you
get to this good place, right?>>Yeah.>>But job one also
triggers J4 which triggers J5 which is needed for
J3 and J5 is blocked.>>Right.>>Right.>>So you’ll color J5 differently.>>Yeah. We’ll just put a green circle around it and
say job five is blocked, right? So, this is very similar to
the network in IT operations use cases where sometimes
what people are doing with graphs here is these
are routers and machines, and what they want to ask is what in this chain if it fails will mess
up my ability to deliver this. Okay. So, basically what
you might want to do is ask a query like start finish. All right. Okay. Then so you
want to enumerate all paths, all distinct paths from
start to finish and then you want to go through
those paths and ask if any component on that
path is blocked. That’s fairly straightforward
to do a cipher, because we know how to match a path, we know how you can use
the distinct keyword to say, I want distinct paths. Then you can use functions and
cipher to pull out the nodes or you could say match a users, I’m just making this
up as we go star b where b.blocked true. Then this will get you p equals return p. I’m
just going really fast. Hopefully this is legible. You can say match all paths where I’m going from some starting
point to some node. That node is blocked,
give me that path. Then you might further
this and say, b. Now I’m getting really small, and then say finish. Then that right there would
identify the blocker. Okay.>>So you don’t even
need one of those, you can just [inaudible]?>>Well, yeah that’s
right. That’s why I was thinking that path finding algorithms might be overkill because you can do
this sort of thing. Then you can sort of, you
could then for example color that node by setting
a new property on it.>>Yeah.>>Then if you did that say every 30 seconds then
separately scan water all of my blocked guys and then have
some resolution approach for that.>>Okay.>>Does that answer your question?>>Yeah, it does. Thank you.>>Similarity algorithms and pre-processing functions
and procedures. We’re coming up to the
end of our time here. I want to make sure that we are respectful of everybody’s time
and leave some time for Q and A. I’ve gone through a lot of algorithms and
a lot of stuff like. I could talk for hours
about justice piece, but it gets really deep really
fast and as an intro topic, I just want to leave you with
this as a point of exploration. Finally, when you get- this is only
going to take a minute or two. When you get lost where do you go? How do you get help? This is, don’t worry about writing
these links down or anything like this because we’re going to
distributed copy of the slides, so all this is going to
be clickable for you. Everything that I’ve told you today you’re going to be able to find in written documented form out
here on our developer pages. We have a Graph Academy where you can get self-paced online training, we have a Neo4j certified
program where you can go through an exam process
to get Neo4j certified, become a professional and get
that sweet sweet Neo4j swag, right? I mean this being the tech industry, I know you guys don’t
have any T-shirts at all and you’re desperately
in need of some, and I want you to know
that we’re here to help. Okay. So, the one hour
certification exam covers a lot of the topics that we’ve covered today and goes into
a little bit more depth. We told you earlier about
the cipher ref card. Like if you only had one link, how do I do something in Neo4j? A single link don’t even want the whole developer manual,
this would be it. This is the cheat sheet of
everything that you need. Obviously we have
developer documentation, operations manuals for
how to run clusters. We have a knowledge base as well. So paying customers, get access
to a lot of extra articles. We have a very large number
of public articles as well as frequently asked questions, there version specific so
you can find out about how this worked in this older
version whatever you need. Lots of sample applications
that you can get clone, compile and go. We are very passionate
about our user community and we have regular meet-ups
all over the world, we have a community site, community.neo4j.com where you
can get connected with some of these and you can meet some of
my favorite fellow graph nerds. That’s also a great point of
entry to just ask a question. We have a fantastic
Developer Relations Team that tends to jump on, helping people who are
out there and wanting to get them started and have
a good early experience with graphs. You know, Neo4j group grew
up in the open source world. So, if you want to know, I don’t know maybe you wake
up late on a Friday night, when you should be
asleep and you say, “How did they implement
index free adjacency? I must go to GitHub and know.” You can do that. We
have Stack Overflow. One of the early ways I got
involved in the community is I answered a zillion
questions out there on Stack Overflow and so a lot of different things that
you might want to do have already been covered. Books, man, it’s endless. Okay. Some colleagues who I
have tremendous respect for, we have two people
Mark Needham and Amy Handler are in the process of publishing
a book on graph algorithms. So, it’s going to be a very deep dive on all of the stuff that I can’t
cover today for time reasons. So, when that O’Reilly book comes out I really highly
recommend you pick up a copy because bigger brains
than me, I’ll tell you that. Anyways. That brings us to Q and A. I want to be here to help you have the smoothest early process and get any of your questions
answered and please don’t spare me the hard ones. Anket.>>So, in the community
has anyone attempted algorithms which are used for probabilistic topic
of models on Neo4j, like reinforcement learning
of mandates, those kind of?>>Yes, I am myself not
super deep on that. I’m not sure I can answer
that really thoroughly. But yes, there’s a guy named Andrew Jefferson who
has written a couple of posts on this topic and Octavian AI I think is the group
that he’s working with. So, that’s a fairly deep
to wrap it a whole. Like how does Neo4j
connect to deep learning, and a lot of those related topics. That’s an area that we’re
increasingly getting into. There’s some publishing out there right now and
there’s more to come.>>Second question I had was, are they in the plans Neo4j
connecting with Power BI?>>Power BI. So, I think in community
open source there’s a number of ways that you
can do that right now. There is a JDBC driver where you can write cipher and expose a table. Okay. There’s not a Power BI specific integration
that I’m aware of, but we are always, I work in partnerships and so on
and we’re always looking to hear about what integrations
are most valuable and why. We’d love to talk to you
more about that offline.>>Hey, you guys are troopers. It’s like five O’clock, and you’ve been listening to
graph stuff for two hours. Yes?>>Right. You already
mentioned the program or anti-pattern of supernodes. I was wondering
what other problems people are typically running into and what are the early symptoms
of those problems?>>That’s a good one, and that’s not an easy one either. All right. Hey, you took me seriously
when I said ask the hard ones. Okay. All right. The supernodes are
definitely an anti-pattern. Neo4j does not index
relationship types, or sorry, relationship properties, and so an anti-pattern that
I see is over-reliance on really dense property metadata
on the relationships. Okay? So, for example, if the way that you
design your model is with very thin nodes and very fat relationships
with 20 or 30 properties, and then you want to
write a lot of queries that are very selective on
relationship properties, this is generally not a good idea. Okay. That’s one
anti-pattern. Let’s see. Super fat nodes where you don’t split things out is
another anti-pattern. So, recall that I’ve
been saying all along, relationships are very
cheap to traverse, right? So, let me take
a super simple example. Let’s say that I have a customer. In a relational database, we’re going to give them a
name and a state and a city. Let’s just keep it that simple. Okay. In relational databases, we tend to think about tables, and so what we do is we pile lots and lots and lots and
lots of properties on, and this breeds queries
where you’re going to be searching for all of the customers who are
in the state of Virginia. I’m from Virginia, so I’ll
use that as an example. A better way to do this is to take your low-dimensional
categorical variables and to break them out into nodes. So, for example, I would
not do it this way. I would do it like this. Customer name has state, has city. Okay. So, an anti-pattern is having lots and lots of really fat nodes
and not that many relationships. In this way, you are defeating
what the database does well, and you’re falling back on your old relational scan
and filter skills. Does that makes sense? Was there
a second part to your question, did I miss or did I answer it?>>Early symptoms.>>Early symptoms.
Bad query performance and also the other symptom would be your graph model looks exactly the same as
your relational model. That would be a giveaway. So, if you have these different affordances
and different flexibilities, it ought to look at least
a little different. All right. It was a question to that?>>I was just going to say
you can also undefined state has a new graph model?>>Yeah, probably so. Actually, yeah. This would help distinguish between Richmond,
Virginia and Richmond, California which I can tell you
are very different places. Yes?>>So in general, you
try to model data [inaudible] but found out
it’s kind of awkward to model the time series data in
some way [inaudible] what’s your recommendation
generally after you’re like adopt different database models, like great kind of thing.>>This, man, you’re really
teeing up the softballs for me. Okay. So, in earlier
versions of Neo4j, it used to be that we recommended this Time Tree approach and that’s probably what
you’re talking about. So, in the Time Tree approach, you would do just for folks who have not seen
the Time Tree approach, you would have a 2019 node
linked to a January node, links to a, what is today anyway, the 24th day, and then you would
link that to customer call, I don’t know, let’s
say a customer called. This would be a Time Tree. So if you wanted to find all the
calls that happened in January, you would match to the January 2019 and then get all
the nodes from there. Neo4j 3.4 introduced
the temporal data type, so you can have times and date times as
a native data type on a property, and we have optimized
indexes for that, and so when the temporal types
and temporal indexing came out, we don’t do Time Trees anymore. You just put a date time, you index it, and that’s that.>>It’s like stacking a value on one property with
a timestamp or something? Like if you have a history
of different values for the same property such as stack going to have the same
property with impact stamp?>>So, if you have
the history of values, there’s a different pattern for that. So, we basically you turn, if part of the problem of
your domain is that you have the customer today, but you want to know
every state the customer has ever had in the past, then, we’re going to basically model
the customer as a LinkedList, and the head of the list is always. So, taking our example right
here and sketching it out. See you’ve got this
customer call on 1/24/19, and then we’ll have a next link, and then we’ll have
the same one on 1/23/19, and so on and so forth. So, when people need to
do revisions in graphs, what they do is they basically
never modify the node. They create a new one
with duplicate data, but whatever they need to change, and the old one gets pushed back on the LinkedList. Does that make sense? In this way, you can traverse the chain and it’s
like a time machine, right? You can navigate to that node through the query and
then you can go back, however many iterations you need
until you hit some time point.>>I see. So, the time stamp you
mentioned was basically just a extra attributes
on property rather.>>Yes, sir, it’s
just another property. It’s just another property that
has the date time data type.>>Okay. All right.>>Yes. I’m going to get
your question in just a second. I’m going to just pull
that up because why not. So, yeah, so that’s
what I’m talking about. It’s just a property data type. Yes?>>Just a query. So, in that example
that you just heard of, what would the node via [inaudible].>>Yeah.>>What are the different property
of the different dates?>>That’s exactly right.>>Okay, yeah.>>It would be
a different property with a different date but otherwise
the same data, and in this way, we would basically be keeping every revision of the
node that we have, enabling the time
machine aspect here. Call that Git for graphs. Because you came branched
with it, you really can. I’ve seen people do that.
Any other questions? Any online? Right. Thanks a lot, guys. We hit our time, our over by two minutes.>>Thank you.

Introduction to Neo4j and Graph Databases
Tagged on:             

16 thoughts on “Introduction to Neo4j and Graph Databases

  • May 18, 2019 at 3:31 pm

    Great presentation. Love the class format and the way he handled all the questions

  • July 16, 2019 at 7:41 pm

    This was very helpful for internalizing some core graph db concepts. Very good mix of inspiration for problems, along with digestible examples. Big ups to M. David Allen and Microsoft Research for putting this together!

  • August 5, 2019 at 9:06 am

    fik pasti bisa

  • August 10, 2019 at 4:23 pm

    Love how Microsoft opened up their sessions online so even people outside of the organization like me could see. Good job!

  • August 14, 2019 at 10:09 pm

    Thanks!! Good presentation!!!

  • September 9, 2019 at 3:15 pm

    Starts at 02:23

  • September 24, 2019 at 12:00 pm

    Global graph database market size was valued at $0.7 billion in 2017 and is estimated to reach $3.77 billion by 2025 with the CAGR of 23.82% during 2019-2025.

    Request a sample report @ https://www.envisioninteligence.com/industry-report/global-graph-database-market/?utm_source=yt-anusha

  • September 27, 2019 at 1:17 am

    When he is refering to content of the slide and not being able to see the slide is frustrating. Other than that, the presentation is great.

  • October 1, 2019 at 8:38 am

    I was the person that worked really hard at 50:08 😀 Oh dear little did I know 😀

  • October 12, 2019 at 8:57 pm

    Not that it matters, but notice that every one of these attendees are from different parts of the world?

  • October 27, 2019 at 6:40 am

    Very Good!

  • November 8, 2019 at 6:10 pm

    Is the console application available in a public repo?

  • November 26, 2019 at 12:11 pm

    Very nice presentation. It would have been better if the camera focused more on the slides. Several times, wanted to look at the board to follow what he was saying.

  • November 28, 2019 at 7:23 pm

    I add this to my intro to Neo4j at https://wilsonmar.github.io/neo4j
    Slides at https://www.microsoft.com/en-us/research/uploads/prod/2019/03/41970_Introduction_to_Neo4j_and_Graph_Databases.pdf
    https://aka.ms/csv2graph/ houses Ambrose Leung's program.

  • December 12, 2019 at 5:03 am

    This database is awesome and this Cypher Query Language is very intuitive and easy to use.

  • December 20, 2019 at 10:25 pm

    Please focus the camera on the screen — it takes time to digest what's written, and we need to hear him speak at the same time.


Leave a Reply

Your email address will not be published. Required fields are marked *