GraphGists

Reddit 表情包图谱

周六晚上,在喝得不够多之后,我偶然看到了 @LeFloatingGhost 发布的这些推文。

memegraph tweet

这看起来绝对像一个表情包图谱。

我们也可以做到

memegraph meme

录制会话

如果你想看看我如何现场解决这个问题,可以观看我的会话。

memegraph gif preview

如果你想查看此帖子的交互式版本,请查看 Graph Gist 集合。

memegraph graphgist

为我们找到一些表情包

sticker b222a421fb6cf257985abfab188be7d6746866850efe2a800a3e57052e1a2411

有一个来自Reddit的非常棒的 CSV 文件,其中包含最热门的表情包。

并从 https://neo4jsandbox.com获取一个空的 Neo4j 沙箱。

数据是什么样的?

检查 CSV

WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
RETURN count(*);
╒══════════╕
│"count(*)"│
╞══════════╡
│"1000"    │
└──────────┘
WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row limit 3;
╒════════════════════════════════════════════════════════════════════════════════════════════════════╕
│"row"                                                                                               │
╞════════════════════════════════════════════════════════════════════════════════════════════════════╡
│{"over_18":"False","name":"t3_1edsw9","permalink":"https://www.reddit.com/r/memes/comments/1edsw9/can│
│_we_please_start_a_crazy_amy_meme_for_amy_of/","url":"https://www.quickmeme.com/meme/3uer85/","domain│
│":"quickmeme.com","distinguished":null,"score":"1831","downs":"1010","link_flair_css_class":null,"su│
│breddit_id":"t5_2qjpg","thumbnail":"https://b.thumbs.redditmedia.com/qpz4enS1CCFIs8Ys.jpg","id":"1eds│
│w9","author_flair_css_class":null,"link_flair_text":null,"selftext":null,"ups":"2841","num_comments"│
│:"120","edited":"False","title":"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Compan│
│y?","created_utc":"1368627364.0","is_self":"False"}                                                 │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
...

加载这些表情包

WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
WITH row LIMIT 10000
CREATE (m:Meme) SET m=row // we take it all into Meme nodes

添加了 100 个标签,创建了 100 个节点,设置了 1700 个属性,语句在 120 毫秒内完成。

获取一些表情包

MATCH (m:Meme) return m limit 25;
memegraph memes
MATCH (m:Meme) return m.id, m.title limit 5;
╒════════╤════════════════════════════════════════════════════════════════════════════════╕
│"m.id"  │"m.title"                                                                       │
╞════════╪════════════════════════════════════════════════════════════════════════════════╡
│"1edsw9"│"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Company?"         │
├────────┼────────────────────────────────────────────────────────────────────────────────┤
│"1ihc34"│"Given the competitive nature of redditors, I assume you all feel the same way."│
├────────┼────────────────────────────────────────────────────────────────────────────────┤
│"1gmt99"│"This man left this woman..."                                                   │
├────────┼────────────────────────────────────────────────────────────────────────────────┤
│"1ds9y4"│"How to cure bad breath..."                                                     │
├────────┼────────────────────────────────────────────────────────────────────────────────┤

但我们想要文字!

让我们获取第一个表情包并开始操作。

将文本拆分为单词。

MATCH (m:Meme) WITH m limit 1
RETURN split(m.title, " ") as words;
["Can","We","Please","Start","a","Crazy","Amy","Meme","For","Amy","of","Amy's","Baking","Company?"]

你能听到我吗?

MATCH (m:Meme) WITH m limit 1
RETURN split(toUpper(m.title), " ") as words;
["CAN","WE","PLEASE","START","A","CRAZY","AMY","MEME","FOR","AMY","OF","AMY'S","BAKING","COMPANY?"]

移除标点符号

使用空字符串拆分创建标点符号数组。

return split(",!?'.","") as chars;
[",","!","?","'","."]

并将每个字符替换为空字符串 ''

with "a?b.c,d" as word
return word,
       reduce(s=word, c IN split(",!?'.","") | replace(s,c,'')) as no_chars;
╒═════════╤══════════╕
│"word"   │"no_chars"│
╞═════════╪══════════╡
│"a?b.c,d"│"abcd"    │
└─────────┴──────────┘

我们得到了一些不错的单词

MATCH (m:Meme)  WITH m limit 1
// lets split the text into words
RETURN split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words;
╒═════════════════════════════════════════════════════════════════════════════════════════════════╕
│"words"                                                                                          │
╞═════════════════════════════════════════════════════════════════════════════════════════════════╡
│["CAN","WE","PLEASE","START","A","CRAZY","AMY","MEME","FOR","AMY","OF","AMYS","BAKING","COMPANY"]│
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

单词足够了,节点在哪里?

让我们创建一些单词节点

(merge 会获取或创建)

MATCH (m:Meme)  WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
MERGE (a:Word {text:words[0]})
MERGE (b:Word {text:words[1]});

我们的前两个单词

MATCH (n:Word) RETURN n;
memegraph two words

展开范围

但我们想要数组中的所有内容,所以让我们展开一个范围。

MATCH (m:Meme)  WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m

UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx

MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});
MATCH (n:Word) RETURN n;

无限制

MATCH (m:Meme) WITH m // no limits
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m

UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx

MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});
memegraph all words
MATCH (n:Word) RETURN count(*);

链接表情包

通过 :NEXT 连接单词,并在每个关系中将表情包 ID 存储在 ids 属性中。

对于第一个单词 (idx = 0),让我们也连接 Meme 节点到第一个 Word

MATCH (m:Meme) WITH m
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx
MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]})

// Connect the words via :NEXT and store the meme-ids on each rel in an `ids` property
MERGE (a)-[rel:NEXT]->(b) SET rel.ids = coalesce(rel.ids,[]) + [m.id]

// to later recreate the meme along the next chain
// connect the first word to the meme itself
WITH * WHERE idx = 0
MERGE (m)-[:FIRST]->(a);

设置了 546 个属性,创建了 614 个关系,语句在 65 毫秒内完成。

完成了!

MATCH (m:Meme)-[:FIRST]->(w:Word)-[:NEXT]->(w2:Word)
RETURN * LIMIT 33;
memegraph example

哪些单词出现次数最多

MATCH (w:Word)
WHERE length(w.text) > 4
RETURN w.text, size( (w)--() ) as relCount
ORDER BY relCount DESC LIMIT 10;
╒══════════════════╤══════════╕
│"w"               │"relCount"│
╞══════════════════╪══════════╡
│{"text":"AFTER"}  │"56"      │
├──────────────────┼──────────┤
│{"text":"REDDIT"} │"34"      │
├──────────────────┼──────────┤
│{"text":"ABOUT"}  │"33"      │
├──────────────────┼──────────┤
│{"text":"TODAY"}  │"33"      │
├──────────────────┼──────────┤
│{"text":"SCUMBAG"}│"32"      │
├──────────────────┼──────────┤
│{"text":"EVERY"}  │"31"      │
├──────────────────┼──────────┤
│{"text":"FIRST"}  │"30"      │
├──────────────────┼──────────┤
│{"text":"ALWAYS"} │"28"      │
├──────────────────┼──────────┤
│{"text":"FRIEND"} │"27"      │
├──────────────────┼──────────┤
│{"text":"THOUGHT"}│"24"      │
└──────────────────┴──────────┘

现在让我们再次找到我们的表情包

// first meme
MATCH (m:Meme) WITH m limit 1
// from the :FIRST :Word follow the :NEXT chain
MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->() // let's follow the chain of words starting
// from the meme, where all relationships contain the meme-id
WHERE ALL(r in rels WHERE m.id IN r.ids)
RETURN *;
memegraph

按 ID 显示表情包

我们还可以从 CSV 列表中获取表情包,例如 ID '1kc9p2' - '表情包再愚蠢,也可能表达出有效的观点'

MATCH (m:Meme) WHERE m.id = '1kc9p2'

MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->()
WHERE ALL(r in rels WHERE m.id IN r.ids)

RETURN *;
memegraph 2

完成了。尽情享受!

附注:如果你想连接自己的内容,可以获取一个 Neo4j 沙箱 或在你的机器上使用 Neo4j。如果你有任何问题,可以问我,Michael,在 Twitter 上在 Slack 上