Reddit MemeGraph
为我们找些迷因
这里有一个来自 Reddit 的非常不错的 CSV 文件,其中包含了热门迷因
并从 https://neo4jsandbox.com 获取一个空的 Neo4j Sandbox。
数据是什么样的?
检查 CSV
WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
RETURN count(*);
╒══════════╕ │"count(*)"│ ╞══════════╡ │"1000" │ └──────────┘
WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row limit 3;
╒════════════════════════════════════════════════════════════════════════════════════════════════════╕ │"row" │ ╞════════════════════════════════════════════════════════════════════════════════════════════════════╡ │{"over_18":"False","name":"t3_1edsw9","permalink":"https://www.reddit.com/r/memes/comments/1edsw9/can│ │_we_please_start_a_crazy_amy_meme_for_amy_of/","url":"https://www.quickmeme.com/meme/3uer85/","domain│ │":"quickmeme.com","distinguished":null,"score":"1831","downs":"1010","link_flair_css_class":null,"su│ │breddit_id":"t5_2qjpg","thumbnail":"https://b.thumbs.redditmedia.com/qpz4enS1CCFIs8Ys.jpg","id":"1eds│ │w9","author_flair_css_class":null,"link_flair_text":null,"selftext":null,"ups":"2841","num_comments"│ │:"120","edited":"False","title":"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Compan│ │y?","created_utc":"1368627364.0","is_self":"False"} │ ├────────────────────────────────────────────────────────────────────────────────────────────────────┤ ...
加载迷因
WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
LOAD CSV WITH HEADERS FROM url AS row
WITH row LIMIT 10000
CREATE (m:Meme) SET m=row // we take it all into Meme nodes
添加了 100 个标签,创建了 100 个节点,设置了 1700 个属性,语句在 120 毫秒内完成。
获取一些迷因
MATCH (m:Meme) return m limit 25;

MATCH (m:Meme) return m.id, m.title limit 5;
╒════════╤════════════════════════════════════════════════════════════════════════════════╕ │"m.id" │"m.title" │ ╞════════╪════════════════════════════════════════════════════════════════════════════════╡ │"1edsw9"│"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Company?" │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1ihc34"│"Given the competitive nature of redditors, I assume you all feel the same way."│ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1gmt99"│"This man left this woman..." │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1ds9y4"│"How to cure bad breath..." │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤
但我们想要的是文字!
移除标点符号
使用空字符串分割创建一个标点符号数组。
return split(",!?'.","") as chars;
[",","!","?","'","."]
并将每个字符替换为空白 ''
with "a?b.c,d" as word
return word,
reduce(s=word, c IN split(",!?'.","") | replace(s,c,'')) as no_chars;
╒═════════╤══════════╕ │"word" │"no_chars"│ ╞═════════╪══════════╡ │"a?b.c,d"│"abcd" │ └─────────┴──────────┘
我们得到了一些不错的单词
MATCH (m:Meme) WITH m limit 1
// lets split the text into words
RETURN split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words;
╒═════════════════════════════════════════════════════════════════════════════════════════════════╕ │"words" │ ╞═════════════════════════════════════════════════════════════════════════════════════════════════╡ │["CAN","WE","PLEASE","START","A","CRAZY","AMY","MEME","FOR","AMY","OF","AMYS","BAKING","COMPANY"]│ └─────────────────────────────────────────────────────────────────────────────────────────────────┘
够了,节点在哪里?
让我们创建一些单词节点
(merge 执行 get-or-create)
MATCH (m:Meme) WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
MERGE (a:Word {text:words[0]})
MERGE (b:Word {text:words[1]});
展开范围
但我们想要数组中的所有内容,所以让我们展开一个范围。
MATCH (m:Meme) WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx
MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});
MATCH (n:Word) RETURN n;
没有限制
MATCH (m:Meme) WITH m // no limits
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx
MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});

MATCH (n:Word) RETURN count(*);
将迷因串联起来
通过 :NEXT
连接单词,并在每个关系上存储迷因 ID 到 ids
属性中
对于第一个单词 (idx = 0)
,我们也将 Meme
节点连接到第一个 Word
MATCH (m:Meme) WITH m
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx
MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]})
// Connect the words via :NEXT and store the meme-ids on each rel in an `ids` property
MERGE (a)-[rel:NEXT]->(b) SET rel.ids = coalesce(rel.ids,[]) + [m.id]
// to later recreate the meme along the next chain
// connect the first word to the meme itself
WITH * WHERE idx = 0
MERGE (m)-[:FIRST]->(a);
设置了 546 个属性,创建了 614 个关系,语句在 65 毫秒内完成。
哪些单词出现频率最高
MATCH (w:Word)
WHERE length(w.text) > 4
RETURN w.text, size( (w)--() ) as relCount
ORDER BY relCount DESC LIMIT 10;
╒══════════════════╤══════════╕ │"w" │"relCount"│ ╞══════════════════╪══════════╡ │{"text":"AFTER"} │"56" │ ├──────────────────┼──────────┤ │{"text":"REDDIT"} │"34" │ ├──────────────────┼──────────┤ │{"text":"ABOUT"} │"33" │ ├──────────────────┼──────────┤ │{"text":"TODAY"} │"33" │ ├──────────────────┼──────────┤ │{"text":"SCUMBAG"}│"32" │ ├──────────────────┼──────────┤ │{"text":"EVERY"} │"31" │ ├──────────────────┼──────────┤ │{"text":"FIRST"} │"30" │ ├──────────────────┼──────────┤ │{"text":"ALWAYS"} │"28" │ ├──────────────────┼──────────┤ │{"text":"FRIEND"} │"27" │ ├──────────────────┼──────────┤ │{"text":"THOUGHT"}│"24" │ └──────────────────┴──────────┘
现在让我们再次找到我们的迷因
// first meme
MATCH (m:Meme) WITH m limit 1
// from the :FIRST :Word follow the :NEXT chain
MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->() // let's follow the chain of words starting
// from the meme, where all relationships contain the meme-id
WHERE ALL(r in rels WHERE m.id IN r.ids)
RETURN *;

按 ID 显示迷因
我们也可以从 CSV 列表中获取迷因,例如 ID '1kc9p2' - '迷因再蠢,也能表达有效的观点'
MATCH (m:Meme) WHERE m.id = '1kc9p2'
MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->()
WHERE ALL(r in rels WHERE m.id IN r.ids)
RETURN *;

完成。尽情享受吧!
附言:如果你想连接你自己的东西,可以获取一个 Neo4j Sandbox 或在你自己的机器上使用 Neo4j。如果你有问题,可以随时通过 Twitter 或 Slack 咨询我(Michael)。
此页面是否有帮助?