<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Erin Grand</title>
<link>https://eringrand.github.io/</link>
<atom:link href="https://eringrand.github.io/index.xml" rel="self" type="application/rss+xml"/>
<description>A blog built with Quarto</description>
<generator>quarto-1.8.26</generator>
<lastBuildDate>Fri, 05 Dec 2025 00:00:00 GMT</lastBuildDate>
<item>
  <title>New Blog - Who Dis?</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/new_blog_who_dis/</link>
  <description><![CDATA[ 





<p>I started my blogging journey as a homework assignment in grad school. At the time, Jekyll was new, exciting, and easy to spin up. I found a theme I liked, contributed to it on GitHub to add all the elements I needed, got an internship from the person who ran the theme (Thanks, Barry!), and stuck with it till now.</p>
<p>There’s nothing wrong with my (old?) blog, but I’ve been using Quarto docs at work and wanted to experiment with Quarto powering my blog. I don’t write new posts very often, so I also don’t have <em>that</em> many posts to transition. It seemed like a fun challenge!</p>
<section id="task-1-get-old-blog-file-structure-into-the-new-file-structure" class="level2">
<h2 class="anchored" data-anchor-id="task-1-get-old-blog-file-structure-into-the-new-file-structure">Task 1: Get old blog file structure into the new file structure</h2>
<p>A Quarto blogs file tree look like:</p>
<pre><code>├── 404.html
├── 404.jpg
├── CNAME
├── _quarto.yml
├── about.qmd
├── index.qmd
├── posts
│   ├── _metadata.yml
│   ├── my_first_post
│   │   └── index.qmd
│   │   ├── stockphoto.png
├── profile.png
└── styles.css</code></pre>
<p>Each post has its own sub-folder under the <em>posts</em> directory. The text of each post is inside an index.qmd file, which contains date and tag metadata in the YAML.</p>
<p>On the other hand, my Jekyll blog posts are markdown files under the *_posts* folder, organized by date in the filename.</p>
<pre><code>  ├── 404.md
  ├── CNAME
  ├── _config.yml
  ├── _includes
  ├── _layouts
  ├── _plugins
  ├── _posts
  │   ├── 2015-02-15-my_first_blog.md
  ├── _sass
  ├── about.md
  ├── archive.html
  ├── images
  │   ├── 404.jpg
  │   └── stockphoto.png
  ├── index.html
  ├── style.scss
  ├── tag_index.html</code></pre>
<p>So the first task was to convert all my Markdown blog posts into their own subfolder under the <em>posts</em> directory. I would also like to grab any metadata out of the posts while doing so.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># md posts filenames &amp; location</span></span>
<span id="cb3-4">posts <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list.files</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"eringrand.github.io.raw/_posts/"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">full.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb3-5"></span>
<span id="cb3-6">posts_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">post_loc  =</span> posts, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># full filepath</span></span>
<span id="cb3-7">                        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">post_name =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">basename</span>(posts) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># just the file name</span></span>
<span id="cb3-8">                        ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rowwise</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-10">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use list-frame to read in post text</span></span>
<span id="cb3-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">post =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">txt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readLines</span>(post_loc))))</span>
<span id="cb3-12"></span>
<span id="cb3-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># post meta data</span></span>
<span id="cb3-14">posts_info <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> posts_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_sub</span>(post_name, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>), <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Jekyll post filenames all start with the date</span></span>
<span id="cb3-16">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">author =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Erin Grand"</span> </span>
<span id="cb3-17">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-18">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># grab metadata from the text of the post itself</span></span>
<span id="cb3-19">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(post, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(txt, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title:"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(txt), </span>
<span id="cb3-20">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove</span>(title, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title:"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-21">           <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove_all</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[[:punct:]]"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-22">           <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_trim</span>(),</span>
<span id="cb3-23">         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># this may be different depending on how your blog does tags</span></span>
<span id="cb3-24">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">categories =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(post, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(txt, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tags:"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-25">                             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(txt) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-26">                             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tags:"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-27">                             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove_all</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[[:punct:]]"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-28">                             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_trim</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-29">                             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">collapse =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">", "</span>),</span>
<span id="cb3-30">         ) </span></code></pre></div></div>
<p>I then cleaned up a bunch of text in my tags/categories, but I ended up rewriting them manually anyway, so I’m going to ignore that code for now.</p>
<p>With all the post information and most of the metadata, I could now write out the new post structure and files.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">fill_between <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) {</span>
<span id="cb4-2">  x_log <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(x, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"---"</span>)</span>
<span id="cb4-3">  </span>
<span id="cb4-4">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find the indices of the first and last TRUE</span></span>
<span id="cb4-5">  first_true_index <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(x_log)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-6">  last_true_index <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(x_log)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]</span>
<span id="cb4-7"></span>
<span id="cb4-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a range of indices to fill</span></span>
<span id="cb4-9">  indices_to_fill <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> first_true_index<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>last_true_index</span>
<span id="cb4-10"></span>
<span id="cb4-11">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set all values within this range to TRUE</span></span>
<span id="cb4-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(x[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>indices_to_fill])</span>
<span id="cb4-13">}</span>
<span id="cb4-14"></span>
<span id="cb4-15">posts_all <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> posts_info <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb4-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">post_txt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fill_between</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(post, txt)))), <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># continue to use list-frames</span></span>
<span id="cb4-17">         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># new yaml headers, with date and tags</span></span>
<span id="cb4-18">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yml_txt =</span> (<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">con =</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"---</span></span>
<span id="cb4-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     title: {title}</span></span>
<span id="cb4-20"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     author: {author}</span></span>
<span id="cb4-21"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     date: {date}</span></span>
<span id="cb4-22"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     categories: [{categories}]</span></span>
<span id="cb4-23"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     image: 'image.jpg'</span></span>
<span id="cb4-24"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">                                     ---"</span>)</span>
<span id="cb4-25">                    ),</span>
<span id="cb4-26">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yml_txt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readLines</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">textConnection</span>(yml_txt)))),</span>
<span id="cb4-27">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">new_post_txt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(yml_txt, post_txt) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(x)),</span>
<span id="cb4-28">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dir_title =</span> janitor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">make_clean_names</span>(title)</span>
<span id="cb4-29">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb4-30">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(dir_title, new_post_txt)</span></code></pre></div></div>
<p>With the posts written the way I wanted them, I just had to create the new sub-folders and write the posts out into individual index files.qmd files.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(posts_all, dir_title), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dir.create</span>(glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"posts/{.x}"</span>)))</span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk2</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(posts_all, dir_title), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(posts_all, new_post_txt), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeLines</span>(.y, glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"posts/{.x}/index.qmd"</span>)))</span></code></pre></div></div>
</section>
<section id="task-2-render-the-blog-and-fix-all-the-errors" class="level2">
<h2 class="anchored" data-anchor-id="task-2-render-the-blog-and-fix-all-the-errors">Task 2: Render the Blog and Fix all the Errors</h2>
<p>A significant number of my posts stopped working because links didn’t go anywhere, and I didn’t grab images from the original posts. (Whoops! Next time, automate image file moving as well.) I didn’t have that many blog posts, so it wasn’t a massive lift for me to manually test each post and edit the text to include correct image links (where possible).</p>
<p>I also went in and changed most of the <em>categories</em> because my old tags didn’t make as much sense to me anymore. (It’s an R blog! Does every post need a #tidyverse tag? Probably not.)</p>
</section>
<section id="task-3-make-blog-pretty" class="level2">
<h2 class="anchored" data-anchor-id="task-3-make-blog-pretty">Task 3: Make Blog Pretty!</h2>
<p>Erin, this blog isn’t any different from the default theme?! Did you finish the new blog??? Dear reader, no, I have not. I’m trying out some themes and taking a stab at CSS and brand.yml, but none of those things are ready for Prime Time yet.</p>
</section>
<section id="wrap-up" class="level1">
<h1>Wrap Up</h1>
<p>This is very much a trial run for me. I like Quarto, but I got used to Jekyll and change is not always my favorite thing. Do you have tips I should use for this new adventure?</p>
<p>What blog engine do you use? Do you like the Quarto style of blog, or still use Rmarkdown <code>blogdown</code> or something else? Let’s talk about it on bluesky (Find me <span class="citation" data-cites="eringrand">@eringrand</span>, because I have <em>not</em> added comments capabilities to this blog yet!)</p>


</section>

 ]]></description>
  <category>blog</category>
  <category>quarto</category>
  <guid>https://eringrand.github.io/posts/new_blog_who_dis/</guid>
  <pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/new_blog_who_dis/3vr9n0.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Best Practices for Cleaning Data in R</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/cleaning_data_in_r/</link>
  <description><![CDATA[ 





<p>A few months ago, I gave a talk at the previously known as NYC R conference, now known as <a href="https://rstats.ai/nyc">New York Data Science and AI</a>. (<a href="youtube.com/watch?v=Cx-UxNCONaE&amp;embeds_referring_euri=https%3A%2F%2Fdataconf.ai%2F&amp;source_ve_path=Mjg2NjY">Watch it here!</a>) My presentation focused on my favorite topic: handling duplicates in data, and the importance of data cleaning.</p>
<p>The saying “90% of data science is cleaning the data” rings especially true for me. I love really love digging into the weeds of cleaning data - figuring out what went wrong, whether the errors were systematic or not, whether there were user input errors (always), etc.</p>
<p>I’ve spent the last 10+ years working in the education space, where messy data is everywhere and data-driven decisions have a real impact on a student’s success. The challenge is to get data as clean as possible to have a correct analysis to base the decisions on.</p>
<p>Some common data challenges I’ve seen are:</p>
<ul>
<li>Missing/Incomplete data</li>
<li>Different data sources without matching IDs</li>
<li>Incorrect/overlapping dates</li>
<li>(Mis)-alignment of data and data processes across all schools and regions</li>
<li>Changing student IDs (not many)</li>
<li>Human data reporting error</li>
<li>Historical data quality</li>
</ul>
<p>Tackling these and more messy data challenges is the 90% of the work that drives meaningful outcomes for students.</p>
<section id="duplicates-oh-no-where-did-they-come-from" class="level2">
<h2 class="anchored" data-anchor-id="duplicates-oh-no-where-did-they-come-from">Duplicates! Oh no! Where did they come from?</h2>
<p>Duplicates in data are everywhere. Any dataset has the potential for some level of duplication, and if you’re not on the lookout, they can persist and cause analysis errors.</p>
<p>Most data duplicates that I’ve seen are caused by inadequate processes. If your organization doesn’t have the right data processes to start with, messy data will continue to flow, no matter how much you code. Creating and training in structures and processes will help reduce errors across the board.</p>
<p>For example, let’s say we have a student named James, who moved from School A to School B mid-year.</p>
<p><strong><em>Bad process example</em></strong>: School B records James entering the school the week before he officially starts, to get started on his course schedule and other paperwork. School A doesn’t record his exit until a week after he left, because they got busy or wanted to wait to see if he changed his mind. As the data person, you don’t know which school James actually attended during the overlapping two weeks in the database.</p>
<p><strong><em>Better process example</em></strong>: Make sure the system of record allows forward and back dating such that School B records James in their system with the correct start date and School A records James as having left on his last day. If James comes back to School A, they will start a new record without overlapping dates. To ensure data uniqueness, the system should verify that James has the same ID in School A and School B.</p>
<p>This example and other data issues can be checked and rechecked through data audits. Even with identifiers, duplicates can still occur (e.g., the same person with two different email addresses), so we use additional fields to audit for duplicates. Names, emails, birthdays, phone numbers, and home addresses are good places to check</p>
<section id="duplicates-you-caused" class="level3">
<h3 class="anchored" data-anchor-id="duplicates-you-caused">Duplicates you caused!</h3>
<p>Duplicates are not always the fault of the data itself. We can cause our own duplicates through incorrectly written code.</p>
<p>Joining – using the incorrect fields or edits to fields needed pivot_wider – including too many columns in the select pivot_longer – including too many columns in the pivot Integrating validation steps, unit testing, and code reviews into your work will reduce the number of “coder-caused” duplicates.</p>
</section>
</section>
<section id="lets-take-a-look-at-some-r-code" class="level2">
<h2 class="anchored" data-anchor-id="lets-take-a-look-at-some-r-code">Let’s take a look at some R code…</h2>
<blockquote class="blockquote">
<p>Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with Janitor and save their thinking for more fun tasks. (<em>Sam Firke</em>)</p>
</blockquote>
<p>If you’re experienced with Tidyverse in general, you should be able to do everything inside Janitor on your own; however, it’s always nice to have a function do it for you.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(janitor)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(readxl)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set up fake student data</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your fake data might be different from mine, as it's totally random IDs.</span></span>
<span id="cb1-7">students <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> tibble<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">student_id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e6</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e7</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">-1</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>), </span>
<span id="cb1-8">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">grade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)),</span>
<span id="cb1-9">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">entrydate =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.Date</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>,</span>
<span id="cb1-10">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">exitdate =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.Date</span>())</span>
<span id="cb1-11"></span>
<span id="cb1-12">students[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> students[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] </span>
<span id="cb1-13">students[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> students[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># set up duplicate</span></span>
<span id="cb1-14"></span>
<span id="cb1-15">students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span>  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dupes</span>(student_id)</span></code></pre></div></div>
<pre><code># A tibble: 2 x 6
  student_id dupe_count grade  entrydate        exitdate
       &lt;dbl&gt;      &lt;int&gt;  &lt;dbl&gt;          &lt;date&gt;            &lt;date&gt;
1    4137115          2         1             2017-12-02    2018-01-01
2    4137115          2         2            2017-12-02    2018-01-01</code></pre>
<p>Using <code>get_dupes</code> and <code>verify()</code> from the <strong>assertr</strong> package is a great way to add checks in case the data changes (which it inevitably will).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">check <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dupes</span>(student_id) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">verify</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(.) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span></code></pre></div></div>
<p>If a student ID changes or new duplicates occur, the code will stop at this step.</p>
<section id="fixing-the-duplicates" class="level3">
<h3 class="anchored" data-anchor-id="fixing-the-duplicates">Fixing the duplicates</h3>
<p>Option 1:</p>
<p>Correct the dupes individually with <code>if_else</code> or <code>case_when</code>. This method is best for errors in one or 2 rows as a quick and easy fix.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">correct_students <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">grade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">if_else</span>(student_id <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> ______, CORRECT<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>GRADE, grade)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">distinct</span>() </span></code></pre></div></div>
<p>Option 2:</p>
<p>Systematic errors can be fixed by taking a summarize on the incorrect column. In this case, we could assume that the lower grade-level is the correct one for all duplicate enrollments. This method works better for systematic issues that you know how to correct, such as taking the higher of two homework assignemtns.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">correct_students <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-2">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(student_id) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb5-3">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">grade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(grade))</span>
<span id="cb5-4"></span>
<span id="cb5-5">correct_students</span></code></pre></div></div>
<p>Option 3:</p>
<p>Output the duplicates and manually choose which version to keep.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dupes</span>(student_id) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../data/dupes.csv"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb6-4"></span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create this file in excel manually, often by asking your coworkers for help</span></span>
<span id="cb6-6">dupes_remove <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../data/dupes_correct.csv"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(delete <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb6-8"></span>
<span id="cb6-9">students_correct <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(students, dupes_remove) </span></code></pre></div></div>
</section>
<section id="document-document-document." class="level3">
<h3 class="anchored" data-anchor-id="document-document-document.">DOCUMENT DOCUMENT DOCUMENT.</h3>
<p>Now that you’ve fixed the duplicates, whether in the database and/or code, DOCUMENT what you did and WHY, so that when the data changes and new duplicates are found, the code still runs.</p>


</section>
</section>

 ]]></description>
  <category>duplicate data</category>
  <guid>https://eringrand.github.io/posts/cleaning_data_in_r/</guid>
  <pubDate>Fri, 14 Nov 2025 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/cleaning_data_in_r/clean_data.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Reviewing my old live journal posts</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/</link>
  <description><![CDATA[ 





<p>I’m about to BARE MY SOUL to the internet. Well, the soul of my teenage self. Get ready!</p>
<p>Live Journal, you may remember, was(/is - it does still exist) was a blogging site before we really knew what blogging was. It was both a place to put diary entries and those quizzes that got passed around in the day. It was also used for FanFic and community gathering. I did NOT use my (main) account for fan gathering. (Though I did write some excellent/terrible fanfics. Didn’t everyone have a Lord of Rings self-insert character?)</p>
<p>I don’t really remember when I started this idea, but I thought it would be fun to see just how emo I was in 2006. Let’s go!</p>
<section id="step-0-load-the-libraries" class="level2">
<h2 class="anchored" data-anchor-id="step-0-load-the-libraries">Step 0: Load the libraries</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(lubridate)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidytext)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(hunspell)</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggrepel)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(cowplot)</span></code></pre></div></div>
</section>
<section id="step-1-clean-the-data" class="level2">
<h2 class="anchored" data-anchor-id="step-1-clean-the-data">Step 1: Clean the data</h2>
<p>I downloaded all my past Live Journal entities to a folder on my desktop in the same CSV format, so that I could easily load them in for analysis. I am pleasantly surprised that Live Journal made it so easy to download my history like this! I did have to click the same button a ton of time - but I did get all my data.</p>
<p>The next step was to take every journal and separate out the individual words using ‘<a href="https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html">tidytext</a>.’</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">lj_words <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(itemid, eventtime, logtime, subject, current_music, current_mood, event) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(eventtime, logtime), ymd_hms),</span>
<span id="cb2-4">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">year =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">year</span>(logtime),</span>
<span id="cb2-5">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">month =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(logtime),</span>
<span id="cb2-6">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">event =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove_all</span>(event, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"'"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, event, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"words"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">format =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"html"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strip_url =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) </span></code></pre></div></div>
<p>Not included in this blog post, for privacy of my teenage friends, I also cleaned and changed names of my friends and locations to clean the data and protect their privacy. For example, instead of the name “Linda” you may see “nameofsister”.</p>
<p>I was (and continue to be) terrible at spelling words correctly and also terrible at checking what I’ve typed after the fact. I use ‘Hunspell’ here in an attempt to fix some of the most common issues. Does this spell check get everything? No! But alas, I am a terrible speller and we move on in life.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">lj_words_spell_check <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_protect <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(my_stop_words, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rowwise</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">spell_check =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hunspell</span>(word)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(spell_check) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">suggest =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hunspell_suggest</span>(spell_check)) </span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">lj_correct <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_spell_check <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(suggest) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">suggest_pick =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pluck</span>(suggest, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># just pick the first one because I am lazy</span></span>
<span id="cb4-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest</span>(suggest_pick) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(word, suggest_pick) </span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">lj_words_corrected <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_protect <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(lj_correct, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">word =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coalesce</span>(suggest_pick, word)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">output =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">input =</span> word) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># used because sometimes the correction is actually 2+ words now </span></span></code></pre></div></div>
</section>
<section id="step-2-now-we-move-on-to-analysis" class="level2">
<h2 class="anchored" data-anchor-id="step-2-now-we-move-on-to-analysis">Step 2: Now we move on to analysis!</h2>
<p>The data is clean, or at least as clean as it is going to get today.</p>
<section id="word-counts" class="level3">
<h3 class="anchored" data-anchor-id="word-counts">Word counts</h3>
<p>I start with TF-IDF. The goal here is to see what I was talking about each year and how it may differ as I got older. As a reminder, I have changed the names of all my friends and family for privacy. That way you don’t know who “nameofbestfriend” is and why I stopped mentioning “nameofbestfriend” in 2006. (We had a bit of a falling out at the end of HS.)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">tfidf <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_corrected <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(year, word) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb6-3">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_tf_idf</span>(word, year, n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(stop_words, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(year) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">wt =</span> tf_idf) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) </span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/2022-06-28-LJ_files/figure-markdown_github/unnamed-chunk-12-1.png" class="img-fluid"></p>
<p>Look at 2009 - clearly my only entries were my Norse myth college class. I remember I put a few of my class papers on my Live Journal.</p>
<p>We can look at the differences between TF-IDF and a regular word count, while accounting for stop words.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">wordcount <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_corrected <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(year, word) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(stop_words, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(year) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">wt =</span> n) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) </span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/2022-06-28-LJ_files/figure-markdown_github/unnamed-chunk-14-1.png" class="img-fluid"></p>
</section>
<section id="sentiment" class="level3">
<h3 class="anchored" data-anchor-id="sentiment">Sentiment</h3>
<p>Next I look at sentiment. I remember using live journal to be super <em>angsty</em>. I assumed that I would largely see negative sentiment and words across the years.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">df_plot <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lj_words_corrected <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_sentiments</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bing"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span> ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bing_sentiment =</span> sentiment) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_sentiments</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"nrc"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span> ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrc_sentiment =</span> sentiment) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(bing_sentiment, nrc_sentiment), </span>
<span id="cb8-7">               <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sentiment_type"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sentiment"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(sentiment_type, year, sentiment) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(sentiment) ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">count =</span> n ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(sentiment_type, year) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">total =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(count)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">percent =</span> count <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> total,</span>
<span id="cb8-15">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">year_month =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_c</span>(year, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"01"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"01"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"-"</span>))</span>
<span id="cb8-16">         ) </span></code></pre></div></div>
<p>Instead, it seems my words were largely more positive then negative. (Outside of 2009 - which is either largely from my anxiety attacks that year or Norse mythology is just super depressing.) Not as ansgty as I remember!</p>
<p><img src="https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/2022-06-28-LJ_files/figure-markdown_github/unnamed-chunk-16-1.png" class="img-fluid"></p>
<p>Ah, but did I mark my “current mood” / how are you feeling” part as positive as my words are? You be the judge.</p>
<p><img src="https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/2022-06-28-LJ_files/figure-markdown_github/unnamed-chunk-18-1.png" class="img-fluid"></p>
</section>
</section>
<section id="end" class="level2">
<h2 class="anchored" data-anchor-id="end">End :)</h2>
<p>So there you have it. Was teenage Erin as emo as I thought? Maybe not! Or maybe I wrote all the most emo journals in my physical diary. The world will never know (because those diaries have been lost).</p>


</section>

 ]]></description>
  <category>sentiment</category>
  <category>tidytext</category>
  <guid>https://eringrand.github.io/posts/reviewing_my_old_live_journal_posts/</guid>
  <pubDate>Tue, 28 Jun 2022 00:00:00 GMT</pubDate>
  <media:content url="https://static.wikia.nocookie.net/disney/images/1/1c/Profile_-_Eeyore.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Data science jobs in the not-for-profit sector</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/data_science_jobs_in_the_nonprofit_sector/</link>
  <description><![CDATA[ 





<p>I’ve been a data scientist in the non-profit sector for 5+ years. In an earlier conversation with people starting their transition into data science, we talked about how non-profit data can be an excellent place to start. There are some cons to beginning at a non-profit - but I love it.</p>
<p>In my experience, non-profits data roles have similarities with start-up positions. A non-profit data role may involve doing more work than described in the job description (e.g., analysis and project management at the same time). Due to a lack of funding, non-profits cannot hire individual people for all the needed work, so a new hire can end up filling many job roles at once. This early experience in more extensive work is a magnificent step for learning on the job. “Personally, I think non-profit DS is the perfect place to start a career :) so much messy data, and so many wins, even from”basic” models.” (Caitlin Hudon) In larger for profit-tech companies, In for-profit tech companies, people are not wearing some many hats. In non-profits, you end up being more cross trained.</p>
<p>On the other hand, a possible con to joining a smaller non-profit OR start-up is that you may be one of the first data people in the company. Data Science teams often don’t exist at non-profits, and as the first or early data person, you are forced to grow your skills quickly on the job, but it also means you won’t have an on-site mentor on the job. “In my current position, my manager is self-taught in coding… He does most things in SQL.” - Kevin Gilds. In chapter 9 of their book, Building a Career in Data Science, Emily Robinson &amp; Jaqueline Nolis talk more about being the first or only data person on the job. When there is a data team in place, non-profits often have smaller data teams. The day-to-day work is likely to be more aligned with reporting/analysis than machine learning. If you are looking specifically to get involved in modeling ASAP, a non-profit will not be the best place to start. That said, the messy data difficulties in a non-profit can often lead to quick wins for automation and coding.</p>
<p>In companies without a dedicated data team, data structures and cleaning become crucial as data is likely in spreadsheets. SQL and Excel skills are more appreciated than complicated programming skills. I have loved working with messy data as it has allowed me to shape the policies and work to create more organized structures. “You will never find better data sets to cut your teeth on than working with non-profit data.” (Caitlin Hudon) Given the often early data stages, you can do quick magic with a spreadsheet. Apply some automation in an excel macro or a pivot table gives you an early win with the data and stakeholders. Teaching others how to do a vlookup could make a difference between them spending 5min vs.&nbsp;1-2 hours on a task.</p>
<p>The clear win or difference, for me, at a non-profit is that everyone is passionately mission-driven. Non-profit employees are not there just for the money (though, hey - let’s pay people what they’re worth, please). Tech and data skills can be uniquely used in a mission-driven space to do fantastic work to make the world better. For example, in an education company, working with student data can directly influence how a teacher can help them perform even better on SATs, enabling them access to a better college. At a mental health company, data work provides information for counselors to better help their clients - literally saving lives.</p>



 ]]></description>
  <category>data science</category>
  <category>careers</category>
  <guid>https://eringrand.github.io/posts/data_science_jobs_in_the_nonprofit_sector/</guid>
  <pubDate>Wed, 14 Apr 2021 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/data_science_jobs_in_the_nonprofit_sector/istockphoto.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Twitter’s Favorite Lesser Known Packages</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/twitters_favorite_lesser_known_packages/</link>
  <description><![CDATA[ 





<p>At the 2018 December NYC R Ladies meetup (yes this post has been sitting in my drafts for over a year), a group started talking about how a few tiny functions in a lesser-known package can provide you with serious magic. The problem is finding those packages and functions! With so many amazing packages on CRAN and GitHub, how do you even begin to search? One way - ask all your twitter followers what they think, and twitter did not disappoint - so here are some examples of <em>amazing</em> packages and functions you might want to learn about.</p>
<p>The types of functions offered seemed to fall in a couple buckets. For example, making tasks you do all the time easier (cleaning data, summary), dealing with data structures that aren’t are easy to deal with (factors, strings.. etc), visualizations, and so much more.</p>
<section id="data-tasks" class="level2">
<h2 class="anchored" data-anchor-id="data-tasks">Data Tasks</h2>
<p>My favorite lesser known package is <a href="https://sfirke.github.io/janitor/">Janitor</a> by Sam Firke. This package has basic functions to clean and prep messy data files. The functions are mostly relatively easy to replicate with dplyr, but why write the same thing over and over when Janitor does it for you!</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>Mine are, from janitor…<br>1. clean_names<br>2. get_dupes<br>3. remove_empty<a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a></p>
<p></p>
<p>— Erin Grand (<span class="citation" data-cites="astroeringrand">@astroeringrand</span>) <a href="https://twitter.com/astroeringrand/status/1072325599300071431?ref_src=twsrc%5Etfw">December 11, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><a href="https://docs.ropensci.org/skimr/">Skimr</a>, as suggested by Fernando Flores, started at an ROpenSci Un-conf that provides a better summary function. It creates both a tidy version of the summary table to work with and a visual version to inspect. This is super useful for investigating data issues.</p>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>Couldn’t choose just one package, so here we go:<br>skimr::skim<br>covr::report<br>DT::JS</p>
<p></p>
<p>— Fernando Flores (<span class="citation" data-cites="ds_floresf">@ds_floresf</span>) <a href="https://twitter.com/ds_floresf/status/1072539510448275456?ref_src=twsrc%5Etfw">December 11, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="data-types" class="level2">
<h2 class="anchored" data-anchor-id="data-types">Data Types</h2>
<p>The tidyverse packages for dealing with specific data types are not nearly as widely used as they can be; forcats, lubridate, glue, and stringr can help solve so many problems with factor, dates, and strings.</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>From forcats:<br>1. fct_infreq<br>2. fct_rev<br>3. fct_drop</p>
<p></p>
<p>— Emily Zabor (<span class="citation" data-cites="zabormetrics">@zabormetrics</span>) <a href="https://twitter.com/zabormetrics/status/1073648773014929413?ref_src=twsrc%5Etfw">December 14, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>forcats::fct_lump<a href="https://t.co/2BboLbdzuS">https://t.co/2BboLbdzuS</a><br><br>glue::glue and glue::glue_data<a href="https://t.co/Bxt20MQGi2">https://t.co/Bxt20MQGi2</a><br><br>Cheated and use 2x packages.</p>
<p></p>
<p>— Thomas Mock 👨🏼 💻 (<span class="citation" data-cites="thomas_mock">@thomas_mock</span>) <a href="https://twitter.com/thomas_mock/status/1072328281741901824?ref_src=twsrc%5Etfw">December 11, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>you stole mine! 😉 this is kind of cheating but from lubridate: year(), month(), day()</p>
<p></p>
<p>— Luuuda (<span class="citation" data-cites="ludmila_janda">@ludmila_janda</span>) <a href="https://twitter.com/ludmila_janda/status/1072339517821067264?ref_src=twsrc%5Etfw">December 11, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="plotting-support" class="level2">
<h2 class="anchored" data-anchor-id="plotting-support">Plotting Support</h2>
<p>A few of the recommendations focused on vizulations and plotting. Key shouts outs for naniar and patchwork. <a href="http://naniar.njtierney.com/">Naniar</a> helps you visualize your missing values. <a href="https://patchwork.data-imaginist.com/">Patchwork</a> allows you to combine plots together.</p>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>From two packages, super handy at first steps after loading dataset: <br>naniar::gg_miss_var<br>summarytools::descr<br>summarytools::freq</p>
<p></p>
<p>— Radoslaw Panczak (<span class="citation" data-cites="RPanczak">@RPanczak</span>) <a href="https://twitter.com/RPanczak/status/1072674486326124544?ref_src=twsrc%5Etfw">December 12, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="other" class="level2">
<h2 class="anchored" data-anchor-id="other">Other</h2>
<p>There are were a ton of other amazing offerings for excellent packages.</p>
<p>The magrittr package has many useful operators outside of the normal %&gt;% pipe.</p>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>I was going to say %&lt;&gt;% , %&lt;&gt;% , and %&lt;&gt;% from magrittr - I use it all the time now thanks to <a href="https://twitter.com/robinson_es?ref_src=twsrc%5Etfw"><span class="citation" data-cites="robinson_es">@robinson_es</span></a> - but now I’m browsing other magrittr functions and the aliases like extract() etc would be v handy when piping</p>
<p></p>
<p>— Sarah R (<span class="citation" data-cites="srhrnkn">@srhrnkn</span>) <a href="https://twitter.com/srhrnkn/status/1072870594314625024?ref_src=twsrc%5Etfw">December 12, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>If you work with spatial data at all, the <a href="https://r-spatial.github.io/sf/">sf</a> package is a must.</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>The sf package cleared my skin, cleaned my home &amp; cured my anxiety</p>
<p></p>
<p>— Brooke Watson (<span class="citation" data-cites="brookLYNevery1">@brookLYNevery1</span>) <a href="https://twitter.com/brookLYNevery1/status/1072616772870770695?ref_src=twsrc%5Etfw">December 11, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>I added the <code>conflicted</code> package to my RProfile this summer, and I really love that it warns me about possible name conflicts <em>before</em> I run into problems <a href="https://t.co/46Y88gexP9">pic.twitter.com/46Y88gexP9</a></p>
<p></p>
<p>— Irene Steves (<span class="citation" data-cites="i_steves">@i_steves</span>) <a href="https://twitter.com/i_steves/status/1088884286101573632?ref_src=twsrc%5Etfw">January 25, 2019</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>What is your favorite lesser know package or function? Sound off in the comments (or find me on twitter).</p>


</section>

 ]]></description>
  <category>twitter</category>
  <guid>https://eringrand.github.io/posts/twitters_favorite_lesser_known_packages/</guid>
  <pubDate>Tue, 30 Jun 2020 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/twitters_favorite_lesser_known_packages/logo.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Scraping APOD Descriptions</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/scraping_apod_descriptions/</link>
  <description><![CDATA[ 





<section id="orginal-plan---scrape-from-archive" class="level2">
<h2 class="anchored" data-anchor-id="orginal-plan---scrape-from-archive">Orginal Plan - Scrape from Archive</h2>
<p>A long while ago now, <a href="https://twitter.com/Nujcharee">Nujchare</a> tweeted about an awesome vis she did using <code>rvest</code> and PowerBi.</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>Using rvest + purrr packages to scrap APOD. PowerBI viz it up real nice! <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/powerbi?src=hash&amp;ref_src=twsrc%5Etfw">#powerbi</a>. My learning journal during <a href="https://twitter.com/hashtag/NASADatanauts?src=hash&amp;ref_src=twsrc%5Etfw">#NASADatanauts</a> year of awesomeness. <a href="https://t.co/cnwttLPoIS">https://t.co/cnwttLPoIS</a> <a href="https://t.co/je511h99L9">pic.twitter.com/je511h99L9</a></p>
<p></p>
<p>— Nujcharee (เป็ด) (<span class="citation" data-cites="Nujcharee">@Nujcharee</span>) <a href="https://twitter.com/Nujcharee/status/939257591431036929?ref_src=twsrc%5Etfw">December 8, 2017</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I complemented her work and she asked me to look over the code. I jumped at the chance, (1) because I don’t know a ton about scraping website data and wanted to see what she started, (2) I could help with the <code>dplyr</code> part of the code, and most importantly (3) I love APOD!</p>
<p>I love APOD so much, that for most of my childhood my life goal was “get a picture published to APOD.” To make matters more exciting in 2009 <a href="https://apod.nasa.gov/apod/ap090917.html">this happened</a>.</p>
<p><img src="https://eringrand.github.io/posts/scraping_apod_descriptions/apod_me.png" class="img-fluid"></p>
</section>
<section id="getting-the-data" class="level2">
<h2 class="anchored" data-anchor-id="getting-the-data">Getting the Data</h2>
<p>To start, we grab the information from the landing page of APOD’s archive and ignore any links that are not pictures of the day. (Luckily, these all start with “ap” so we can use <code>str_detect()</code> to find them.)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(rvest)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidytext)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## scrape the landing page</span></span>
<span id="cb1-6">apod <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_html</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://apod.nasa.gov/apod/archivepix.html"</span>)</span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## scrape all URLs</span></span>
<span id="cb1-9">url <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_nodes</span>(apod, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb1-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(xml_attrs) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">url =</span> .) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(url, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ap"</span>), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(url, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/"</span>))</span></code></pre></div></div>
<p>Next, we have to go to each of the pages and scrape the underlying page data. There are A LOT of APODs, so this can take a long time. I’ve chosen to only look at the first 1000 images for now. (More on solving this at the end!)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># read html from url</span></span>
<span id="cb2-2">my_read_html <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(url, ...) {</span>
<span id="cb2-3">  xml2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_html</span>(url, ...)</span>
<span id="cb2-4">}</span>
<span id="cb2-5"></span>
<span id="cb2-6">data_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> url[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, ] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># first 1000 links</span></span>
<span id="cb2-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">full_url =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://apod.nasa.gov/apod/"</span>, url)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">page =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(full_url, my_read_html),</span>
<span id="cb2-9">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pic =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(page, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_node</span>(.x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xpath =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"//*/img"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_attr</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"src"</span>)),</span>
<span id="cb2-10">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(page, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_nodes</span>(.x, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_text</span>()),</span>
<span id="cb2-11">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">description =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(page, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_nodes</span>(.x, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"p"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">html_text</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> .[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(., <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Ex"</span>)]) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># descriptions start with "Explanation:"</span></span>
<span id="cb2-12">         )</span></code></pre></div></div>
</section>
<section id="data-cleaning" class="level2">
<h2 class="anchored" data-anchor-id="data-cleaning">Data Cleaning</h2>
<p>With the raw data in hand, I move into more specific text cleaning. I want to start with some quick tidy text analysis of the descriptions, so I cant to clean that up first.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> data_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">description =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(description, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>),</span>
<span id="cb3-3">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">description =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(description, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Explanation:"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>),</span>
<span id="cb3-4">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(title, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>),</span>
<span id="cb3-5">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(title, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"APOD:"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>),</span>
<span id="cb3-6">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">if_else</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(title, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2017 November 22"</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2017 November 22 - Oumuamua Interstellar Asteroid"</span>, title)</span>
<span id="cb3-7">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate_all</span>(trimws) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb3-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">separate</span>(title, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">into =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" - "</span>)</span></code></pre></div></div>
<p>Great, now we can a do a quick word count using tidytext tools.</p>
</section>
<section id="fun-stuff---word-count" class="level2">
<h2 class="anchored" data-anchor-id="fun-stuff---word-count">Fun Stuff - word count</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">keep_words <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"way"</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># I don't want "way" as in "Milky Way" to be filtered</span></span>
<span id="cb4-2">my_stop_words <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">word =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"image"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lexicon =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PERSONAL"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(stop_words) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>word <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> keep_words)</span>
<span id="cb4-6">  </span>
<span id="cb4-7"></span>
<span id="cb4-8">data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">distinct</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, description) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(my_stop_words) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span></code></pre></div></div>
<pre><code>## # A tibble: 10 x 2
##    word       n
##    &lt;chr&gt;  &lt;int&gt;
##  1 light   1088
##  2 star     725
##  3 stars    656
##  4 galaxy   636
##  5 nebula   627
##  6 moon     522
##  7 sun      497
##  8 earth    496
##  9 bright   461
## 10 sky      411</code></pre>
<p>I love this because it clearly shows the types of objects that make up most of pretty Astronomy pictures, i.e stars, galaxies and nebulae. Very cool!</p>
<p>If I look at bi-grams is there any doubt that “Milky Way” will have a strong showing?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">distinct</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, description, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ngrams"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(title, word) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">separate</span>(word, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">into =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word2"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unite</span>(word, word1, word2, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span></code></pre></div></div>
<pre><code>## # A tibble: 10 x 2
##    word                n
##    &lt;chr&gt;           &lt;int&gt;
##  1 milky way         315
##  2 planet earth      205
##  3 way galaxy        133
##  4 million light     118
##  5 solar system      114
##  6 space telescope   106
##  7 star forming      100
##  8 hubble space       88
##  9 spiral galaxy      88
## 10 star cluster       75</code></pre>
<p>…and there it is, clearly winning over “Planet Earth” and “Solar System.”</p>
<p>As a person who studied star formation, I’m also proud of the strong showing of “star forming” in the bi-grams. Yay baby stars!</p>
</section>
<section id="but-wait-isnt-there-an-api" class="level2">
<h2 class="anchored" data-anchor-id="but-wait-isnt-there-an-api">But wait… isn’t there an API?</h2>
<p>This is great and fun, but what I’d really love to look at the entire APOD archive, or pull a specific date. Luckily, NASA has a great <a href="https://github.com/nasa/apod-api">API</a> to do just that! The API is super easy to use and simple enough to write into some R functions. I decided the coolest thing to do with this API was create a package, and thus my new package - <a href="https://github.com/eringrand/astropic">astropic</a> was born (available on github)!</p>
<p>The goal of <a href="https://github.com/eringrand/astropic">astropic</a> is to connect R to the NASA APOD API. The APOD API supports one image at a time. In order to supply more than that, this package also includes creating time ranges (of less than 1000 days at a time) and some historical data in tibble format.</p>
<p>You can install the current version from GitHub to check it out</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># install.packages("devtools")</span></span>
<span id="cb8-2">devtools<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install_github</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"eringrand/astropic"</span>)</span></code></pre></div></div>
<p><a href="https://github.com/eringrand/astropic">Astropic</a> does not yet contain ANY tests and the documentation is very sparse. It is most definitely a work in progress - I’ll update more as I add more to it.</p>
<p>Next time on the blog, more about the package creation and cool things you can do with it. In the mean time, please feel free to send pull requests and let me know what you’d like from such a package.</p>


</section>

 ]]></description>
  <category>tidytext</category>
  <category>astronomy</category>
  <guid>https://eringrand.github.io/posts/scraping_apod_descriptions/</guid>
  <pubDate>Sat, 21 Apr 2018 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/scraping_apod_descriptions/apod-logo.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Faces of rstudioconf</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/faces_of_rstudioconf/</link>
  <description><![CDATA[ 





<p>I was reminded today by <a href="https://twitter.com/d4tagirl">Daniela</a> that everyone should blog - and on top of that you can totally blog something small and simple just to get something out there.</p>
<p>In the spirit of small and simple, last week I tweeted out this cool image…</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>I am writing up my notes from <a href="https://twitter.com/hashtag/rstduiconf?src=hash&amp;ref_src=twsrc%5Etfw">#rstduiconf</a> (blog post coming!) and wanted to have a quick picture to go with my thoughts. Remembering <a href="https://twitter.com/ma_salmon?ref_src=twsrc%5Etfw"><span class="citation" data-cites="ma_salmon">@ma_salmon</span></a> ’s post on Faces of R (here: <a href="https://t.co/C1sRW3hwVL">https://t.co/C1sRW3hwVL</a> ), I decided to make a Faces of Rstudioconf! <a href="https://t.co/vtNVn2RyHV">pic.twitter.com/vtNVn2RyHV</a></p>
<p></p>
<p>— Erin Grand (<span class="citation" data-cites="astroeringrand">@astroeringrand</span>) <a href="https://twitter.com/astroeringrand/status/961466502821052416?ref_src=twsrc%5Etfw">February 8, 2018</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I have not yet organized all my thoughts from the conference (spoilers, it was awesome, I learned so much!), but that will not stop by from posting the code I borrowed from <a href="https://twitter.com/ma_salmon">Maelle</a> to create the pretty image. So, here you go!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(rtweet)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(magick)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(gmp)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">search_terms <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rstudioconf"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rstudioconf2018"</span>)</span>
<span id="cb2-2"></span>
<span id="cb2-3">tweets <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_df</span>(search_terms, search_tweets, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">include_rts=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">parse=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) </span>
<span id="cb2-4"></span>
<span id="cb2-5">users_tweets <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lookup_users</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(tweets<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>user_id))</span>
<span id="cb2-6"></span>
<span id="cb2-7">users <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> users_tweets <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(user_id, </span>
<span id="cb2-9">         profile_image_url, </span>
<span id="cb2-10">         screen_name,</span>
<span id="cb2-11">         name, </span>
<span id="cb2-12">         followers_count, </span>
<span id="cb2-13">         profile_image_url</span>
<span id="cb2-14">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">distinct</span>()</span>
<span id="cb2-16"></span>
<span id="cb2-17">save_image <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(df){</span>
<span id="cb2-18">  image <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">try</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_read</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>profile_image_url), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">silent =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb2-19">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">class</span>(image)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"try-error"</span>){</span>
<span id="cb2-20">    image <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-21">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_scale</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"50x50"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-22">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_write</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"~pictures/"</span>, df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>screen_name,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".jpg"</span>))</span>
<span id="cb2-23">  }</span>
<span id="cb2-24">}</span>
<span id="cb2-25"></span>
<span id="cb2-26">users <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(users, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(profile_image_url))</span>
<span id="cb2-27">users_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">split</span>(users, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(users))</span>
<span id="cb2-28"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk</span>(users_list, save_image)</span>
<span id="cb2-29"></span>
<span id="cb2-30"></span>
<span id="cb2-31">files <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dir</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pictures/"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">full.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb2-32"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb2-33">files <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(files, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(files))</span>
<span id="cb2-34">gmp<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factorize</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(files))</span>
<span id="cb2-35"></span>
<span id="cb2-36">no_rows <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span></span>
<span id="cb2-37">no_cols <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">31</span></span>
<span id="cb2-38"></span>
<span id="cb2-39">make_column <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(i, files, no_rows){</span>
<span id="cb2-40">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_read</span>(files[(i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>no_rows<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>((i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>no_rows)]) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-41">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_append</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stack =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-42">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_write</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cols/"</span>, i, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".jpg"</span>))</span>
<span id="cb2-43">}</span>
<span id="cb2-44"></span>
<span id="cb2-45"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>(no_cols<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">-1</span>), make_column, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">files =</span> files, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">no_rows =</span> no_rows)</span>
<span id="cb2-46"></span>
<span id="cb2-47"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_read</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dir</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cols/"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">full.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-48"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_append</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stack =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-49">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">image_write</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2018-02-7-facesofrstudioconf.jpg"</span>)</span></code></pre></div></div>
<p><img src="https://github.com/eringrand/projects/blob/master/Rstudio%20Conf%20Twitter%20Pictures/2018-02-7-facesofnasadatanauts.jpg?raw=true" class="img-fluid"></p>



 ]]></description>
  <category>twitter</category>
  <category>conference</category>
  <guid>https://eringrand.github.io/posts/faces_of_rstudioconf/</guid>
  <pubDate>Thu, 15 Feb 2018 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/faces_of_rstudioconf/2018-02-7-facesofnasadatanauts.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>R in the World of Education</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/r_in_the_world_of_education/</link>
  <description><![CDATA[ 





<p>I recently gave a r-ladies presentation about my work cleaning and working with really messy education data. This blog post is an attempt at summarizing the main points of the talk.</p>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>Excited for <a href="https://twitter.com/astroeringrand?ref_src=twsrc%5Etfw"><span class="citation" data-cites="astroeringrand">@astroeringrand</span></a>’s <a href="https://twitter.com/RLadiesNYC?ref_src=twsrc%5Etfw"><span class="citation" data-cites="RLadiesNYC">@RLadiesNYC</span></a> talk on <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> in education! <a href="https://t.co/wX3PT7F2KI">pic.twitter.com/wX3PT7F2KI</a></p>
<p></p>
<p>— Emily Robinson (<span class="citation" data-cites="robinson_es">@robinson_es</span>) <a href="https://twitter.com/robinson_es/status/940730589077999617?ref_src=twsrc%5Etfw">December 12, 2017</a></p>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><em>Look how cool I am. Go me! The slides are <a href="https://github.com/eringrand/eringrand.github.io/tree/master/Presentations/rladies-nyc-edu">here</a> in case you missed them.</em></p>
<section id="uncommon-schools" class="level1">
<h1>Uncommon Schools</h1>
<p>I’ve been an Associate Director of Data Analytics at Uncommon Schools for almost a year and a half now. Part of my job at Uncommon has been working with and teaching R to my fellow data analysts. As such, I’ve developed a sense of what works best for the type of messy data we’re constantly analyzing.</p>
<p>As a bit of background, Uncommon Schools is a Charter Management Organization (CMO), or network of 52 public <a href="http://www.uncommonschools.org/our-approach/faq-what-is-charter-school">charter schools</a> across Massachusetts, New Jersey, and New York. The oldest school (North Star Academy in Newark) was established in 1997 and the CMO was formed in 2005. For a more in depth history of Uncommon Schools, check out our <a href="http://www.uncommonschools.org/our-approach/our-history">website</a>.</p>
<section id="what-kind-of-data-do-we-work-with" class="level2">
<h2 class="anchored" data-anchor-id="what-kind-of-data-do-we-work-with">What kind of data do we work with?</h2>
<p>The Data Analytics team focuses at the overall picture, so while we don’t work on reporting data directly back to the state (our regional teams have excellent people working on this) we do get to work with most of the data the organization collects.</p>
<p>The data we work with generally fits into one of these buckets…</p>
<ul>
<li><strong>Assessments</strong>: Interim assessments are taken as practice tests through out the school year.</li>
<li><strong>Exams</strong>: Common Core aligned state exams, SAT, PSAT, APs, …etc</li>
<li><strong>Classroom</strong>: Assignment grades, attendance, suspensions, …etc</li>
<li><strong>Teacher</strong>: student - course - teacher linkage information</li>
<li><strong>Staff Data</strong>: HR and Recruitment</li>
</ul>
<p>Unfortunately, given the amount of data we have and the number of sources it may be coming from, we have a ton of data challenges to overcome in every analysis.</p>
<p>Of course, every piece of data has its own challenges and “messy nature,” but there are patterns.</p>
<ul>
<li>Missing/Incomplete data</li>
<li>Different data sources without matching IDs (i.e HR to Teacher to Student)</li>
<li>Movement between schools and courses of students and teachers</li>
<li>Alignment of data and data processes across all schools and regions</li>
<li>Student IDs that change</li>
<li>Human data reporting error</li>
<li>Historical data quality</li>
</ul>
<p>Some of these challenges are easy to fix and some are harder. For example, a messy excel sheet can be cleaned (by hand or by code). We’ve developed (or are developing in some cases) systems to work with most of these types of challenges.</p>
<ul>
<li>Messy excel sheets (historical or human entered)</li>
<li>Column names that don’t apply anymore</li>
<li>Lack of historical documentation</li>
<li>Finding duplicate tests</li>
<li>Students that take half of one test and the other half of another</li>
<li>Vanishing leading zeros</li>
<li>Tracking of student IDs that change</li>
<li>Lack of common definitions (i.e “cohort”)</li>
<li>How to refer to school years or school abbreviations</li>
<li>Data audits</li>
</ul>
</section>
</section>
<section id="the-janitor-package" class="level1">
<h1>The <code>janitor</code> Package</h1>
<p>I really like this explanation by package author, Sam Firke. <a href="https://github.com/sfirke/janitor"><em>Janitor</em></a> <em>was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with janitor and save their thinking for more fun tasks.</em></p>
<p>Meaning, if you’re experienced with the Tidyverse in general, you should be able to do everything inside <code>janitor</code> on your own. However, we don’t always have the time to always clean up data without some help.</p>
<div data-align="center">
<p><img src="http://media3.giphy.com/media/3oKIPCSX4UHmuS41TG/giphy-downsized.gif" width="100px"></p>
</div>
<p>I like using <code>janitor</code> over writing my own code because, (1) functions are well tested, (2) I can turn multiple lines of code into one or two, and (3) the <code>janitor</code> functions are written to be pipe-able to work in the tidyverse space. It’s a pretty cool bonus that Sam works in the education space, so the functions were created to handle the education data problems I constantly face.</p>
<section id="an-example-using-janitor-to-clean-a-messy-excel-file." class="level2">
<h2 class="anchored" data-anchor-id="an-example-using-janitor-to-clean-a-messy-excel-file.">An example, using <code>janitor</code> to clean a messy excel file.</h2>
<p>I’m often tasked with cleaning roster files, which contain entry and exit data for students. These files can be very messy due to students who moved between schools or were not exited properly from the system, causing duplicates.</p>
<p>To clean this data, first I read it in with <code>read_excel</code> and use <code>janitor's</code> <code>clean_names</code> to convert all the column names to something I can use. <code>remove_empty()</code> removes entire columns or rows that are NA as excel sometimes can’t tell where there is data and where there isn’t.</p>
<p>I choose to use <code>col_type = "text"</code> in my <code>read_excel</code> statement, because I sometimes have to deal with leading zeros, NAs that are not written as NAs, or other text fields in my numerical columns. Reading in as text and converting later allows me to find and correct problems before they become NAs. I use <code>mutate_at</code> to convert these columns back to numbers after examining that everything looks good.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">students <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> readxl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_excel</span>(filepath, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheet=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sheet1"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col_types =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-2">  janitor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">clean_names</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-3">  janitor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">remove_empty_cols</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-4">  janitor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">remove_empty_rows</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-5">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate_at</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vars</span>(entrydate, exitdate, student_id, yearsinuncommon), as.numeric) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-6">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate_at</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vars</span>(entrydate, exitdate), excel_numeric_to_date) </span></code></pre></div></div>
<p>The next step in data cleaning is to look for duplicates. Luckily, <code>janitor</code> has a super helpful <code>get_dupes()</code> function which does just that!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">students <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb2-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dupes</span>(student_id)</span></code></pre></div></div>
<pre><code># A tibble: 2 x 6
  student_id dupe_count grade yearsinuncommon  entrydate   exitdate
       &lt;dbl&gt;      &lt;int&gt; &lt;dbl&gt;           &lt;dbl&gt;     &lt;date&gt;     &lt;date&gt;
1    2342675          2    10               1 2017-11-11 2017-12-11
2    2342675          2    11               1 2017-11-11 2017-12-11</code></pre>
<p>In this example data, I have one student with duplicate information. They’re in two different grades, ugh! Now, I have to choose a method to correct this student in my data.</p>
<section id="there-are-three-main-ways-i-use-to-correct-dupes." class="level3">
<h3 class="anchored" data-anchor-id="there-are-three-main-ways-i-use-to-correct-dupes.">There are three main ways I use to correct dupes.</h3>
<section id="correct-the-dupes-individually-with-if_else-or-case_when." class="level4">
<h4 class="anchored" data-anchor-id="correct-the-dupes-individually-with-if_else-or-case_when.">1. Correct the dupes individually with <code>if_else</code> or <code>case_when</code>.</h4>
<p>If there are only a few duplicates, or they’re all in one grade or class room, a quick set of <em>if</em> statements will do the trick to make the rows perfectly duplicated. From there, use <code>distinct</code> to get only distinct rows and remove the duplicates.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(students, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">grade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">if_else</span>(student_id <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2342675</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, grade))</span></code></pre></div></div>
</section>
<section id="summarize-by-taking-minimum-date-grade-to-choose-one-row-to-keep." class="level4">
<h4 class="anchored" data-anchor-id="summarize-by-taking-minimum-date-grade-to-choose-one-row-to-keep.">2. Summarize by taking minimum date / grade to choose one row to keep.</h4>
<p>This is helpful if you just need one of the rows, and don’t really care which row is the one you keep. For example, our exit and enter date information is not usually great, so I’m okay with the ‘just pick one’ version as long as the student’s grade and teacher information is the same in both rows.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(students, student_id) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">grade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(grade))</span></code></pre></div></div>
</section>
<section id="output-the-duplicates-and-manually-choose-which-version-to-keep." class="level4">
<h4 class="anchored" data-anchor-id="output-the-duplicates-and-manually-choose-which-version-to-keep.">3. Output the duplicates and manually choose which version to keep.</h4>
<p>This involves the most manual work, so I usually grab help when I need to do this, yay teammates!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">dupes_correct <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dupes_correct.csv"</span>)</span>
<span id="cb6-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(students, dupes_correct) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replace_na</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">keep =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">assert</span>(not_na, keep) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">keep =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span></code></pre></div></div>
</section>
</section>
<section id="managing-data-changes" class="level3">
<h3 class="anchored" data-anchor-id="managing-data-changes">Managing Data Changes</h3>
<p>As data is updated, there might be more duplicates to worry about. A great way to check for duplicate updates is to use the 1:2 punch of <strong>janitor’s</strong> <code>get_dupes</code> and <strong>assertr’s</strong> <code>verify()</code>. This allows you to put checks in place in case the data changes.</p>
<pre><code>check &lt;- students %&gt;% 
  get_dupes(student_id) %&gt;% 
  verify(nrow(.) == 0)</code></pre>
<p>If new duplicates occur the code will HALT at this step alerting that something is wrong.</p>
</section>
</section>
</section>
<section id="an-example-project-state-test-analysis" class="level1">
<h1>An Example Project: State Test Analysis</h1>
<p>The largest impact project the data analytics team is in charge of all year is our annual state test analysis. We gather and all the raw results data for each of our schools, clean it, and combine the information into one cohesive story.</p>
<p>The old process for this used a lot of excel workbooks, manual edits, and a very confusing naming system to port the data into tableau dashboards. With the many steps, and points of error, the process took a really long time. A big goal of ours was to go from raw data to published tableau dashboard in a <em>few hours</em> without any big hiccups.</p>
<p>This year we completed an overhaul of the process with R scripts, that clean and QC the data, add variables we need for analysis, combine with historical state test results, and output tableau ready inputs. The entire state test analyses from raw data to dashboard can now be completed with the press of a few buttons.</p>
<p><img src="https://eringrand.github.io/posts/r_in_the_world_of_education/{{ site.baseurl }}/Presentations/rladies-nyc/process.png" height="100px"></p>
<p>Using <code>PURRR</code> code to read and combined multiple files into one data frame has been the saving grace of this analysis. We have each grade/subject/school combination in a separate file, and with just a few lines of code R can bring them together for further cleaning.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">files <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list.files</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../Input/"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pattern =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">full.names =</span>  <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb8-2">nys <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_dfr</span>(files, prep_nys_files)</span></code></pre></div></div>
<p>If you’re curious to other parts of the code that really changed this process (for example <code>assertr</code> or <code>tidyr</code>) please feel free to ask!</p>
</section>
<section id="wrap-up" class="level1">
<h1>Wrap Up</h1>
<p>These changes didn’t come easy for my team. I work with a group of extremely smart people, but before I started with Uncommon most people on my team didn’t code at all let alone use R. It’s been part of my job to teach everyone to write in R, and make sure we’re all using best practices. One day I will write a longer blog post about some of the leanings I’ve had throughout this process, so stay tuned for that! In the mean time, I offer you some closing remarks on what I’ve found to work best.</p>
<ul>
<li>Choose the packages to teach that are needed every day. (For me, this was <code>janitor</code> and <code>dplyr</code>)</li>
<li>Have someone that is active in R community, so that you can be on the cutting edge of best practices and new packages.</li>
<li>The more practice someone has, the faster they’ll learn. Pair professional development sessions with coding projects.</li>
</ul>
<p>Introducing my team to the <code>tidyverse</code> and <code>janitor</code> have been a really big help for getting my team members on board and excited about learning and using R. A quick demonstration of <code>group_by()</code> and <code>get_dupes()</code> was really all I needed to motivate big changes in our analysis process.</p>


</section>

 ]]></description>
  <category>education</category>
  <category>rladies</category>
  <category>duplicate data</category>
  <guid>https://eringrand.github.io/posts/r_in_the_world_of_education/</guid>
  <pubDate>Sat, 30 Dec 2017 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/r_in_the_world_of_education/DQ4lewSWAAATBry.jpeg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Update to Lizzie Bennet Text Analysis Using Plotly</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/update_to_lizzie_bennet_text_analysis_using_plotly/</link>
  <description><![CDATA[ 





<p><a href="https://twitter.com/d4tagirl">Daniela Vázquez</a> recently published her <a href="https://d4tagirl.com/2017/05/how-do-you-feel-about-last-week-tonight">blog post on Last Week Tonight</a>. She used a bunch of code from my previous LBD analysis (THANKS FOR THE LOVE DANIELA! :heart:) and also created this super cool <code>plotly</code> widget.</p>
<p>I had never used <code>ggplotw2</code> and <code>plotly</code> before and wanted to give it a try, recreating a previous plot of sentiment by LBD episode.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(viridis)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(plotly)</span>
<span id="cb1-3"></span>
<span id="cb1-4">p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(lbsentiment, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>index, sentiment, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.factor</span>(index), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text=</span>title)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-5">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_bar</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stat =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"identity"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show.legend =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-6">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-7">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_text</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>index, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>plot_sentiment, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label=</span>plot_index), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb1-8">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sentiment in Lizzie Bennet Diaries"</span>,</span>
<span id="cb1-9">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sentiment"</span></span>
<span id="cb1-10">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_viridis</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">end =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">discrete=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">direction =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-12">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_discrete</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">expand=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-13">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strip.text=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strip.text=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">face =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"italic"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-16">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-17">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-18">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none"</span>)</span>
<span id="cb1-19"></span>
<span id="cb1-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplotly</span>(p, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tooltip=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">750</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">height=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">400</span>)</span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/update_to_lizzie_bennet_text_analysis_using_plotly/newplot.png" class="img-fluid"></p>
<iframe width="750" height="400" src="plotly.html">
</iframe>



 ]]></description>
  <category>sentiment</category>
  <category>tidytext</category>
  <category>LBD</category>
  <category>plotly</category>
  <category>rladies</category>
  <guid>https://eringrand.github.io/posts/update_to_lizzie_bennet_text_analysis_using_plotly/</guid>
  <pubDate>Wed, 31 May 2017 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/update_to_lizzie_bennet_text_analysis_using_plotly/newplot.png" medium="image" type="image/png" height="77" width="144"/>
</item>
<item>
  <title>Text Analysis of The Lizzie Bennet Diaries</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/text_analysis_of_the_lizzie_bennet_diaries/</link>
  <description><![CDATA[ 





<p>Inspired by <a href="http://juliasilge.com/">Julia’s Silge’s</a> recent talk on <a href="http://tidytextmining.com/">Tidytext</a> as part of <a href="https://open.nasa.gov/explore/datanauts/">NASA Datanauts</a>, and her blog posts, I decided to try my hand at some text analysis. Julia’s examples focus on the works of Jane Austen. As Jane Austen has been adapted so many time, I decided to “adapt” Julia’s talk for the modern works of Austen’s book Pride and Prejudice - specifically the Lizzie Bennet Diaries.</p>
<p><img src="http://www.pemberleydigital.com/wp-content/uploads/2012/04/LBD-FacebookCover-Emmy.png" class="img-fluid"> <a href="http://www.pemberleydigital.com/the-lizzie-bennet-diaries/">Image source: Pemberly Digital</a></p>
<section id="the-lizzie-bennet-diaries" class="level1">
<h1>The Lizzie Bennet Diaries</h1>
<p>The <a href="http://www.pemberleydigital.com/the-lizzie-bennet-diaries/">Lizzie Bennet Diaries</a> is a modern adaptation of Jane Austen’s Pride and Prejudice for YouTube. The story is told through a series of Vlogs by Lizzie Bennet as part of a school project. The series, created by Hank Green and Bernie Su, first aired on April 9, 2012, making this year the <strong>5th Anniversary</strong> of the series! Altogether, the series filmed more than 100 video episodes with over 9.5 hours of video making it the longest adaption of Pride and Prejudice to date.</p>
<p>Along with the main LBD channel, there are also some supporting channels. These allow other characters to tell parts of the story that Lizzie doesn’t take part in. For example, Lydia’s Vlogs include the story on how she meets George Wickham and their budding relationship. While not required viewing, these extra videos help round out the experience.</p>
<p>Since the series ended, 2 books have come out from the creators and writers of the original videos: one that follows the videos but adds some more detail to Lizzie’s life, and one that focuses on Lydia’s story after the series ends.</p>
<p>In honor of LBD’s 5th Anniversary, let’s do some LBD text analysis! <strong>Happy Anniversary LBD!</strong></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="http://media3.giphy.com/media/10MjSRjJxjc6XK/giphy-downsized.gif" class="img-fluid figure-img"></p>
<figcaption>celebration lizzie+bennet</figcaption>
</figure>
</div>
</section>
<section id="analysis" class="level1">
<h1>Analysis</h1>
<section id="gathering-data" class="level2">
<h2 class="anchored" data-anchor-id="gathering-data">Gathering Data</h2>
<p>The first part of this analysis is grabbing all the text from YouTube. To access the API, I use the <a href="https://soodoku.github.io/tuber/"><code>tuber</code></a> package by Gaurav Sood.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tuber)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">yt_oauth</span>(app_id, app_password)</span></code></pre></div></div>
<p>The fist step was to find the channel id to access the LBD playlist. I do a quick search for <code>lizziebennet</code> to find some videos that I know are part of the series.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"></span>
<span id="cb2-2">search <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">yt_search</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lizziebennet"</span>)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, ] </span>
<span id="cb2-3">search <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(title, channelId)</span>
<span id="cb2-4"></span>
<span id="cb2-5"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##                                                       title</span></span>
<span id="cb2-6"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1                         My Name is Lizzie Bennet  - Ep: 1</span></span>
<span id="cb2-7"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2 The Lizzie Bennet Diaries - Episódio 98 (LEGENDADO PT-BR)</span></span>
<span id="cb2-8"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 3                                      Yeah I Know - Ep: 61</span></span>
<span id="cb2-9"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 4                                 Introducing Lizzie Bennet</span></span>
<span id="cb2-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 5                                  The Lizzie Trap - Ep: 78</span></span>
<span id="cb2-11"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##                  channelId</span></span>
<span id="cb2-12"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1 UCXfbQAimgtbk4RAUHtIAUww</span></span>
<span id="cb2-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2 UCfhdE-vIhW9GD0eGdd300ag</span></span>
<span id="cb2-14"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 3 UCXfbQAimgtbk4RAUHtIAUww</span></span>
<span id="cb2-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 4 UCGaVdbSav8xWuFWTadK6loA</span></span>
<span id="cb2-16"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 5 UCXfbQAimgtbk4RAUHtIAUww</span></span></code></pre></div></div>
<p>With the channel ID in hand, I can now access the channel’s resources to find the playlist ID, which I will use to access all the videos in that playlist. <code>list_channel_resources</code> for <code>tuber</code> creates a list of channel attributes and buried in that list in the playlist ID.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"></span>
<span id="cb3-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Channel Information</span></span>
<span id="cb3-3">a <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list_channel_resources</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filter =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">channel_id=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UCXfbQAimgtbk4RAUHtIAUww"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">part=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"contentDetails"</span>)</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Uploaded playlists:</span></span>
<span id="cb3-6">playlist_id <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> a<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>items[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>contentDetails<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>relatedPlaylists<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>uploads</span>
<span id="cb3-7"></span>
<span id="cb3-8">playlist_id</span>
<span id="cb3-9"></span>
<span id="cb3-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## [1] "UUXfbQAimgtbk4RAUHtIAUww"</span></span></code></pre></div></div>
<p>The YouTube API automatically pages videos so the max you get per page is 50. I know I need more than that, so I created a function that I call a few times to get all the videos. (This way works, but I would love any comments on how to make it better.)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pass NA as next page to get first page</span></span>
<span id="cb4-3">nextPageToken <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb4-4">vid_info <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span>{}</span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Loop over every available page</span></span>
<span id="cb4-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">repeat</span> {</span>
<span id="cb4-8">  vids <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_playlist_items</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filter=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">playlist_id=</span>playlist_id), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">page_token =</span> nextPageToken)</span>
<span id="cb4-9">  vid_ids <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(vids<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>items, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"contentDetails"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-10">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"videoId"</span>)  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-11">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlist</span>()</span>
<span id="cb4-12">    </span>
<span id="cb4-13">  vid_info <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> vid_info <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ids =</span> vid_ids))</span>
<span id="cb4-14">    </span>
<span id="cb4-15">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the token for the next page</span></span>
<span id="cb4-16">  nextPageToken <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.null</span>(vids<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>nextPageToken), vids<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>nextPageToken, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>)</span>
<span id="cb4-17">    </span>
<span id="cb4-18">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if no more pages then done</span></span>
<span id="cb4-19">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nextPageToken)){</span>
<span id="cb4-20">     <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span>
<span id="cb4-21">  }</span>
<span id="cb4-22"></span>
<span id="cb4-23">}</span>
<span id="cb4-24"></span>
<span id="cb4-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># check that I have all 112 videos</span></span>
<span id="cb4-26"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(vid_info)</span>
<span id="cb4-27"></span>
<span id="cb4-28"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## [1] 112</span></span></code></pre></div></div>
<p>Now that I have a list of video IDs, I can use <code>get_captions</code> to access the text of the videos. I also use <code>xmlTreeParse</code> and <code>xmlToList</code> to covert the caption into into an easily accessible lines of text. I put the text, video ID, and video title in a tibble for use in tidydata.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"></span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(XML)</span>
<span id="cb5-3"></span>
<span id="cb5-4">getText <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(id){</span>
<span id="cb5-5">  x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_captions</span>(id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lang =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"en"</span>)</span>
<span id="cb5-6">  title <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_video_details</span>(id)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>title</span>
<span id="cb5-7">  a <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xmlTreeParse</span>(x)</span>
<span id="cb5-8">  text <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> a<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>doc<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>children<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>transcript</span>
<span id="cb5-9">  text <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xmlToList</span>(text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">simplify =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">addAttributes =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb5-10">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> title)</span>
<span id="cb5-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(text) </span>
<span id="cb5-13">}</span>
<span id="cb5-14"></span>
<span id="cb5-15">text <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_df</span>(vid_ids, getText) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb5-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set_names</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vid_id"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>))</span></code></pre></div></div>
<p>I don’t actually want to refer to each video by it’s full title, so I do some data munching to get each episode’s number (1-100). Notice, the Q&amp;A videos do not get a episode number assigned to them. For the sake of this analysis, I’ve decided to only work with the main 100 episodes.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"></span>
<span id="cb6-2">titles <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> text <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">distinct</span>(title) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(title <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Question and Answers #3 (ft. Caroline Lee)"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Questions and Answers #3 (ft. Caroline Lee)"</span>, title),</span>
<span id="cb6-5">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ep_num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[- .)(+!',/]|[a-zA-Z]*:?"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, title),</span>
<span id="cb6-6">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ep_num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(title <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2 + 1 - Ep: 73"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">73</span>, ep_num),</span>
<span id="cb6-7">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ep_num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(title <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"25 Douchebags and a Gentleman - Ep:18"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>, ep_num),</span>
<span id="cb6-8">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ep_num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(title <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bing Lee and His 500 Teenage Prostitutes - Ep: 4"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, ep_num),</span>
<span id="cb6-9">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ep_num =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">parse_number</span>(ep.num)</span>
<span id="cb6-10">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grepl</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Questions and Answers"</span>, title)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb6-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(ep_num) </span></code></pre></div></div>
<p>One of the problems with using captions, is the messy text. I used a simple set of <code>gsub</code> commands to transform obvious punctuation marks into their English counterparts. I also pulled out the character SPEAKING the words from the text itself. I left this column alone in the data set, but might one day go back and focus an analysis on speaking characters.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"></span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidytext)</span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(stringr)</span>
<span id="cb7-4"></span>
<span id="cb7-5">lizziebennet <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> text <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(titles, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"title"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(ep_num)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(ep_num) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linenumber =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row_number</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&amp;#39;"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"'"</span>, text),</span>
<span id="cb7-11">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&amp;quot;"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\"</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, text),</span>
<span id="cb7-12">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&amp;amp;"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"and"</span>, text),</span>
<span id="cb7-13">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">character =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_extract</span>(text, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^[a-zA-Z]*:"</span>),</span>
<span id="cb7-14">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^[a-zA-Z]*:"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, text)</span>
<span id="cb7-15">         ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(ep_num, linenumber)</span></code></pre></div></div>
<p>Okay, so now the text is <em>mostly</em> in place. The first thing I did was look at word counts. The most common words are not surprising, it’s just a list of the characters.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"></span>
<span id="cb8-2">lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-3">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, text) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(stop_words, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb8-7"></span>
<span id="cb8-8"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## # A tibble: 10 × 2</span></span>
<span id="cb8-9"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##         word     n</span></span>
<span id="cb8-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##        &lt;chr&gt; &lt;int&gt;</span></span>
<span id="cb8-11"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1     lizzie   460</span></span>
<span id="cb8-12"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2       jane   301</span></span>
<span id="cb8-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 3      darcy   243</span></span>
<span id="cb8-14"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 4       bing   232</span></span>
<span id="cb8-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 5    collins   220</span></span>
<span id="cb8-16"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 6      lydia   196</span></span>
<span id="cb8-17"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 7     bennet   194</span></span>
<span id="cb8-18"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 8  charlotte   180</span></span>
<span id="cb8-19"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 9       yeah   178</span></span>
<span id="cb8-20"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 10      time   176</span></span></code></pre></div></div>
<p>Not surprisingly, the most common trigrams are from the phrase that begins every episode, “My name is Lizzie Bennet and…”</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"></span>
<span id="cb9-2">lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-3">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ngrams"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb9-6"></span>
<span id="cb9-7"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## # A tibble: 10 × 2</span></span>
<span id="cb9-8"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##                 word     n</span></span>
<span id="cb9-9"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##                &lt;chr&gt; &lt;int&gt;</span></span>
<span id="cb9-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1         my name is   106</span></span>
<span id="cb9-11"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2   is lizzie bennet    96</span></span>
<span id="cb9-12"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 3     name is lizzie    96</span></span>
<span id="cb9-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 4  lizzie bennet and    84</span></span>
<span id="cb9-14"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 5       i don't know    40</span></span>
<span id="cb9-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 6          oh my god    36</span></span>
<span id="cb9-16"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 7           a lot of    33</span></span>
<span id="cb9-17"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 8        going to be    31</span></span>
<span id="cb9-18"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 9       what are you    29</span></span>
<span id="cb9-19"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 10     mr collins oh    28</span></span></code></pre></div></div>
<p>I was also especially amused by <em>So good to see you!</em> and <em>THE MOST AWKWARD DANCE EVER</em> being in the Top 10 5-grams.</p>
<pre><code>## # A tibble: 12 × 2
##                           word     n
##                          &lt;chr&gt; &lt;int&gt;
## 1     my name is lizzie bennet    95
## 2    name is lizzie bennet and    83
## 3       is lizzie bennet and i    19
## 4    is lizzie bennet and this    14
## 5    lizzie bennet and this is    11
## 6           so good to see you     9
## 7       had nothing to do with     5
## 8     is lizzie bennet and i'm     5
## 9       lizzie bennet and i am     5
## 10 the most awkward dance ever     5
## 11     what are you doing here     5</code></pre>
</section>
<section id="sentiment-analysis" class="level2">
<h2 class="anchored" data-anchor-id="sentiment-analysis">Sentiment Analysis</h2>
<p>I’ve chosen to use the Bing lexicon (because of Bing Lee, get it?). In Tidydata, sentiment analysis is easy because you just join the lexicon against your tokenzied words.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"></span>
<span id="cb11-2">bing <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> sentiments <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-3">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(lexicon <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bing"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-4">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>score)</span>
<span id="cb11-5"></span>
<span id="cb11-6">lbwordcount <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-7">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, text) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(stop_words) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(title)</span>
<span id="cb11-10">  </span>
<span id="cb11-11">lbsentiment <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-12">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(word, text) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">anti_join</span>(stop_words) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">inner_join</span>(bing) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb11-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(title, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">index=</span>ep_num, sentiment) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb11-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">spread</span>(sentiment, n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb11-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(lbwordcount) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-18">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sentiment =</span> positive <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> negative,</span>
<span id="cb11-19">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sentiment =</span> sentiment <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n)  </span></code></pre></div></div>
<p>Most positive sentiment episodes:</p>
<pre><code>## # A tibble: 5 × 2
##                                       title  sentiment
##                                       &lt;chr&gt;      &lt;dbl&gt;
## 1                    Care Packages - Ep: 58 0.09623431
## 2                         The End - Ep: 100 0.09375000
## 3                   Jane Chimes In - Ep: 12 0.09132420
## 4 My Parents: Opposingly Supportive - Ep: 3 0.08415842
## 5      Wishing Something Universal - Ep: 76 0.08018868</code></pre>
<p>Most negative sentiment episodes:</p>
<pre><code>## # A tibble: 5 × 2
##                            title   sentiment
##                            &lt;chr&gt;       &lt;dbl&gt;
## 1   Turn About the Room - Ep: 32 -0.15217391
## 2        How About That - Ep: 91 -0.09937888
## 3          Staff Spirit - Ep: 59 -0.09745763
## 4 How to Hold a Grudge  - Ep: 74 -0.09352518
## 5      Meeting Bing Lee - Ep: 28 -0.07614213</code></pre>
<p>The next step was to visualize this in a way where you can look at the sentiment over the episodes.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"></span>
<span id="cb14-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(viridis)</span>
<span id="cb14-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_set</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>()) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># a theme with a white background</span></span>
<span id="cb14-4"></span>
<span id="cb14-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(lbsentiment, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>index, sentiment, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.factor</span>(index))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-6">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_bar</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stat =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"identity"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show.legend =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-7">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-8">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data=</span>plot_text, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>index, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>sentiment, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label=</span>index), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb14-9">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sentiment in Lizzie Bennet Diaries"</span>,</span>
<span id="cb14-10">             <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sentiment"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-11">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_viridis</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">end =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">discrete=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">direction =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-12">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_discrete</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">expand=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.02</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-13">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strip.text=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-14">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strip.text=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">face =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"italic"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-15">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-16">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb14-17">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>())</span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/text_analysis_of_the_lizzie_bennet_diaries/{{ site.baseurl }}/images/lizziebennet_textmining_files/figure-markdown_github/unnamed-chunk-17-1.png" class="img-fluid"> <em>Sentiment by main episode of LBD.</em></p>
<p>Julia’s sentiment analysis of the original text is much more positive than my LBD analysis, with two negative portions relating to Darcy proposing to Elizabeth and Lydia running away with Wickham. I had expected a similar “Wickham” negative spike in this plot, and while the section of Wickham related episodes (Ep 84 to Ep 89) is surely negative it’s not more negative than some of the introductory episodes.</p>
<p>One could argue, that since most of the Lydia - Wickham story line happens off screen and in Lydia’s blogs, that would explain that lack of a clear negative spike in the Wickham episodes.</p>
</section>
<section id="more-sentiment" class="level2">
<h2 class="anchored" data-anchor-id="more-sentiment">More sentiment</h2>
<p>Continuing the analysis, I wanted to look at which words were causing the largest effect on the overall sentiment.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"></span>
<span id="cb15-2">bing_word_counts <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(sentiment) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">word =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reorder</span>(word, n)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(word, n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> sentiment)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_col</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show.legend =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">facet_wrap</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>sentiment, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scales =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"free_y"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Contribution to sentiment"</span>,</span>
<span id="cb15-10">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coord_flip</span>()</span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/text_analysis_of_the_lizzie_bennet_diaries/{{ site.baseurl }}/images/lizziebennet_textmining_files/figure-markdown_strict/unnamed-chunk-18-1.png" class="img-fluid"></p>
<p>Given that this is a modern adaption, it’s interesting that much like the analysis done on the original “miss” is the top contribution to negative sentiment. In the original text I would assume a higher count of “Miss Bennet’s” to the modernized version. However, Lizzie does talk about you she’ll miss Charlotte, or she misses her home… etc, so it’s not too surprising to see it have a considerable contribution here.</p>
<p>I did a bit of an investigation into this with bigrams.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"></span>
<span id="cb16-2">lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-3">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(bigram, text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ngrams"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">separate</span>(bigram, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word2"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(word1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"miss"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">miss_in_name =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(word2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bennet"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lu"</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Yes"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"No"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(miss_in_name)</span>
<span id="cb16-8"></span>
<span id="cb16-9"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## # A tibble: 2 × 2</span></span>
<span id="cb16-10"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##   miss_in_name     n</span></span>
<span id="cb16-11"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##          &lt;chr&gt; &lt;int&gt;</span></span>
<span id="cb16-12"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1           No    26</span></span>
<span id="cb16-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2          Yes    27</span></span></code></pre></div></div>
<p>And oddly enough, the use of the word “miss” is about half and half between “I miss [person/thing]” and “Miss Bennet” type phrases. Interesting! (Anyone want to guess who refers to Lizzie as Miss Bennet the most? Unsurprisingly, it’s Ricky Collins.)</p>
</section>
<section id="more-with-bigrams" class="level2">
<h2 class="anchored" data-anchor-id="more-with-bigrams">More with Bigrams</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"></span>
<span id="cb17-2">bigrams_separated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> lizziebennet <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-3">  tidytext<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unnest_tokens</span>(bigram, text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ngrams"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">separate</span>(bigram, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word2"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>word1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> stop_words<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>word, </span>
<span id="cb17-6">         <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>word2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> stop_words<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>word) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(word1, word2, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb17-8"></span>
<span id="cb17-9">bigrams_separated <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb17-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">top_n</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>) </span>
<span id="cb17-12"></span>
<span id="cb17-13"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## # A tibble: 11 × 3</span></span>
<span id="cb17-14"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##      word1   word2     n</span></span>
<span id="cb17-15"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">##      &lt;chr&gt;   &lt;chr&gt; &lt;int&gt;</span></span>
<span id="cb17-16"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 1   lizzie  bennet   132</span></span>
<span id="cb17-17"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 2     bing     lee    43</span></span>
<span id="cb17-18"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 3   george wickham    24</span></span>
<span id="cb17-19"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 4      hey  lizzie    24</span></span>
<span id="cb17-20"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 5       de  bourgh    22</span></span>
<span id="cb17-21"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 6    ricky collins    21</span></span>
<span id="cb17-22"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 7      los angeles    20</span></span>
<span id="cb17-23"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 8     miss  bennet    19</span></span>
<span id="cb17-24"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 9     tour  leader    18</span></span>
<span id="cb17-25"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 10   video    blog    17</span></span>
<span id="cb17-26"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## 11 william   darcy    17</span></span></code></pre></div></div>
<p>Not surprisingly, the common bigrams are first and last names of characters, but there’s also some fun other popular bigrams with “tour leader” and “video blog.” I guess <em>vlog</em> wasn’t super popular to use on it’s own yet.</p>
</section>
<section id="network-of-words" class="level2">
<h2 class="anchored" data-anchor-id="network-of-words">Network of Words</h2>
<p>One of my favorite part of tidytext is the example on making a bigram network. It’s just so fun!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"></span>
<span id="cb18-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(igraph)</span>
<span id="cb18-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggraph)</span>
<span id="cb18-4"></span>
<span id="cb18-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb18-6"></span>
<span id="cb18-7">bigrams_separated <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">graph_from_data_frame</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggraph</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">layout =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fr"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_edge_link</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_node_point</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_node_text</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> name), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">vjust =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.y=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-18">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks.y=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb18-19">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.y=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>())</span></code></pre></div></div>
<p><img src="https://eringrand.github.io/posts/text_analysis_of_the_lizzie_bennet_diaries/{{ site.baseurl }}/images/lizziebennet_textmining_files/figure-markdown_github/unnamed-chunk-21-1.png" class="img-fluid"></p>
<p>I especially enjoy the Bennet sister cluster in the left corner.</p>
<hr>
<p>I leave you with this last picture.</p>
<p><img src="https://scontent-lga3-1.xx.fbcdn.net/v/t1.0-9/1908336_10202570761373587_7013966634375610561_n.jpg?oh=1a5119c2ae93bbd9b01060523cc7e43c&amp;oe=59733FEF" class="img-fluid"> Some of cast of Lizzie Bennet Diaries and me. Vidcon 2014</p>


</section>
</section>

 ]]></description>
  <category>rladies</category>
  <category>tidytext</category>
  <category>sentiment</category>
  <category>LBD</category>
  <guid>https://eringrand.github.io/posts/text_analysis_of_the_lizzie_bennet_diaries/</guid>
  <pubDate>Tue, 02 May 2017 00:00:00 GMT</pubDate>
  <media:content url="http://media3.giphy.com/media/10MjSRjJxjc6XK/giphy-downsized.gif" medium="image" type="image/gif"/>
</item>
<item>
  <title>New York Rstats Conference</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/new_york_rstats_conference/</link>
  <description><![CDATA[ 





<p>Last weekend was the 2017 RStatsNYC conference. I had a great time talking to friends and meeting new friends throughout the weekend. The speakers covered a variety of topics from data ethics to cloud cloud computing. I’ve complied my notes, plus some of popular tweets from the conference below.</p>
<section id="day-1" class="level1">
<h1>Day 1</h1>
<section id="how-r-helps-airbnb-make-the-most-of-its-data" class="level3">
<h3 class="anchored" data-anchor-id="how-r-helps-airbnb-make-the-most-of-its-data">How R Helps Airbnb Make the Most of Its Data</h3>
<p>Ricardo Bion, Airbnb</p>
<ul>
<li>AirBnD started in 2008 with 1 city and 1 room, now there are 3M homes in 71K cities</li>
<li>100+ data scietists using a mix of lanaguages mostly R, but lots of python</li>
<li>Why to use an R Packages:
<ul>
<li>Passing around functions required duplication of work, where as a package can include data, test, add-ins, vignettes, R markdown and notebook templates</li>
<li>The AirBnB packages have consistent API, branded visualization, branded templates, and of course function functions</li>
</ul></li>
<li>Education:
<ul>
<li>made a difference in confidence of R stats</li>
<li>new hire buddy</li>
<li>intro classes at datacamp if interested, sponsored by airbnb</li>
<li>peer support with office hours, code review, R slack group</li>
<li>learning lunch, journal club, offsites</li>
</ul></li>
<li>Reproducibility:
<ul>
<li>scale knowedlge</li>
<li>knowledge repo</li>
<li>posts have tags with topics, date, then served as a web ui</li>
<li>uses github for peer review</li>
<li>branded template</li>
<li>incorprate best practices from academia and software</li>
</ul></li>
</ul>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>Hearing about reproducibility and R packages at Airbnb from <a href="https://twitter.com/ricardobion"><span class="citation" data-cites="ricardobion">@ricardobion</span></a> at NYR <a href="https://twitter.com/hashtag/rstatsnyc?src=hash">#rstatsnyc</a> <a href="https://t.co/uGSZwtX1XM">pic.twitter.com/uGSZwtX1XM</a></p>
<p></p>
<p>— Julia Silge (<span class="citation" data-cites="juliasilge">@juliasilge</span>) <a href="https://twitter.com/juliasilge/status/855413199344209922">April 21, 2017</a></p>
</blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="fine-grained-visual-category-recognition-and-perceptual-embedding" class="level3">
<h3 class="anchored" data-anchor-id="fine-grained-visual-category-recognition-and-perceptual-embedding">Fine Grained Visual Category Recognition and Perceptual Embedding</h3>
<p>Serge Belongie, Cornell University</p>
<ul>
<li>Really intersting talk on using Stochastic Neighbor algorithm with crowd sourcing to get a visual similarity of images.</li>
<li>Motivation of humans and computers working together</li>
<li>His talk focused on detecting what type of bird was in an image</li>
<li>Detecting that there is a bird in the picture is getting easy for computer, but detecting the exact name of the bird is much more difficult</li>
</ul>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>.<a href="https://twitter.com/SergeBelongie"><span class="citation" data-cites="SergeBelongie">@SergeBelongie</span></a> showing a stochastic neighbor algorithm with crowd sourcing to get a visual similarity of images <a href="https://twitter.com/hashtag/rstatsnyc?src=hash">#rstatsnyc</a> <a href="https://t.co/TRuIruTELS">pic.twitter.com/TRuIruTELS</a></p>
<p></p>
<p>— Erin Grand (<span class="citation" data-cites="astroeringrand">@astroeringrand</span>) <a href="https://twitter.com/astroeringrand/status/855420509344993282">April 21, 2017</a></p>
</blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="an-r-cloud-computing-lifeline-the-missing-manual-for-running-r-on-amazon-cloud" class="level3">
<h3 class="anchored" data-anchor-id="an-r-cloud-computing-lifeline-the-missing-manual-for-running-r-on-amazon-cloud">An R Cloud Computing Lifeline: The Missing Manual for Running R on Amazon Cloud</h3>
<p>Kelly O’Briant, B23</p>
<ul>
<li>Working with R in the cloud is different from working with Rstudio on your computer, you have to install all your favroite packages again every time you start up a new server</li>
<li>bigger instances sizes, analysis while sleeping, running multiple R servers at the same time, instances themselves are disaposable and renewable resources</li>
<li>able to use tools in a more powerful manor</li>
</ul>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p><a href="https://twitter.com/b23kelly"><span class="citation" data-cites="b23kelly">@b23kelly</span></a> created a custom R package to set up new projects/deployed servers fast without having to reconfigure anew each time <a href="https://twitter.com/hashtag/rstatsnyc?src=hash">#rstatsnyc</a></p>
<p></p>
<p>— Alec Barrett (<span class="citation" data-cites="alecbarrett">@alecbarrett</span>) <a href="https://twitter.com/alecbarrett/status/855424336068530180">April 21, 2017</a></p>
</blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="r-makes-the-world-go-round-data-driven-decision-making-at-jetblue" class="level3">
<h3 class="anchored" data-anchor-id="r-makes-the-world-go-round-data-driven-decision-making-at-jetblue">R Makes the World Go ’Round: Data-Driven Decision Making at JetBlue</h3>
<p>Catherine Zhou, JetBlue</p>
<p>One of the interesting take always throughout the conference was that you can start coding in R pretty quickly, if you start with the right ideas and tools about what R is. Most intro stats classes (mine included) treat R and other programming languages as a calculator. But R is so much more than that! <a href="https://twitter.com/catherinezh">Catherine Z</a> made a point about giving people templates with tidyverse functions to produce their own analyses. In my own work, I’m helping my coworker think of tidyverse in a similar way to computing excel tasks, i.e <code>group_by()</code> %&gt;% <code>summarise()</code> is equivalent to a <em>pivot table</em>, and <code>mutate()</code> adds a new column the same you way you might by drag-and-dropping an excel equation.</p>
<p>There were also several good comments on how you can learn slowly by doing something small in R (such as doing a bit of cleaning) and then porting it back out to excel or tableau to finish your analysis. Best takeaway? <strong>Not everyone needs to be fluent in R. - <a href="https://twitter.com/catherinezh">Catherine</a></strong></p>
<blockquote class="twitter-tweet blockquote" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
</p><p>.<a href="https://twitter.com/catherinezh"><span class="citation" data-cites="catherinezh">@catherinezh</span></a> Easy sells: Automation and reproducibility, but not everyone needs to be fluent. <a href="https://t.co/EancCU2eWb">pic.twitter.com/EancCU2eWb</a></p>
<p></p>
<p>— Erin Grand (<span class="citation" data-cites="astroeringrand">@astroeringrand</span>) <a href="https://twitter.com/astroeringrand/status/855438176269344769">April 21, 2017</a></p>
</blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
<section id="theoretical-statistics-is-the-theory-of-applied-statistics-how-to-think-about-what-we-do" class="level3">
<h3 class="anchored" data-anchor-id="theoretical-statistics-is-the-theory-of-applied-statistics-how-to-think-about-what-we-do">Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do</h3>
<p>Andrew Gelman, Columbia</p>
<p>Reiterating Andrew Gelman’s point about how p-value statistical testing is actually a pretty bad framework for hypothesis testing most of the time. time to really brush up on Bayesian stats. I also liked the general point that you should have to state what you expect to find (alternate hypotheses, estimated effect sizes, whatever) before you go barging around looking for anything and act like whatever you end up with is what you were searching for all along.</p>
</section>
<section id="the-unreasonable-effectiveness-of-empathy---the-killer-skill-needed-for-a-successful-technical-career" class="level3">
<h3 class="anchored" data-anchor-id="the-unreasonable-effectiveness-of-empathy---the-killer-skill-needed-for-a-successful-technical-career">The Unreasonable Effectiveness of Empathy - The killer skill needed for a successful technical career</h3>
<p>JD Long, RenaissanceRe</p>
<ul>
<li>Analysis doesn’t end at result delivery - it ends at developing and proselytizing new business strategies and innovation.</li>
<li>Automating Excel workflows can be a first step towards bringing R to a team.</li>
<li>Agile development tells user stories as an empathy hack.</li>
</ul>
<pre><code>As a ______
I want ______
So I can ______ </code></pre>
<ul>
<li>Person-level stories (the near) are always more meaningful than data stories (the far). We need to balance both as Data Scientists.</li>
<li>In development, we need to have an actual user in mind, rather than a theoretical user who wants everything.</li>
</ul>
<blockquote class="twitter-tweet blockquote" data-lang="en">
<p lang="en" dir="ltr">
</p><p>“As you tell your data stories, think about the individual people in your data and your consumers” - <a href="https://twitter.com/CMastication"><span class="citation" data-cites="CMastication">@CMastication</span></a> <a href="https://twitter.com/hashtag/rstatsnyc?src=hash">#rstatsnyc</a> <a href="https://t.co/f9DZGCprDO">pic.twitter.com/f9DZGCprDO</a></p>
<p></p>
<p>— Emily Robinson (<span class="citation" data-cites="robinson_es">@robinson_es</span>) <a href="https://twitter.com/robinson_es/status/855500182586368002">April 21, 2017</a></p>
</blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>
</section>
<section id="day-2" class="level1">
<h1>Day 2</h1>
<section id="using-human-mobility-data-to-assess-public-circulation-health" class="level3">
<h3 class="anchored" data-anchor-id="using-human-mobility-data-to-assess-public-circulation-health">Using Human Mobility Data to Assess Public Circulation Health</h3>
<p>Michael Kane</p>
<ul>
<li>talking about cell phone tower data -&gt; homan ciculatory data</li>
<li>anytime you are connecting or handing off to a cell phone tower, the data is logged</li>
<li>secruity at: IDS are all hashed version of cell phones</li>
<li>35 TB data set! big! 10 GB every day</li>
<li>touching ALL of the day when doing analysis</li>
<li>use R, with Hadoop</li>
<li>using HMR in cluster (hapood)</li>
<li>looks at inflow and outflow counts (flux)</li>
<li>are providing reliable information on human mobility on storms</li>
<li>want the learner that is the most REGULAR not the closest to accuracy</li>
</ul>
</section>
<section id="from-agreeing-to-marching-to-organizing-oss-needs-you" class="level3">
<h3 class="anchored" data-anchor-id="from-agreeing-to-marching-to-organizing-oss-needs-you">From Agreeing to Marching to Organizing: OSS Needs You</h3>
<p>Mike Malecki and Neal Richardson</p>
<ul>
<li>Best ways to contribute to open source are to start with improving documentation</li>
<li>Open source contributions: failing test with fix &gt; failing test &gt; bug report</li>
<li>Remember to include sessionInfo()</li>
<li>When releasing a package, release quickly, but also slowly - take time to fix dumb decisions
<ul>
<li>Bring something new to the community, but don’t reinvent the wheel</li>
<li>Tell people about your package (social media), then listen to how they’re using it</li>
<li>When thinking about a package: documentation &gt; usability &gt; performance &gt; features</li>
</ul></li>
</ul>
</section>
</section>
<section id="other-learnings" class="level1">
<h1>Other Learnings</h1>
<p>I spent most of the conference chatting and meeting other members of the Rladies New York chapter.</p>
<p><img src="https://pbs.twimg.com/media/C-CEwTYXoAAqppX.jpg:large" class="img-fluid"></p>
</section>
<section id="packages-to-try-out" class="level1">
<h1>Packages to Try Out</h1>
<ul>
<li>RXKCD: add XKCD cartoons to stuff</li>
<li>trelliscope: many-panel data vis</li>
<li>compareGroups: compare demographics and other aspects across groups</li>
<li>goodpractice: does a variety of checking for good package development practice</li>
<li>lintr: helps check for good code style</li>
</ul>


</section>

 ]]></description>
  <category>conference</category>
  <category>rladies</category>
  <guid>https://eringrand.github.io/posts/new_york_rstats_conference/</guid>
  <pubDate>Sun, 30 Apr 2017 00:00:00 GMT</pubDate>
  <media:content url="https://pbs.twimg.com/media/C-CEwTYXoAAqppX.jpg:large" medium="image"/>
</item>
<item>
  <title>Graphics in Science</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/graphics_and_plots_in_science/</link>
  <description><![CDATA[ 





<p>Graphics and visualizations are used for promotion, advertisement to promote a product or idea. In science, graphics tend to fall into one of two categories: for use in education or a science journal. For information on what makes a good educational graphic, or a teaching tool, I’ve written a piece earlier on this blog <a href="http://eringrand.github.io/educationgraphics/">here</a>. In the academic articles, graphics hold a special role in telling a compelling story of the data and results, however, the editing emphasis is often placed much more on text than making interesting and understandable science graphics.</p>
<p><em>DISCLAIMER: I am coming from an astronomy and physics background, and am going to discuss problems found within these contexts.</em></p>
<section id="academic-article-graphics" class="level3">
<h3 class="anchored" data-anchor-id="academic-article-graphics">Academic Article Graphics</h3>
<p>We think of academics and especially science as being told through plots and graphs. In fact, Tufte explains that the history graphics begins with time series plots of the planets and the sun in the night sky. Now a days, science articles use graphics to tell a story. We let the data speak for itself by representing it in a reproducible graphic.</p>
<p>In my time in academia (in physics and astronomy) I’ve come across several common problems such as:</p>
<ul>
<li>Missing or incorrect error bars (especially on log-log plots)</li>
<li>Missing or incorrect ticks marks and axis labels</li>
<li>Too much text - Keep notes and explanations outside the graphic and in the image caption</li>
<li>Overlap of lines or points</li>
<li>Wasting space or not using enough of it</li>
<li>Plots that should have been tables</li>
</ul>
<p>Some of these problems come from trying to show off too much of the data. You want the data to stand out, but you don’t always need to include all of it. This is hard because we spend so much time working with the data that we want to share everything, but the added complexity often takes away from the graph and the point you’re trying to make.</p>
<p>In the remainder of the blog, I will try to address each of these points and introduce a fast and easy way to correct them.</p>
<section id="corrections-to-common-problem-with-academic-graphics-log-log-plots-with-missing-or-symmetric-error-bars-can-be-fixed-by-forcing-asymmetric-error-bars.-when-there-are-small-errors-the-log-can-show-as-a-negative-error-which-often-means-that-plot-wont-do-anything.-in-matplotlib-the-default-for-the-y-axis-is-to-map-all-negative-values-a-very-small-positive-one.-the-code-for-that-is" class="level4">
<h4 class="anchored" data-anchor-id="corrections-to-common-problem-with-academic-graphics-log-log-plots-with-missing-or-symmetric-error-bars-can-be-fixed-by-forcing-asymmetric-error-bars.-when-there-are-small-errors-the-log-can-show-as-a-negative-error-which-often-means-that-plot-wont-do-anything.-in-matplotlib-the-default-for-the-y-axis-is-to-map-all-negative-values-a-very-small-positive-one.-the-code-for-that-is">Corrections to common problem with academic graphics: * Log-log plots with missing or symmetric error bars can be fixed by forcing asymmetric error bars. When there are small errors, the log can show as a negative error, which often means that plot won’t do anything. In Matplotlib the default (for the y axis) is to map all negative values a very small positive one. The code for that is:</h4>
<pre><code>plt.yscale('log', nonposx='clip')</code></pre>
<ul>
<li>Tick marks: The defaults for tick marks and labels are often too large, too small, facing the wrong direction or else wise strange. In ggplot in R this can be manipulated under theme:</li>
</ul>
<p>For example, to make everything expect your points or lines disappear you’d use:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.line=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-2">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-3">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.y=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-4">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-5">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.x=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-6">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title.y=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-7">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none"</span>,</span>
<span id="cb2-8">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.background=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-9">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.border=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-10">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.grid.major=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-11">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.grid.minor=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb2-12">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.background=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) </span></code></pre></div></div>
<p>Each of these could also be modified to make the text larger or smaller, change the font, rotate the labels…etc.</p>
<p>To change the direction and size of the tick labels you’d use something like:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x  =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">angle=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">90</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">vjust=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>)</span></code></pre></div></div>
<ul>
<li><p>You can reduce clutter on the graph by using fewer (labeled) tick marks.</p></li>
<li><p>Always remember to label your axes! This is done in python with:</p></li>
</ul>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb4-2">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X Label'</span>)</span>
<span id="cb4-3">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Y Label'</span>)</span></code></pre></div></div>
<p>or in ggplot2 in R using:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Plot Title"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb5-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ylab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"X Label"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb5-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xlab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Y Label"</span>)</span></code></pre></div></div>
<ul>
<li>The problem of too many overlapping lines or points can be solved in various ways depending on the data. Sometimes, changing the colors and alpha of the points might be enough. In other cases, it’s best to separate out the information into a table of plots.</li>
</ul>
<p>For example, below is the original plot from my research showing the intensity of different molecules across velocities. The plot places all four molecules on the same graph with a key indicating which is which. In color, this graphic might make more sense, but it is still hard to make out the individual curves. Plus, the key is small and referring back to it is time consuming and annoying.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://eringrand.github.io/posts/graphics_and_plots_in_science/spectrapel.png" class="img-fluid figure-img"></p>
<figcaption>Velocity spectra for the Pelican Pillar</figcaption>
</figure>
</div>
<p>In fixing the graphic, while also including more information from my other sources, I separated out the each of the molecules and sources into a table of spectra. This un-clutters the plot and allows you to more easily visualize trends in the sources. (Notice how the plot is missing axes labels - shame on me!)</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://eringrand.github.io/posts/graphics_and_plots_in_science/spectra2.png" class="img-fluid figure-img"></p>
<figcaption>Velocity spectra for the pillars show brightness temperature against velocity in km/s. The spectra were taken in the heads of the pillars at the peak brightness and averaged over a beam size.</figcaption>
</figure>
</div>
<p>The code for this plot was done in IDL - a language mostly used only by astronomers (after looking at the code, you’ll see why no one else joined in the fun…) If you’re interested, you can check it out <a href="https://github.com/eringrand/idlcodes/blob/master/plotspectra.pro">here</a>.</p>
<p>For the future, I want to try and remake some of my research plots in R for better practice with R and ggplot2, using something along these lines:</p>
<pre><code>data %&gt;%
ggplot(aes(x=vel,y=tb)) + 
geom_line() + 
facet_wrap(~pillar)</code></pre>
<ul>
<li>In attempts to not waste space, you should examine the size and scale of the axes. This often shows up as a problem when an outlier or two that expand the axes such that much of the plot is empty. In these cases, you can crop the plot to the main data and include an arrow to show where the outlier is.</li>
</ul>
<p>Most importantly, don’t display empty plots like <a href="https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/wittke_thompson_fig1CD.jpg">this infamous plot </a> from Wittke-Thompson JK and Pluzhnikov A, Cox NJ (2005).</p>
<ul>
<li>Sometimes, to save room or otherwise, it’s best to display the information in a table format instead of bar graphs. For example, this plot from a science article titled <a href="http://arxiv.org/abs/1403.3091">“Studying Gender in Conference Talks – data from the 223rd meeting of the American Astronomical Society”</a> shows the large difference in number questions asked by males vs number asked by females, given a male or female chair. This plot displays the most significant finding from the analysis: a strong dependence on session chair gender. Still, this information could have easily been shown in a table instead of a graph. This would be a useful plot for a presentation on the subject, but not needed in the article.</li>
</ul>
<p><img src="https://eringrand.github.io/posts/graphics_and_plots_in_science/chairs_questions.png" class="img-fluid"></p>
<hr>
</section>
</section>
<section id="color-in-academic-graphics" class="level3">
<h3 class="anchored" data-anchor-id="color-in-academic-graphics">Color in academic graphics:</h3>
<p>Color can be a huge issue in scientific articles. This is largely because most journals charge more for printing in color, but will present colored versions of plots in the online versions on the articles. This means that authors need to make sure that they have plots that work well in color and in black and white, which gives way to some graphics which are very hard to read.</p>
<section id="common-color-problems." class="level4">
<h4 class="anchored" data-anchor-id="common-color-problems.">Common color problems.</h4>
<ul>
<li>Eye piercing bight colors and/or use of rainbow colors. We’ve discussed the problems with the rainbow in class, but as a reminder: the rainbow color scheme includes colors which are hard to see, doesn’t have a universally understood order, artificially exaggerates differences in color while softening the differences between others and (importantly for print) doesn’t convert well to black and white.</li>
</ul>
<p>Contrast is one of THE biggest problems I see in academic figures. Things like cyan or yellow on white, red on blue, navy on black… these cause major problems (and headaches) when reading text or trying to discern between lines. Your plot doesn’t have to be pretty, but it does have to be legible!</p>
<p>Color in astronomy maps often tags along with Color-coded image of the molecular cloud</p>
<hr>
</section>
</section>
<section id="graph-critique-and-fix" class="level3">
<h3 class="anchored" data-anchor-id="graph-critique-and-fix">Graph Critique and Fix</h3>
<p><img src="https://eringrand.github.io/posts/graphics_and_plots_in_science/bad.png" class="img-fluid"></p>
<p>The image from an article titled “MOLECULAR CLOUDS IN THE NORTH AMERICAN AND PELICAN NEBULAE: STRUCTURES” by Shaobo Zhang, Ye Xu and Ji Yang, displays the locations of clumps, as well as their velocity and size. From the image caption “The circles indicate the clump positions on the integrated intensity map of 13 CO. The colors of the circles represent the velocities the clumps, while the circles are scaled according to the sizes of clumps.”</p>
<p>This is a perfect example of trying to show too much in one plot such that it’s no longer understandable. A different color scheme would help the eye more easily see the trends in velocity. I would also like to see the circles filled in, and the background map a bit darker. Also, The graph extends too far up so that the color legend is clear, but leaves too much empty space in the graph. The axes and tick marks could also be smaller.</p>
<p>I didn’t have their data, but I remade a similar type of plot pulling points, velocities and sizes from normal distributions (see code below).</p>
<p><img src="https://eringrand.github.io/posts/graphics_and_plots_in_science/Rplot.png" class="img-fluid"></p>
<p>This image fixes some of the problem by using GGPLOT default color scheme, which keeps the hue in blue and changes the brightness. I’ve filled in the circle in make the difference in sizes more clear, and I made sure that the circles are scaled by area, as to not conflate radius and area.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2) </span>
<span id="cb7-2"></span>
<span id="cb7-3">xvar <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace=</span>T)</span>
<span id="cb7-4">yvar <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace=</span>T)</span>
<span id="cb7-5">v <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span>  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sort</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb7-6">s <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>))</span>
<span id="cb7-7"></span>
<span id="cb7-8">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(xvar,yvar,v,s)</span>
<span id="cb7-9">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> data[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">order</span>(data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>v),]</span>
<span id="cb7-10">data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)</span>
<span id="cb7-11">data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)</span>
<span id="cb7-12"></span>
<span id="cb7-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(data,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>x,<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>y), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stat_density2d</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha=</span>..level..), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">geom=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"polygon"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_guide=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_alpha_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb7-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x=</span>xvar, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y=</span>yvar, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size=</span>s, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color=</span>v), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_guide=</span><span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb7-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_colour_gradient</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">limits=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-18">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_size_area</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max_size=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb7-19">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_bw</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb7-20">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.title=</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) </span></code></pre></div></div>
<hr>
</section>
<section id="conclusions" class="level3">
<h3 class="anchored" data-anchor-id="conclusions">Conclusions:</h3>
<p>* Always remember to think about the story your telling and how your graphic fits in. * Label your plots correctly, but don’t clog the plot with text. Keep your labels short, and rotate them if needed to to be read. * If displaying all of your data looks cluttered, think about if you really need to show all of it, and if so if there’s a better way to display it. * Watch out for color! We like pretty graphs, but only if we can still read them.</p>


</section>

 ]]></description>
  <category>visualization</category>
  <category>data science</category>
  <category>python</category>
  <guid>https://eringrand.github.io/posts/graphics_and_plots_in_science/</guid>
  <pubDate>Tue, 24 Mar 2015 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/graphics_and_plots_in_science/spectrapel.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Educational Graphics in Science</title>
  <dc:creator>Erin Grand</dc:creator>
  <link>https://eringrand.github.io/posts/graphics_edu_in_science/</link>
  <description><![CDATA[ 





<section id="educational-graphics" class="level2">
<h2 class="anchored" data-anchor-id="educational-graphics">Educational graphics:</h2>
<p>Educational graphics are often used as a way to teach one concept. As such, they tend to generalize the information in such a way that leaves out important information.</p>
<p>Take, for example the standard evolution depiction:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://eringrand.github.io/posts/graphics_edu_in_science/Human-Evolution-460x233.jpg" class="img-fluid figure-img"></p>
<figcaption>Evolution image</figcaption>
</figure>
</div>
<p>We use this image as a teaching tool, because it’s easy to visual evolution this way. However, the graphic, isn’t accurate in it’s depiction of evolution because due to the simplicity of the graphic there is no knowledge of time, species extinction, A better graphic would be able to easily show all of this information.</p>
<p>The image below does a little better by including time information as well as showing the full evolution tree. This graphic makes clear the complex nature of evolution, but still leaves out many extinction events. (Click <a href="http://www.lucasbrouwers.nl/blog/wp-content/uploads/2009/12/Evo_large.gif"> here </a> to see a larger version of this image.) <img src="http://www.lucasbrouwers.nl/blog/wp-content/uploads/2009/12/Evo_large.gif" alt="Example evolution graphic" width="650px"></p>
<hr>
<p>For an example of commonly used well depicted graphic, we have this image showing the structure of the terrestrial planets.</p>
<p><img src="https://eringrand.github.io/posts/graphics_edu_in_science/terrestrial_interiors.jpg" class="img-fluid"></p>
<p>This graphic shows the relative size difference between the planets as well as their inside makeup. It even manages to show where we have incomplete information. There are lots of version of this graphic, but I think this one works the best because it uses labels and lines sparingly in place of a color coded legend.</p>


</section>

 ]]></description>
  <guid>https://eringrand.github.io/posts/graphics_edu_in_science/</guid>
  <pubDate>Tue, 24 Mar 2015 00:00:00 GMT</pubDate>
  <media:content url="https://eringrand.github.io/posts/graphics_edu_in_science/Human-Evolution-460x233.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
