{"id":1860,"date":"2023-10-23T10:04:53","date_gmt":"2023-10-23T14:04:53","guid":{"rendered":"https:\/\/www.unliterate.net\/?p=1860"},"modified":"2023-10-23T13:28:11","modified_gmt":"2023-10-23T17:28:11","slug":"today-i-learned-regex-loop","status":"publish","type":"post","link":"https:\/\/www.unliterate.net\/index.php\/2023\/10\/23\/today-i-learned-regex-loop\/","title":{"rendered":"Today I learned: regex > loop"},"content":{"rendered":"\n<p>In writing &#8220;quad-quad&#8221;, which is a set of four 4-letter speak-able words that can be used as a user-friendly &#8220;bookmark&#8221; into easily finding a record, I was writing a &#8220;quick&#8221; program to extract the contents of wikidatawiki-20220820-pages-articles-multistream.xml (<a href=\"https:\/\/dumps.wikimedia.your.org\/wikidatawiki\/\" target=\"_blank\" rel=\"noreferrer noopener\">a wikipedia dump<\/a>) and came into this large delay in the following loop:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$alphas = 'qwertyuiopasdfghjklzxcvbnm ';\n$newline = '';\nfor ($x = 0; $x &lt; strlen($line); $x++) {\n    $c = substr($line, $x, 1);\n    if (strpos($alphas, $c) !== false) {\n        $newline = $newline . $c;\n    else {\n        $newline = $newline . ' ';\n    }\n}\n<\/code><\/pre>\n\n\n\n<p>The loops main purpose is to sanitize any non-letter data by replacing unknown characters with a space for later processing. The end result would be words that I could filter down to 4-character words and tally them up.<\/p>\n\n\n\n<p>When the program read a line around 1mb in length it would &#8220;hang&#8221; for a bit as it chewed through the data. In a nutshell 25,100,655 bytes of data would take <em>24m36s<\/em>. It was time to optimize.<\/p>\n\n\n\n<p>Replacing the previous with the following regex performance was increased immensely.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$newline = preg_replace('\/&#91;^a-z]\/', ' ', $line);<\/code><\/pre>\n\n\n\n<p>The same amount of data took <em>1.892s<\/em>.<\/p>\n\n\n\n<p>Lesson: If you don&#8217;t know regexes, learn regexes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In writing &#8220;quad-quad&#8221;, which is a set of four 4-letter speak-able words that can be used as a user-friendly &#8220;bookmark&#8221; into easily finding a record, I was writing a &#8220;quick&#8221; program to extract the contents of wikidatawiki-20220820-pages-articles-multistream.xml (a wikipedia dump) and came into this large delay in the following loop: The loops main purpose is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,23],"tags":[],"class_list":["post-1860","post","type-post","status-publish","format-standard","hentry","category-geek-instructions","category-wikipedia"],"_links":{"self":[{"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/posts\/1860","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/comments?post=1860"}],"version-history":[{"count":4,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/posts\/1860\/revisions"}],"predecessor-version":[{"id":1865,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/posts\/1860\/revisions\/1865"}],"wp:attachment":[{"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/media?parent=1860"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/categories?post=1860"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.unliterate.net\/index.php\/wp-json\/wp\/v2\/tags?post=1860"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}