+ - 0:00:00
Notes for current slide
Notes for next slide

Regex

A DVS Workshop
http://library.duke.edu/data/news

John Little

2017-01-31

1 / 20

Brief "History"

  • 1950s: American mathematician Stephen Cole Kleene

  • Came into common use with Unix text-processing

  • Consists of different syntaxes (POSIX, Perl)

2 / 20

Implementations

  • Search engines, word processors, text editors

  • AWK, grep (UNIX command line)

  • Textpad, Notepad++

  • Google Sheets

  • MS Word

  • OpenRefine

  • Programming languages: often built, sometimes via libraries

3 / 20

Patterns

  • Syntax for representing a pattern

  • Each characters in a regex is either a metacharacter (special meaning)

4 / 20

Patterns

  • Syntax for representing a pattern

  • Each characters in a regex is either a metacharacter (special meaning)

  • Or, a regular character (literal meaning)

5 / 20

Patterns

  • Syntax for representing a pattern

  • Each characters in a regex is either a metacharacter (special meaning)

  • Or, a regular character (literal meaning)

Duke.

  • The wildcard '.' matches every character except a newline
6 / 20

Patterns

  • Syntax for representing a pattern

  • Each characters in a regex is either a metacharacter (special meaning)

  • Or, a regular character (literal meaning)

Duke.

  • The wildcard '.' matches every character except a newline

    • Matches: "Duke ", "Dukes", "Duke1", "Duke0", ...
    • Note the space following "Duke"
    • Not "duke" or "dukes"
7 / 20

Patterns

  • Syntax for representing a pattern

  • Each characters in a regex is either a metacharacter (special meaning)

  • Or, a regular character (literal meaning)

Duke.

  • The wildcard '.' matches every character except a newline

    • Matches: "Duke ", "Dukes", "Duke1", "Duke0", ...
    • Note the space following "Duke"
    • Not "duke" or "dukes"
  • You can "escape" a metacharacter to enforce a literal match

\.

8 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

9 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

  • More general: .

10 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

  • More general: .

  • More precise: j

11 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

  • More general: .

  • More precise: j

  • Common regex used to locate same word spelled two different ways

    • match "color" and "colour"
12 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

  • More general: .

  • More precise: j

  • Common regex used to locate same word spelled two different ways

    • match "color" and "colour"

      • /colou?r/
13 / 20

Matches

  • Range from precise to general

  • General: [a-z] matches all letters from 'a' to 'z'

  • More general: .

  • More precise: j

  • Common regex used to locate same word spelled two different ways

    • match "color" and "colour"

      • /colou?r/
    • match "color" and "Color"

      • /[Cc]olor/
14 / 20

Cheatsheets

15 / 20

Find and Replace

  • Most people have used regex and didn't know it

  • Exist in Word Processors as "find & replace"

  • Sometimes it's used to find: match a pattern

  • Sometimes it's used to replace: substitute

16 / 20

Uses

  • Can get very sophisticated, matching for complex substitutions

    • For Example capture (find)
      • #hastags (words that begin with "#")
      • all email addresses
      • variant spellings
      • variant capitalizations
      • variant puntuations
  • But ...

17 / 20

Uses

  • Can get very sophisticated, matching for complex substitutions

    • For Example capture (find)
      • #hastags (words that begin with "#")
      • all email addresses
      • variant spellings
      • variant capitalizations
      • variant puntuations
  • But ...

Regex is not Machine Learning

  • You specify the pattern

  • This is sometimes challenging

18 / 20

XKCD Next

19 / 20

Attribution

This slide deck and the handouts and exercises were strongly influenced by

Shareable under CC BY-NC-SA license

Data, presentation, and handouts are shareable under CC BY-NC-SA license

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

20 / 20

Brief "History"

  • 1950s: American mathematician Stephen Cole Kleene

  • Came into common use with Unix text-processing

  • Consists of different syntaxes (POSIX, Perl)

2 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow