Regexing fun

Hi all, yeah it’s been a long time. I got a job 1.5 years ago and haven’t really posted since I’ve been “really busy”, etcetera, and tldr.

I figured I’d post about something I did today that I thought was interesting/fun. This post probably won’t make sense unless you have a basic understanding of regex.

Problem description

Let’s say you have a bunch of old urls that are hosting resources which follow an old format and point to an old s3 bucket/have a different uniform resource locator.

In the case of s3, there’s two ways of representing urls: http://s3.amazonaws.com/[bucket_name]-accountnumber/ or http://[bucket_name].s3.amazonaws.com/.

Your task is to convert these to a consistent format so when they point to a “consistent” structure–you get something like s3://bucketname/dir/dir2/resource.ext that you can convert to the proper url for locating the resource.

It sounds convoluted, I know, because it sort of is. The reason this is desirable is because if you can convert all these urls to a different format you can serve a different format that’s easier to look at/parse and is more “consistent”.

In my case I have two url formats similar to above https://coolhost.s3.us-west-2.amazonaws.com/files/bucketname/dir1/dir3/file.jpeg and https://s3-us-west-2.amazonaws.com/bucket2/dir/filename.png that I wanted to pass into a function to get a consistent resource location scheme. For the first case it would be s3://coolhost/files/bucket1/dir3/file.jpeg and for the second case s3://otherbucket/dir/filenamepng. Essentially the bucket name has to follow before all the key information.

Regexing

The way I decided to do this was to use groups and MatchAll in go’s regexp standard library.

Here’s the code which includes the regex

// Convert converts the s3 url into a digestible s3 path
func (s *S3URLConverter) Convert(url string) string {
    return s.re.ReplaceAllString(url, "s3://$2$3$4/$7")
}

// NewS3URLConverter creates a new s3 url converter
func NewS3URLConverter() S3URLConverter {
    return S3URLConverter { 
        re: regexp.MustCompile(`https:\/\/((coolhost)\.s3\.us-west-2\.amazonaws\.com(\/)|s3-us-west-2\.amazonaws\.com\/)(bucket1|bucket2)(-([0-9]+)|)\/([\w\/\-?=%.]+)`)
    })
}

Working backwards, you can view the regex here you’ll notice that both URLs get matched, and groups 2, 3, 4 and 7 can be appended to form a consistent url (hence Convert).

The idea is to just generate groups that can be appended from left to right no matter what the url is so that they form a consistent url. Groups are all the stuff that are included in the ().

A few things:

  • the problem boils down to extracting the bucket’s name in both cases and appending it with its subdirectories
  • the (-([0-9]+)|) is a group that just says “a hyphen appended with multiple numbers or nothing at all” (I added this to handle the bucketname-accountnumber format. You can probably make this \d+ if you want to be more correct (since this catches account numbers that start with 0)
  • the group ((coolhost)\.s3\.us-west-2\.amazonaws\.com(\/)|s3-us-west-2\.amazonaws\.com\/) is the most important. note how it catches the backslash in the first case, but not the second case after the OR. this is needed so that the next group is a backslash when we match coolhost in group 2, but not in the other url format. with the coolhost url format where the host is a subdomain of s3.us-west-2.amazonaws.com the value coolhost is actually the bucketname, so it should prepend everything. in the second format it’s not and the first subdirectory is the bucketname, thus we do want to capture a slash in the other case hence why the OR branch is so large to evaluate these two these cases separately.
  • the \/([\w\/\-?=%.]+) group just represents “everything after the backslash” aka the bucket’s subdirectories
  • the converter is wrapped in a struct because it wraps around the regexp.Regexp object and we don’t want to re-compile it (that’s slow)
  • the unit tests aren’t included in the above code, but they’re incredibly important for verifying the converter behavior and do exist/are exhaustive

And that’s about it! Hopefully this gave some people some new ideas for how to do “string conversion” because I feel like it’s rare that we actually get to “use regex” to solve problems. Keep in mind that this is a basic problem and isn’t meant to highlight anything other than the premise behind “conversion problems” which require you to use information in the string to build new strings.

You can catch me streaming at twitch.tv/cub.

Back