Dueling Excel - Extract Numbers - RegEx - Duel 164

If you like this content, please consider visiting the video on YouTube and subscribe to the MrExcel Channel to show your support!
This video has been published on Aug 15, 2014.
Today's question is how to extract the last 3 digits between this and that. While Bill starts down the formula road, he switches to a VBA User Defined Function. Mike creates a formula with 15 different functions. Finally, Charles Williams and his Speed Tools jump in with a cool Regular Expressions (RegEx) function to simplify things. (Update from Oct 1 2014: See Charles Williams blog post about an improvement to SpeedTools to handle this: Extracting Digits from Text: Using Formulas and Designing a Missing Excel Function: GROUPS )
Table of Contents:
Definition of the Problem 0:20
Bill tries a formula 1:09
Bill switches to VBA 2:30
Mike does a formula 4:49
Using RegEx in Excel 11:25
Timing the Results 14:00
maxresdefault.jpg


Transcript of the video:
Hey, alright, welcome back, it's time for another Dueling Excel podcast.
I'm Bill Jelen from MrExcel, I’ll be joined by Mike Girvin from Excel Is Fun.
This is our episode 164.
Strip Out the Last 3 Digits Between Here and There.
Like, you have to wonder, where people come up with this thing sometimes.
I got this request, they said, you know: we have a series of data, like this, and I need to extract this number and I've… the person who sent it to me, highlighted numbers in red - they're not really highlighted in red - and said: is there a formula, that would do that?
Well, you know, kind of the first thing, that we have to do in here, is infer the rules.
All right, so all of these have the word POV, underscore (_), and then the number, that we want to start to get.
But, as soon as we encounter, either another underscore or a dot, we need to stop getting numbers.
And if there's more than three digits in that string, we only want the rightmost regions.
I said: are you sure..?
Yeah, that's exactly what we know.
So hey, look at this, you know, it was a Dueling Excel podcast day, I was just: see, if I could help this person out, and so I started to write these formulas.
Look for POV, underscore, in A2 and that tells me where it is (=FIND(“POV_”,A2)).
And then, used the =MID(A2,B2+4,50), just getting fifty characters to get everything after POV.
And then in column C, find either an _, and if I don't find anything, note it as 999 (so using the =IFERROR).
And then, also in C, find a dot, a period (.), all right.
And it looks like everyone has a period, so I didn't have to worry about IFERROR there.
And then, take the minimum of columns D and E (=MIN(D2,E2)), and then extract the leftmost characters, all right.
So up to that point I was good.
But then I needed a formula, that would give me just the digits from this column.
Just the digits from this column, throwing out all of the upper and lowercase letters.
And that's where things went bad.
I went out to the MrExcel.com message board, I searched, there are some insane formulas there, but they were insane.
And I said, you know, this is just crazy, let's switch over to VBA, where I could do this whole thing in a short, little, custom, user-defined function here.
So the user-defined function, I called it Nums, where we’re going to pass through a variable called MV – my value.
And I use the Application.WorksheetFunction.Find to find “POV_,” I add 4 to it.
(X = Application.WorksheetFunction.Find (“POV_”, MV) + 4).
So right there, those are my two formulas from columns B and C, right at the beginning.
And then, from that point, from that starting position: the X to the length of my value (For i = X To Len(MV)), I loop through.
Look at ThisChar.
If ThisChar is a dot (.) or an underscore (_), I know that I'm done, we go down to FoundIt.
If otherwise, we look and see it's a digit, 0 through 9, if it is, I append it onto the end of the variable called Nums.
And Nums of course is the function number.
That's what gets returned to the spreadsheet at the end.
So we loop through all of those characters and then, when we're done, just make sure that we get the rightmost three characters (Nums = Right(Nums, 3)).
So for me, rather than trying to figure out the array formula, it was easier just to knock out this function here.
We'll come over here and take a look at how it works.
So =Nums(A2) and double click to copy that down, and we're getting the right answers all the time.
But, Mike, I'd be interested, you know, I say this would be an interesting Dueling excel podcast, if you could show me a formula here that would keep only the digits from this column.
And then after that, it's easy enough to take the =RIGHT(H2,3) and solve the problem.
So Mike, let's see what you have.
Mike: thanks MrExcel, hey, the truth is for this one: this is one of these just insane situations, VBA really rules here.
Now, I'm sure there's a single-cell formula, that Aladin, or Dominique, or PG301 or Don Quixote from the MrExcel message board could create, but VBA wins here.
I mean, this is just an insane thing that we have to do here.
MrExcel asked what do we do, once we get to this column, right.
How do we extract just the numbers here.
So I'm going to come back over here.
I've mashed all of the things MrExcel did into a single formula in this column, so we get this.
Now the problem is, and I've actually done a formula based on some things I've seen on a MrExcel message board, of how to extract numbers from a text string, right.
But we have some leading zeros here.
So this is going to get horrible.
Now I'll try and do it here and I'm going start right here, with this longer text string.
I'm going to have to extract all the individual characters and ask the question: are you a number?
So we'll start with the MID and we'll do some sort of an array formula here, =MID(B12, Now, normally we put a start and number of characters, but I want a start number of 1, and 2, and 3, and 4, and 5, and 6, all the way to 13.
But here I want 1, 2, 3, 4, so I'm going to create in the start number and array that creates a dynamic array, either it's going to be 1 to 4 or 1 to 13.
So I'm going say “1:” and join it to the length of this cell.
(=MID(B12,”1:”&LEN(B12)).
Now, this little bit right here, and the start number, that's just the beginning part.
F9, that's just giving a 1 to 13.
It looks like a row reference.
If I were to convert that using INDIRECT, from text back to a reference, well, it's not going to let me, because it's going to try and do all 13 rows.
F9, never going to work.
So we can get just the row numbers by wrapping ROW around this.
Now, this is something I had in my book and it's been around for a long time.
F9, that gives me a dynamic array of sequential numbers, right.
Here it's 1 to 13 and right, there will be 1 to 4, all right.
Now, comma (,), how many characters?
1.
Now, this is a function argument array operation right there, so we're going to get thirteen answers.
F9, there's the thirteen answers.
So we've extracted all the individual characters into a single cell.
Now we need to ask the question: are any of you numbers?
Ctrl+Z, well, first I have to get those.
All of those are text strings, so I'm going to convert the numbers to numbers and the text strings to errors.
F9, notice there's a number right there, Ctrl+Z.
Now I want… I'm looking through here and I'm noticing, there's all these lead zeros, but there's always going to be a number between 1 and 9.
So I'm just going to use the AGGREGTE on this array and say: please give me the biggest 1.
Why AGGREGATE?
Well, one reason is that AGGREGTE allows me to ignore all those errors easily.
So I'm going to say function, not 1 to 13.
1 to 13 - they don't do array operations.
14 - that's large, because I want the max, right?
So 14, comma, options: I love this 6, that'll avoid those errors.
There's the array, I come to the end and I say: ,[ k ]=1, because I want the max.
Don't have to use Ctrl+Shift+Enter, because AGGREGATE can handle array operations without Ctrl+Shift+Enter.
Copy it down and copy it up.
Now I got the number, but now I need no lead zeros there and one lead zero there, and only two lead zeroes there.
Wow, well, you know, I could use something like TEXT, but then for the TEXT argument, I would have to have format of either a zero, two zeros or three zeros, so watch this.
Over to the side, I'm going to figure out one, two and three, because I'm going to need to say how many zeros to repeat.
Actually, I'm going to need this same little bit here, so “boop” on the array, copy, alright, so =, Ctrl+V, well, that gives me those errors and I want to find out how many numbers there are, so I'm going to say ISNUMBER.
How many numbers are there in that array.
So let's see, copy down, maybe right here we could look at this one, an F9, that's as TRUE;TRUE;TRUE, so it's 4.
Now I need to count those TRUEs, so I'm going to use some IFs, don't like array, so I'm going to use SUMPRODUCT, double negative (--), because those are TRUEs and FALSEs, and I need to count the TRUEs and FALSEs, so I have to convert them to 1s and 0s.
Come to the end, there's a close parentheses, Ctrl+Enter, double click and send it down.
So it looks like they're all fine except for the 4.
I wouldn't really want a 3, so I'm going to say: give me the minimum (MIN) of whatever some product spits out in the number argument, or 3 for the second one there, close parenthesis, Ctrl+Enter, double click and send it down.
There we have the number of zeros we need to repeat.
Now, if I were to use the TEXT function, I would use this and I would use the REPEAT function, because the TEXT function would need 0 00 or 000.
You, know I'm going to not do the TEXT function, because TEXT function sometimes can be pretty slow calculating.
I'm just going to join this to the beginning using REPEAT.
I'd have to use REPT anyway inside of text.
So the text I want to repeat is 0, [ number of times ], that right there.
I think that that will give me one too many, let's try it, [ number of times ], close parenthesis.
Now that whole REPT right there.
I need to join it to the front of AGGREGATE.
I already know it's going to give me one too many, so I'm joining it, Ctrl+Enter, double click and send it down.
Yeah, one too many.
So when it says, well, let's see, right here: for the SUMPRODUCT, for the MIN right there, that little bit right there, I'm going to subtract 1, Ctrl+Enter, double click and send it down.
Wow, I can't believe I actually did that.
That is ridiculous, because look, there's this, boom, and this.
There's got to be a better way to do that.
All right, throw it back to MrExcel.
Bill: Mike, that is amazing.
I think that has to be a record, that is the most number of functions, we've ever used in a Dueling Excel podcast.
What an amazing for me.
I have complete respect for that formula, although, I have to tell you: I'm glad that I went with VBA, because it would have just been so much easier, than those formulas.
But after you sent your video back, I started to think about this and I said, you know, we've used Charles Williams’ tools over here in FastExcel to time, the various formulas recently.
I wonder, if this section here has something that would help.
And if we go to Text, he has something called Rgx.SUBSTITUTE.
Now I know just enough about regex to be dangerous.
Regex is a regular expression and it allows us to create some pattern matching.
I used this once in Adobe InDesign, where I had a difficult Find&Replace in a book and I was able to write one successful regular expression, and it solved my problem.
So let's see, if we can do Rgx.SUBSTITUTE here.
Click OK, the text that we want to replace is over here in column G.
And what we are looking for.
We have to put this in square brackets, we're looking for the letters a-z, so anything from A to Z, close the quotes.
And Case_Sensitive, we're going to say is false, that we… if it's anything from A to Z, they're lowercase, or uppercase, click OK.
And that one works great.Well, double click to copy that down and it is getting just the digits.
From there, it's easy enough, the =RIGHT(H2,3) and then putting it all in one big formula, like you did.
I'm just combining everything together into that formula.
So, I think I would still go with VBA but I know, sometimes people don't want to use VBA and this seems like it's a good alternative.
Now, hey, thanks to Charles Williams, he provided a copy of FastExcel for us.
You don't need all the FastExcel to get this text function.
If you come out to the web, here's Charles’ site, Why FastExcel, we look for SpeedTools and then in SpeedTools, it's called SpeedTools Extras.
That gives us these functions, including the Rgx functions right there.
Which really give you some serious, serious power that allowed to solve this problem.
And I think, oh boy, what is it.
The Extras is very inexpensive, $29, £19, €24, what a deal.
So check that out at FastExcel.
All right, well.
Hey, Mike, thanks for putting up with this horrible question, although it was a real question and we'll really had this problem.
So I'm glad we have three different solutions.
Now, I want to thank everyone for stopping by, we’ll see next week for another Dueling Excel podcast from MrExcel and Excel Is Fun.
So well, hey, you know.
So I used Charles Williams’ FastExcel to use that function.
I'm really, really curious about the formula speed here.
So particularly, we have a couple of things.
Mike's formulas are using, you know, 15 different function, there's a lot of different calculations happening there.
My VBA is interpreted code, so it has to, you know, interpret that at runtime.
Charles is using C++ for FastExcels, huh.
I want to see, you know, really, how all of these stack up.
So, again, using FastExcel here, we'll select that whole range.
So this is my range and I'll go back up to the top with the Ctrl+..
And we'll CalcRange.
So 45, CalcRange, 42 and 44.
45, 42 and 44, all right.
Let's try Mike’s formulas.
Which is a heroic formula, right, but with a lot of different functions.
So CalcRange, 56, CalcRange, 59 CalcRange, and 60.
And then finally, let's see what speed advantage we get.
Now this one's using a lot of different functions as well in addition to the C++ functions.
So let's see how that shakes out.
So we'll CalcRange, oh, 98, CalcRange, 97, CalcRange, 99.
So isn't that interesting, that the VBA actually wins out, because it's short directly to the point.
That regex expression, that I'm using in the middle of Charles’ code, you know, is a very generic kind of thing.
And it’ll be interesting, I'm going to dig into regex some more and chose a more regex examples over the next couple of weeks here, just to see what we can do.
Again, as I mentioned, I've done it now successfully twice in my life and I'm sure there's a lot more power there, and I even understand.
So there you go, just for those of you interested in formulas speed comparison of the three.
 

Forum statistics

Threads
1,214,813
Messages
6,121,706
Members
449,049
Latest member
THMarana

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top