How to extract text within a string of text
Solution 1
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
=RegexExtract(A1,"gi[|](\d+)[|]")
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you're all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
Solution 2
Here it is (assuming data is in column A)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!
Solution 3
As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
Just saw that @Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
Wend
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function
Solution 4
I'd probably split the data first on the |
delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and |
as the delimiter.
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.
Brandon
Updated on December 13, 2020Comments
-
Brandon over 3 years
I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
etc...
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
What I would hope the output would look like would be something like:
297848936,297338191
168010496
etc...
-
Brandon over 12 yearsHaha that did work wonderfully. Thanks for your help. You're right, this is going to get ugly fast. Perhaps I should stick with VBA then? I don't really mind I just thought people might find VBA answers to be too cumbersome :P To be honest, I'm not sure I have any clue what is going on in that code you included! I'm not sure where I would need to make tweaks for it to go up to 5 or 7 numbers.
-
Brandon over 12 yearsI actually originally thought this, but I should mention that there are times where after the gb column there is numbers as well. So within that example string I listed, you could also get something like "randomrandomrandom gb|13151414|" I just changed my original post to reflect that.
-
Brandon over 12 yearsOh man this is beautiful. Absolutely fabulous. Seriously, why do you do this? It's so helpful but I'm just curious why people give their time for something like this? It's wonderfully charitable of you all.
-
aevanko over 12 yearsYou're very welcome! As for why I take the time: I do it becuase other people do it. I think it's more like the 'paying it forward' thing. I help others becuase one day, they will help me with some code, and the people I help will help others, etc. :)
-
Doug Glancy over 12 yearsRegex is a great way to go. +1 For myself, I answer questions because it's fun and a great way to learn/practice. Plus, like Issun says, I've gotten amazing help from generous and very talented people in newsgroups and other forums over the years.
-
Brandon over 12 yearsAh hah this is a great one as well. I see that VBA can be a really smooth approach to this then, I did not realize that. Thanks again for your help!
-
NetMage almost 9 yearsLooks like a typo Item(j) should be Item(i) - can't correct since it is only one letter wrong (kind of dumb for a coding site!). -- Got it fixed by spelling separator correctly.
-
jule64 over 6 yearsGreat solution, elegant and works so well, this should be a built in function! I will add a link to this answer in the function documentation
-
Jorge González Lorenzo about 6 yearsIf you choose this approach, better use the MID() function instead of LEFT and RIGHT. That would make the code more readable.
-
ShaneSauce over 5 yearsIt might be useful to know that this function is actually already baked in to Google Sheets. So if you don't want to mess with VBA, you can always import your data there and RegexExtract is already available.
-
Armin Alibasic over 5 yearsHello @aevanko thank you for this code, it solved my problem. I added some more logic so for example now I get results something like this: , , 1, , , , 2, , , , , , , 22, , , , . How I can remove the extra commas so I can sum these three numbers (or more or less numbers depending from the text).. I was trying to change the function itself, instead of commas to get me sum number but it doesnt work. Any idea how can I make sum of the results?