Extracting Tuple from Bag in PIG
15,652
Solution 1
Here is what I would do.
PIG Script
A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);
B = foreach A generate data3, data4;
C = filter B by data3 matches 'row';
D = foreach C generate data4;
E = foreach D generate REGEX_EXTRACT($0,'value: .([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+).*', 1);
Output
(192.168.1.3)
If needed, you can use a more crazy regexp to extract the IP address: Extract ip addresses from Strings using regex
Solution 2
You could use Flatten Operator to flatten the bag and then use filter to extract the ip address.
E = foreach C generate flatten(TOKENIZE(data4));
F = filter E by $0 matches '.\\d+\\.\\d+\\.\\d+\\.\\d+'
Hope this helps
Solution 3
public class someClass extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
DataBag bag = (DataBag)input.get(0);
Iterator<Tuple> it = bag.iterator();
Tuple tup;
for(int i = 0; i < 2; i++)
{
tup = it.next();
}
String ipString = tup.get(0);
String ip = //get ip from string with a regex
return ip;
}
}
of course you should add some input checks (null inputs, bag sized 1, etc) and secure the code.
Comments
-
Pradeep Bhadani over 1 year
FILE COntent (test.txt):
Some specific column value: x192.168.1.2 blah blah Some specific row value: y192.168.1.3 blah blah Some specific field value: z192.168.1.4 blah blah
PIG QUERY:
A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray); B = foreach A generate data3, data4; C = filter B by data3 matches 'row'; D = foreach C generate data4; E = foreach D generate TOKENIZE(data4);
Output :
((value:), (y192.168.1.3))
Now i want to extract specific tuple in this output bag, say second tuple (y192.168.1.3). After this i want to extract the IP address. I am trying to do with UDFs but got stuck.
-
Ray Toal about 12 yearsPig allows regex matching. Have you tried that?
-
Pradeep Bhadani about 12 yearsis regex in pig is same as java???
-
Pradeep Bhadani about 12 yearsI tried with this : E = foreach D generate REGEX_EXTRACT(message,'Internet:*') As result; but it throws an error :- ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT as multiple or none of them fit. Please use an explicit cast. My Ip in message is written as Internet:192.x.x.x
-
-
Pradeep Bhadani about 12 yearsthanks for d reply..but its not giving desired result. tup.get(0) returns complete message ((value:), (y192.168.1.3)) not (y192.168.1.3)
-
Pradeep Bhadani about 12 yearsTried this but it returns no result. I also tried with this : E = foreach D generate REGEX_EXTRACT(message,'Internet:*') As result; but it throws an error :- ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT as multiple or none of them fit. Please use an explicit cast. My Ip in message is written as Internet:192.x.x.x
-
sheimi about 12 yearsI'm sorry, I made a mistake. The first line should be E = foreach C generate flatten(TOKENIZE(data4));
-
sheimi about 12 yearsI think you should specify the index. So it should be E = foreach D generate REGEX_EXTRACT($0,'(.*):(.*)', 2); . Here is the sample code. But there is an error in the sample, '\' should be removed.