Parse JavaScript with jsoup

39,750

Since jsoup isn't a javascript library you have two ways to solve this:

A. Use a javascript library

  • Pro:

    • Full Javascript support
  • Con:

    • Additional libraray / dependencies

B. Use Jsoup + manual parsing

  • Pro:

    • No extra libraries required
    • Enough for simple tasks
  • Con:

    • Not as flexible as a javascript library

Here's an example how to get the key with jsoupand some "manual" code:

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)key=\"(.+?)\""); // Regex for the value of the key
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'key' part


while( m.find() )
{
    System.out.println(m.group()); // the whole key ('key = value')
    System.out.println(m.group(1)); // value only
}

Output (using your html part):

key="pqRjnA"
pqRjnA
Share:
39,750
ravi
Author by

ravi

Updated on November 15, 2021

Comments

  • ravi
    ravi over 2 years

    In an HTML page, I want to pick the value of a javascript variable.
    Below is the snippet of HTML page:

    <input id="hidval" value="" type="hidden"> 
    <form method="post" style="padding: 0px;margin: 0px;" name="profile" autocomplete="off">
    <input name="pqRjnA" id="pqRjnA" value="" type="hidden">
    <script type="text/javascript">
        key="pqRjnA";
    </script>
    

    My aim is to read the value of variable key from this page using jsoup.
    Is it possible with jsoup? If yes then how?

  • Anil Kumar Pandey
    Anil Kumar Pandey almost 10 years
    Hey, Jsoup + manual parsing is very good solution for this, but breaking while I am using the js variable as array. eg: keyArray = [1, 2, 3] can you please give me solution for this.
  • ollo
    ollo almost 10 years
    You can use this regex instead: (?s)(keyArray)\\s??=\\s??\\[(.*?)\\]. If defined two groups: Group 1 = variable name, group 2 = value (those within [ ]).
  • user79307
    user79307 over 9 years
    And What if I have something like abc.xyz.init({requiredJsonObjectAsAnArgument}); inside script tags and I want to parse requiredJsonObjectAsAnArgument only. Can you suggest me the applicable regex for this case?
  • ollo
    ollo over 9 years
    Please try (?s)\\.init\\(\\{(.+?)\\}\\); - group #1 contains the requiredJsonObjectAsAnArgument.