wget: recursively retrieve urls from specific website

30

You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.

Share:
30

Related videos on Youtube

Oleg Kuts
Author by

Oleg Kuts

Updated on September 18, 2022

Comments

  • Oleg Kuts
    Oleg Kuts over 1 year

    First, I'm very new to this, sorry if this question is dumb. I have two controllers with class request mappings and form to validate:

    @Controller
    @RequestMapping("/creditcard")
    public class CreditCardController {
        @Autowired
        //repositories
    
        @RequestMapping(value = "/addnewcard", method = RequestMethod.POST)
        public String addNew(
                 @ModelAttribute("newCard") @Valid CreditCard creditCard,
                BindingResult bindingResult, Principal principal, Model model) {
            if (bindingResult.hasErrors()) {
                //render error view
            }
            creditCardService.registerNew(creditCard, principal.getName());
            return "redirect:/account";
        }
    

    Another

    @Controller
    @RequestMapping("/account")
    public class AccountController {
        @Autowired
        //repo
    
        @RequestMapping("")
        public String showUserProfile(Principal principal, Model model) {
            String username = principal.getName();
            User user = userService.findByUsername(username);
            Account account = user.getAccount();
            List<Payment> payers = paymentService
                    .getAllPaymentsForPayerAccount(account.getId());
            model.addAttribute("user", user);
            model.addAttribute("account", account);
            model.addAttribute("payers", payers);
            return "userprofile";
        }
    }
    

    Form on userprofile.jsp

    <form:form cssClass="form-horizontal" modelAttribute="newCard" action="creditcard/addnewcard">
    ........
    </form:form>
    

    And all this works without @Valid. When I add @Valid it works fine, when validation fails ( shows error view with messages), but when it succeeds I get 404 error due to incorrect URI - http://localhost:8080/PaymentSystem/creditcard/creditcard/addnewcard. Here one /creditcard is extra, @Valid somehow adds this to my URI. I found to ways for me to solve this: 1) I moved addNew card method to 'AccountController'

    2) I just removed @RequestMapping("/creditcard")

    But still I do not found any explanation of such behaviour. Any idea?

    • Kerrek SB
      Kerrek SB over 12 years
      This question isn't very clear. What do you mean by "all possible URLs"? Do you want to start with one website and then crawl to all its linked websites, recursively? If so, how do you want to achieve that without downloading the actual websites, which you need to parse for further links?
    • steenhulthin
      steenhulthin over 12 years
      What did you try? wget -r is the recursive option. Did you try with that? What problem did you run into?
    • dma_k
      dma_k over 12 years
      Just use wget -r http://site.com. Also nice option is -p which will also fetch all prerequisites for the page, even if they are external.
    • Admin
      Admin over 12 years
      @Kerrek all possible URLs - yes, URLs which are linked to internal pages (that means, which has the same domain). & That's a good point there, wget can download only html content, at least to find the linked urls/pages, ignoring any other file types.
    • Kerrek SB
      Kerrek SB over 12 years
      @abhiomkar: Well, yes, you wouldn't download all the pictures and flash animations of course. The -r option already does exactly that. If you also want stylesheets that may be linked inside other style sheets, you have to work a bit harder, but for the semantic content only, -r is exactly the answer. Did you try any of this before asking the question, by the way? I think wget has a pretty decent documentation...
    • giorgio79
      giorgio79 almost 10 years