wget: recursively retrieve urls from specific website
You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.
Related videos on Youtube
Oleg Kuts
Updated on September 18, 2022Comments
-
Oleg Kuts over 1 year
First, I'm very new to this, sorry if this question is dumb. I have two controllers with class request mappings and form to validate:
@Controller @RequestMapping("/creditcard") public class CreditCardController { @Autowired //repositories @RequestMapping(value = "/addnewcard", method = RequestMethod.POST) public String addNew( @ModelAttribute("newCard") @Valid CreditCard creditCard, BindingResult bindingResult, Principal principal, Model model) { if (bindingResult.hasErrors()) { //render error view } creditCardService.registerNew(creditCard, principal.getName()); return "redirect:/account"; }
Another
@Controller @RequestMapping("/account") public class AccountController { @Autowired //repo @RequestMapping("") public String showUserProfile(Principal principal, Model model) { String username = principal.getName(); User user = userService.findByUsername(username); Account account = user.getAccount(); List<Payment> payers = paymentService .getAllPaymentsForPayerAccount(account.getId()); model.addAttribute("user", user); model.addAttribute("account", account); model.addAttribute("payers", payers); return "userprofile"; } }
Form on userprofile.jsp
<form:form cssClass="form-horizontal" modelAttribute="newCard" action="creditcard/addnewcard"> ........ </form:form>
And all this works without
@Valid
. When I add@Valid
it works fine, when validation fails ( shows error view with messages), but when it succeeds I get404
error due to incorrect URI -http://localhost:8080/PaymentSystem/creditcard/creditcard/addnewcard
. Here one/creditcard
is extra,@Valid
somehow adds this to my URI. I found to ways for me to solve this: 1) I movedaddNew
card method to 'AccountController'2) I just removed
@RequestMapping("/creditcard")
But still I do not found any explanation of such behaviour. Any idea?
-
Kerrek SB over 12 yearsThis question isn't very clear. What do you mean by "all possible URLs"? Do you want to start with one website and then crawl to all its linked websites, recursively? If so, how do you want to achieve that without downloading the actual websites, which you need to parse for further links?
-
steenhulthin over 12 yearsWhat did you try? wget -r is the recursive option. Did you try with that? What problem did you run into?
-
dma_k over 12 yearsJust use
wget -r http://site.com
. Also nice option is-p
which will also fetch all prerequisites for the page, even if they are external. -
Admin over 12 years@Kerrek all possible URLs - yes, URLs which are linked to internal pages (that means, which has the same domain). & That's a good point there, wget can download only html content, at least to find the linked urls/pages, ignoring any other file types.
-
Kerrek SB over 12 years@abhiomkar: Well, yes, you wouldn't download all the pictures and flash animations of course. The
-r
option already does exactly that. If you also want stylesheets that may be linked inside other style sheets, you have to work a bit harder, but for the semantic content only,-r
is exactly the answer. Did you try any of this before asking the question, by the way? I think wget has a pretty decent documentation... -
giorgio79 almost 10 years
-