Yes! You can extend the harvester as explained in this example:
Extending the Web Harvester · Esri/geoportal-server Wiki · GitHub
Using for example apache tika you could get the document information from various file types.
you would generate metadata for these docs yourself in your preferred profile (question 2). I've been using Dublin Core as there typically is only limited information available when indexing docs.
I have some code that indexes docs that I could post on GitHub (question 3). Perhaps something to work on together?