SQLTeam.com | Weblogs | Forums

Windows Directory File Paths

sql2012

#1

I am trying to write a vb.net application that looks searches a database and returns results based on the search (and open files that are listed from the search) but I am having a problem with what kind of application I can store my data in and how to retrieve it. I have searched high and low and it looks like a sql command called xp.dirtree should work and it looks like it does store the information that I need but I don't know how to put the data back together once it is broken out (xp dirtree breaks files and folders down into directories & depth) or if this is even the correct command I should be using. A lot of the articles on the web are much older and makes me think that there is a better way to do it. Originally I was storing all the files in text files but the text files because huge and not able to work with.

Any help on this topic would be appreciated. Even if you can point me in the right direction for what I should be searching for.


#2

I think it would help if you could explain:

What sort of data the Search could process - is it, for example, keywords from the file and/or dates of the file, or something like that?

Or is it any "Free Text Search" on the contents of the file?

It is possible to store the whole contents of a file, in a SQL database (instead of in the file system) and then use Free Text searching tools on that. But that is not always the best way ...

... an alternative is to store the file on Disk, and store the Name & Path of the file in the Database, along with any "Attributes / Properties" that you want to be able to search on.


#3

The program is just going to search file names in a directory. It won't go inside and get the text inside the files. I did have text files that had the paths to all the files but the text files became huge and hard to work with.I am looking at a 200 mb file and that is just one location of files. My text files total 842 mb's :grimacing:


#4

So IIUC you could:

Create a database with a table / column(s) that contain the full path and filename of each file.

You will search that for "suitable/matching filenames"

You will then need to use VB.NET to "open" that file, to retrieve its contents.

I think you will need to open the files directly, rather than being able to get SQL to return the file-content (as part of the resultset)


#5

How do I store the full path names in sql? Do you know what command I am looking for that will do that? Wouldn't this produce the same large files that my text files did by having the full path names? I feel like there has to be an easier way. There is this program called Cathy Search that makes a catalog of the file(s) and when you are searching it, it will just search the catalog so the results are fast. Basically what I am trying to do is get rid of windows search because it takes too long to find the files. I would use cathy but I want something that I can be updated and pushed out to the network easily.

Cathy Search


#6

Lets assume that you have a table called [Files] in SQL with the columns [Path] and [FileName] (and maybe some sort of [ID] column that will uniquely identify each row)

You prepare a directory listing of all the paths & filenames that you want to store (various ways to do that, let's assume that it is already done). So now you have the table [Files] populated with lots of rows.

Then you can query SQL:

SELECT [Path], [FileName]
FROM [Files]
WHERE [FileName] LIKE '%FredBlogs%.DOCX'

and from those results you use the [Path] and [FileName] values to locate the file(s).

That's basically it. SQL will query the filenames etc. very quickly - its good at that! - and can be accessed across the network and so on.

You would need to have either:

a) a means of the Client pulling the files from the Server (once it had the Path/FileName)
b) Having some sort of APP / Service on the Server that "delivered" the files to the Client

We use a document management system (DMS) here which does that. In addition to the Path / Filename it also has a complete tree structure by Client / Matter within which documents, emails, etc. are stored. Perhaps your Path also simulates a multi-level hierarchy of, for example, Clients and Matters? If so you could "parse" the Path when the data is imported into SQL into some sort of hierarchical structure - then you could have an APP on the Client PCs display all the documents using that hierarchy structure. Its a fair amount of work to build an APP to do all that ...

The DMS we use stores every version of each document, has all sorts of searchable keywords and other Attributes about the files, has permissions as to who-can-see-what, the ability to "publish" a group of documents to a recipient, is fully integrated into Outlook / Word / etc. so the FILE : SAVE actually stores into the DMS, rather that "to a disk drive", and so on.

I don't know if that is what you are actually wanting? but it might be worth looking at what various Document Management Systems have to offer - if only for a Wish List of ideas for your own APP.

If Cathy does all that you want might be better to stick with that. FWIW I insisted that we install a fairly Micky Mouse document management system at the outset when the company here was formed. That meant that all documents, from day one, were in a DMS with at least rudimentary keywords etc. All incoming documents were scanned and stored. Nowadays we don't get anything, hardly, in the post - everything arrives as Email attachments, which avoids having to scan anything, but we were able to migrate the original DMS into something much more sophisticated as the company grew. For most people the real bugbear of a DMS is the initial loading of all the historical documents into the system


#7

Kristen thanks for the response. I wasn't sure what DMS was and took a look at it and our company isn't comfortable with storing our data out on the cloud, at least not yet. We keep a lot of our data in house and a lot of the live data is replicated throughout the offices. The main problem that we have is that our live data has gotten to be so much and it doesn't stop growing. The live data being so much makes the backups take longer, requires more storage space and if something was to happen to the data it would take longer to restore it from scratch. So what we try to push is archiving data that isn't used but when we archive we take it off the live server and move it to our corporate office. In turn the end users say that it takes too long to search for things across the wire. Cathy was a good option at first because it gave us a quick solution for searching data that was archived. However we couldn't put limits on it, like prevent users from making new index files or using it to index our live data. We didn't want that. So I came in and said Cathy looks easy enough to just rewrite so we can customize it to our needs.

While going back and forth and searching high & low my coworker found an article that could make sql work for us. It takes a directory and lists the full unc path. When an end user searches using the app I wrote in vb.net it lists all the files that are related to the search in a listbox from which an end user can open the file, open the location or copy the file.

Link to what helped us

Also I meant to ask how come next to your name it says Test?


#8

Yeah, storing stuff in the Cloud seems to be all the rage!! ... we don't store anything there, yet, except our backups. Our Email, SQL Databases, DMS, Intranet Website are all in-house on servers inside our own buildings. So you would be able to host a DMS (or any SQL database and its applications, I suppose) in house for as long as you prefer that method.

It sounds like you have "encapsulated" everything you need in either the UNC path and/or the Filename itself? A DMS, or a Database which stores the UNC and Filename, would be able to do the same job (maybe better, maybe no worse). I'm not seeing a major gain there though if you just store UNC and Filename. For me the major gain comes when you store other useful data in association with your Filename. It sounds like your colleagues know what keywords to use for a search - presumably they form part of either the Filename or the UNC path?

For me the main benefit of a Database would allow other data to be stored. Maybe your filenames are "ClientA-ProjectB_yyyymmdd_V999.DOCX" or somesuch, where you encapsulate the Client Code and Project Reference, along with the date and version number or whatever. For me the document could be called "FRED.DOCX" because the Client Code, Project Reference, Create date, Version Number etc. etc. would all be additional columns in the database table. If the project moved from ClientA to ClientB I can just change that value in the database (and I also have an audit of when it changed, by whom, and what it was previously). I can report on all documents for ClientA, or perhaps the narrower search I need is: "All documents for ClientA on ProjectB between date XXX and YYY". SQL is incredibly fast at running a search like that, whereas anything just using the filesystem, or an index list of the file system, is going to have to use pure horsepower to find matching entries. The difference is that as the size of the index grows then SQL will scale MUCH better than a pure-horsepower solution.

So for you to want a database solution you problem need to also be wanting to store multiple Attributes or Properties - Meta Data if you will - about each document. That information would be entered when a document was added to the system. For us, using our DMS, when we "check in" a document it asks us those questions - Which client? Which Project? What Title? and some other "keyword" and "attribute" type things - e.g. a checkbox for whether it is Regulatory related (which is a major thing for us).

Other things that SQL will do:

Not get corrupted. SQL Server is very robust, and uses a twin-file system of Data and Log. Every transaction (i.e. a change to the database [slight oversimplification]) is stored in the log, sequentially, and applied to the data file. It is possible to recover a Full Backup from, say, last night and then replay, forwards, all the Log file backups to a given date/time. So if someone [or a software bug :frowning:] accidentally deletes half the customers you can restore to just-before-that-happened.

Also, if the Data file does become corrupted then you can restore from a Full Backup before the problem (e.g. a disk controller is found to have been corrupting data since Tuesday Night) and then play forward the Log Backups. Most times you can do this right up to "now" and have zero data loss, in particular if the Data and Log files, and their backups, are stored on different disk drives / controllers - because a corruption in one is unlikely to also effect the other.

Data is consistent - by that I mean that if you say "Update the Files Table and the Archive Table and the User Statistics Table" then SQL can perform that as a single transaction. Either all three things happen, or none of them; it will never just do the File Table Update on its own. So if you get a power cut / server crash in the middle, or a critical program bug, then SQL will rollback any partially completed parts of the transaction.

SQL backups can be compressed. That saves about 80% of the disk space and 40% of the I/O (i.e. "time")

You will still have to backup all your files (unless you store those, too, in the SQL database - but there are good reasons for both DOING and NOT doing that).

Back in the old forum when anyone got to 1,000 posts the Zen Master, Merkin, bestowed a Custom Title on the poster. There were lots of great titles - like "SQL Warrior Princess" who subsequently became "Almighty SQL Goddess" :slight_smile: So ... back in 2004 I eventually got to 1,000 posts and was given the unbelievably august title of ... "test" :heart_eyes:... whilst Merkin had been away there was some slightly ... errmmm ... "off topic posting" and I got the blame ...

I'm sure it is of no great interest anymore, but just in case here's a link to the story:

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61454#213293


#9

Really interesting article.

Thank you for taking the time out to write it up for me and I did take a few looks at the posts on you making your 1000th post.

Keep up the good work. Sorry for such the long delay in response.