i am having trouble in the section: # Find longest match of each STR in DNA sequence.
I dont understand why when i print(longest_str) i get all values equal to 0 {'AGATC': 0, 'AATG': 0, 'TATC': 0}
Am i calling the longest_match function wrong?
PD: I am new to programming and python, thanks for your help!!
import csv
import sys
def main():
# TODO: Check for command-line usage
longest_str = {}
if len(sys.argv) != 3:
sys.exit("Usage: python dna.py, data.csv, sequence.txt")
# TODO: Read database file into a variable
with open(sys.argv[1]) as f:
data = csv.DictReader(f)
# TODO: Read DNA sequence file into a variable
with open(sys.argv[2]) as f2:
dna_sequence = csv.DictReader(f2)
# TODO: Find longest match of each STR in DNA sequence
subsequences = data.fieldnames[1:]
for subsequence in subsequences:
longest_str[subsequence] = longest_match(str(dna_sequence), subsequence)
print(longest_str)
# TODO: Check database for matching profiles
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i count * subsequence_length
end = start subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count = 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
CodePudding user response:
The dna sequence is not a csv file. dna_sequence = csv.DictReader(f2)
dna_sequence is a dictreader object here. The longest_match function provided by cs50 won't know what to do with it. It needs a string.
CodePudding user response:
To clarify what @Fuelled_By_Coffee said, csv.DictReader() returns a dictreader object. It is used to iterate over rows in the CSV file, returning a dictionary for each row of data. So, data and dna_sequence are dictreader objects, NOT the contents of each file.
A dictreader object is appropriate to read the CSV file. However, you're not done reading that file. Before you start checking DNA sequences, you need to read all of the data from the CSV file into memory. My advice: Get this working first, BEFORE you work on the rest of the code.
Regarding the dna_sequence data, these files aren't appropriate for dictreader. This object expects a header row with field names. To see what I mean, compare the contents of sequence\1.txt to databases\small.csv. Notice how the CSV has a header line, and the sequence file doesn't? You need a different Python method to read the sequence files.
